Migration Guide: Private Source to Boilerplate
April 13, 2026 ยท View on GitHub
This guide explains how to migrate an existing private data source into this boilerplate without leaking organization-specific details.
Goal
Transform source-specific jobs into reusable, source-agnostic boilerplate components while preserving behavior.
Migration Checklist
- Create a dedicated branch for the source migration.
- Scaffold boilerplate files with:
make scaffold-source SOURCE=<source_name> ENTITY_TYPE=<entity_type> - Move source-specific extraction logic into the generated Raw job.
- Keep shared logic in
libs/common/andlibs/pyspark/generic. - Replace hardcoded values with config fields and env vars.
- Add unit tests for config defaults and transform behavior.
- Add one integration smoke test for the full raw->gold flow.
- Run validation:
make test-unit make lint make type-check
File-by-File Mapping
-
Raw extraction job:
- from: private source raw module
- to:
jobs/raw/<source>_raw_job.py
-
Bronze normalization job:
- from: private source bronze module
- to:
jobs/bronze/<source>_bronze_job.py
-
Silver standardization job:
- from: private source silver module
- to:
jobs/silver/<source>_silver_job.py
-
Gold aggregation job:
- from: private source gold module
- to:
jobs/gold/<source>_gold_job.py
-
Source tests:
- from: private tests
- to:
tests/unit/jobs/test_<source>_jobs.py
De-Identification Rules
Before opening a PR, remove or rename:
- Internal service URLs and hostnames
- Company-specific identifiers and business labels
- Account IDs and bucket names
- Secret names tied to internal conventions
- Proprietary metric names in Gold outputs
Use neutral defaults such as public_api, customers, raw_zone, and gold_zone.
Config Migration Pattern
Use this pattern when replacing hardcoded constants:
- Add fields to the corresponding config class (
RawJobConfig,BronzeJobConfig, etc.). - Provide safe defaults for local/dev usage.
- Resolve runtime values via Glue args and env vars.
- Derive table names in
model_post_init.
Example:
class CustomerApiRawConfig(RawJobConfig):
source_name: str = Field(default="customer_api")
entity_type: str = Field(default="customers")
api_base_url: str = Field(default="https://example.com")
Testing Strategy
Minimum expected coverage for migrated sources:
-
Unit tests:
- config defaults and table naming
- transform behavior on valid and empty input
-
Integration tests:
- single smoke flow raw->bronze->silver->gold
Common Pitfalls
- Reusing private schema names in default table/database values
- Leaving source-specific exceptions in shared libs
- Hardcoding S3 paths instead of using
RAW_ZONE_PATHandWAREHOUSE_PATH - Skipping a Gold assertion for aggregation consistency
Definition of Done
A migration is complete when:
- Source jobs run with boilerplate defaults in local mode.
- No private identifiers remain in generated code or docs.
- Unit tests pass for the new source.
- Smoke integration test passes.
- Lint and type-check are green.