AWS Glue ETL Boilerplate
May 9, 2026 · View on GitHub
A production-ready starting point for building AWS Glue v5 data pipelines following the Medallion Architecture (Raw → Bronze → Silver → Gold).
Ships with:
- Generic
public_apisample jobs across all four layers (using JSONPlaceholder) - Pydantic v2 four-tier config resolution (Workflow Properties → CLI args → env vars → defaults)
- Apache Iceberg tables via AWS Glue Data Catalog
- LocalStack-based local development environment
- Full unit test suite with no external dependencies
Architecture
External API / SFTP
│
▼
┌──────────────┐ Python Shell (PyShell)
│ Raw │ Extract → write JSONL to S3
└──────┬───────┘
│
▼
┌──────────────┐ PySpark
│ Bronze │ Read raw JSONL → normalize → Iceberg
└──────┬───────┘
│
▼
┌──────────────┐ PySpark
│ Silver │ Quality / typing / hashing → Iceberg
└──────┬───────┘
│
▼
┌──────────────┐ PySpark
│ Gold │ Business aggregates → Iceberg
└──────────────┘
Each layer has a dedicated base class in libs/pyspark/ and config class in libs/common/config/.
See docs/ARCHITECTURE.md for more details.
Project Structure
.
├── jobs/
│ ├── raw/ # Python Shell extraction jobs
│ ├── bronze/ # PySpark normalization jobs
│ ├── silver/ # PySpark quality/standardization jobs
│ └── gold/ # PySpark aggregation jobs
├── libs/
│ ├── common/ # Shared config, utils, logging
│ ├── pyshell/ # PyShellJobBase for raw layer
│ └── pyspark/ # SparkSessionFactory + Medallion base classes
├── tests/
│ ├── unit/ # Fast, no-Spark tests (<1 s each)
│ └── integration/ # LocalStack + Spark integration tests
├── scripts/ # build, deploy, sync helper scripts
├── docs/ # Extended documentation
├── .devcontainer/ # VS Code dev container (awsglue + localstack)
├── .env.example # Reference env file
└── Makefile # Developer shortcuts
Requirements
| Tool | Version |
|---|---|
| Python | 3.11 |
| uv | latest |
| Docker + Docker Compose | 24+ |
| Java (for Spark) | 11 or 17 |
Quick Start
1. Clone and bootstrap
git clone https://github.com/your-org/aws-glue-etl-boilerplate.git
cd aws-glue-etl-boilerplate
make bootstrap
This sets up uv, creates .venv, installs runtime/dev dependencies, and runs a sanity test pass.
Optional manual setup is still available via uv if you need a custom environment.
Optional NaNLABS baseline checks (if installed):
make check-env
make nan-health
make nan-skills
2. Set up environment variables
cp .env.example .env
# edit .env with your values
Key variables (see docs/ENVIRONMENT_VARIABLES.md for the full list):
| Variable | Default | Description |
|---|---|---|
SOURCE_NAME | public_api | Identifier for the data source |
ENTITY_TYPE | posts | Entity being processed |
RAW_ZONE_PATH | — | S3 path for raw JSONL output |
WAREHOUSE_PATH | — | S3 path for Iceberg warehouse |
RAW_DATABASE_NAME | raw_zone | Glue database for raw layer |
BRONZE_DATABASE_NAME | bronze_zone | Glue database for bronze layer |
SILVER_DATABASE_NAME | silver_zone | Glue database for silver layer |
GOLD_DATABASE_NAME | gold_zone | Glue database for gold layer |
API_BASE_URL | https://jsonplaceholder.typicode.com | Base URL for public API source |
API_ENDPOINT | /posts | Endpoint path |
3. Start the local infrastructure
# Starts LocalStack (S3, Glue, SecretsManager) + SFTP server
docker compose -f .devcontainer/compose.yml up -d
Or open the project in VS Code and use Reopen in Container for the full dev container experience.
Running Jobs Locally
Use the Makefile shortcuts:
make run-raw DATA_SOURCE=public_api ENTITY_TYPE=posts
make run-bronze DATA_SOURCE=public_api ENTITY_TYPE=posts
make run-silver DATA_SOURCE=public_api ENTITY_TYPE=posts
make run-gold DATA_SOURCE=public_api ENTITY_TYPE=posts
Or invoke directly:
python jobs/raw/public_api_raw_job.py \
--JOB_NAME=local_test \
--ENTITY_TYPE=posts \
--API_BASE_URL=https://jsonplaceholder.typicode.com
spark-submit jobs/bronze/public_api_bronze_job.py \
--JOB_NAME=local_test \
--ENTITY_TYPE=posts
Testing
# Unit tests only (no Spark, no AWS, fast)
make test-unit
# or:
python -m pytest tests/unit/ -q
# Integration tests (includes smoke coverage; some tests may require LocalStack/Spark)
make test-integration
# Quality checks
make lint
make type-check
# Optional baseline checks
make check-env
make nan-health
See docs/TESTING.md for conventions and marker usage.
Adding a New Data Source
-
Generate job and unit test templates:
bash make scaffold-source SOURCE=my_source ENTITY_TYPE=entities -
Implement source-specific extraction/transform logic in generated files: -
jobs/raw/{source}_raw_job.py-jobs/bronze/{source}_bronze_job.py-jobs/silver/{source}_silver_job.py-jobs/gold/{source}_gold_job.py -
Adapt the generated unit tests under
tests/unit/jobs/test_{source}_jobs.py. -
Add/adjust env vars in
.env.exampleif needed.
The config system resolves parameters automatically — no wiring needed beyond the field definitions.
Documentation
| Doc | Description |
|---|---|
| docs/ARCHITECTURE.md | Medallion layer overview |
| docs/DEVELOPMENT.md | Local run examples |
| docs/ENVIRONMENT_VARIABLES.md | All supported env vars |
| docs/LIBS_STRUCTURE.md | Library layout |
| docs/MIGRATION_GUIDE.md | Private source migration checklist |
| docs/TESTING.md | Testing conventions |
| CONTRIBUTING.md | Contribution guidelines |
| AGENTS.md | AI agent usage guide |
Contributing
See CONTRIBUTING.md.