AWS Glue ETL Boilerplate

May 9, 2026 · View on GitHub

A production-ready starting point for building AWS Glue v5 data pipelines following the Medallion Architecture (Raw → Bronze → Silver → Gold).

Ships with:

  • Generic public_api sample jobs across all four layers (using JSONPlaceholder)
  • Pydantic v2 four-tier config resolution (Workflow Properties → CLI args → env vars → defaults)
  • Apache Iceberg tables via AWS Glue Data Catalog
  • LocalStack-based local development environment
  • Full unit test suite with no external dependencies

Architecture

External API / SFTP


┌──────────────┐    Python Shell (PyShell)
│     Raw      │    Extract → write JSONL to S3
└──────┬───────┘


┌──────────────┐    PySpark
│    Bronze    │    Read raw JSONL → normalize → Iceberg
└──────┬───────┘


┌──────────────┐    PySpark
│    Silver    │    Quality / typing / hashing → Iceberg
└──────┬───────┘


┌──────────────┐    PySpark
│     Gold     │    Business aggregates → Iceberg
└──────────────┘

Each layer has a dedicated base class in libs/pyspark/ and config class in libs/common/config/.
See docs/ARCHITECTURE.md for more details.


Project Structure

.
├── jobs/
│   ├── raw/          # Python Shell extraction jobs
│   ├── bronze/       # PySpark normalization jobs
│   ├── silver/       # PySpark quality/standardization jobs
│   └── gold/         # PySpark aggregation jobs
├── libs/
│   ├── common/       # Shared config, utils, logging
│   ├── pyshell/      # PyShellJobBase for raw layer
│   └── pyspark/      # SparkSessionFactory + Medallion base classes
├── tests/
│   ├── unit/         # Fast, no-Spark tests (<1 s each)
│   └── integration/  # LocalStack + Spark integration tests
├── scripts/          # build, deploy, sync helper scripts
├── docs/             # Extended documentation
├── .devcontainer/    # VS Code dev container (awsglue + localstack)
├── .env.example      # Reference env file
└── Makefile          # Developer shortcuts

Requirements

ToolVersion
Python3.11
uvlatest
Docker + Docker Compose24+
Java (for Spark)11 or 17

Quick Start

1. Clone and bootstrap

git clone https://github.com/your-org/aws-glue-etl-boilerplate.git
cd aws-glue-etl-boilerplate
make bootstrap

This sets up uv, creates .venv, installs runtime/dev dependencies, and runs a sanity test pass.

Optional manual setup is still available via uv if you need a custom environment.

Optional NaNLABS baseline checks (if installed):

make check-env
make nan-health
make nan-skills

2. Set up environment variables

cp .env.example .env
# edit .env with your values

Key variables (see docs/ENVIRONMENT_VARIABLES.md for the full list):

VariableDefaultDescription
SOURCE_NAMEpublic_apiIdentifier for the data source
ENTITY_TYPEpostsEntity being processed
RAW_ZONE_PATHS3 path for raw JSONL output
WAREHOUSE_PATHS3 path for Iceberg warehouse
RAW_DATABASE_NAMEraw_zoneGlue database for raw layer
BRONZE_DATABASE_NAMEbronze_zoneGlue database for bronze layer
SILVER_DATABASE_NAMEsilver_zoneGlue database for silver layer
GOLD_DATABASE_NAMEgold_zoneGlue database for gold layer
API_BASE_URLhttps://jsonplaceholder.typicode.comBase URL for public API source
API_ENDPOINT/postsEndpoint path

3. Start the local infrastructure

# Starts LocalStack (S3, Glue, SecretsManager) + SFTP server
docker compose -f .devcontainer/compose.yml up -d

Or open the project in VS Code and use Reopen in Container for the full dev container experience.


Running Jobs Locally

Use the Makefile shortcuts:

make run-raw    DATA_SOURCE=public_api ENTITY_TYPE=posts
make run-bronze DATA_SOURCE=public_api ENTITY_TYPE=posts
make run-silver DATA_SOURCE=public_api ENTITY_TYPE=posts
make run-gold   DATA_SOURCE=public_api ENTITY_TYPE=posts

Or invoke directly:

python jobs/raw/public_api_raw_job.py \
  --JOB_NAME=local_test \
  --ENTITY_TYPE=posts \
  --API_BASE_URL=https://jsonplaceholder.typicode.com

spark-submit jobs/bronze/public_api_bronze_job.py \
  --JOB_NAME=local_test \
  --ENTITY_TYPE=posts

Testing

# Unit tests only (no Spark, no AWS, fast)
make test-unit
# or:
python -m pytest tests/unit/ -q

# Integration tests (includes smoke coverage; some tests may require LocalStack/Spark)
make test-integration

# Quality checks
make lint
make type-check

# Optional baseline checks
make check-env
make nan-health

See docs/TESTING.md for conventions and marker usage.


Adding a New Data Source

  1. Generate job and unit test templates: bash make scaffold-source SOURCE=my_source ENTITY_TYPE=entities

  2. Implement source-specific extraction/transform logic in generated files: - jobs/raw/{source}_raw_job.py - jobs/bronze/{source}_bronze_job.py - jobs/silver/{source}_silver_job.py - jobs/gold/{source}_gold_job.py

  3. Adapt the generated unit tests under tests/unit/jobs/test_{source}_jobs.py.

  4. Add/adjust env vars in .env.example if needed.

The config system resolves parameters automatically — no wiring needed beyond the field definitions.


Documentation

DocDescription
docs/ARCHITECTURE.mdMedallion layer overview
docs/DEVELOPMENT.mdLocal run examples
docs/ENVIRONMENT_VARIABLES.mdAll supported env vars
docs/LIBS_STRUCTURE.mdLibrary layout
docs/MIGRATION_GUIDE.mdPrivate source migration checklist
docs/TESTING.mdTesting conventions
CONTRIBUTING.mdContribution guidelines
AGENTS.mdAI agent usage guide

Contributing

See CONTRIBUTING.md.