wxtrain

March 22, 2026 ยท View on GitHub

All-Rust weather data pipeline for ML training. Fetches GRIB from operational NWP models, decodes natively, computes derived fields, and exports training-ready datasets.

27,559 lines of Rust across 9 crates. Single binary. One cargo build --release. No eccodes, no Fortran, no C dependencies.

Quick Start

git clone <repo-url>
cd wxtrain
cargo build --release

The binary lands at target/release/wxtrain.

# List supported models
wxtrain models

# Fetch a single HRRR field via byte-range subset (~325 KB instead of ~80 MB)
wxtrain fetch model-subset \
  --model hrrr --product surface --forecast-hour 0 \
  --search "TMP:2 m above ground" \
  --output hrrr_tmp2m.grib2 --limit 1

# See what's in it
wxtrain scan-grib --file hrrr_tmp2m.grib2

# Decode message 1
wxtrain decode-grib --file hrrr_tmp2m.grib2 --message 1

# Render to PNG
wxtrain render grib --file hrrr_tmp2m.grib2 --message 1 \
  --output hrrr_tmp2m.png --colormap heat

# Download full GRIBs in parallel (24 forecast hours, 4 concurrent)
wxtrain fetch batch \
  --model hrrr --product surface \
  --forecast-hours 0-23 --output-dir ./hrrr_data/ --parallelism 4

# Compute thermodynamic quantities
wxtrain calc thermo --temperature-c 20 --dewpoint-c 10 --pressure-hpa 850

Supported Models

ModelResolutionGridSourceSubsetting
HRRR3 km1799x1059 LambertNOAA NOMADS / AWS.idx byte-range
GFS0.25 deg1440x721 lat-lonNOAA NOMADS.idx byte-range
NAM12 km614x428 LambertNOAA NOMADS.idx byte-range
RAP13 km337x451 LambertNOAA NOMADS.idx byte-range
ECMWF IFS0.25 deg1440x721 lat-lonECMWF Open Data.index inventory
ERA50.25 deg1440x721 lat-lonCDS APIAuthenticated retrieval

Byte-range subsetting uses .idx inventory files to request only the fields you need. A single HRRR surface field is 300-400 KB; the full file is ~80 MB.

ERA5 credentials are discovered from CDSAPI_URL/CDSAPI_KEY environment variables or ~/.cdsapirc.

Architecture

CrateLinesRole
wx-types153Domain model: grids, fields, soundings, radar volumes, dataset specs
wx-grib2,814GRIB1/GRIB2 scanner, inventory parser, message decoder (simple, complex+spatial differencing, JPEG 2000, CCSDS/AEC)
wx-fetch1,305Download planning, source templates, byte-range fetch, cache semantics, CDS retrieval
wx-calc7,391Thermodynamics, kinematics, severe weather indices -- 199 functions verified against MetPy
wx-radar159NEXRAD palette parsing, value sampling, color table decode
wx-render133Deterministic PNG rasterization with configurable colormaps
wx-export283Dataset manifests, export targets (Arrow, Parquet, WebDataset, Zarr)
wx-train948ML job specs, training plan expansion, NPY/shard dataset assembly
wx-cli5,057Single binary entrypoint -- all commands, argument parsing, orchestration

Training Pipeline

The pipeline goes from raw GRIB to GPU-ready tensors. Job planning is architecture-aware:

ArchitectureExport FormatShard Size
Classical ML (XGBoost)Parquet16 samples
Swin TransformerWebDataset96 samples
Diffusion (UNet/DiT)WebDataset128 samples
Forecast Graph NetworkParquet48 samples

Workflow

# 1. Create a job spec
wxtrain train job-init \
  --output job.json \
  --architecture swin-transformer \
  --task forecasting \
  --dataset-name hrrr_forecast

# 2. Expand to full training plan (features, shards, model recipe)
wxtrain train job-plan --spec job.json

# 3. Build NPY arrays from GRIB files
wxtrain train build-grib-sample \
  --file hrrr_surface.grib2 \
  --output-dir training_data/ \
  --colormap heat

Supported architectures: classical-ml, swin-transformer, diffusion, forecast-graph-network, custom.

Working Training Examples

Four end-to-end examples in examples/training/, each tested on a fresh Linux node (clone, build, fetch, train, inference) on a Blackwell GPU:

ScriptTaskArchitecture
train_unet.pyCAPE regression from surface fieldsUNet (3-level encoder/decoder + skip connections)
train_classifier.pySevere weather classificationMLP classifier
train_swin.py3-hour forecastSwin Transformer
train_diffusion.pySuper-resolutionDiffusion model

Each script calls wxtrain fetch batch and wxtrain train build-grib-sample to build its own dataset, then trains with PyTorch.

Python Integration

The wxtrain_data package provides PyTorch dataset loaders for wxtrain-exported data.

pip install -e python/
from wxtrain_data import WxforgeDataset, WxforgeMultiSampleDataset

# Single sample bundle
ds = WxforgeDataset("training_data/sample_bundle")

# Multi-sample dataset
ds = WxforgeMultiSampleDataset("training_data/")

The Python package reads NPY/JSON artifacts produced by wxtrain train build-* and does not depend on the wxtrain binary at runtime.

CLI Reference

CommandDescription
wxtrain modelsList all supported models with grid specs and sources
wxtrain fetch model-subsetDownload specific fields via byte-range .idx requests
wxtrain fetch model-downloadDownload a complete GRIB file (supports ERA5 CDS auth)
wxtrain fetch batchDownload full GRIBs for multiple forecast hours in parallel
wxtrain scan-gribList every message in a GRIB file
wxtrain decode-gribDecode a message and print grid geometry, stats, sample values
wxtrain calc thermoCompute thermodynamic quantities from a single observation
wxtrain render gribRender a decoded GRIB message to georeferenced PNG
wxtrain train job-initCreate an ML job spec for a given architecture and task
wxtrain train job-planExpand a job spec into a full training plan
wxtrain train build-grib-sampleBuild NPY arrays from decoded GRIB fields

Timings

Measured on a desktop (Windows 11, NVMe SSD, residential internet):

OperationResult
Fetch single HRRR field via .idx~325 KB in ~1s
Decode 1799x1059 GRIB2 message<10ms
Full end-to-end (fetch + decode + plan + build)<5s
24 forecast hours x 9 fields~80 MB total transfer

Verification

Thermodynamic calculations are regression-tested against MetPy:

python examples/metpy_regression_suite.py
python examples/verification_suite.py

Reports are written to examples/ as JSON.

Documentation

Credits

Thermodynamic calculations verified against MetPy. Color tables by Solarpower07.

License

MIT