ForgeTrain Engine
May 26, 2026 · View on GitHub
An LLM Pretraining Framework Built End-to-End by an Autonomous Agent Loop
🤖 100% AI-Authored · 🚀 50.9% MFU on H100 · 📈 +8% over Megatron-LM · 🧪 8× H100 single-host bring-up validated
Subproject of the ForgeTrain monorepo.
License Python CUDA PyTorch GPU MFU
⭐ Star this repo if you find it useful · 🤝 Built on the shoulders of CUTLASS, FlashAttention & TransformerEngine
TrainingEngine (8B) is a single-host LLM pretraining framework targeting eight H100 SXM5 GPUs (SM90a). The framework code (Python + CuTeDSL, excluding helpers ported from upstream NVIDIA CuTeDSL) was written, debugged, and optimized end-to-end by an AI Agent Loop, with zero manual code edits. Using MiniCPM4-8B as the target workload, it runs the end-to-end training loop on 8× H100 in a tensor_model_parallel_size = 2 / data_parallel_size = 4 layout and delivers a ~8% MFU lift over the Megatron-LM baseline at GAS=8; multi-node scale-out is future work. Data loading goes through HuggingFace datasets — any Hub dataset or local Parquet / Arrow / JSON works with a one-line CLI flag.
✨ Highlights
🤖 Fully AI-Authored
- 100% Agent-Loop Authored — every line of framework code (Python + CuTeDSL), every stress test, and every operator wrapper was produced by an AI Agent running in auto-loop mode. Humans only supplied the training objective and the hardware budget; no manual code edits, no manual hyperparameter tuning, no manual bug fixes.
- Self-Diagnosing Agent Loop — Driven by a harness, the Agent autonomously executes the full loop: read baseline scripts / milestones → implement → launch a job → read logs → locate root cause → patch code — all without human intervention for debugging.
🚀 Faster than Megatron-LM
- MFU 50.9%, ~8% above the Megatron-LM baseline (MFU ~47%) at the production cadence (
micro_batch_size = 2,grad_accum_steps = 8,seq_length = 4096, TP=2 / DP=4 on 8× H100). - Custom CuTeDSL kernels — hand-written SM90a kernels for the three hottest call sites:
gemm_fc1(SwiGLU column-parallel GEMM),gemm_output(LM-head column-parallel GEMM), and a flash-attention forward DSL (flash_attn_dsl). Together they cover the inner-loop bottleneck and free the remaining call sites to run the baselinetorch.matmulpath unchanged, other GEMM and flash-attention backward dsl are future work. - Self-explored optimization space — In auto-loop the Agent enumerated and benchmarked CuTeDSL / cuBLAS / SDPA operator variants plus per-shape kernel-template parameters (stage count, swizzle, cluster mode, epilogue overlap), measuring both MFU and loss alignment; the production defaults are the optimum the Agent picked from the full grid.
🤖 Agent-Friendly Quick Deploy
This repo was produced by an AI Agent and is friendliest to AI Agents. Paste the prompt below into Cursor / Claude Code / Codex / Cline — it will read the README, install dependencies, run the smoke test and report the MFU, without you typing commands one at a time.
🟢 5-step minimal pretraining demo (fastest install check)
Following https://raw.githubusercontent.com/OpenBMB/ForgeTrain/main/exports/train_engine_8b/README.md,
run a 5-step minimal pretraining demo on the current 8× H100 node:
1. Check the environment (Python ≥ 3.11, CUDA ≥ 12.x, H100 SXM5, PyTorch ≥ 2.4)
and install anything missing;
2. Install the repo: pip install -e . and HF deps:
pip install datasets transformers sentencepiece;
3. Import smoke test:
PYTHONPATH=src python -c "from training_engine_tensor import config; print('OK')"
4. Run 5 steps on HF FineWeb (an open, downloadable English webtext
pretraining corpus; the ``sample-10BT`` subset is ~10 B tokens,
suitable for the 8B pretraining distribution):
export CUSTOM_GEMM=1 OP_ATTENTION=v1
torchrun --standalone --nproc-per-node=8 \
-m training_engine_tensor pretrain \
--num-steps 5 --global-batch-size 64 \
--micro-batch-size 2 --grad-accum-steps 8 \
--seq-length 4096 \
--hf-dataset HuggingFaceFW/fineweb \
--hf-dataset-config sample-10BT \
--hf-text-field text \
--tokenizer-path openbmb/MiniCPM4-8B
5. Print the final loss, step time, and MFU.
If anything fails, dig into the source on your own — do not ask me.
Directory Layout
train_engine/
src/
training_engine_tensor/ # Framework core (Python + CuTeDSL)
__main__.py # CLI entry (pretrain subcommand)
entry.py # Training loop driver
forward.py / backward.py # Forward / backward
optimizer.py / parameters.py # Adam + parameter management
nccl.py # NCCL collectives
kernels.py / custom_gemm.py # Operator dispatch shims
engine_config.py # Config hub (EngineConfig frozen dataclass)
op_dispatcher.py # Per-operator version dispatch
ENV_WHITELIST.md # Process-external env var whitelist
ops/ # Custom operator subpackage
gemm_fc1/ # SwiGLU column-parallel GEMM (CuTeDSL v1)
kernel.py / _cute_kernel.py / register.toml
gemm_output/ # LM-head column-parallel GEMM (CuTeDSL v1)
kernel.py / register.toml
gemm_qkv_proj/ # baseline-only (register.toml stub)
gemm_attn_out_proj/ # baseline-only
gemm_fc2/ # baseline-only
flash_attn_dsl/ # SM90a flash-attention fwd DSL kernel
quack/ # CuTeDSL helpers used by flash_attn_dsl
tests/ # Operator stress tests (bare-metal pytest)
scripts/ # Training entries & tools
entry_hf_pretrain.sh # HuggingFace dataset training entry
model_spec.toml # Model & training spec (L1 SSOT)
pyproject.toml # Package definition
🚀 Quick Start
Short form:
Python ≥ 3.11·CUDA ≥ 12.x·PyTorch ≥ 2.4·H100 SXM5 (SM90a)·nvidia-cutlass-dsl ≥ 4.4. Full environment manifest (hardware / driver / dep wheels / env var whitelist) lives indocs/environment.md.
1. Install
git clone https://github.com/OpenBMB/ForgeTrain.git
cd ForgeTrain/exports/train_engine_8b
pip install -e .
# HuggingFace data path (required)
pip install datasets transformers sentencepiece
2. Verify install
PYTHONPATH=src python -c "from training_engine_tensor import config; print('OK')"
3. Run the operator stress tests (recommended before the first training job)
pytest tests/ -v
The three smoke tests under tests/ exercise the production GEMM /
attention kernels under concurrent neighbour workloads (cuBLAS GEMMs,
HBM memcpy, TMA-heavy GEMMs, Hopper cluster-mode GEMMs, cross-stream
event chaos, alloc churn). Each takes ~30 s by default; raise
STRESS_DURATION_S for the long-run profile.
4. Single-node training (8× H100, with a HuggingFace dataset)
The recommended open pretraining corpus is
HuggingFaceFW/fineweb
(its sample-10BT config is a ~10 B-token webtext subset).
export HF_DATASET=HuggingFaceFW/fineweb
export HF_DATASET_CONFIG=sample-10BT
export HF_TEXT_FIELD=text
export TOKENIZER_PATH=<YOUR_TOKENIZER>
# Optional: enable the engine's CuTeDSL kernels (gemm_fc1 + gemm_output + FA fwd).
export CUSTOM_GEMM=1
export OP_ATTENTION=v1
scripts/entry_hf_pretrain.sh
HF_DATASET accepts either a HuggingFace Hub name (e.g.
HuggingFaceFW/fineweb) or a local dataset directory (Parquet /
Arrow / JSON / JSONL). Pick one of two ways to declare the text field:
HF_TEXT_FIELD=<COLUMN>— take a single column directly (e.g. fineweb'stext)HF_TEXT_TEMPLATE="..."— concatenate multiple columns with Python.formatsyntax (e.g."{title}\n\n{content}", suitable for datasets whose text is split across multiple columns)
For finer control, invoke the CLI directly:
PYTHONPATH=src \
torchrun --standalone --nproc-per-node=8 \
-m training_engine_tensor pretrain \
--num-steps 200 \
--global-batch-size 64 \
--micro-batch-size 2 --grad-accum-steps 8 \
--seq-length 4096 \
--hf-dataset HuggingFaceFW/fineweb \
--hf-dataset-config sample-10BT \
--hf-text-field text \
--tokenizer-path <YOUR_TOKENIZER>
Run python -m training_engine_tensor pretrain --help to see the full
flag surface; every EngineConfig field is auto-mirrored as a CLI flag.
📦 Models & Versions
Supported models
| Model | Params | Architecture | HuggingFace | ModelScope |
|---|---|---|---|---|
| MiniCPM4-8B | 8 B | 32-layer Transformer · GQA (32Q/2KV) · SwiGLU · RoPE | 🤗 link | link |
Architectural constants (L1 SSOT in model_spec.toml):
hidden_size = 4096, num_layers = 32, num_attention_heads = 32,
num_query_groups = 2, head_dim = 128, ffn_hidden_size = 16384,
seq_length = 4096, vocab_size = 73448.
Hardware
| GPU | Arch | Status |
|---|---|---|
| NVIDIA H100 SXM5 (80 GB) | SM90a (Hopper) | ✅ 8× H100 single-host bring-up + MFU validated (≠ full pretrain) |
Operator Stack
The engine ships three self-developed CuTeDSL kernels; the remaining
GEMM call sites run the baseline torch.matmul path unconditionally so
the engine stays correct even when CUSTOM_GEMM=0:
| Op | Default | Production source | Replaces |
|---|---|---|---|
gemm_fc1 | v1 (CuTeDSL) | src/training_engine_tensor/ops/gemm_fc1/ | SwiGLU column-parallel GEMM |
gemm_output | v1 (CuTeDSL) | src/training_engine_tensor/ops/gemm_output/ | LM-head column-parallel GEMM |
| flash-attention fwd | v1 (DSL) | src/flash_attn_dsl/ | SM90a flash-attention fwd |
gemm_qkv_proj | baseline | torch.matmul | — |
gemm_attn_out_proj | baseline | torch.matmul | — |
gemm_fc2 | baseline | torch.matmul | — |
The dispatcher (op_dispatcher.get_op_version) reads each operator's
register.toml at import time and resolves the active version from the
declared env var:
| Env var | Production default | Description |
|---|---|---|
CUSTOM_GEMM | 1 | Master switch — also forces OP_GEMM_FC1=v1 + OP_GEMM_OUTPUT=v1 |
OP_ATTENTION | v1 | Enables the SM90a flash-attn DSL forward |
OP_GEMM_FC1 | v1 | gemm_fc1 CuTeDSL kernel |
OP_GEMM_OUTPUT | v1 | gemm_output CuTeDSL kernel |
OP_GEMM_FC2 | baseline (fixed) | torch.matmul fallback |
OP_GEMM_QKV_PROJ | baseline (fixed) | torch.matmul fallback |
OP_GEMM_ATTN_OUT_PROJ | baseline (fixed) | torch.matmul fallback |
Setting any OP_<NAME>=baseline is always a safe fall-back to the eager
PyTorch path; the engine boots correctly even with every custom op
disabled.
CLI
python -m training_engine_tensor pretrain ...
pretrain— random-init pretraining with the HF dataloader (--checkpoint-rootwarm-starts from a Megatron-format checkpoint;--save-checkpoint-dirwrites a resume-able shard set after the final step).
For env-var configuration see src/training_engine_tensor/ENV_WHITELIST.md.
Anything not on the whitelist must flow through EngineConfig.
🛡️ Code Quality
The framework code produced by the Agent Loop follows a strict contract. The repo's stress tests + AST guard turn it into an executable gate — a red CI run or a failing end-to-end training job counts as a violation.
🚨 RED LINE — Zero Tolerance
| Red line | One-liner | Repo evidence |
|---|---|---|
| SSOT | Every fact has exactly one authoritative source | EngineConfig owns every behavioural knob; model_spec.toml L1→L2→L3 one-way |
| DAG | Module deps are acyclic; lower layers never depend on upper layers | engine_config → config → kernels → fwd/bwd → entry is one-way |
| Fail Fast | Errors raise immediately; no layered try/catch or silent fallback | ops/gemm_*/kernel.py refuse to swallow CuTeDSL JIT errors |
| Minimal Public Surface | Hidden by default; every public must justify itself with a real caller | All modules declare __all__; ENV_WHITELIST.md is the only outward env contract |
| TDD | Failing test first, then minimum code to pass | tests/ ships operator stress tests gated on real H100 hardware |
Development process
- Root Cause, No Workaround — every fix targets the root cause; no band-aids or symptomatic patches
- Design Before Implementation — design first, then implement, no matter how trivial the change
- Development Completion — finish implementation + verification + tests in a single execution pass; do not stop mid-flight to ask whether to continue
- No Backward Compatibility by Default — assume unpublished, fast-iteration; no migration code