README.md
May 13, 2026 Β· View on GitHub
English | δΈζ
A modular, scalable, high-performance training framework for LLMs, VLMs, diffusion, and embodied models.
π Up to 5.0Γ training speedup Β Β·Β π Native NVIDIA GPU & Kunlun XPU support
π‘ Why LoongForge?
π LoongForge is part of Baidu Baige's Loong open-source series β named after the traditional Chinese loong boat (ιΎθ), a symbol of coordinated power and forward momentum.
LoongForge is a unified training framework for LLMs, VLMs, VLAs, and diffusion models, covering pre-training, continued pre-training, and SFT. Built upon Megatron-LM with deep systemic enhancements across model coverage, training performance, and hardware support, it delivers significant speedups over mainstream open-source baselines.
Before going open-source, LoongForge was developed as AIAK-Training-LLM, Baidu Baige's training acceleration stack. It has supported production training for enterprise customers across Education, Computer Vision, and Embodied AI, typically delivering 30%~50% speedup over customer baselines, with the largest production runs reaching 5,000+ XPUs.
π₯ Latest News
- [2026/05] π v0.1.0 β first official tagged release of LoongForge.
- [2026/05] π Powered the training and public release of LLaVA-OneVision-2.0.
- [2026/05] π€ Expanded VLA coverage with GR00T N1.6; 60%+ speedup on Pi0.5 and GR00T training.
- [2026/04] π§© Added training support for MiniMax-M2.7 on both NVIDIA GPU and Kunlun XPU.
- [2026/04] π LoongForge source code publicly available on GitHub. [blog]
- [2025/10] π Powered the training and public release of LLaVA-OneVision-1.5 under AIAK-Training-LLM, the predecessor of LoongForge. [blog]
β‘ Quick Start
See the full documentation for installation, tutorials, and advanced usage β English Β· δΈζ.
1. Install β via Docker (prebuilt images coming soon) or source build:
- NVIDIA GPU: Installation Guide
- Kunlun XPU: Installation Guide
2. Launch your first training run β follow a tutorial for your target hardware and modality:
- NVIDIA GPU: LLM Β· VLM Β· VLA Β· Diffusion (WAN)
- Kunlun XPU: Kunlun XPU Tutorials
3. Explore β browse configs/models/ and examples/ / examples_xpu/ for ready-to-run scripts.
β¨ Key Features
- π§© Flexible Multi-Modal Composition β Configuration-driven assembly of VLMs from interchangeable ViT and LLM components.
- β‘ Heterogeneous Parallelism β Independent TP / DP / recompute per model component (e.g., ViT vs. LLM) for optimal throughput and memory.
- π Decoupled Encoder-Decoder Training β Separates ViT and LLM into independent tasks, eliminating encoder-induced pipeline bubbles.
- βοΈ DP Load Balancing β Load-aware data redistribution mitigates sequence-packing imbalance, improving multi-node scaling efficiency.
- π MoE-Native Optimization β Overlapped All2All / activation offload / compute, with further memory reduction beyond upstream Megatron-LM on DeepSeek-V3, Qwen3-MoE, etc.
- π¬ Adaptive FP8 Training β End-to-end FP8 for LLMs and VLMs with standard blockwise FP8; optional adaptive mode picks per-operator precision by GEMM shape and efficiency.
- π§ Custom Fused Operators β Fused kernels like FusedDSA for DSA-style models β TileLang version open-sourced, high-performance CUDA version available on Baidu Baige platform.
- π Flexible Checkpointing β Offline bidirectional Megatron β HuggingFace conversion plus native online HF load/save β no format barriers across your workflow.
- π§° Versatile Pipelines & Data Tools β Out-of-the-box Pretrain / MidTrain / SFT / LoRA, with built-in dataset format conversion and sequence packing.
- π Heterogeneous Hardware β Native support for NVIDIA GPUs and Kunlun XPUs via a minimally-intrusive plugin design.
π Deep-dive: LLM features Β· VLM features
π Benchmark
Measured in v0.1.0 across LLM, VLM, and VLA workloads against mainstream open-source training baselines:
| Model | Type | Baseline | Configuration | Speedup |
|---|---|---|---|---|
| Qwen3-30B-A3B | MoE | Megatron-LMβ | 32 Γ A800β‘ Β· GBS 1024 Β· 32K | 1.16Γ |
| Qwen3-VL-30B-A3B | VLM | VeOmniβ | 32 Γ A800β‘ Β· GBS 128 Β· 32K | 1.45Γ |
| GR00T N1.6 | VLA | LeRobotβ | 8 Γ A800β‘ Β· GBS 128 Β· 224Γ224 | 1.60Γ |
| Pi0.5 | VLA | OpenPIβ | 8 Γ A800β‘ Β· GBS 112 Β· 224Γ224 | 1.65Γ |
Due to test-bed scale limits, DeepSeek-V3.2 was validated separately on a reduced-layer configuration β LoongForge's DSA operator-level optimizations still deliver ~5Γ speedup over Megatron-LM and reach 64K sequence (baseline OOMs beyond 8K).
β Numbers reflect baseline and LoongForge versions at the time of measurement, and may evolve as implementations change.
β‘ Validation on additional hardware is rolling out in upcoming releases.
π Powered by LoongForge
- LLaVA-OneVision-2.0 β Next-generation multimodal model, with new VideoCaption and Spatial datasets.
- LLaVA-OneVision-1.5 β Fully open framework for democratized multimodal training.
- Qianfan-VL β Domain-Enhanced Vision-Language Models for Enterprise, 3B to 70B parameters.
ποΈ Supported Models
LoongForge supports a broad range of state-of-the-art models across LLM, VLM, diffusion, and VLA.
| Modality | Architectures | Models |
|---|---|---|
| LLM | DeepSeek-V2 | deepseek-v2-lite, deepseek-v2 |
| DeepSeek-V3 | deepseek-v3, deepseek-v32 | |
| LLaMA2 | llama2-7b, llama2-13b, llama2-70b | |
| LLaMA3 | llama3-8b, llama3-70b | |
| LLaMA3.1 | llama3.1-8b, llama3.1-70b, llama3.1-405b | |
| Qwen | qwen-1.8b β qwen-72b | |
| Qwen1.5 | qwen1.5-0.5b β qwen1.5-72b | |
| Qwen2 | qwen2-0.5b β qwen2-72b | |
| Qwen2.5 | qwen2.5-0.5b β qwen2.5-72b | |
| Qwen3 | qwen3-0.6b β qwen3-480b-a35b, qwen3-coder-30b-a3b | |
| Qwen3-Next | qwen3-next-80b-a3b | |
| MiniMax | minimax-m2.1, minimax-m2.5, minimax-m2.7 | |
| MIMO | mimo-7b | |
| GLM | glm5 | |
| VLM | Qwen2.5-VL | qwen2.5-vl-3b β qwen2.5-vl-72b |
| Qwen3-VL | qwen3-vl-30b-a3b, qwen3-vl-235b-a22b | |
| Qwen3.5 | qwen3.5-0.8b β qwen3.5-397b-a17b | |
| Qwen3.6 | qwen3.6-27b, qwen3.6-35b-a3b | |
| ERNIE4.5-VL | ernie4.5vl-28b-a3b | |
| LLaVA-OneVision-1.5 | llava-onevision-1.5-4b | |
| InternVL2.5 | internvl2.5-8b β internvl2.5-78b | |
| InternVL3.5 | internvl3.5-8b β internvl3.5-241b-a28b | |
| CustomCombinedModel | Flexible ViT + LLM backbone configuration (example) | |
| Diffusion | WAN2.2 | wan2.2_i2v_a14b |
| VLA | Pi | pi0.5 |
| GR00T | groot-n1.6 |
π Roadmap
Model Support
- LLM / VLM: ongoing validation and release of new models (e.g., DeepSeek-V4, Kimi 2.5 / 2.6)
- Embodied AI: expanded WAM coverage (e.g., DreamZero, LingBot VA)
Performance & Scaling
- Adopt next-generation techniques introduced with DeepSeek-V4
- Advanced MoE load-balancing strategies
- Long-sequence training with ChunkPipe scheduling and Context Parallelism
- Further diffusion-model acceleration (e.g., WAN)
- INT4 quantization-aware training
- MTP (Multi-Token Prediction) scaling for speculative decoding
ποΈ Repository Layout
π Directory tree
LoongForge/
βββ loongforge/ # Core training framework
β βββ train/ # Training entry points & trainers
β β βββ pretrain/ # Pretrain (LLM, VLM)
β β βββ sft/ # SFT (LLM, VLM, InternVL, ERNIE)
β β βββ diffusion/ # Diffusion (WAN)
β β βββ embodied/ # Embodied AI (Pi0.5, GR00T)
β βββ models/ # Unified model abstractions
β β βββ foundation/ # LLM backbones (LLaMA, Qwen, DeepSeek, ...)
β β βββ encoder/ # Vision encoders (ViT, Qwen-VL, InternVL, ...)
β β βββ omni_models/ # Multi-modal composition
β β βββ diffusion/ # Diffusion models (WAN)
β β βββ embodied/ # Embodied models (Pi0.5, GR00T)
β β βββ common/ # Shared layers and utilities
β βββ data/ # Data pipelines (multi-modal, video, DP balance)
β βββ tokenizer/ # Tokenizers
β βββ utils/ # Config map, constants, etc.
βββ third_party/Loong-Megatron/ # Patched Megatron-LM (git submodule)
βββ configs/ # Hydra YAML configs (models, data)
βββ examples/ # GPU launch scripts
βββ examples_xpu/ # Kunlun XPU launch scripts
βββ tools/ # Checkpoint conversion, data preprocessing
βββ ops/ # Custom fused operators (incl. open-sourced TileLang)
βββ patches/ # TransformerEngine patches
βββ docker/ # Dockerfiles (GPU & XPU)
βββ tests/ # E2E test suite (YAML-driven)
βββ docs/ # Documentation
π€ Contributing
We warmly welcome community contributions β bug reports, feature proposals, and PRs alike. Please read our Contributing Guidelines before submitting.
π License
LoongForge is released under the Apache License 2.0. Some files are derived from third-party open-source projects; please refer to the specific file headers for their respective copyright and attribution.
π Citation
@software{LoongForge2026,
title = {LoongForge: A modular, scalable, high-performance training framework for LLMs, VLMs, diffusion, and embodied models},
author = {{The LoongForge Authors}},
year = {2026},
url = {https://github.com/baidu-baige/LoongForge}
}
π Acknowledgments
LoongForge is built upon NVIDIA's Megatron-LM. We also drew inspiration from several excellent open-source projects, including but not limited to HuggingFace Transformers, LLaMA-Factory, and Megatron-Bridge. We sincerely thank these communities for their outstanding contributions.
π¬ Contact
Open a GitHub issue for questions, feedback, or feature requests, or scan the WeChat QR code below to join our developer community.