README.md

May 13, 2026 Β· View on GitHub

English | δΈ­ζ–‡

LoongForge

A modular, scalable, high-performance training framework for LLMs, VLMs, diffusion, and embodied models.

English δΈ­ζ–‡ License WeChat

πŸš€ Up to 5.0Γ— training speedup Β Β·Β  🌐 Native NVIDIA GPU & Kunlun XPU support

πŸ’‘ Why LoongForge?

πŸ‰ LoongForge is part of Baidu Baige's Loong open-source series β€” named after the traditional Chinese loong boat (ιΎ™θˆŸ), a symbol of coordinated power and forward momentum.

LoongForge is a unified training framework for LLMs, VLMs, VLAs, and diffusion models, covering pre-training, continued pre-training, and SFT. Built upon Megatron-LM with deep systemic enhancements across model coverage, training performance, and hardware support, it delivers significant speedups over mainstream open-source baselines.

Before going open-source, LoongForge was developed as AIAK-Training-LLM, Baidu Baige's training acceleration stack. It has supported production training for enterprise customers across Education, Computer Vision, and Embodied AI, typically delivering 30%~50% speedup over customer baselines, with the largest production runs reaching 5,000+ XPUs.

πŸ”₯ Latest News

  • [2026/05] πŸŽ‰ v0.1.0 β€” first official tagged release of LoongForge.
  • [2026/05] 🌟 Powered the training and public release of LLaVA-OneVision-2.0.
  • [2026/05] πŸ€– Expanded VLA coverage with GR00T N1.6; 60%+ speedup on Pi0.5 and GR00T training.
  • [2026/04] 🧩 Added training support for MiniMax-M2.7 on both NVIDIA GPU and Kunlun XPU.
  • [2026/04] πŸš€ LoongForge source code publicly available on GitHub. [blog]
  • [2025/10] 🌟 Powered the training and public release of LLaVA-OneVision-1.5 under AIAK-Training-LLM, the predecessor of LoongForge. [blog]

⚑ Quick Start

See the full documentation for installation, tutorials, and advanced usage β€” English Β· δΈ­ζ–‡.

1. Install β€” via Docker (prebuilt images coming soon) or source build:

2. Launch your first training run β€” follow a tutorial for your target hardware and modality:

3. Explore β€” browse configs/models/ and examples/ / examples_xpu/ for ready-to-run scripts.

✨ Key Features

  • 🧩 Flexible Multi-Modal Composition β€” Configuration-driven assembly of VLMs from interchangeable ViT and LLM components.
  • ⚑ Heterogeneous Parallelism β€” Independent TP / DP / recompute per model component (e.g., ViT vs. LLM) for optimal throughput and memory.
  • πŸ”€ Decoupled Encoder-Decoder Training β€” Separates ViT and LLM into independent tasks, eliminating encoder-induced pipeline bubbles.
  • βš–οΈ DP Load Balancing β€” Load-aware data redistribution mitigates sequence-packing imbalance, improving multi-node scaling efficiency.
  • πŸš€ MoE-Native Optimization β€” Overlapped All2All / activation offload / compute, with further memory reduction beyond upstream Megatron-LM on DeepSeek-V3, Qwen3-MoE, etc.
  • πŸ”¬ Adaptive FP8 Training β€” End-to-end FP8 for LLMs and VLMs with standard blockwise FP8; optional adaptive mode picks per-operator precision by GEMM shape and efficiency.
  • πŸ”§ Custom Fused Operators β€” Fused kernels like FusedDSA for DSA-style models β€” TileLang version open-sourced, high-performance CUDA version available on Baidu Baige platform.
  • πŸ” Flexible Checkpointing β€” Offline bidirectional Megatron ↔ HuggingFace conversion plus native online HF load/save β€” no format barriers across your workflow.
  • 🧰 Versatile Pipelines & Data Tools β€” Out-of-the-box Pretrain / MidTrain / SFT / LoRA, with built-in dataset format conversion and sequence packing.
  • 🌐 Heterogeneous Hardware β€” Native support for NVIDIA GPUs and Kunlun XPUs via a minimally-intrusive plugin design.

πŸ“– Deep-dive: LLM features Β· VLM features

πŸ“Š Benchmark

Measured in v0.1.0 across LLM, VLM, and VLA workloads against mainstream open-source training baselines:

ModelTypeBaselineConfigurationSpeedup
Qwen3-30B-A3BMoEMegatron-LM†32 Γ— A800‑ Β· GBS 1024 Β· 32K1.16Γ—
Qwen3-VL-30B-A3BVLMVeOmni†32 Γ— A800‑ Β· GBS 128 Β· 32K1.45Γ—
GR00T N1.6VLALeRobot†8 Γ— A800‑ Β· GBS 128 Β· 224Γ—2241.60Γ—
Pi0.5VLAOpenPI†8 Γ— A800‑ Β· GBS 112 Β· 224Γ—2241.65Γ—

Due to test-bed scale limits, DeepSeek-V3.2 was validated separately on a reduced-layer configuration β€” LoongForge's DSA operator-level optimizations still deliver ~5Γ— speedup over Megatron-LM and reach 64K sequence (baseline OOMs beyond 8K).

† Numbers reflect baseline and LoongForge versions at the time of measurement, and may evolve as implementations change.
‑ Validation on additional hardware is rolling out in upcoming releases.

🌟 Powered by LoongForge

  • LLaVA-OneVision-2.0 β€” Next-generation multimodal model, with new VideoCaption and Spatial datasets.
  • LLaVA-OneVision-1.5 β€” Fully open framework for democratized multimodal training.
  • Qianfan-VL β€” Domain-Enhanced Vision-Language Models for Enterprise, 3B to 70B parameters.

πŸ›οΈ Supported Models

LoongForge supports a broad range of state-of-the-art models across LLM, VLM, diffusion, and VLA.

ModalityArchitecturesModels
LLMDeepSeek-V2deepseek-v2-lite, deepseek-v2
DeepSeek-V3deepseek-v3, deepseek-v32
LLaMA2llama2-7b, llama2-13b, llama2-70b
LLaMA3llama3-8b, llama3-70b
LLaMA3.1llama3.1-8b, llama3.1-70b, llama3.1-405b
Qwenqwen-1.8b β†’ qwen-72b
Qwen1.5qwen1.5-0.5b β†’ qwen1.5-72b
Qwen2qwen2-0.5b β†’ qwen2-72b
Qwen2.5qwen2.5-0.5b β†’ qwen2.5-72b
Qwen3qwen3-0.6b β†’ qwen3-480b-a35b, qwen3-coder-30b-a3b
Qwen3-Nextqwen3-next-80b-a3b
MiniMaxminimax-m2.1, minimax-m2.5, minimax-m2.7
MIMOmimo-7b
GLMglm5
VLMQwen2.5-VLqwen2.5-vl-3b β†’ qwen2.5-vl-72b
Qwen3-VLqwen3-vl-30b-a3b, qwen3-vl-235b-a22b
Qwen3.5qwen3.5-0.8b β†’ qwen3.5-397b-a17b
Qwen3.6qwen3.6-27b, qwen3.6-35b-a3b
ERNIE4.5-VLernie4.5vl-28b-a3b
LLaVA-OneVision-1.5llava-onevision-1.5-4b
InternVL2.5internvl2.5-8b β†’ internvl2.5-78b
InternVL3.5internvl3.5-8b β†’ internvl3.5-241b-a28b
CustomCombinedModelFlexible ViT + LLM backbone configuration (example)
DiffusionWAN2.2wan2.2_i2v_a14b
VLAPipi0.5
GR00Tgroot-n1.6

πŸš€ Roadmap

Model Support

  • LLM / VLM: ongoing validation and release of new models (e.g., DeepSeek-V4, Kimi 2.5 / 2.6)
  • Embodied AI: expanded WAM coverage (e.g., DreamZero, LingBot VA)

Performance & Scaling

  • Adopt next-generation techniques introduced with DeepSeek-V4
  • Advanced MoE load-balancing strategies
  • Long-sequence training with ChunkPipe scheduling and Context Parallelism
  • Further diffusion-model acceleration (e.g., WAN)
  • INT4 quantization-aware training
  • MTP (Multi-Token Prediction) scaling for speculative decoding

πŸ—οΈ Repository Layout

πŸ“ Directory tree
LoongForge/
β”œβ”€β”€ loongforge/                   # Core training framework
β”‚   β”œβ”€β”€ train/                    # Training entry points & trainers
β”‚   β”‚   β”œβ”€β”€ pretrain/             #   Pretrain (LLM, VLM)
β”‚   β”‚   β”œβ”€β”€ sft/                  #   SFT (LLM, VLM, InternVL, ERNIE)
β”‚   β”‚   β”œβ”€β”€ diffusion/            #   Diffusion (WAN)
β”‚   β”‚   └── embodied/             #   Embodied AI (Pi0.5, GR00T)
β”‚   β”œβ”€β”€ models/                   # Unified model abstractions
β”‚   β”‚   β”œβ”€β”€ foundation/           #   LLM backbones (LLaMA, Qwen, DeepSeek, ...)
β”‚   β”‚   β”œβ”€β”€ encoder/              #   Vision encoders (ViT, Qwen-VL, InternVL, ...)
β”‚   β”‚   β”œβ”€β”€ omni_models/          #   Multi-modal composition
β”‚   β”‚   β”œβ”€β”€ diffusion/            #   Diffusion models (WAN)
β”‚   β”‚   β”œβ”€β”€ embodied/             #   Embodied models (Pi0.5, GR00T)
β”‚   β”‚   └── common/               #   Shared layers and utilities
β”‚   β”œβ”€β”€ data/                     # Data pipelines (multi-modal, video, DP balance)
β”‚   β”œβ”€β”€ tokenizer/                # Tokenizers
β”‚   └── utils/                    # Config map, constants, etc.
β”œβ”€β”€ third_party/Loong-Megatron/   # Patched Megatron-LM (git submodule)
β”œβ”€β”€ configs/                      # Hydra YAML configs (models, data)
β”œβ”€β”€ examples/                     # GPU launch scripts
β”œβ”€β”€ examples_xpu/                 # Kunlun XPU launch scripts
β”œβ”€β”€ tools/                        # Checkpoint conversion, data preprocessing
β”œβ”€β”€ ops/                          # Custom fused operators (incl. open-sourced TileLang)
β”œβ”€β”€ patches/                      # TransformerEngine patches
β”œβ”€β”€ docker/                       # Dockerfiles (GPU & XPU)
β”œβ”€β”€ tests/                        # E2E test suite (YAML-driven)
└── docs/                         # Documentation

🀝 Contributing

We warmly welcome community contributions β€” bug reports, feature proposals, and PRs alike. Please read our Contributing Guidelines before submitting.

πŸ“„ License

LoongForge is released under the Apache License 2.0. Some files are derived from third-party open-source projects; please refer to the specific file headers for their respective copyright and attribution.

πŸ“ Citation

@software{LoongForge2026,
  title  = {LoongForge: A modular, scalable, high-performance training framework for LLMs, VLMs, diffusion, and embodied models},
  author = {{The LoongForge Authors}},
  year   = {2026},
  url    = {https://github.com/baidu-baige/LoongForge}
}

πŸ™ Acknowledgments

LoongForge is built upon NVIDIA's Megatron-LM. We also drew inspiration from several excellent open-source projects, including but not limited to HuggingFace Transformers, LLaMA-Factory, and Megatron-Bridge. We sincerely thank these communities for their outstanding contributions.

πŸ’¬ Contact

Open a GitHub issue for questions, feedback, or feature requests, or scan the WeChat QR code below to join our developer community.

LoongForge WeChat Community