Primus
April 6, 2026 Β· View on GitHub
Primus/Primus-LM is a flexible and high-performance training framework designed for large-scale foundation model training and inference on AMD GPUs. It supports pretraining, posttraining, and reinforcement learning workflows with multiple backends including Megatron-LM, TorchTitan, and JAX MaxText, alongside ROCm-optimized components.
Part of the Primus Ecosystem: Primus-LM is the training framework layer of the Primus ecosystem, working together with Primus-Turbo (high-performance operators) and Primus-SaFE (stability & platform).
β¨ Key Features
- π Multi-Backend Support: Seamlessly switch between Megatron-LM, TorchTitan, and other training frameworks
- π Unified CLI: One command interface for local development, containers, and Slurm clusters (Docs)
- β‘ ROCm Optimized: Deep integration with AMD ROCm stack and optimized kernels from Primus-Turbo
- π¦ Production Ready: Battle-tested on large-scale training with hundreds of GPUs
- π Extensible Architecture: Plugin-based design for easy integration of custom models and workflows
- π‘οΈ Enterprise Features: Built-in fault tolerance, checkpoint management, and monitoring
β Supported Models (high level)
- Megatron-LM: LLaMA2 / LLaMA3 / LLaMA4 families, DeepSeek-V2/V3, Mixtral-style MoE, and other GPT-style models
- TorchTitan: LLaMA3 / LLaMA4, DeepSeek-V3, and related decoder-only architectures
- MaxText (JAX): LLaMA3.x and other MaxText-supported transformer models (subset; see MaxText docs for details)
For the full and up-to-date model matrix, see Supported Models.
π What's New
- [2025/12/17] MoE Training Best Practices on AMD GPUs - MoE Package Blog
- [2025/11/14] π Primus CLI 1.0 Released - Unified command-line interface with comprehensive documentation
- [2025/08/22] Primus introduction blog
- [2025/06/18] Added TorchTitan backend support
- [2025/05/16] Added benchmark suite for performance evaluation
- [2025/04/18] Added Preflight cluster sanity checker
- [2025/04/14] Integrated HipBLASLt autotuning for optimized GPU kernel performance
- [2025/04/09] Extended support for LLaMA2, LLaMA3, DeepSeek-V2/V3 models
- [2025/03/04] Released Megatron trainer module
π Setup & Deployment
Primus leverages AMDβs ROCm Docker images to provide a consistent, ready-to-run environment optimized for AMD GPUs. This eliminates manual dependency and environment configuration.
Prerequisites
- AMD ROCm drivers (version β₯ 7.0 recommended)
- Docker (version β₯ 24.0) with ROCm support
- ROCm-compatible AMD GPUs (e.g., Instinct MI300 series)
- Proper permissions for Docker and GPU device access
Quick Start with Primus CLI
-
Pull the latest Docker image
docker pull docker.io/rocm/primus:v26.2 -
Clone the repository
git clone --recurse-submodules https://github.com/AMD-AIG-AIMA/Primus.git cd Primus -
Run your first training
# Run training in container # NOTE: If your config downloads weights/tokenizer from Hugging Face Hub, # you typically need to pass HF_TOKEN into the container. ./primus-cli container --image rocm/primus:v26.2 \ --env HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" \ -- train pretrain --config examples/megatron/configs/MI300X/llama2_7B-BF16-pretrain.yaml
For more detailed usage instructions, see the CLI User Guide.
π Documentation
Comprehensive documentation is available in the docs/ directory:
- Quick Start Guide - Get started in 5 minutes
- Primus CLI User Guide - Complete CLI reference and usage
- CLI Architecture - Technical design and architecture
- Backend Patch Notes - Primus-specific backend arguments
- Full Documentation Index - Browse all available documentation
π Primus Ecosystem
Primus-LM is part of a comprehensive ecosystem designed to provide end-to-end solutions for large model training on AMD GPUs:
ποΈ Architecture Overview
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Primus-SaFE β
β (Stability & Platform Layer) β
β Cluster Management | Fault Tolerance | Scheduling β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββ
β Primus-LM β
β (Training Framework) β
β Megatron | TorchTitan | Unified CLI | Workflows β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββ
β Primus-Turbo β
β (High-Performance Operators) β
β FlashAttention | GEMM | Collectives | GroupedGemm β
β AITER | CK | hipBLASLt | Triton β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π¦ Component Details
| Component | Role | Key Features | Repository |
|---|---|---|---|
| Primus (Primus-LM) | Training Framework | Multi-backend support, unified CLI, production-ready workflows | This repo |
| Primus-Turbo | Performance Layer | Optimized kernels for attention, GEMM, communication, and more | Primus-Turbo |
| Primus-SaFE | Platform Layer | Cluster orchestration, fault tolerance, topology-aware scheduling | Primus-SaFE |
π How They Work Together
- Primus-LM provides the training framework and workflow orchestration
- Primus-Turbo supplies highly optimized compute kernels for maximum performance
- Primus-SaFE ensures stability and efficient resource utilization at scale
This separation of concerns allows each component to evolve independently while maintaining seamless integration.
π TODOs
- Add support for more model architectures and backends
- Expand documentation with more examples and tutorials
π Upstream Optimizations
Primus builds on top of several ROCm-native operator libraries and compiler projectsβwe couldnβt reach current performance levels without them:
- ROCm AITer β AI Tensor Engine kernels (elementwise, attention, KV-cache, fused MoE, etc.)
- Composable Kernel β performance-portable tensor operator generator for GEMM and convolutions
- hipBLASLt β low-level BLAS Lt API with autotuning support for ROCm GPUs
- ROCm Triton β Python-first kernel compiler used for custom attention and MoE paths
If you rely on Primus, please consider starring or contributing to these projects as wellβthey are foundational to our stack.
π€ Contributing
We welcome contributions! Please see our Contributing Guide for details.
π License
Primus is released under the Apache 2.0 License.
Built with β€οΈ by AMD AI Brain - Training at Scale (TAS) Team