Primus

April 6, 2026 · View on GitHub

Primus/Primus-LM is a flexible and high-performance training framework designed for large-scale foundation model training and inference on AMD GPUs. It supports pretraining, posttraining, and reinforcement learning workflows with multiple backends including Megatron-LM, TorchTitan, and JAX MaxText, alongside ROCm-optimized components.

Part of the Primus Ecosystem: Primus-LM is the training framework layer of the Primus ecosystem, working together with Primus-Turbo (high-performance operators) and Primus-SaFE (stability & platform).

✨ Key Features

🔄 Multi-Backend Support: Seamlessly switch between Megatron-LM, TorchTitan, and other training frameworks
🚀 Unified CLI: One command interface for local development, containers, and Slurm clusters (Docs)
⚡ ROCm Optimized: Deep integration with AMD ROCm stack and optimized kernels from Primus-Turbo
📦 Production Ready: Battle-tested on large-scale training with hundreds of GPUs
🔌 Extensible Architecture: Plugin-based design for easy integration of custom models and workflows
🛡️ Enterprise Features: Built-in fault tolerance, checkpoint management, and monitoring

✅ Supported Models (high level)

Megatron-LM: LLaMA2 / LLaMA3 / LLaMA4 families, DeepSeek-V2/V3, Mixtral-style MoE, and other GPT-style models
TorchTitan: LLaMA3 / LLaMA4, DeepSeek-V3, and related decoder-only architectures
MaxText (JAX): LLaMA3.x and other MaxText-supported transformer models (subset; see MaxText docs for details)

For the full and up-to-date model matrix, see Supported Models.

🆕 What's New

[2025/12/17] MoE Training Best Practices on AMD GPUs - MoE Package Blog
[2025/11/14] 🎉 Primus CLI 1.0 Released - Unified command-line interface with comprehensive documentation
[2025/08/22] Primus introduction blog
[2025/06/18] Added TorchTitan backend support
[2025/05/16] Added benchmark suite for performance evaluation
[2025/04/18] Added Preflight cluster sanity checker
[2025/04/14] Integrated HipBLASLt autotuning for optimized GPU kernel performance
[2025/04/09] Extended support for LLaMA2, LLaMA3, DeepSeek-V2/V3 models
[2025/03/04] Released Megatron trainer module

🚀 Setup & Deployment

Primus leverages AMD’s ROCm Docker images to provide a consistent, ready-to-run environment optimized for AMD GPUs. This eliminates manual dependency and environment configuration.

Prerequisites

AMD ROCm drivers (version ≥ 7.0 recommended)
Docker (version ≥ 24.0) with ROCm support
ROCm-compatible AMD GPUs (e.g., Instinct MI300 series)
Proper permissions for Docker and GPU device access

Quick Start with Primus CLI

Pull the latest Docker image

docker pull docker.io/rocm/primus:v26.2

Clone the repository

git clone --recurse-submodules https://github.com/AMD-AIG-AIMA/Primus.git
cd Primus

Run your first training

# Run training in container
# NOTE: If your config downloads weights/tokenizer from Hugging Face Hub,
#       you typically need to pass HF_TOKEN into the container.
./primus-cli container --image rocm/primus:v26.2 \
  --env HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" \
  -- train pretrain --config examples/megatron/configs/MI300X/llama2_7B-BF16-pretrain.yaml

For more detailed usage instructions, see the CLI User Guide.

📚 Documentation

Comprehensive documentation is available in the docs/ directory:

Quick Start Guide - Get started in 5 minutes
Primus CLI User Guide - Complete CLI reference and usage
CLI Architecture - Technical design and architecture
Backend Patch Notes - Primus-specific backend arguments
Full Documentation Index - Browse all available documentation

🌐 Primus Ecosystem

Primus-LM is part of a comprehensive ecosystem designed to provide end-to-end solutions for large model training on AMD GPUs:

🏗️ Architecture Overview

┌─────────────────────────────────────────────────────┐
│                   Primus-SaFE                       │
│         (Stability & Platform Layer)                │
│   Cluster Management | Fault Tolerance | Scheduling │
└────────────────────────┬────────────────────────────┘
                         │
┌────────────────────────▼────────────────────────────┐
│                   Primus-LM                         │
│              (Training Framework)                   │
│    Megatron | TorchTitan | Unified CLI | Workflows  │
└────────────────────────┬────────────────────────────┘
                         │
┌────────────────────────▼────────────────────────────┐
│                  Primus-Turbo                       │
│           (High-Performance Operators)              │
│  FlashAttention | GEMM | Collectives | GroupedGemm  │
│        AITER | CK | hipBLASLt | Triton              │
└─────────────────────────────────────────────────────┘

📦 Component Details

Component	Role	Key Features	Repository
Primus (Primus-LM)	Training Framework	Multi-backend support, unified CLI, production-ready workflows	This repo
Primus-Turbo	Performance Layer	Optimized kernels for attention, GEMM, communication, and more	Primus-Turbo
Primus-SaFE	Platform Layer	Cluster orchestration, fault tolerance, topology-aware scheduling	Primus-SaFE

🔗 How They Work Together

Primus-LM provides the training framework and workflow orchestration
Primus-Turbo supplies highly optimized compute kernels for maximum performance
Primus-SaFE ensures stability and efficient resource utilization at scale

This separation of concerns allows each component to evolve independently while maintaining seamless integration.

📝 TODOs

Add support for more model architectures and backends
Expand documentation with more examples and tutorials

🙏 Upstream Optimizations

Primus builds on top of several ROCm-native operator libraries and compiler projects—we couldn’t reach current performance levels without them:

ROCm AITer – AI Tensor Engine kernels (elementwise, attention, KV-cache, fused MoE, etc.)
Composable Kernel – performance-portable tensor operator generator for GEMM and convolutions
hipBLASLt – low-level BLAS Lt API with autotuning support for ROCm GPUs
ROCm Triton – Python-first kernel compiler used for custom attention and MoE paths

If you rely on Primus, please consider starring or contributing to these projects as well—they are foundational to our stack.

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

📄 License

Primus is released under the Apache 2.0 License.

Built with ❤️ by AMD AI Brain - Training at Scale (TAS) Team