Performance summary โ†’ Markdown report

May 7, 2026 ยท View on GitHub


ATOM (AiTer Optimized Model) is a lightweight vLLM-like implementation, focusing on integration and optimization based on AITER.

๐Ÿ“ข News

  • [2026/05] Dissecting DeepSeek V4 Compressor โ€” interactive animation visualizing how the CSA/HCA compressor state cache works (overlap mechanism, prefill vs decode, bulk compression vs sequential accumulation).
  • [2026/03] ATOM now supports Prefill/Decode (P/D) disaggregation โ€” run prefill and decode on separate GPU nodes with RDMA-based KV cache transfer via MORI-IO. See disaggregation docs.

๐Ÿš€ Features

  • ROCm Optimized: Built on AMD's ROCm platform with AITER kernels (ASM, CK, Triton)
  • OpenAI-Compatible API: Drop-in server with /v1/chat/completions and /v1/completions endpoints
  • Piecewise torch.compile: 4 compilation levels with CUDA graph capture for low-latency decode
  • Multi-GPU Parallelism: Tensor parallelism (TP), data parallelism (DP), and expert parallelism (EP) with MORI all-to-all
  • Two-Batch Overlap (TBO): Following DeepSeek's system design, TBO splits each batch into two micro-batches and pipelines them across compute and communication streams. Effectively hiding expert-parallel communication latency and reducing peak memory usage. See recipe
  • Quantization: FP8, MXFP4, INT8, INT4 with auto-detection from HuggingFace configs
  • Speculative Decoding: Multi-Token Prediction (MTP) with EAGLE proposer
  • Prefix Caching: xxhash64-based KV cache block sharing across sequences

Supported Models

Model FamilyHF ArchitectureDense/MoENotes
LlamaLlamaForCausalLMDenseLlama 2, Llama 3, Llama 3.1
Qwen3Qwen3ForCausalLMDense
Qwen3-MoEQwen3MoeForCausalLMMoE128 experts, top-8 routing
Qwen3-NextQwen3NextForCausalLMMoEHybrid full attention + Gated DeltaNet
DeepSeek V2/V3DeepseekV3ForCausalLMMoEMLA attention, MTP speculative decoding
MixtralMixtralForCausalLMMoE8 experts, top-2 routing
GLM-4-MoEGlm4MoeForCausalLMMoE
GLM-5GlmMoeDsaForCausalLMMoEMLA attention, similar to DeepSeek V3.2. See recipe
GPT-OSSGptOssForCausalLMMoESliding window + attention sinks
Kimi-K2via --trust-remote-codeMoESee recipe
MiMo-V2-FlashMiMoV2FlashForCausalLMMoEHybrid full + SWA attention, 3-layer MTP. See recipe

๐Ÿ“‹ Requirements

  • AMD GPU with ROCm support
  • Docker

๐Ÿ› ๏ธ Installation

Pre-built image with AITER + ATOM ready to use:

docker pull rocm/atom-dev:latest

docker run -it --network=host \
  --device=/dev/kfd \
  --device=/dev/dri \
  --group-add video \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  -v $HOME:/home/$USER \
  -v /mnt:/mnt \
  -v /data:/data \
  --shm-size=16G \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  rocm/atom-dev:latest

Option B: Build from Base ROCm Image

1. Pull and run the base image

docker pull rocm/pytorch:rocm7.0.2_ubuntu24.04_py3.12_pytorch_release_2.8.0

docker run -it --network=host \
  --device=/dev/kfd \
  --device=/dev/dri \
  --group-add video \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  -v $HOME:/home/$USER \
  -v /mnt:/mnt \
  -v /data:/data \
  --shm-size=16G \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  rocm/pytorch:rocm7.0.2_ubuntu24.04_py3.12_pytorch_release_2.8.0

2. Install AITER and ATOM inside the container

pip install amd-aiter
git clone https://github.com/ROCm/ATOM.git && pip install ./ATOM

๐Ÿ’ก Usage

Basic Example

Before running the example, please install ninja and the Hugging Face CLI, and log in to your account.

pip install ninja
pip install -U "huggingface_hub"
hf auth login

The default optimization level is 3 (piecewise torch.compile with CUDA graphs).

python -m atom.examples.simple_inference --model meta-llama/Meta-Llama-3-8B --kv_cache_dtype fp8

Note: First-time execution may take approximately 10 minutes for model compilation.

Serving

Start an OpenAI-compatible server:

# Single GPU
python -m atom.entrypoints.openai_server --model Qwen/Qwen3-0.6B --kv_cache_dtype fp8

# Multi-GPU with tensor parallelism
python -m atom.entrypoints.openai_server --model deepseek-ai/DeepSeek-R1 --kv_cache_dtype fp8 -tp 8

# With MTP speculative decoding
python -m atom.entrypoints.openai_server --model deepseek-ai/DeepSeek-R1 --kv_cache_dtype fp8 -tp 8 \
  --method mtp --num-speculative-tokens 3

๐Ÿ“Š Performance

Live Benchmark Dashboard

rocm.github.io/ATOM/benchmark-dashboard

The dashboard tracks nightly performance across models and configurations:

  • Interactive vs Throughput โ€” tok/s/user vs tok/s/gpu tradeoff across concurrency levels
  • Throughput & Latency trends โ€” Output throughput, TTFT, TPOT over time, grouped by model
  • Regression detection โ€” Automatic alerts when throughput drops >5% or latency increases >10%
  • Profiler trace collection โ€” On regression, automatically re-runs with PyTorch profiler and uploads traces

Models tracked: DeepSeek-R1-0528 (FP8 & MTP3), GLM-5-FP8, gpt-oss-120b

Online Serving Throughput

DS R1 Performance

For more information, visit InferenceX.

Benchmarking

Run an online throughput benchmark against a running server:

python -m atom.benchmarks.benchmark_serving \
  --model=deepseek-ai/DeepSeek-R1 --backend=vllm --base-url=http://localhost:8000 \
  --dataset-name=random \
  --random-input-len=1024 --random-output-len=1024 \
  --random-range-ratio=0.8 \
  --num-prompts=1280 --max-concurrency=128 \
  --request-rate=inf --ignore-eos \
  --save-result --percentile-metrics="ttft,tpot,itl,e2el"

Profiling & Trace Analysis

Collect a Trace

Launch the server with --torch-profiler-dir and --mark-trace:

python -m atom.entrypoints.openai_server \
  --model deepseek-ai/DeepSeek-R1 --kv_cache_dtype fp8 -tp 8 \
  --torch-profiler-dir ./trace --mark-trace

Collect traces via benchmark --profile flag (auto start/stop):

python -m atom.benchmarks.benchmark_serving \
  --model=deepseek-ai/DeepSeek-R1 --backend=vllm --base-url=http://localhost:8000 \
  --dataset-name=random --random-input-len=1024 --random-output-len=1024 \
  --num-prompts=128 --max-concurrency=128 \
  --request-rate=inf --ignore-eos --profile

Or control profiling manually on a running server:

curl -X POST http://127.0.0.1:8000/start_profile
# ... run your workload ...
curl -X POST http://127.0.0.1:8000/stop_profile

Analyze the Trace

# Kernel breakdown per layer โ†’ Excel
python tools/parse_trace.py ./trace/rank_0/DeepSeek-R1_ts_*.json.gz --layer 3

# Performance summary โ†’ Markdown report
python tools/analyze_trace_summary.py ./trace/rank_0/DeepSeek-R1_ts_*.json.gz
OutputDescription
prefill_breakdown.xlsxPer-kernel duration, call count, pct%, module grouping, cross-layer averages
decode_breakdown.xlsxSame for decode phase, with CUDAGraph kernel mapping
performance_summary.mdPrefill/decode/draft step timing, iteration breakdown

Accuracy Validation

pip install lm-eval[api]

# Start server, then run evaluation
lm_eval --model local-completions \
  --model_args model=meta-llama/Meta-Llama-3-8B,base_url=http://localhost:8000/v1/completions,num_concurrent=64,max_retries=3,tokenized_requests=False \
  --tasks gsm8k --num_fewshot 5

๐Ÿ“š Documentation

Full documentation: rocm.github.io/ATOM/docs

TopicDescriptionGuide
ArchitectureSystem overview, request lifecycle, component designArchitecture Guide
ConfigurationConfig classes, CLI arguments, environment variablesConfiguration Guide
Model SupportSupported models, weight loading, adding new architecturesModel Support Guide
Model OperationsAITER kernel integration, linear/attention/MoE/norm wrappersModel Ops Guide
Scheduling & KV CacheBatch scheduling, block allocation, prefix cachingScheduling Guide
Compilationtorch.compile levels, CUDA graphs, piecewise compilationCompilation Guide
DistributedTensor/data/expert parallelism, multi-GPU deploymentDistributed Guide
Serving & BenchmarksOpenAI API server, benchmarking, profiling, speculative decodingServing Guide
Environment VariablesAll ATOM_* variable definitionsEnv Vars

Deployment Recipes:

Framework Integration:

Acknowledgements

This project was adapted from nano-vllm.

Support & Reporting Issues

We welcome issues and contributions! Please use the GitHub Issues page to report bugs or request features: https://github.com/ROCm/ATOM/issues