ATOM Configuration Guide
April 29, 2026 · View on GitHub
ATOM (AiTer Optimized Model) is AMD's lightweight LLM inference engine built on AITER kernels for ROCm/HIP GPUs. This guide documents every configuration class, CLI flag, and environment variable that controls ATOM's runtime behaviour.
Quick Reference
| Config Class | Primary Purpose |
|---|---|
Config | Master dataclass -- model path, memory, TP size, scheduler limits, KV cache, profiler, and references to all sub-configs |
CompilationConfig | Compilation level (0-3), CUDA graph capture sizes, piecewise splitting ops, inductor settings |
CompilationLevel | Integer constants for the four compilation levels |
CUDAGraphMode | Enum controlling how CUDA graphs are captured (none / piecewise / full / hybrid) |
QuantizationConfig | Layer-wise quantization orchestrator: global config, per-layer overrides, exclude lists, layer name remapping |
LayerQuantConfig | Per-layer quantization spec (frozen dataclass): quant type, dtype, dynamic flag, method |
ParallelConfig | Data-parallel size, rank, master IP/port |
SpeculativeConfig | Speculative decoding method, draft model, number of speculative tokens |
KVCacheConfig / KVCacheTensor | Per-layer KV cache tensor descriptors (k/v caches and scales) |
SamplingParams | Temperature, max tokens, stop strings, ignore-EOS flag |
EngineArgs | CLI argument parser that builds a Config for LLMEngine |
1. Master Configuration (Config)
Defined in atom/config.py. The root dataclass that the engine consumes.
| Field | Type | Default | Description |
|---|---|---|---|
model | str | (required) | HuggingFace model name or local path |
trust_remote_code | bool | False | Trust remote code when loading the model from HuggingFace |
max_num_batched_tokens | int | 16384 | Maximum number of tokens batched together per scheduler step |
scheduler_delay_factor | float | 0.0 | Multiplicative delay (factor x previous prompt latency) before scheduling the next prompt |
max_num_seqs | int | 512 | Maximum number of sequences batched together |
max_model_len | int | None | None | Maximum context length; defaults to hf_config.max_position_embeddings (capped by it when set) |
gpu_memory_utilization | float | 0.9 | Fraction of GPU memory available for KV cache and weights (0.0 -- 1.0) |
tensor_parallel_size | int | 1 | Number of tensor-parallel GPUs (1 -- 8) |
enforce_eager | bool | False | Disable compilation and CUDA graphs; run in eager mode |
parallel_config | ParallelConfig | ParallelConfig() | Data-parallel configuration (see Section 4) |
kv_cache_block_size | int | 16 | Block size for paged KV cache; must be a multiple of 16 or exactly 1 |
num_kvcache_blocks | int | -1 | Number of KV cache blocks (-1 = auto) |
kv_cache_dtype | str | "bf16" | KV cache data type ("bf16" or "fp8") |
enable_prefix_caching | bool | False | Enable prefix caching to reuse KV blocks across requests sharing the same prefix |
port | int | 8006 | Engine internal communication port |
torch_profiler_dir | str | None | os.getenv("ATOM_TORCH_PROFILER_DIR", None) | Directory for saving PyTorch profiler traces; creates the directory if it does not exist |
compilation_config | CompilationConfig | CompilationConfig() | Compilation and CUDA graph settings (see Section 2) |
quant_config | QuantizationConfig | (auto-detected) | Quantization settings; auto-detected from HuggingFace config during __post_init__ via QuantizationConfig(hf_config) (see Section 3) |
asyncio_mode | bool | False | Enable asyncio-based engine loop |
load_dummy | bool | False | Skip loading model weights (for benchmarking / testing) |
enable_expert_parallel | bool | False | Enable Expert Parallelism for MoE models |
master_addr | str | "127.0.0.1" | Master address for distributed communication |
graph_bs | Optional[list[int]] | None | Explicit list of batch sizes for CUDA graph capture; derived from compilation_config during init |
enable_dp_attention | bool | False | Enable data-parallel attention |
torch_dtype | torch.dtype | (computed) | Inferred from hf_config.torch_dtype; falls back to torch.bfloat16 |
speculative_config | Optional[SpeculativeConfig] | None | Speculative decoding configuration (see Section 5) |
bos_token_id | int | -1 | Beginning-of-sequence token ID (-1 = use model default) |
eos_token_id | int | -1 | End-of-sequence token ID (-1 = use model default) |
stop_token_ids | list[int] | [] | Additional stop token IDs; populated from GenerationConfig.eos_token_id during init |
Auto-derived fields (set in __post_init__ or by ModelRunner.get_num_blocks(), not user-supplied):
| Field | Type | Description |
|---|---|---|
hf_config | PretrainedConfig | Loaded automatically via get_hf_config(model) |
generation_config | GenerationConfig | Loaded automatically via get_generation_config(model) |
per_req_cache_equiv_blocks | int | Number of KV cache block equivalents reserved per request for the per-request stateful-attention cache (currently GDN recurrent state; future stateful attentions plug in via AttentionMetadataBuilder.compute_per_req_cache_bytes()); computed by ModelRunner.get_num_blocks() |
num_per_req_cache_groups | int | Number of per-request slot groups available (= max_num_seqs for stateful-attention models, 0 otherwise); computed by ModelRunner.get_num_blocks() |
2. Compilation Configuration (CompilationConfig)
Defined in atom/config.py. Controls torch.compile and CUDA graph behaviour.
2.1 Compilation Levels (CompilationLevel)
| Constant | Value | Description |
|---|---|---|
NO_COMPILATION | 0 | No compilation -- pure eager execution |
DYNAMO_AS_IS | 1 | Use torch.compile / TorchDynamo as-is |
DYNAMO_ONCE | 2 | TorchDynamo with a single compilation pass |
PIECEWISE | 3 | Piecewise compilation with CUDA graph capture (recommended for production) |
2.2 CompilationConfig Fields
| Field | Type | Default | Description |
|---|---|---|---|
level | int | 0 | Compilation level (see table above); must be 0 -- 3 |
use_cudagraph | bool | True | Whether to use CUDA graphs |
cudagraph_capture_sizes | Optional[list[int]] | None | Explicit list of batch sizes for CUDA graph capture; overrides cuda_graph_sizes when set |
cuda_graph_sizes | list[int] | [] (post-init: [512]) | CUDA graph sizing strategy: 1 value generates [1,2,4,8] + range(16, N+1, 16); multiple values used as-is; empty defaults to [512] |
debug_dump_path | str | "" | Path to dump debug / compilation information |
cache_dir | str | "" | Directory for compilation caches |
use_inductor | bool | True | Enable TorchInductor backend |
cudagraph_mode | Optional[CUDAGraphMode] | None | CUDA graph capture mode (see below); set to PIECEWISE automatically at level 3 |
splitting_ops | Optional[list[str]] | None | Ops that split the graph into sub-graphs for piecewise compilation; auto-populated at level 3 with ["aiter.unified_attention_with_output", "aiter.mla_attention"] |
cudagraph_copy_inputs | bool | False | Copy input tensors into internally managed buffers before CUDA graph replay; only effective in PIECEWISE mode |
compile_sizes | Optional[list[Union[int, str]]] | None | Sizes to compile for inductor; accepts integers and the string "cudagraph_capture_sizes" |
inductor_compile_config | dict | {} | Additional configuration passed to the inductor backend |
2.3 CUDA Graph Mode (CUDAGraphMode)
| Mode | Value | Description |
|---|---|---|
NONE | 0 | No CUDA graph capture |
PIECEWISE | 1 | Piecewise CUDA graphs -- attention ops stay outside the graph for flexibility (default at level 3) |
FULL | 2 | Full CUDA graph capture for all batches; best for small models / short prompts |
FULL_DECODE_ONLY | (FULL, NONE) | Full CUDA graphs for decode batches only; mixed prefill-decode runs without graphs (useful in P/D setups) |
FULL_AND_PIECEWISE | (FULL, PIECEWISE) | Full graphs for decode, piecewise for prefill/mixed -- most performant mode for most models |
Helper methods on CUDAGraphMode:
decode_mode()-- returns the mode used for pure decode batches.mixed_mode()-- returns the mode used for mixed prefill-decode batches.requires_piecewise_compilation()-- whether the mode needs piecewise compilation.has_full_cudagraphs()-- whether the mode includes full CUDA graph capture.separate_routine()-- whether decode and mixed batches use different routines.
3. Quantization Configuration (QuantizationConfig & LayerQuantConfig)
Defined in atom/config.py and atom/quant_spec.py. The quantization system uses two classes:
QuantizationConfig-- the top-level orchestrator that holds a global config, per-layer overrides, and exclusion lists.LayerQuantConfig-- a frozen dataclass (defined inatom/quant_spec.py) that stores the concrete quantization parameters for a single layer or as the global default. Typed, immutable, with attribute access (e.g.,spec.quant_type).
3.1 LayerQuantConfig Fields
LayerQuantConfig is a frozen dataclass. Fields are accessed as typed attributes (e.g., spec.quant_type).
| Field | Type | Default | Description |
|---|---|---|---|
quant_type | QuantType | QuantType.No | Quantization granularity (see below) |
quant_dtype | torch.dtype | torch.bfloat16 | Data type for quantized weights |
is_dynamic | bool | True | Use dynamic quantization (scales computed at runtime) |
quant_method | str | "" | Quantization method (e.g., "quark", "compressed-tensors") |
3.2 QuantizationConfig Attributes
| Attribute | Type | Description |
|---|---|---|
torch_dtype | torch.dtype | The model's default dtype (from hf_config.torch_dtype) |
hf_quant_config | dict | None | Raw quantization_config dict from HuggingFace config |
global_quant_config | LayerQuantConfig | Default quantization spec applied to all layers |
_parsed.layer_pattern_specs | list[tuple[str, LayerQuantConfig]] | Per-layer overrides keyed by layer name pattern (supports fnmatch globs like "*.mlp.*") |
exclude_layers | list[str] | Layer names excluded from quantization (supports exact match and "re:" regex prefix) |
quant_method | str | Top-level quantization method name (e.g., "quark", "compressed-tensors") |
Key methods:
| Method | Description |
|---|---|
get_name() | Returns the quantization method name |
get_layer_quant_config(layer_name) | Returns the LayerQuantConfig for a layer: checks exclusions first, then per-layer overrides, then falls back to global spec |
remap_layer_name(hf_config, packed_modules_mapping) | Remaps layer names for packed/fused modules (e.g., q_a_proj → fused_qkv_a_proj for DeepSeek) |
compute_hash() | Returns a SHA-256 hash of the quantization config for cache invalidation |
3.3 QuantType Values (from AITER)
| Value | Description |
|---|---|
QuantType.No | No quantization |
QuantType.per_Token | Per-token / per-channel quantization |
QuantType.per_1x128 | Block quantization with group size 128 |
QuantType.per_1x32 | Block quantization with group size 32 |
QuantType.per_128x128 | Large 2D block quantization (remapped to per_1x128 in MoE kernels) |
QuantType.per_Tensor | Per-tensor quantization |
3.4 Supported Quantization Dtypes
| Dtype | AITER Key | Notes |
|---|---|---|
| FP8 (E4M3) | "fp8" | 8-bit floating point |
| MXFP4 | "fp4x2" | Microscaling FP4; forces QuantType.per_1x32 |
| INT8 | "i8" | 8-bit integer |
| INT4 | "i4x2" | 4-bit integer (packed) |
3.5 Auto-Detection from HuggingFace
During Config.__post_init__, ATOM constructs QuantizationConfig(hf_config) which
reads hf_config.quantization_config and automatically determines quantization
parameters:
For quark models (quant_method == "quark"):
- Parses
global_quant_configdict viaQuarkParserto produce the globalLayerQuantConfig. - Parses each entry in
layer_quant_configdict to produce per-layer overrides. - Reads the
"exclude"list for excluded layers. - Within each config dict,
weight.qschemedeterminesquant_type("per_channel"→per_Token,"per_tensor"→per_Tensor,"per_group"→per_1x32), andweight.dtypedeterminesquant_dtype. input_tensors.is_dynamiccontrols dynamic quantization (defaults toTrueif absent).
For other models (compressed-tensors, etc.):
- If
quant_method == "compressed-tensors"or channel quantization is detected, setsper_Token. - If
weight_block_sizeorgroup_sizeis found: group size 128 maps toper_1x128, group size 32 maps toper_1x32. - Otherwise falls back to
per_Tensor. - The dtype is parsed from fields like
dtype,weight_dtype, orquant_methodlooking forfp8,fp4,mxfp4,int8,int4, ornum_bits. - If
activation_schemeis"static",is_dynamicis set toFalse. - Excluded layers are read from the
"ignore"key.
3.6 Layer-Level Quantization Dispatch
Linear layers, MoE layers, and fused ops call quant_config.get_layer_quant_config(prefix) to obtain the appropriate LayerQuantConfig for their position in the model. This enables mixed-precision quantization where different layers can have different quant types and dtypes (e.g., FP8 for attention, FP4 for MLP).
4. Parallel Configuration (ParallelConfig)
Defined in atom/config.py. Controls data parallelism. Environment variables
(Section 8) override defaults when set.
| Field | Type | Default | Description |
|---|---|---|---|
data_parallel_size | int | 1 | Number of data-parallel groups; overridden by ATOM_DP_SIZE env var |
data_parallel_size_local | int | 1 | Number of local data-parallel groups |
data_parallel_rank | int | 0 | Rank within the data-parallel group; overridden by ATOM_DP_RANK |
data_parallel_rank_local | Optional[int] | None | Local rank within the data-parallel group (SPMD mode); overridden by ATOM_DP_RANK_LOCAL |
data_parallel_master_port | int | 29500 | Port used by the data-parallel master for process group initialization |
data_parallel_base_port | int | get_open_port() | Base port for data-parallel communication (dynamically assigned) |
data_parallel_master_ip | str | "127.0.0.1" | IP address of the data-parallel master |
Computed property:
world_size-- set during init, equals TP x PP.world_size_across_dp--world_size * data_parallel_size.
5. Speculative Decoding Configuration (SpeculativeConfig)
Defined in atom/config.py. Currently only the Multi-Token Prediction (MTP)
method with num_speculative_tokens=1 is supported.
| Field | Type | Default | Description |
|---|---|---|---|
method | Optional[str] | "" | Speculative decoding method; currently only "mtp" is accepted |
model | Optional[str] | None | Draft model name or path (typically the same as the target model for MTP) |
num_speculative_tokens | Optional[int] | None | Number of speculative tokens per iteration; must be 1 |
draft_model_hf_config | Optional[PretrainedConfig] | None | HuggingFace config for the draft model; auto-loaded from model when None |
5.1 Table-Driven MTP Config
MTP configuration uses two class-level lookup tables to support multiple model families without per-model branching.
_MTP_TYPE_MAP -- maps a base model_type to its MTP model_type:
Base model_type | MTP model_type |
|---|---|
deepseek_v3 | deepseek_mtp |
glm_moe_dsa | deepseek_mtp |
qwen3_next | qwen3_next_mtp |
qwen3_5 | qwen3_5_mtp |
qwen3_5_moe | qwen3_5_mtp |
qwen3_5_text | qwen3_5_mtp |
qwen3_5_moe_text | qwen3_5_mtp |
_MTP_CONFIG -- maps MTP model_type to a (n_predict_attr, architecture) tuple:
MTP model_type | n_predict_attr | Architecture |
|---|---|---|
deepseek_mtp | num_nextn_predict_layers | DeepSeekMTPModel |
qwen3_next_mtp | num_nextn_predict_layers | Qwen3NextMTPModel |
qwen3_5_mtp | mtp_num_hidden_layers | Qwen3_5MTPModel |
5.2 Post-init behaviour (hf_config_override)
The static method hf_config_override applies a two-step transformation to the
draft model's HuggingFace config:
-
Resolve model type -- looks up
hf_config.model_typein_MTP_TYPE_MAP. If found, rewritesmodel_typeto the MTP variant (e.g.deepseek_v3->deepseek_mtp). -
Apply MTP overrides -- looks up the (possibly rewritten)
model_typein_MTP_CONFIG. If found:- Reads
n_predictfrom the model-specific attribute (e.g.num_nextn_predict_layersormtp_num_hidden_layers), defaulting to 1. Warns and forces it to 1 if the original value differs. - Sets
n_predict=1,num_nextn_predict_layers=1(universal across all MTP families), andarchitecturesto the corresponding MTP model class. - Qwen3.5 MTP only: additionally injects
n_shared_experts=1andn_routed_experts(read fromhf_config.num_experts, default 0) so the MTP module can construct its MoE layer.
- Reads
Other post-init steps:
- Loads
draft_model_hf_configfrommodelif not provided. - Extracts
text_configfrom multimodal model configs when present. Config.__post_init__raisesValueErrorifnum_speculative_tokens != 1.
6. Sampling Parameters (SamplingParams)
Defined in atom/sampling_params.py. Passed per-request to control generation.
| Field | Type | Default | Description |
|---|---|---|---|
temperature | float | 1.0 | Sampling temperature; lower values make output more deterministic |
max_tokens | int | 64 | Maximum number of tokens to generate |
ignore_eos | bool | False | Continue generating past the EOS token |
stop_strings | Optional[list[str]] | None | List of strings that trigger generation to stop |
7. CLI Arguments (EngineArgs)
Defined in atom/model_engine/arg_utils.py. The EngineArgs dataclass exposes
all flags via add_cli_args() and converts them into a Config via
create_engine().
| Flag | Short | Type | Default | Description |
|---|---|---|---|---|
--model | str | "Qwen/Qwen3-0.6B" | Model name or path | |
--trust-remote-code | flag | False | Trust remote code when loading model | |
--tensor-parallel-size | -tp | int | 1 | Tensor parallel size |
--data-parallel-size | -dp | int | 1 | Data parallel size |
--enforce-eager | flag | False | Enforce eager mode execution | |
--enable_prefix_caching | flag | False | Enable prefix caching | |
--port | int | 8006 | Engine internal port | |
--kv_cache_dtype | str | "bf16" | KV cache dtype; choices: bf16, fp8 | |
--block-size | int | 16 | KV cache block size (maps to kv_cache_block_size) | |
--max-model-len | int | None | Maximum model context length; defaults to hf_config.max_position_embeddings | |
--cudagraph-capture-sizes | str | "[1,2,4,8,16,32,48,64,128,256]" | CUDA graph capture sizes as a Python list string | |
--level | int | 3 | Compilation level (0 -- 3) | |
--load_dummy | flag | False | Skip loading model weights | |
--enable-expert-parallel | flag | False | Enable Expert Parallelism (EP MoE) | |
--torch-profiler-dir | str | None | Directory for torch profiler traces | |
--enable-dp-attention | flag | False | Enable DP attention | |
--method | str | None | Speculative method; choices: mtp | |
--num-speculative-tokens | int | 1 | Number of speculative tokens per iteration | |
--max-num-batched-tokens | int | 16384 | Maximum number of tokens to batch in the async engine | |
--max-num-seqs | int | 512 | Maximum number of sequences to batch together | |
--gpu-memory-utilization | float | 0.9 | Fraction of GPU memory to use (0.0 -- 1.0) | |
--scheduler-delay-factor | float | 0.0 | Delay factor multiplied by previous prompt latency before scheduling next prompt |
Example:
python -m atom.entrypoint \
--model deepseek-ai/DeepSeek-R1 \
--tensor-parallel-size 8 \
--level 3 \
--cudagraph-capture-sizes "[1,2,4,8,16,32,64,128,256]" \
--kv_cache_dtype fp8 \
--gpu-memory-utilization 0.92 \
--max-num-seqs 256
8. Environment Variables
8.1 Variables Registered in atom/utils/envs.py
All variables use lazy evaluation. Boolean variables treat "1" as True and
anything else (including unset) as False, unless noted otherwise.
| Variable | Type | Default | Description |
|---|---|---|---|
ATOM_DP_RANK | int | 0 | Data-parallel rank of this process |
ATOM_DP_RANK_LOCAL | int | 0 | Local data-parallel rank (for SPMD mode) |
ATOM_DP_SIZE | int | 1 | Total number of data-parallel groups |
ATOM_DP_MASTER_IP | str | "127.0.0.1" | IP address of the data-parallel master |
ATOM_DP_MASTER_PORT | int | 29500 | Port of the data-parallel master |
ATOM_ENFORCE_EAGER | Removed. Use CLI flag --enforce-eager instead. | ||
ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION | bool | False | Enable QK-norm + RoPE + cache + quant fusion; enable for Qwen3-MoE models |
ATOM_USE_TRITON_GEMM | bool | False | Use Triton-based GEMM kernels instead of default backends |
ATOM_USE_TRITON_MXFP4_BMM | bool | False | Use Triton-based MXFP4 batched matrix multiply |
ATOM_ENABLE_DS_INPUT_RMSNORM_QUANT_FUSION | bool | True | Enable fused input RMSNorm + quantization for DeepSeek models |
ATOM_ENABLE_DS_QKNORM_QUANT_FUSION | bool | True | Enable fused QK-norm + quantization for DeepSeek models |
ATOM_ENABLE_ALLREDUCE_RMSNORM_FUSION | bool | True | Enable fused all-reduce + RMSNorm kernel |
ATOM_LLAMA_ENABLE_AITER_TRITON_FUSED_RMSNORM_QUANT | bool | True | Enable AITER Triton fused RMSNorm + quantization for LLaMA models |
ATOM_LLAMA_ENABLE_AITER_TRITON_FUSED_SILU_MUL_QUANT | bool | True | Enable AITER Triton fused SiLU + multiply + quantization for LLaMA models |
8.2 Additional Environment Variables (Used Outside envs.py)
| Variable | Type | Default | Where Used | Description |
|---|---|---|---|---|
ATOM_TORCH_PROFILER_DIR | str | None | atom/config.py (Config.torch_profiler_dir) | Directory for PyTorch profiler output; sets the default for Config.torch_profiler_dir |
ATOM_PROFILER_MORE | str | "0" | atom/model_engine/model_runner.py | Set to "1" to enable detailed profiling (record_shapes, with_stack, profile_memory) |
HF_TOKEN | str | None | atom/config.py (get_hf_config) | HuggingFace authentication token for gated model downloads |
9. Decision Tree -- Choosing a Compilation Level
Start
|
v
Is this a debugging / development run?
|-- Yes --> Level 0 (NO_COMPILATION) or --enforce-eager
|
v
Do you need torch.compile but no graph splitting?
|-- Yes, one-shot compile --> Level 2 (DYNAMO_ONCE)
|-- Yes, keep Dynamo default --> Level 1 (DYNAMO_AS_IS)
|
v
Production inference on ROCm/HIP GPU?
|-- Yes --> Level 3 (PIECEWISE) [default in EngineArgs]
- Auto-sets CUDAGraphMode.PIECEWISE
- Auto-populates splitting_ops for attention ops
- Pair with --cudagraph-capture-sizes for your batch profile
|
v
Need maximum decode throughput?
|-- Yes --> Level 3 + set cudagraph_mode to FULL_AND_PIECEWISE
(full graphs for decode, piecewise for prefill)
Rules of thumb:
- Level 3 is the default for
EngineArgsand is recommended for most production workloads. - Level 0 /
--enforce-eageris useful for debugging, profiling, or when CUDA graphs are incompatible with your model. - Match
--cudagraph-capture-sizesto your expected batch sizes for optimal memory usage and launch latency. - When using
--enable-dp-attentionor Expert Parallelism (--enable-expert-parallel), level 3 is still recommended.
Source Files
| File | Description |
|---|---|
atom/config.py | Config, CompilationConfig, CompilationLevel, CUDAGraphMode, QuantizationConfig, ParallelConfig, SpeculativeConfig, KVCacheTensor, KVCacheConfig, get_hf_config |
atom/utils/envs.py | All ATOM_* environment variable definitions with lazy evaluation |
atom/model_engine/arg_utils.py | EngineArgs dataclass and CLI argument parser |
atom/sampling_params.py | SamplingParams dataclass |
atom/model_engine/model_runner.py | Uses ATOM_PROFILER_MORE and ATOM_TORCH_PROFILER_DIR for profiling |