ATOM Configuration Guide

April 29, 2026 · View on GitHub

ATOM (AiTer Optimized Model) is AMD's lightweight LLM inference engine built on AITER kernels for ROCm/HIP GPUs. This guide documents every configuration class, CLI flag, and environment variable that controls ATOM's runtime behaviour.

Quick Reference

Config Class	Primary Purpose
`Config`	Master dataclass -- model path, memory, TP size, scheduler limits, KV cache, profiler, and references to all sub-configs
`CompilationConfig`	Compilation level (0-3), CUDA graph capture sizes, piecewise splitting ops, inductor settings
`CompilationLevel`	Integer constants for the four compilation levels
`CUDAGraphMode`	Enum controlling how CUDA graphs are captured (none / piecewise / full / hybrid)
`QuantizationConfig`	Layer-wise quantization orchestrator: global config, per-layer overrides, exclude lists, layer name remapping
`LayerQuantConfig`	Per-layer quantization spec (frozen dataclass): quant type, dtype, dynamic flag, method
`ParallelConfig`	Data-parallel size, rank, master IP/port
`SpeculativeConfig`	Speculative decoding method, draft model, number of speculative tokens
`KVCacheConfig` / `KVCacheTensor`	Per-layer KV cache tensor descriptors (k/v caches and scales)
`SamplingParams`	Temperature, max tokens, stop strings, ignore-EOS flag
`EngineArgs`	CLI argument parser that builds a `Config` for `LLMEngine`

1. Master Configuration (`Config`)

Defined in atom/config.py. The root dataclass that the engine consumes.

Field	Type	Default	Description
`model`	`str`	(required)	HuggingFace model name or local path
`trust_remote_code`	`bool`	`False`	Trust remote code when loading the model from HuggingFace
`max_num_batched_tokens`	`int`	`16384`	Maximum number of tokens batched together per scheduler step
`scheduler_delay_factor`	`float`	`0.0`	Multiplicative delay (factor x previous prompt latency) before scheduling the next prompt
`max_num_seqs`	`int`	`512`	Maximum number of sequences batched together
`max_model_len`	`int \| None`	`None`	Maximum context length; defaults to `hf_config.max_position_embeddings` (capped by it when set)
`gpu_memory_utilization`	`float`	`0.9`	Fraction of GPU memory available for KV cache and weights (0.0 -- 1.0)
`tensor_parallel_size`	`int`	`1`	Number of tensor-parallel GPUs (1 -- 8)
`enforce_eager`	`bool`	`False`	Disable compilation and CUDA graphs; run in eager mode
`parallel_config`	`ParallelConfig`	`ParallelConfig()`	Data-parallel configuration (see Section 4)
`kv_cache_block_size`	`int`	`16`	Block size for paged KV cache; must be a multiple of 16 or exactly 1
`num_kvcache_blocks`	`int`	`-1`	Number of KV cache blocks (`-1` = auto)
`kv_cache_dtype`	`str`	`"bf16"`	KV cache data type (`"bf16"` or `"fp8"`)
`enable_prefix_caching`	`bool`	`False`	Enable prefix caching to reuse KV blocks across requests sharing the same prefix
`port`	`int`	`8006`	Engine internal communication port
`torch_profiler_dir`	`str \| None`	`os.getenv("ATOM_TORCH_PROFILER_DIR", None)`	Directory for saving PyTorch profiler traces; creates the directory if it does not exist
`compilation_config`	`CompilationConfig`	`CompilationConfig()`	Compilation and CUDA graph settings (see Section 2)
`quant_config`	`QuantizationConfig`	(auto-detected)	Quantization settings; auto-detected from HuggingFace config during `__post_init__` via `QuantizationConfig(hf_config)` (see Section 3)
`asyncio_mode`	`bool`	`False`	Enable asyncio-based engine loop
`load_dummy`	`bool`	`False`	Skip loading model weights (for benchmarking / testing)
`enable_expert_parallel`	`bool`	`False`	Enable Expert Parallelism for MoE models
`master_addr`	`str`	`"127.0.0.1"`	Master address for distributed communication
`graph_bs`	`Optional[list[int]]`	`None`	Explicit list of batch sizes for CUDA graph capture; derived from `compilation_config` during init
`enable_dp_attention`	`bool`	`False`	Enable data-parallel attention
`torch_dtype`	`torch.dtype`	(computed)	Inferred from `hf_config.torch_dtype`; falls back to `torch.bfloat16`
`speculative_config`	`Optional[SpeculativeConfig]`	`None`	Speculative decoding configuration (see Section 5)
`bos_token_id`	`int`	`-1`	Beginning-of-sequence token ID (`-1` = use model default)
`eos_token_id`	`int`	`-1`	End-of-sequence token ID (`-1` = use model default)
`stop_token_ids`	`list[int]`	`[]`	Additional stop token IDs; populated from `GenerationConfig.eos_token_id` during init

Auto-derived fields (set in __post_init__ or by ModelRunner.get_num_blocks(), not user-supplied):

Field	Type	Description
`hf_config`	`PretrainedConfig`	Loaded automatically via `get_hf_config(model)`
`generation_config`	`GenerationConfig`	Loaded automatically via `get_generation_config(model)`
`per_req_cache_equiv_blocks`	`int`	Number of KV cache block equivalents reserved per request for the per-request stateful-attention cache (currently GDN recurrent state; future stateful attentions plug in via `AttentionMetadataBuilder.compute_per_req_cache_bytes()`); computed by `ModelRunner.get_num_blocks()`
`num_per_req_cache_groups`	`int`	Number of per-request slot groups available (= `max_num_seqs` for stateful-attention models, 0 otherwise); computed by `ModelRunner.get_num_blocks()`

2. Compilation Configuration (`CompilationConfig`)

Defined in atom/config.py. Controls torch.compile and CUDA graph behaviour.

2.1 Compilation Levels (`CompilationLevel`)

Constant	Value	Description
`NO_COMPILATION`	`0`	No compilation -- pure eager execution
`DYNAMO_AS_IS`	`1`	Use torch.compile / TorchDynamo as-is
`DYNAMO_ONCE`	`2`	TorchDynamo with a single compilation pass
`PIECEWISE`	`3`	Piecewise compilation with CUDA graph capture (recommended for production)

2.2 `CompilationConfig` Fields

Field	Type	Default	Description
`level`	`int`	`0`	Compilation level (see table above); must be 0 -- 3
`use_cudagraph`	`bool`	`True`	Whether to use CUDA graphs
`cudagraph_capture_sizes`	`Optional[list[int]]`	`None`	Explicit list of batch sizes for CUDA graph capture; overrides `cuda_graph_sizes` when set
`cuda_graph_sizes`	`list[int]`	`[]` (post-init: `[512]`)	CUDA graph sizing strategy: 1 value generates `[1,2,4,8] + range(16, N+1, 16)`; multiple values used as-is; empty defaults to `[512]`
`debug_dump_path`	`str`	`""`	Path to dump debug / compilation information
`cache_dir`	`str`	`""`	Directory for compilation caches
`use_inductor`	`bool`	`True`	Enable TorchInductor backend
`cudagraph_mode`	`Optional[CUDAGraphMode]`	`None`	CUDA graph capture mode (see below); set to `PIECEWISE` automatically at level 3
`splitting_ops`	`Optional[list[str]]`	`None`	Ops that split the graph into sub-graphs for piecewise compilation; auto-populated at level 3 with `["aiter.unified_attention_with_output", "aiter.mla_attention"]`
`cudagraph_copy_inputs`	`bool`	`False`	Copy input tensors into internally managed buffers before CUDA graph replay; only effective in PIECEWISE mode
`compile_sizes`	`Optional[list[Union[int, str]]]`	`None`	Sizes to compile for inductor; accepts integers and the string `"cudagraph_capture_sizes"`
`inductor_compile_config`	`dict`	`{}`	Additional configuration passed to the inductor backend

2.3 CUDA Graph Mode (`CUDAGraphMode`)

Mode	Value	Description
`NONE`	`0`	No CUDA graph capture
`PIECEWISE`	`1`	Piecewise CUDA graphs -- attention ops stay outside the graph for flexibility (default at level 3)
`FULL`	`2`	Full CUDA graph capture for all batches; best for small models / short prompts
`FULL_DECODE_ONLY`	`(FULL, NONE)`	Full CUDA graphs for decode batches only; mixed prefill-decode runs without graphs (useful in P/D setups)
`FULL_AND_PIECEWISE`	`(FULL, PIECEWISE)`	Full graphs for decode, piecewise for prefill/mixed -- most performant mode for most models

Helper methods on CUDAGraphMode:

decode_mode() -- returns the mode used for pure decode batches.
mixed_mode() -- returns the mode used for mixed prefill-decode batches.
requires_piecewise_compilation() -- whether the mode needs piecewise compilation.
has_full_cudagraphs() -- whether the mode includes full CUDA graph capture.
separate_routine() -- whether decode and mixed batches use different routines.

3. Quantization Configuration (`QuantizationConfig` & `LayerQuantConfig`)

Defined in atom/config.py and atom/quant_spec.py. The quantization system uses two classes:

QuantizationConfig -- the top-level orchestrator that holds a global config, per-layer overrides, and exclusion lists.
LayerQuantConfig -- a frozen dataclass (defined in atom/quant_spec.py) that stores the concrete quantization parameters for a single layer or as the global default. Typed, immutable, with attribute access (e.g., spec.quant_type).

3.1 `LayerQuantConfig` Fields

LayerQuantConfig is a frozen dataclass. Fields are accessed as typed attributes (e.g., spec.quant_type).

Field	Type	Default	Description
`quant_type`	`QuantType`	`QuantType.No`	Quantization granularity (see below)
`quant_dtype`	`torch.dtype`	`torch.bfloat16`	Data type for quantized weights
`is_dynamic`	`bool`	`True`	Use dynamic quantization (scales computed at runtime)
`quant_method`	`str`	`""`	Quantization method (e.g., `"quark"`, `"compressed-tensors"`)

3.2 `QuantizationConfig` Attributes

Attribute	Type	Description
`torch_dtype`	`torch.dtype`	The model's default dtype (from `hf_config.torch_dtype`)
`hf_quant_config`	`dict \| None`	Raw `quantization_config` dict from HuggingFace config
`global_quant_config`	`LayerQuantConfig`	Default quantization spec applied to all layers
`_parsed.layer_pattern_specs`	`list[tuple[str, LayerQuantConfig]]`	Per-layer overrides keyed by layer name pattern (supports fnmatch globs like `".mlp."`)
`exclude_layers`	`list[str]`	Layer names excluded from quantization (supports exact match and `"re:"` regex prefix)
`quant_method`	`str`	Top-level quantization method name (e.g., `"quark"`, `"compressed-tensors"`)

Key methods:

Method	Description
`get_name()`	Returns the quantization method name
`get_layer_quant_config(layer_name)`	Returns the `LayerQuantConfig` for a layer: checks exclusions first, then per-layer overrides, then falls back to global spec
`remap_layer_name(hf_config, packed_modules_mapping)`	Remaps layer names for packed/fused modules (e.g., `q_a_proj` → `fused_qkv_a_proj` for DeepSeek)
`compute_hash()`	Returns a SHA-256 hash of the quantization config for cache invalidation

3.3 `QuantType` Values (from AITER)

Value	Description
`QuantType.No`	No quantization
`QuantType.per_Token`	Per-token / per-channel quantization
`QuantType.per_1x128`	Block quantization with group size 128
`QuantType.per_1x32`	Block quantization with group size 32
`QuantType.per_128x128`	Large 2D block quantization (remapped to `per_1x128` in MoE kernels)
`QuantType.per_Tensor`	Per-tensor quantization

3.4 Supported Quantization Dtypes

Dtype	AITER Key	Notes
FP8 (E4M3)	`"fp8"`	8-bit floating point
MXFP4	`"fp4x2"`	Microscaling FP4; forces `QuantType.per_1x32`
INT8	`"i8"`	8-bit integer
INT4	`"i4x2"`	4-bit integer (packed)

3.5 Auto-Detection from HuggingFace

During Config.__post_init__, ATOM constructs QuantizationConfig(hf_config) which reads hf_config.quantization_config and automatically determines quantization parameters:

For quark models (quant_method == "quark"):

Parses global_quant_config dict via QuarkParser to produce the global LayerQuantConfig.
Parses each entry in layer_quant_config dict to produce per-layer overrides.
Reads the "exclude" list for excluded layers.
Within each config dict, weight.qscheme determines quant_type ("per_channel" → per_Token, "per_tensor" → per_Tensor, "per_group" → per_1x32), and weight.dtype determines quant_dtype.
input_tensors.is_dynamic controls dynamic quantization (defaults to True if absent).

For other models (compressed-tensors, etc.):

If quant_method == "compressed-tensors" or channel quantization is detected, sets per_Token.
If weight_block_size or group_size is found: group size 128 maps to per_1x128, group size 32 maps to per_1x32.
Otherwise falls back to per_Tensor.
The dtype is parsed from fields like dtype, weight_dtype, or quant_method looking for fp8, fp4, mxfp4, int8, int4, or num_bits.
If activation_scheme is "static", is_dynamic is set to False.
Excluded layers are read from the "ignore" key.

3.6 Layer-Level Quantization Dispatch

Linear layers, MoE layers, and fused ops call quant_config.get_layer_quant_config(prefix) to obtain the appropriate LayerQuantConfig for their position in the model. This enables mixed-precision quantization where different layers can have different quant types and dtypes (e.g., FP8 for attention, FP4 for MLP).

4. Parallel Configuration (`ParallelConfig`)

Defined in atom/config.py. Controls data parallelism. Environment variables (Section 8) override defaults when set.

Field	Type	Default	Description
`data_parallel_size`	`int`	`1`	Number of data-parallel groups; overridden by `ATOM_DP_SIZE` env var
`data_parallel_size_local`	`int`	`1`	Number of local data-parallel groups
`data_parallel_rank`	`int`	`0`	Rank within the data-parallel group; overridden by `ATOM_DP_RANK`
`data_parallel_rank_local`	`Optional[int]`	`None`	Local rank within the data-parallel group (SPMD mode); overridden by `ATOM_DP_RANK_LOCAL`
`data_parallel_master_port`	`int`	`29500`	Port used by the data-parallel master for process group initialization
`data_parallel_base_port`	`int`	`get_open_port()`	Base port for data-parallel communication (dynamically assigned)
`data_parallel_master_ip`	`str`	`"127.0.0.1"`	IP address of the data-parallel master

Computed property:

world_size -- set during init, equals TP x PP.
world_size_across_dp -- world_size * data_parallel_size.

5. Speculative Decoding Configuration (`SpeculativeConfig`)

Defined in atom/config.py. Currently only the Multi-Token Prediction (MTP) method with num_speculative_tokens=1 is supported.

Field	Type	Default	Description
`method`	`Optional[str]`	`""`	Speculative decoding method; currently only `"mtp"` is accepted
`model`	`Optional[str]`	`None`	Draft model name or path (typically the same as the target model for MTP)
`num_speculative_tokens`	`Optional[int]`	`None`	Number of speculative tokens per iteration; must be `1`
`draft_model_hf_config`	`Optional[PretrainedConfig]`	`None`	HuggingFace config for the draft model; auto-loaded from `model` when `None`

5.1 Table-Driven MTP Config

MTP configuration uses two class-level lookup tables to support multiple model families without per-model branching.

_MTP_TYPE_MAP -- maps a base model_type to its MTP model_type:

Base `model_type`	MTP `model_type`
`deepseek_v3`	`deepseek_mtp`
`glm_moe_dsa`	`deepseek_mtp`
`qwen3_next`	`qwen3_next_mtp`
`qwen3_5`	`qwen3_5_mtp`
`qwen3_5_moe`	`qwen3_5_mtp`
`qwen3_5_text`	`qwen3_5_mtp`
`qwen3_5_moe_text`	`qwen3_5_mtp`

_MTP_CONFIG -- maps MTP model_type to a (n_predict_attr, architecture) tuple:

MTP `model_type`	`n_predict_attr`	Architecture
`deepseek_mtp`	`num_nextn_predict_layers`	`DeepSeekMTPModel`
`qwen3_next_mtp`	`num_nextn_predict_layers`	`Qwen3NextMTPModel`
`qwen3_5_mtp`	`mtp_num_hidden_layers`	`Qwen3_5MTPModel`

5.2 Post-init behaviour (`hf_config_override`)

The static method hf_config_override applies a two-step transformation to the draft model's HuggingFace config:

Resolve model type -- looks up hf_config.model_type in _MTP_TYPE_MAP. If found, rewrites model_type to the MTP variant (e.g. deepseek_v3 -> deepseek_mtp).
Apply MTP overrides -- looks up the (possibly rewritten) model_type in _MTP_CONFIG. If found:
- Reads n_predict from the model-specific attribute (e.g. num_nextn_predict_layers or mtp_num_hidden_layers), defaulting to 1. Warns and forces it to 1 if the original value differs.
- Sets n_predict=1, num_nextn_predict_layers=1 (universal across all MTP families), and architectures to the corresponding MTP model class.
- Qwen3.5 MTP only: additionally injects n_shared_experts=1 and n_routed_experts (read from hf_config.num_experts, default 0) so the MTP module can construct its MoE layer.

6. Sampling Parameters (`SamplingParams`)

Defined in atom/sampling_params.py. Passed per-request to control generation.

Field	Type	Default	Description
`temperature`	`float`	`1.0`	Sampling temperature; lower values make output more deterministic
`max_tokens`	`int`	`64`	Maximum number of tokens to generate
`ignore_eos`	`bool`	`False`	Continue generating past the EOS token
`stop_strings`	`Optional[list[str]]`	`None`	List of strings that trigger generation to stop

7. CLI Arguments (`EngineArgs`)

Defined in atom/model_engine/arg_utils.py. The EngineArgs dataclass exposes all flags via add_cli_args() and converts them into a Config via create_engine().

Flag	Short	Type	Default	Description
`--model`		`str`	`"Qwen/Qwen3-0.6B"`	Model name or path
`--trust-remote-code`		flag	`False`	Trust remote code when loading model
`--tensor-parallel-size`	`-tp`	`int`	`1`	Tensor parallel size
`--data-parallel-size`	`-dp`	`int`	`1`	Data parallel size
`--enforce-eager`		flag	`False`	Enforce eager mode execution
`--enable_prefix_caching`		flag	`False`	Enable prefix caching
`--port`		`int`	`8006`	Engine internal port
`--kv_cache_dtype`		`str`	`"bf16"`	KV cache dtype; choices: `bf16`, `fp8`
`--block-size`		`int`	`16`	KV cache block size (maps to `kv_cache_block_size`)
`--max-model-len`		`int`	`None`	Maximum model context length; defaults to `hf_config.max_position_embeddings`
`--cudagraph-capture-sizes`		`str`	`"[1,2,4,8,16,32,48,64,128,256]"`	CUDA graph capture sizes as a Python list string
`--level`		`int`	`3`	Compilation level (0 -- 3)
`--load_dummy`		flag	`False`	Skip loading model weights
`--enable-expert-parallel`		flag	`False`	Enable Expert Parallelism (EP MoE)
`--torch-profiler-dir`		`str`	`None`	Directory for torch profiler traces
`--enable-dp-attention`		flag	`False`	Enable DP attention
`--method`		`str`	`None`	Speculative method; choices: `mtp`
`--num-speculative-tokens`		`int`	`1`	Number of speculative tokens per iteration
`--max-num-batched-tokens`		`int`	`16384`	Maximum number of tokens to batch in the async engine
`--max-num-seqs`		`int`	`512`	Maximum number of sequences to batch together
`--gpu-memory-utilization`		`float`	`0.9`	Fraction of GPU memory to use (0.0 -- 1.0)
`--scheduler-delay-factor`		`float`	`0.0`	Delay factor multiplied by previous prompt latency before scheduling next prompt

Example:

python -m atom.entrypoint \
    --model deepseek-ai/DeepSeek-R1 \
    --tensor-parallel-size 8 \
    --level 3 \
    --cudagraph-capture-sizes "[1,2,4,8,16,32,64,128,256]" \
    --kv_cache_dtype fp8 \
    --gpu-memory-utilization 0.92 \
    --max-num-seqs 256

8. Environment Variables

8.1 Variables Registered in `atom/utils/envs.py`

All variables use lazy evaluation. Boolean variables treat "1" as True and anything else (including unset) as False, unless noted otherwise.

Variable	Type	Default	Description
`ATOM_DP_RANK`	`int`	`0`	Data-parallel rank of this process
`ATOM_DP_RANK_LOCAL`	`int`	`0`	Local data-parallel rank (for SPMD mode)
`ATOM_DP_SIZE`	`int`	`1`	Total number of data-parallel groups
`ATOM_DP_MASTER_IP`	`str`	`"127.0.0.1"`	IP address of the data-parallel master
`ATOM_DP_MASTER_PORT`	`int`	`29500`	Port of the data-parallel master
~~`ATOM_ENFORCE_EAGER`~~			Removed. Use CLI flag `--enforce-eager` instead.
`ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION`	`bool`	`False`	Enable QK-norm + RoPE + cache + quant fusion; enable for Qwen3-MoE models
`ATOM_USE_TRITON_GEMM`	`bool`	`False`	Use Triton-based GEMM kernels instead of default backends
`ATOM_USE_TRITON_MXFP4_BMM`	`bool`	`False`	Use Triton-based MXFP4 batched matrix multiply
`ATOM_ENABLE_DS_INPUT_RMSNORM_QUANT_FUSION`	`bool`	`True`	Enable fused input RMSNorm + quantization for DeepSeek models
`ATOM_ENABLE_DS_QKNORM_QUANT_FUSION`	`bool`	`True`	Enable fused QK-norm + quantization for DeepSeek models
`ATOM_ENABLE_ALLREDUCE_RMSNORM_FUSION`	`bool`	`True`	Enable fused all-reduce + RMSNorm kernel
`ATOM_LLAMA_ENABLE_AITER_TRITON_FUSED_RMSNORM_QUANT`	`bool`	`True`	Enable AITER Triton fused RMSNorm + quantization for LLaMA models
`ATOM_LLAMA_ENABLE_AITER_TRITON_FUSED_SILU_MUL_QUANT`	`bool`	`True`	Enable AITER Triton fused SiLU + multiply + quantization for LLaMA models

8.2 Additional Environment Variables (Used Outside `envs.py`)

Variable	Type	Default	Where Used	Description
`ATOM_TORCH_PROFILER_DIR`	`str`	`None`	`atom/config.py` (`Config.torch_profiler_dir`)	Directory for PyTorch profiler output; sets the default for `Config.torch_profiler_dir`
`ATOM_PROFILER_MORE`	`str`	`"0"`	`atom/model_engine/model_runner.py`	Set to `"1"` to enable detailed profiling (`record_shapes`, `with_stack`, `profile_memory`)
`HF_TOKEN`	`str`	`None`	`atom/config.py` (`get_hf_config`)	HuggingFace authentication token for gated model downloads

9. Decision Tree -- Choosing a Compilation Level

Start
  |
  v
Is this a debugging / development run?
  |-- Yes --> Level 0 (NO_COMPILATION) or --enforce-eager
  |
  v
Do you need torch.compile but no graph splitting?
  |-- Yes, one-shot compile --> Level 2 (DYNAMO_ONCE)
  |-- Yes, keep Dynamo default --> Level 1 (DYNAMO_AS_IS)
  |
  v
Production inference on ROCm/HIP GPU?
  |-- Yes --> Level 3 (PIECEWISE) [default in EngineArgs]
              - Auto-sets CUDAGraphMode.PIECEWISE
              - Auto-populates splitting_ops for attention ops
              - Pair with --cudagraph-capture-sizes for your batch profile
  |
  v
Need maximum decode throughput?
  |-- Yes --> Level 3 + set cudagraph_mode to FULL_AND_PIECEWISE
              (full graphs for decode, piecewise for prefill)

Rules of thumb:

Level 3 is the default for EngineArgs and is recommended for most production workloads.
Level 0 / --enforce-eager is useful for debugging, profiling, or when CUDA graphs are incompatible with your model.
Match --cudagraph-capture-sizes to your expected batch sizes for optimal memory usage and launch latency.
When using --enable-dp-attention or Expert Parallelism (--enable-expert-parallel), level 3 is still recommended.

Source Files

File	Description
`atom/config.py`	`Config`, `CompilationConfig`, `CompilationLevel`, `CUDAGraphMode`, `QuantizationConfig`, `ParallelConfig`, `SpeculativeConfig`, `KVCacheTensor`, `KVCacheConfig`, `get_hf_config`
`atom/utils/envs.py`	All `ATOM_*` environment variable definitions with lazy evaluation
`atom/model_engine/arg_utils.py`	`EngineArgs` dataclass and CLI argument parser
`atom/sampling_params.py`	`SamplingParams` dataclass
`atom/model_engine/model_runner.py`	Uses `ATOM_PROFILER_MORE` and `ATOM_TORCH_PROFILER_DIR` for profiling