Configurations

June 15, 2026 · View on GitHub

This page provides a comprehensive reference for all configuration parameters available in AReaL's command-line interface. These parameters are defined using dataclasses and can be specified in YAML configuration files or overridden via command line arguments.

Usage

Configuration files are specified using the --config parameter:

python3 train.py --config path/to/config.yaml

You can override specific parameters from the command line:

python3 train.py --config path/to/config.yaml actor.lr=1e-4 seed=42

For detailed examples, see the experiment configurations in the examples/ directory.

Table of Contents

Core Experiment Configurations

Training Configurations

Inference Configurations

Dataset

System and Cluster Configurations

Logging and Monitoring

Others


(section-base-experiment)=

BaseExperiment Configuration

Base configuration class for all experiment types with common settings.

ParameterTypeDefaultDescription
experiment_namestringRequiredName of the experiment (no '_' or '/'). Required.
trial_namestringRequiredName of the trial (no '-' or '/'). Required.
clusterClusterSpecConfigRequiredCluster specification. Mainly used by slurm.
allocation_modestring""DEPRECATED: Use per-engine 'backend' fields instead (e.g., actor.backend, rollout.backend). Legacy pattern-based GPU parallel strategy allocation mode. Only used by SPMD launchers (local/ray/slurm). Manual migration to per-engine 'backend' fields is required.
seedinteger1Random seed for reproducibility.
enable_offloadbooleanFalseWhether to enable training offload using torch_memory_saver. This requires setting up the environment for TMS (e.g., via LD_PRELOAD).
total_train_epochsinteger1Total number of epochs to train the model.
total_train_stepsinteger | NoneNoneTerminate training after this number of steps. For benchmarking purposes only. None indicates normal training.
total_train_n_seqsinteger | NoneNoneTerminate training after consuming this number of samples. For benchmarking purposes only. None indicates normal training.
tokenizer_pathstring""Path to the tokenizer.
train_datasetTrainDatasetConfigRequired-
valid_datasetValidDatasetConfig | NoneNone-
saverSaverConfigRequired-
evaluatorEvaluatorConfigRequired-
stats_loggerStatsLoggerConfigRequired-
perf_tracerPerfTracerConfig | NoneNonePerformance tracer configuration. None means disabled.
memory_profilerMemoryProfilerConfig | NoneNoneMemory snapshot profiler configuration. None means disabled.
recoverRecoverConfigRequired-
sglangSGLangConfigRequired-
vllmvLLMConfigRequired-
schedulerSchedulerConfigRequired-

(section-grpo)=

GRPO Configuration

A dummy place holder of GRPO config for backward compatibility.

ParameterTypeDefaultDescription
experiment_namestringRequiredName of the experiment (no '_' or '/'). Required.
trial_namestringRequiredName of the trial (no '-' or '/'). Required.
clusterClusterSpecConfigRequiredCluster specification. Mainly used by slurm.
allocation_modestring""DEPRECATED: Use per-engine 'backend' fields instead (e.g., actor.backend, rollout.backend). Legacy pattern-based GPU parallel strategy allocation mode. Only used by SPMD launchers (local/ray/slurm). Manual migration to per-engine 'backend' fields is required.
seedinteger1Random seed for reproducibility.
enable_offloadbooleanFalseWhether to enable training offload using torch_memory_saver. This requires setting up the environment for TMS (e.g., via LD_PRELOAD).
total_train_epochsinteger1Total number of epochs to train the model.
total_train_stepsinteger | NoneNoneTerminate training after this number of steps. For benchmarking purposes only. None indicates normal training.
total_train_n_seqsinteger | NoneNoneTerminate training after consuming this number of samples. For benchmarking purposes only. None indicates normal training.
tokenizer_pathstring""Path to the tokenizer.
train_datasetTrainDatasetConfigRequired-
valid_datasetValidDatasetConfig | NoneNone-
saverSaverConfigRequired-
evaluatorEvaluatorConfigRequired-
stats_loggerStatsLoggerConfigRequired-
perf_tracerPerfTracerConfig | NoneNonePerformance tracer configuration. None means disabled.
memory_profilerMemoryProfilerConfig | NoneNoneMemory snapshot profiler configuration. None means disabled.
recoverRecoverConfigRequired-
sglangSGLangConfigRequired-
vllmvLLMConfigRequired-
schedulerSchedulerConfigRequired-
gconfigGenerationHyperparametersRequired-
eval_gconfigGenerationHyperparameters | NoneNoneGeneration hyperparameters for evaluation. If None, use gconfig.
rolloutInferenceEngineConfigRequired-
actorPPOActorConfigRequired-
refPPOActorConfig | NoneNone-
criticPPOCriticConfig | NoneNone-
teacherTeacherConfig | NoneNoneOptional teacher model configuration used for on-policy distillation during PPO training. If provided, the actor may be trained to match the teacher in addition to the standard PPO objective.
dynamic_bsbooleanFalseEnable dynamic batch sizing in prepare_batch. When True, batch collection stops when (accepted + rejected) >= batch_size, returning only accepted results. This results in variable-sized batches of valid data.

(section-ppo)=

PPO Configuration

Configuration for Proximal Policy Optimization (PPO) reinforcement learning experiments.

ParameterTypeDefaultDescription
experiment_namestringRequiredName of the experiment (no '_' or '/'). Required.
trial_namestringRequiredName of the trial (no '-' or '/'). Required.
clusterClusterSpecConfigRequiredCluster specification. Mainly used by slurm.
allocation_modestring""DEPRECATED: Use per-engine 'backend' fields instead (e.g., actor.backend, rollout.backend). Legacy pattern-based GPU parallel strategy allocation mode. Only used by SPMD launchers (local/ray/slurm). Manual migration to per-engine 'backend' fields is required.
seedinteger1Random seed for reproducibility.
enable_offloadbooleanFalseWhether to enable training offload using torch_memory_saver. This requires setting up the environment for TMS (e.g., via LD_PRELOAD).
total_train_epochsinteger1Total number of epochs to train the model.
total_train_stepsinteger | NoneNoneTerminate training after this number of steps. For benchmarking purposes only. None indicates normal training.
total_train_n_seqsinteger | NoneNoneTerminate training after consuming this number of samples. For benchmarking purposes only. None indicates normal training.
tokenizer_pathstring""Path to the tokenizer.
train_datasetTrainDatasetConfigRequired-
valid_datasetValidDatasetConfig | NoneNone-
saverSaverConfigRequired-
evaluatorEvaluatorConfigRequired-
stats_loggerStatsLoggerConfigRequired-
perf_tracerPerfTracerConfig | NoneNonePerformance tracer configuration. None means disabled.
memory_profilerMemoryProfilerConfig | NoneNoneMemory snapshot profiler configuration. None means disabled.
recoverRecoverConfigRequired-
sglangSGLangConfigRequired-
vllmvLLMConfigRequired-
schedulerSchedulerConfigRequired-
gconfigGenerationHyperparametersRequired-
eval_gconfigGenerationHyperparameters | NoneNoneGeneration hyperparameters for evaluation. If None, use gconfig.
rolloutInferenceEngineConfigRequired-
actorPPOActorConfigRequired-
refPPOActorConfig | NoneNone-
criticPPOCriticConfig | NoneNone-
teacherTeacherConfig | NoneNoneOptional teacher model configuration used for on-policy distillation during PPO training. If provided, the actor may be trained to match the teacher in addition to the standard PPO objective.
dynamic_bsbooleanFalseEnable dynamic batch sizing in prepare_batch. When True, batch collection stops when (accepted + rejected) >= batch_size, returning only accepted results. This results in variable-sized batches of valid data.

(section-rw)=

RW Configuration

Configuration for Reward Model (RW) training experiments.

ParameterTypeDefaultDescription
experiment_namestringRequiredName of the experiment (no '_' or '/'). Required.
trial_namestringRequiredName of the trial (no '-' or '/'). Required.
clusterClusterSpecConfigRequiredCluster specification. Mainly used by slurm.
allocation_modestring""DEPRECATED: Use per-engine 'backend' fields instead (e.g., actor.backend, rollout.backend). Legacy pattern-based GPU parallel strategy allocation mode. Only used by SPMD launchers (local/ray/slurm). Manual migration to per-engine 'backend' fields is required.
seedinteger1Random seed for reproducibility.
enable_offloadbooleanFalseWhether to enable training offload using torch_memory_saver. This requires setting up the environment for TMS (e.g., via LD_PRELOAD).
total_train_epochsinteger1Total number of epochs to train the model.
total_train_stepsinteger | NoneNoneTerminate training after this number of steps. For benchmarking purposes only. None indicates normal training.
total_train_n_seqsinteger | NoneNoneTerminate training after consuming this number of samples. For benchmarking purposes only. None indicates normal training.
tokenizer_pathstring""Path to the tokenizer.
train_datasetTrainDatasetConfigRequired-
valid_datasetValidDatasetConfig | NoneNone-
saverSaverConfigRequired-
evaluatorEvaluatorConfigRequired-
stats_loggerStatsLoggerConfigRequired-
perf_tracerPerfTracerConfig | NoneNonePerformance tracer configuration. None means disabled.
memory_profilerMemoryProfilerConfig | NoneNoneMemory snapshot profiler configuration. None means disabled.
recoverRecoverConfigRequired-
sglangSGLangConfigRequired-
vllmvLLMConfigRequired-
schedulerSchedulerConfigRequired-
actorTrainEngineConfigRequired-

(section-sft)=

SFT Configuration

Configuration for Supervised Fine-Tuning (SFT) experiments.

ParameterTypeDefaultDescription
experiment_namestringRequiredName of the experiment (no '_' or '/'). Required.
trial_namestringRequiredName of the trial (no '-' or '/'). Required.
clusterClusterSpecConfigRequiredCluster specification. Mainly used by slurm.
allocation_modestring""DEPRECATED: Use per-engine 'backend' fields instead (e.g., actor.backend, rollout.backend). Legacy pattern-based GPU parallel strategy allocation mode. Only used by SPMD launchers (local/ray/slurm). Manual migration to per-engine 'backend' fields is required.
seedinteger1Random seed for reproducibility.
enable_offloadbooleanFalseWhether to enable training offload using torch_memory_saver. This requires setting up the environment for TMS (e.g., via LD_PRELOAD).
total_train_epochsinteger1Total number of epochs to train the model.
total_train_stepsinteger | NoneNoneTerminate training after this number of steps. For benchmarking purposes only. None indicates normal training.
total_train_n_seqsinteger | NoneNoneTerminate training after consuming this number of samples. For benchmarking purposes only. None indicates normal training.
tokenizer_pathstring""Path to the tokenizer.
train_datasetTrainDatasetConfigRequired-
valid_datasetValidDatasetConfig | NoneNone-
saverSaverConfigRequired-
evaluatorEvaluatorConfigRequired-
stats_loggerStatsLoggerConfigRequired-
perf_tracerPerfTracerConfig | NoneNonePerformance tracer configuration. None means disabled.
memory_profilerMemoryProfilerConfig | NoneNoneMemory snapshot profiler configuration. None means disabled.
recoverRecoverConfigRequired-
sglangSGLangConfigRequired-
vllmvLLMConfigRequired-
schedulerSchedulerConfigRequired-
actorTrainEngineConfigRequired-

(section-fsdp-engine)=

FSDPEngine Configuration

Configuration for Fully Sharded Data Parallel (FSDP) training backend.

ParameterTypeDefaultDescription
wrap_policyFSDPWrapPolicy | NoneNoneFSDP wrap policy, specifying model layers to wrap.
offload_paramsbooleanFalseWhether to offload FSDP parameters to CPU.
memory_efficient_loadbooleanFalseEnable memory-efficient model loading. When enabled, model weights are initialized on CPU and only rank 0 loads pretrained weights, which are then broadcast to all ranks after FSDP sharding. This reduces peak GPU memory during initialization for large models. Note: For VLMs, rank 0 broadcast is not used; each rank loads weights independently on CPU.
per_layer_optim_stepbooleanFalseRun Adam step on GPU by streaming optimizer states layer-by-layer with async prefetching, instead of running on CPU. Optimizer states are automatically managed on CPU by the per-layer wrapper regardless of offload_params setting. Requires optimizer type 'adam' (AdamW).
optim_step_prefetch_layersinteger1Number of layers to prefetch during per-layer optim step.
shard_vision_across_spbooleanFalseShard vision encoder across SP ranks by image. Only effective when context_parallel_size > 1.

(section-fsdp-wrap-policy)=

FSDPWrapPolicy

Policy configuration for FSDP model layer wrapping. None defaults to wrapping transformer decoder layers defined by transformers.

ParameterTypeDefaultDescription
transformer_layer_cls_to_wraplist of string | NoneNoneA list of transformer layer names for FSDP to wrap.

(section-micro-batch)=

MicroBatch Specification

Specification for splitting micro-batches during training.

ParameterTypeDefaultDescription
n_mbsinteger | None1Number of micro-batches (or minimum number if max_tokens_per_mb is set). Used when max_tokens_per_mb is None or as minimum count
granularityinteger1Granularity of each micro-batch. Adjacent sequences are grouped by this size when dividing microbatches.
max_tokens_per_mbinteger | NoneNoneMaximum tokens per micro-batch for each forward pass. When set, n_mbs becomes the minimum number of micro-batches.
n_mbs_divisorinteger1Divisor for the number of micro-batches. The final number of micro-batches will be adjusted to be divisible by this value.
packing_algorithmstring"ffd"Sequence packing algorithm for micro-batch allocation. Supported values: 'ffd' (First Fit Decreasing, default), 'kk' (Karmarkar-Karp, better balance but slightly slower). KK is recommended when workload balance across DP ranks is critical (e.g., large-scale RL training with variable-length sequences). Choices: ffd, kk

(section-norm)=

Norm Configuration

Configuration for reward/advantage normalization.

ParameterTypeDefaultDescription
mean_levelstring | None"batch"Mean level for normalization. None for no mean normalization. Choices: batch, group, None
mean_leave1outbooleanFalseWhether to use leave-one-out average.
std_levelstring | None"batch"Standard deviation level for normalization. None for no std normalization. Choices: batch, group, None
std_unbiasedbooleanTrueWhether to use unbiased standard deviation computation. Defaults to True (changed from False in v0.3.4).
epsfloat1e-05The eps when dividing by standard deviation to avoid numerical issues.
group_sizeinteger1Group size for group-level normalization

(section-optimizer)=

Optimizer Configuration

Configuration for model optimization during training.

ParameterTypeDefaultDescription
typestring"adam"Optimizer type. For FSDP Engine, adam_bf16 enables memory-efficient BF16 optimizer states. For Megatron Engine, adam_bf16 requires dtype=bfloat16 and is automatically converted to adam with precision-aware optimizer enabled. Choices: adam, sgd, adam_bf16
lrfloat0.001Learning rate
weight_decayfloat0.01Weight decay
beta1float0.9Adam beta1 parameter. Only effective when optimizer_type is adam/adam_bf16
beta2float0.999Adam beta2 parameter. Only effective when optimizer_type is adam/adam_bf16
epsfloat1e-08Adam epsilon parameter. Only effective when optimizer_type is adam/adam_bf16
min_lr_ratiofloat0.0Minimum learning rate ratio after annealing
lr_scheduler_typestring"constant"Learning rate scheduler type Choices: linear, cosine, constant
warmup_steps_proportionfloat0.001Proportion of training steps for warmup
initial_loss_scalefloat4294967296Initial loss scaling factor
min_loss_scalefloat1.0Minimum loss scaling factor
loss_scale_windowfloat5Window size for loss scaling adjustment
hysteresisinteger2Hysteresis (scaling factor) for loss scaling
gradient_clippingfloat1.0Gradient clipping threshold

(section-ppo-actor)=

PPOActor Configuration

Configuration for PPO actor model, a subclass of a TrainEngine.

ParameterTypeDefaultDescription
experiment_namestringRequired-
trial_namestringRequired-
pathstring""Path to HuggingFace checkpoint
attn_implstring"flash_attention_2"Attention implementation for huggingface transformers model. Accepts builtin transformers backends or a Hugging Face kernels repo ID formatted as org/repo[@revision][:entrypoint]. Choices: eager, sdpa, flash_attention_2, flash_attention_3, flex_attention
use_kernelsbooleanFalseEnable Hugging Face kernels model kernelization after model creation.
init_from_scratchbooleanFalseInitialize model weights randomly
is_criticbooleanFalseWhether to use a critic/reward model
temperaturefloat1.0Temperature during generation.
mb_specMicroBatchSpecRequired-
pad_to_maximumbooleanFalseWhether to pad each microbatch to the length upper bound specified by mb_spec. Can reduce memory fragmentation but slows down training.
disable_dropoutbooleanFalseDisable dropout layers during training
gradient_checkpointingbooleanFalseEnable gradient checkpointing
dtypestring"bfloat16"Forward/backward compute dtype.
grad_reduce_dtypestring"float32"Gradient reduction data type.
optimizer_dtypestring"float32"Underlying parameter storage dtype, also the dtype of optimizer states (exp_avg, exp_avg_sq) since torch.optim.AdamW inherits dtype from model.parameters(). Default 'float32' maintains fp32 master weights matching DeepSpeed ZeRO-3 and Megatron precision-aware optimizer behavior. FSDP2's MixedPrecisionPolicy(param_dtype=dtype) will still cast forward/backward computation to dtype (e.g. bfloat16). Set to 'bfloat16' together with optimizer.type='adam_bf16' to reduce memory at the cost of needing Kahan summation for stability. Currently FSDP-only; Megatron uses use_precision_aware_optimizer instead and ignores this field.
optimizerOptimizerConfig | NoneNoneOptimizer configuration. None means no training.
weight_update_modestring"xccl"Weight update backend type. Choices: disk, xccl
fsdpFSDPEngineConfigRequired-
archonArchonEngineConfigRequired-
megatronMegatronEngineConfigRequired-
offloadbooleanFalseWhether to offload model parameters and optimizer states to CPU.
use_lorabooleanFalseWhether to use LoRA. Only support FSDP. Note that should be enabled together with vLLM/SGLang.
lora_rankinteger32lora rank
lora_alphainteger16lora alpha
target_moduleslist of stringRequiredlora target_modules.
peft_typestring"lora"peft method type. Only LoRA is supported for now.
enable_tree_trainingbooleanFalseEnable tree training with flex attention module.
scheduling_spectupleRequiredTrain engine schedule specs. Can accept 1 or 2 SchedulingSpec: if 1 spec provided, it's used for both worker and engine, engine is embedded in the worker; if 2 specs provided, first one is for worker, second one is for engine. Currently only used by the TrainController.
backendstringRequiredBackend and parallelism strategy. Must include an explicit backend prefix, e.g. 'fsdp:d4', 'megatron:d4t2p2', 'archon:d2'. Required.
_versionstring"v1"Train controller implementation version. Use 'v1' for legacy TrainController, 'v2' for GatewayTrainController. Choices: v1, v2
admin_api_keystring"areal-admin-key"Admin API key used by gateway/router/data-proxy in controller v2.
log_levelstring"warning"Gateway stack log level for controller v2.
request_timeoutfloat3600.0Gateway request timeout in seconds for controller v2.
setup_timeoutfloat3600.0Gateway setup timeout in seconds for controller v2.
workers_ready_timeoutfloat30.0Timeout (seconds) for initialize() to wait for guards to be ready.
scheduling_strategySchedulingStrategyRequiredThe scheduling strategy of this TrainEngine, either separation or colocation. Currently only used by the TrainController.
ppo_n_minibatchesinteger4Number of minibatches for each PPO update
eps_clipfloat0.2Clipping factor for policy ratio
eps_clip_higherfloat | NoneNoneClipping factor (higher value) for policy ratio. Default is None. When eps_clip_higher is set (decoupled), eps_clip will be used as the lower value.
c_clipfloat | NoneNoneDual clipping factor for policy ratio, must be > 1.0. None disables dual clipping.
m2_thresholdfloat | NoneNoneThe second momentum threshold for M2PO.
reward_normNormConfig | NoneNoneNormalization configuration for rewards
reward_scalingfloat1.0Reward scaling factor
reward_biasfloat0.0Reward bias
reward_clipfloat20.0Maximum absolute value for reward clipping
overlong_reward_penaltybooleanFalsePenalty for overlong sequences. Used within DAPO.
overlong_tokensinteger | NoneNoneNumber of tokens in the tail that will receive a penalty
overlong_penalty_factorfloat | NoneNonePenalty factor for tokens in the tail
mask_no_eos_with_zerobooleanFalseMask truncated generations (no EOS token) and exclude from training
discountfloat1.0Discount factor for future rewards
gae_lambdafloat1.0Lambda parameter for GAE
adv_normNormConfig | NoneNoneNormalization configuration for advantages.
kl_ctlfloat0.1KL divergence coefficient
kl_estimatorstring"k1"KL divergence estimator Choices: k1, k2, k3
use_sapo_lossbooleanFalseUse SAPO loss (mutually exclusive with PPO clipping)
sapo_tau_posfloat1.0SAPO temperature for positive advantages
sapo_tau_negfloat1.05SAPO temperature for negative advantages
recompute_logprobbooleanFalseRecompute log probability and replace the log probability returned by inference.
use_decoupled_lossbooleanFalseUse the decoupled loss. Implicitly enables recompute_logprob.
rejection_samplingRejectionSamplingConfig | NoneNoneRejection sampling configuration for filtering stale samples. None disables filtering (equivalent to old behave_imp_weight_mode='disabled'). Only effective when use_decoupled_loss=True.
importance_sampling_levelstring"token"Level at which to compute importance sampling ratios. 'token': per-token ratios (standard PPO). 'sequence': sequence-level geometric mean of per-token ratios (GSPO). Choices: token, sequence
prox_logp_methodstring"recompute"Method for computing proximal policy log-probabilities in decoupled PPO. Only effective when use_decoupled_loss=True. Options: 'recompute' (default): Standard decoupled PPO, recompute proximal policy via forward pass. 'loglinear': Use log-linear interpolation to approximate proximal policy (skip forward pass). 'metrics': Like 'recompute', but also compute approximation metrics for evaluation. Choices: recompute, loglinear, metrics
log_agent_statsbooleanFalseLog statistics for agent trajectories
log_agent_stats_keyslist of stringRequiredKeys for logging agent trajectory statistics
max_new_tokensinteger1024Maximum number of new tokens to generate

(section-ppo-critic)=

PPOCritic Configuration

Configuration for PPO critic model, a subclass of a TrainEngine.

ParameterTypeDefaultDescription
experiment_namestringRequired-
trial_namestringRequired-
pathstring""Path to HuggingFace checkpoint
attn_implstring"flash_attention_2"Attention implementation for huggingface transformers model. Accepts builtin transformers backends or a Hugging Face kernels repo ID formatted as org/repo[@revision][:entrypoint]. Choices: eager, sdpa, flash_attention_2, flash_attention_3, flex_attention
use_kernelsbooleanFalseEnable Hugging Face kernels model kernelization after model creation.
init_from_scratchbooleanFalseInitialize model weights randomly
is_criticbooleanFalseWhether to use a critic/reward model
temperaturefloat1.0Temperature during generation.
mb_specMicroBatchSpecRequired-
pad_to_maximumbooleanFalseWhether to pad each microbatch to the length upper bound specified by mb_spec. Can reduce memory fragmentation but slows down training.
disable_dropoutbooleanFalseDisable dropout layers during training
gradient_checkpointingbooleanFalseEnable gradient checkpointing
dtypestring"bfloat16"Forward/backward compute dtype.
grad_reduce_dtypestring"float32"Gradient reduction data type.
optimizer_dtypestring"float32"Underlying parameter storage dtype, also the dtype of optimizer states (exp_avg, exp_avg_sq) since torch.optim.AdamW inherits dtype from model.parameters(). Default 'float32' maintains fp32 master weights matching DeepSpeed ZeRO-3 and Megatron precision-aware optimizer behavior. FSDP2's MixedPrecisionPolicy(param_dtype=dtype) will still cast forward/backward computation to dtype (e.g. bfloat16). Set to 'bfloat16' together with optimizer.type='adam_bf16' to reduce memory at the cost of needing Kahan summation for stability. Currently FSDP-only; Megatron uses use_precision_aware_optimizer instead and ignores this field.
optimizerOptimizerConfig | NoneNoneOptimizer configuration. None means no training.
weight_update_modestring"xccl"Weight update backend type. Choices: disk, xccl
fsdpFSDPEngineConfigRequired-
archonArchonEngineConfigRequired-
megatronMegatronEngineConfigRequired-
offloadbooleanFalseWhether to offload model parameters and optimizer states to CPU.
use_lorabooleanFalseWhether to use LoRA. Only support FSDP. Note that should be enabled together with vLLM/SGLang.
lora_rankinteger32lora rank
lora_alphainteger16lora alpha
target_moduleslist of stringRequiredlora target_modules.
peft_typestring"lora"peft method type. Only LoRA is supported for now.
enable_tree_trainingbooleanFalseEnable tree training with flex attention module.
scheduling_spectupleRequiredTrain engine schedule specs. Can accept 1 or 2 SchedulingSpec: if 1 spec provided, it's used for both worker and engine, engine is embedded in the worker; if 2 specs provided, first one is for worker, second one is for engine. Currently only used by the TrainController.
backendstringRequiredBackend and parallelism strategy. Must include an explicit backend prefix, e.g. 'fsdp:d4', 'megatron:d4t2p2', 'archon:d2'. Required.
_versionstring"v1"Train controller implementation version. Use 'v1' for legacy TrainController, 'v2' for GatewayTrainController. Choices: v1, v2
admin_api_keystring"areal-admin-key"Admin API key used by gateway/router/data-proxy in controller v2.
log_levelstring"warning"Gateway stack log level for controller v2.
request_timeoutfloat3600.0Gateway request timeout in seconds for controller v2.
setup_timeoutfloat3600.0Gateway setup timeout in seconds for controller v2.
workers_ready_timeoutfloat30.0Timeout (seconds) for initialize() to wait for guards to be ready.
scheduling_strategySchedulingStrategyRequiredThe scheduling strategy of this TrainEngine, either separation or colocation. Currently only used by the TrainController.
ppo_n_minibatchesinteger4Number of minibatches for each PPO update
eps_clipfloat0.5Clipping factor for value loss
mask_no_eos_with_zerobooleanFalseMask truncated generations (no EOS token) and exclude from training

(section-train-engine)=

TrainEngine Configuration

Core configuration for model training, including optimization and backend settings.

ParameterTypeDefaultDescription
experiment_namestringRequired-
trial_namestringRequired-
pathstring""Path to HuggingFace checkpoint
attn_implstring"flash_attention_2"Attention implementation for huggingface transformers model. Accepts builtin transformers backends or a Hugging Face kernels repo ID formatted as org/repo[@revision][:entrypoint]. Choices: eager, sdpa, flash_attention_2, flash_attention_3, flex_attention
use_kernelsbooleanFalseEnable Hugging Face kernels model kernelization after model creation.
init_from_scratchbooleanFalseInitialize model weights randomly
is_criticbooleanFalseWhether to use a critic/reward model
temperaturefloat1.0Temperature during generation.
mb_specMicroBatchSpecRequired-
pad_to_maximumbooleanFalseWhether to pad each microbatch to the length upper bound specified by mb_spec. Can reduce memory fragmentation but slows down training.
disable_dropoutbooleanFalseDisable dropout layers during training
gradient_checkpointingbooleanFalseEnable gradient checkpointing
dtypestring"bfloat16"Forward/backward compute dtype.
grad_reduce_dtypestring"float32"Gradient reduction data type.
optimizer_dtypestring"float32"Underlying parameter storage dtype, also the dtype of optimizer states (exp_avg, exp_avg_sq) since torch.optim.AdamW inherits dtype from model.parameters(). Default 'float32' maintains fp32 master weights matching DeepSpeed ZeRO-3 and Megatron precision-aware optimizer behavior. FSDP2's MixedPrecisionPolicy(param_dtype=dtype) will still cast forward/backward computation to dtype (e.g. bfloat16). Set to 'bfloat16' together with optimizer.type='adam_bf16' to reduce memory at the cost of needing Kahan summation for stability. Currently FSDP-only; Megatron uses use_precision_aware_optimizer instead and ignores this field.
optimizerOptimizerConfig | NoneNoneOptimizer configuration. None means no training.
weight_update_modestring"xccl"Weight update backend type. Choices: disk, xccl
fsdpFSDPEngineConfigRequired-
archonArchonEngineConfigRequired-
megatronMegatronEngineConfigRequired-
offloadbooleanFalseWhether to offload model parameters and optimizer states to CPU.
use_lorabooleanFalseWhether to use LoRA. Only support FSDP. Note that should be enabled together with vLLM/SGLang.
lora_rankinteger32lora rank
lora_alphainteger16lora alpha
target_moduleslist of stringRequiredlora target_modules.
peft_typestring"lora"peft method type. Only LoRA is supported for now.
enable_tree_trainingbooleanFalseEnable tree training with flex attention module.
scheduling_spectupleRequiredTrain engine schedule specs. Can accept 1 or 2 SchedulingSpec: if 1 spec provided, it's used for both worker and engine, engine is embedded in the worker; if 2 specs provided, first one is for worker, second one is for engine. Currently only used by the TrainController.
backendstringRequiredBackend and parallelism strategy. Must include an explicit backend prefix, e.g. 'fsdp:d4', 'megatron:d4t2p2', 'archon:d2'. Required.
_versionstring"v1"Train controller implementation version. Use 'v1' for legacy TrainController, 'v2' for GatewayTrainController. Choices: v1, v2
admin_api_keystring"areal-admin-key"Admin API key used by gateway/router/data-proxy in controller v2.
log_levelstring"warning"Gateway stack log level for controller v2.
request_timeoutfloat3600.0Gateway request timeout in seconds for controller v2.
setup_timeoutfloat3600.0Gateway setup timeout in seconds for controller v2.
workers_ready_timeoutfloat30.0Timeout (seconds) for initialize() to wait for guards to be ready.
scheduling_strategySchedulingStrategyRequiredThe scheduling strategy of this TrainEngine, either separation or colocation. Currently only used by the TrainController.

(section-generation-hyperparameters)=

GenerationHyperparameters

Controls text generation behavior for rollout.

ParameterTypeDefaultDescription
n_samplesinteger1Number of sequences to generate per prompt.
max_new_tokensinteger16384Maximum number of tokens to generate.
min_new_tokensinteger0Minimum number of tokens to generate.
max_tokensinteger32768Maximum number of tokens including prompt and generated tokens.
greedybooleanFalseWhether to use greedy decoding (max probability).
top_pfloat1.0Nucleus sampling probability threshold (0.0, 1.0].
top_kinteger100000000Number of highest probability tokens to consider.
temperaturefloat1.0Sampling temperature. Higher values increase diversity.
stop_token_idslist of integerRequiredStop generation when encountering these token IDs.
ignore_eosbooleanFalseDo not stop generation when EOS is encountered.
skip_special_tokensbooleanTrueSkip special tokens when decoding/displaying outputs.
stoplist of string | NoneNoneOne or multiple stop words. Generation will stop if one of these words is sampled.
frequency_penaltyfloat0.0Penalizes tokens based on their frequency in generation so far. Must be between -2 and 2 where negative numbers encourage repetition.
lora_namestring"default_lora"Lora name to be used for this generation.
use_beam_searchbooleanFalseEnable beam search in the vLLM engine. When enabled, sampling parameters like temperature, top-p, and top-k are auto ignored.

(section-inference-engine)=

InferenceEngine Configuration

Configuration for inference servers, including offpolicyness control.

ParameterTypeDefaultDescription
experiment_namestring | NoneNone-
trial_namestring | NoneNone-
filerootstring | NoneNoneRoot directory for logs and trajectory dumps.
max_concurrent_rolloutsinteger | NoneNoneMaximum number of concurrent rollouts to the inference engine. Defaults to consumer_batch_size.
queue_sizeinteger | NoneNoneInput/Output queue size for async rollout.
consumer_batch_sizeinteger1Batch size for consuming rollouts from the queue.
max_head_offpolicynessinteger0Maximum off-policyness for the head. If the current version is more than this many versions behind, the request will not be accepted.
enable_rollout_tracingbooleanFalseWhether to output verbose tracing messages for each generation request.
check_trajectory_formatbooleanFalseWhether to check the format of produced trajectories of a customized workflow. Useful when debugging the workflow in isolation. Should be False during RL training.
tokenizer_pathstring""Path to tokenizer for trajectory text decoding.
dump_to_filebooleanFalseWhether to dump the trajectories to files under fileroot.
setup_timeoutfloat300.0Timeout in seconds of connecting to remote servers or launching local servers.
workers_ready_timeoutfloat30.0Timeout (seconds) for initialize() to wait for guards to be ready.
request_timeoutfloat3600Timeout for HTTP requests.
request_retriesinteger3Number of retries for failed requests.
pause_grace_periodfloat0.0The grace period after calling /pause_generation. Wait until all requests have been dropped.
scheduling_spectupleRequiredinference engine schedule specs. Can accept 1 or 2 SchedulingSpec: if 1 spec provided, it's used for both worker and engine, engine is embedded in the worker; if 2 specs provided, first one is for worker, second one is for engine. Currently only used by the RolloutController.
backendstringRequiredBackend and parallelism strategy. Must include an explicit backend prefix, e.g. 'sglang:d4', 'vllm:d2t4'. Required.
scheduling_strategySchedulingStrategyRequiredThe scheduling strategy of this InferenceEngine, either separation or colocation. Currently only used by the RolloutController.
use_lorabooleanFalseWhether to use LoRA. Should be same as actors LORA option.
lora_namestring""LoRA adapter name the rollout backend serves. Generation requests select the adapter by this name (plus the weight version). Usually left empty and auto-filled from gconfig.lora_name by PPOConfig.post_init so load and request sides stay in sync.
agentAgentConfigRequiredAgent workflow configuration used by inference-service rollouts.
return_routed_expertsbooleanFalseReturn routed expert indices for MoE models. Effective only when using SGLang engine with MoE models.
_versionstring"v1"Rollout controller implementation version. Use 'v1' for legacy RolloutController, 'v2' for RolloutControllerV2. Choices: v1, v2
modelstring"default"Model name exposed through the inference-service gateway.
routing_strategystring"round_robin"Routing strategy for the inference-service router.
poll_intervalfloat5.0Health-poll interval in seconds for the inference-service router.
admin_api_keystring"areal-admin-key"Admin API key used by the inference-service gateway, router, and data proxies.
api_urlstring | NoneNoneExternal OpenAI-compatible base URL for inference-service external model mode.
provider_api_keystring | NoneNoneAPI key for the external OpenAI-compatible provider.

(section-sg-lang)=

SGLang Configuration

Configuration for SGLang runtime. Refer to:

https://github.com/sgl-project/sglang for detailed documentation.

ParameterTypeDefaultDescription
model_pathstring""-
random_seedinteger1-
skip_tokenizer_initbooleanFalse-
disable_cuda_graphbooleanFalse-
disable_radix_cachebooleanTrue-
disable_cuda_graph_paddingbooleanFalse-
enable_nccl_nvlsbooleanFalse-
disable_outlines_disk_cachebooleanFalse-
disable_custom_all_reducebooleanFalse-
disable_overlap_schedulebooleanFalse-
enable_mixed_chunkbooleanFalse-
enable_dp_attentionbooleanFalse-
enable_ep_moebooleanFalse-
enable_torch_compilebooleanFalse-
torch_compile_max_bsinteger32-
cuda_graph_max_bsinteger | NoneNone-
cuda_graph_bslist of integer | NoneNone-
torchao_configstring""-
enable_nan_detectionbooleanFalse-
enable_p2p_checkbooleanFalse-
triton_attention_reduce_in_fp32booleanFalse-
triton_attention_num_kv_splitsinteger8-
num_continuous_decode_stepsinteger1-
enable_memory_saverbooleanFalse-
allow_auto_truncatebooleanFalse-
attention_backendstring | None"fa3"-
enable_multimodalbooleanFalse-
sampling_backendstring | NoneNone-
context_lengthinteger | None32768-
mem_fraction_staticfloat | None0.9-
max_running_requestsinteger | NoneNone-
chunked_prefill_sizeinteger | None-1-
max_prefill_tokensinteger32768-
schedule_policystring"lpm"-
schedule_conservativenessfloat1.0-
cpu_offload_gbinteger0-
dtypestring"bfloat16"-
kv_cache_dtypestring"auto"-
dp_sizeinteger1-
ep_sizeinteger1-
enable_loraboolean | NoneNone-
max_lora_rankinteger | NoneNone-
max_loaded_lorasinteger8-
lora_pathslist of string | NoneNone-
lora_backendstring"triton"-
log_levelstring"warning"-
log_level_httpstring | None"warning"-
log_requestsbooleanFalse-
log_requests_levelinteger0-
show_time_costbooleanFalse-
enable_metricsbooleanTrue-
decode_log_intervalinteger1-
enable_multithread_loadbooleanFalse-
enable_return_routed_expertsbooleanFalse-

(section-v-llm)=

vLLM Configuration

Configuration for vLLM runtime. Refer to:

https://docs.vllm.ai/en/stable/api/index.html for detailed documentation.

ParameterTypeDefaultDescription
modelstring""-
seedinteger1-
skip_tokenizer_initbooleanFalse-
enforce_eagerbooleanFalse-
dtypestring"bfloat16"-
distributed_executor_backendstring"mp"-
max_num_seqsinteger256-
block_sizeinteger16-
cpu_offload_gbfloat0-
disable_sliding_windowbooleanTrue-
max_model_leninteger | None32768-
no_enable_chunked_prefillbooleanFalse-
no_enable_prefix_cachingbooleanTrue-
gpu_memory_utilizationfloat0.9-
worker_extension_clsstring"areal.engine.vllm_ext.vllm_worker_extension.VLLMWorkerExtension"-
enable_sleep_modebooleanFalse-
uvicorn_log_levelstring"warning"-
gdn_prefill_backendstring | NoneNoneGDN prefill backend for hybrid models like Qwen3.5. Choices: triton, flashinfer
enable_lorabooleanFalse-
max_lora_rankinteger16-
max_lorasinteger8-
lora_moduleslist of string | NoneNone-

(section-train-dataset)=

TrainDataset Configuration

Configuration for training dataset loading and preprocessing.

ParameterTypeDefaultDescription
splitstring"train"Dataset split to use, e.g., 'train', 'test'.
pathstringRequiredPath to the dataset. Can be a local path or a HuggingFace dataset name.
typestringRequiredType of training method, e.g., 'sft', 'rl', etc.
batch_sizeinteger1Batch size for the dataloader
shufflebooleanTrueWhether to shuffle the dataset
pin_memorybooleanFalsePin memory for faster data loading (set True for GPU training)
num_workersinteger0Number of worker processes for data loading
num_dataset_workersinteger1Number of remote data-service worker processes to launch when using scheduling_spec.
drop_lastbooleanTrueDrop the last incomplete batch
max_lengthinteger | NoneNoneMaximum token length of sequences in dataset. Longer sequences are filtered out.
dataset_kwargsdictRequiredAdditional keyword arguments for dataset loading. These are passed to the dataset loading function get_custom_dataset.
scheduling_specSchedulingSpec | NoneRequiredScheduling spec for remote data loading workers. If set, dataset loading will be offloaded to a data service with remote workers.
setup_timeoutfloat120.0Timeout in seconds for the data service to load and register a dataset. Increase this value when loading large datasets for the first time (e.g. HuggingFace datasets that require downloading and preprocessing).

(section-valid-dataset)=

ValidDataset Configuration

Configuration for validation dataset loading and preprocessing.

It has different default values with TrainDatasetConfig. shuffle and drop_last default to False.

ParameterTypeDefaultDescription
splitstring"test"Dataset split to use, e.g., 'train', 'test'.
pathstringRequiredPath to the dataset. Can be a local path or a HuggingFace dataset name.
typestringRequiredType of training method, e.g., 'sft', 'rl', etc.
batch_sizeinteger1Batch size for the dataloader
shufflebooleanFalseWhether to shuffle the dataset
pin_memorybooleanFalsePin memory for faster data loading (set True for GPU training)
num_workersinteger0Number of worker processes for data loading
num_dataset_workersinteger1Number of remote data-service worker processes to launch when using scheduling_spec.
drop_lastbooleanFalseDrop the last incomplete batch
max_lengthinteger | NoneNoneMaximum token length of sequences in dataset. Longer sequences are filtered out.
dataset_kwargsdictRequiredAdditional keyword arguments for dataset loading. These are passed to the dataset loading function get_custom_dataset.
scheduling_specSchedulingSpec | NoneRequiredScheduling spec for remote data loading workers. If set, dataset loading will be offloaded to a data service with remote workers.
setup_timeoutfloat120.0Timeout in seconds for the data service to load and register a dataset. Increase this value when loading large datasets for the first time (e.g. HuggingFace datasets that require downloading and preprocessing).

(section-cluster)=

Cluster Specification Configuration

Configuration for cluster specification and distributed computing setup.

ParameterTypeDefaultDescription
name_resolveNameResolveConfigRequiredName resolving configuration.
cluster_namestring"local"Name of the cluster. Used to set specific environs.
filerootstring"/tmp/areal/"Root for logs and checkpoints. Should be available on all nodes.
n_nodesinteger32The size of the cluster. Used to decide slurm hostname suffix.
n_gpus_per_nodeinteger8Number of GPUs per node (physical).

(section-name-resolve)=

NameResolve Configuration

Configuration for distributed name resolution and service discovery.

ParameterTypeDefaultDescription
typestring"nfs"Type of the distributed KV store for name resolving. Choices: nfs, etcd3, ray
nfs_record_rootstring"/tmp/areal/name_resolve"Record root for NFS name resolving. Should be available on all nodes.
etcd3_addrstring"localhost:2379"Address of the ETCD3 server.
ray_actor_namestring"ray_kv_store"Name of the distributed Ray KV store.

(section-evaluator)=

Evaluator Configuration

Configuration for model evaluation scheduling and timing.

ParameterTypeDefaultDescription
experiment_namestringRequired-
trial_namestringRequired-
filerootstringRequired-
freq_epochsinteger | NoneNoneTrigger frequency in epochs. None disables epoch-based saving.
freq_stepsinteger | NoneNoneTrigger frequency in steps. None disables step-based saving.
freq_secsinteger | NoneNoneTrigger frequency in seconds. None disables time-based saving.
eval_before_trainbooleanFalseRun one evaluation before training begins, then continue with the configured evaluation frequency.

(section-recover)=

Recover Configuration

Configuration for experiment recovery and fault tolerance.

ParameterTypeDefaultDescription
experiment_namestringRequired-
trial_namestringRequired-
filerootstringRequired-
freq_epochsinteger | NoneNoneTrigger frequency in epochs. None disables epoch-based saving.
freq_stepsinteger | NoneNoneTrigger frequency in steps. None disables step-based saving.
freq_secsinteger | NoneNoneTrigger frequency in seconds. None disables time-based saving.
modestring"disabled"Recovery mode for the launcher. Options: 'on' or 'auto': Automatically recover from previous runs if recover info and checkpoints are available. 'off' or 'disabled': Never recover from previous runs.
retriesinteger3Number of recovery retries when recovery is enabled.
no_save_optimbooleanFalseDo not save optimizer state in recovery checkpoints. Required when using use_distributed_optimizer with Megatron (flattened_range incompatibility).
no_load_optimbooleanFalseDo not load optimizer state when recovering from checkpoint.

(section-saver)=

Saver Configuration

Configuration for model checkpoint saving scheduling and timing.

ParameterTypeDefaultDescription
experiment_namestringRequired-
trial_namestringRequired-
filerootstringRequired-
freq_epochsinteger | NoneNoneTrigger frequency in epochs. None disables epoch-based saving.
freq_stepsinteger | NoneNoneTrigger frequency in steps. None disables step-based saving.
freq_secsinteger | NoneNoneTrigger frequency in seconds. None disables time-based saving.
modestring"auto"Checkpoint save mode for HF saves. 'auto': use async for Archon engine, sync for others (default). 'sync': always synchronous. 'async': always process-based async with pinned memory staging, extra CPU pinned memory proportional to per-rank model shard size (e.g., ~17.5GB/rank for 70B model on 8 GPUs). Non-Archon engines fall back to sync with a warning. Choices: auto, sync, async

(section-stats-logger)=

StatsLogger Configuration

Configuration for experiment statistics logging and tracking services.

ParameterTypeDefaultDescription
experiment_namestringRequired-
trial_namestringRequired-
filerootstringRequired-
wandbWandBConfigRequiredWeights & Biases configuration.
swanlabSwanlabConfigRequiredSwanLab configuration.
tensorboardTensorBoardConfigRequiredTensorBoard configuration. Only 'path' field required.
trackioTrackioConfigRequiredTrackio configuration (Hugging Face experiment tracking).

(section-swanlab)=

Swanlab Configuration

Configuration for SwanLab experiment tracking and monitoring.

ParameterTypeDefaultDescription
projectstring | NoneNone-
namestring | NoneNone-
configdict | NoneNone-
logdirstring | NoneNone-
modestring"disabled"Tracking mode. One of 'cloud', 'local', 'disabled', or 'offline'. Choices: cloud, local, disabled, offline
api_keystring | NoneNone-

(section-tensor-board)=

TensorBoard Configuration

Configuration for TensorBoard logging and visualization.

ParameterTypeDefaultDescription
pathstring | NoneNone-

(section-trackio)=

Trackio Configuration

Configuration for Trackio experiment tracking (Hugging Face).

Trackio is a lightweight, local-first experiment tracking library with a wandb-compatible API. Dashboards can be viewed locally or deployed to Hugging Face Spaces.

See: https://github.com/gradio-app/trackio
ParameterTypeDefaultDescription
modestring"disabled"-
projectstring | NoneNone-
namestring | NoneNone-
space_idstring | NoneNone-

(section-wand-b)=

WandB Configuration

Configuration for Weights & Biases experiment tracking.

ParameterTypeDefaultDescription
modestring"disabled"Tracking mode. One of 'online', 'offline', 'disabled', or 'shared'. Choices: online, offline, disabled, shared
wandb_base_urlstring""-
wandb_api_keystring""-
entitystring | NoneNone-
projectstring | NoneNone-
namestring | NoneNone-
job_typestring | NoneNone-
groupstring | NoneNone-
notesstring | NoneNone-
tagslist of string | NoneNone-
configdict | NoneNone-
id_suffixstring | None"train"-

(section-agent)=

Agent Configuration

Configuration for agent workflows and the experimental agent service controller.

Consolidates proxy settings (mode, parsers, export) with agent-service orchestration (scheduling, auth) into a single flat dataclass.

ParameterTypeDefaultDescription
agent_cls_pathstring""Fully-qualified import path for the AgentRunnable implementation.
admin_api_keystring"areal-admin-key"Admin API key for the proxy server and agent-service inter-service auth. Used to authenticate management operations (grant_capacity, start_session). Cannot be used for chat completions. Each session gets a unique API key allocated via start_session. WARNING: Change this from the default for non-local deployments.
scheduling_spectupleRequiredScheduling spec for agent-service guard workers. Must contain exactly one SchedulingSpec. Use scheduling_spec[0].env_vars for child-process environment variables.
modestring"inline"OpenAI proxy mode: 'inline' (in-process), 'subproc' (subprocess), or 'online' (external user sessions for online RL training). inline mode runs the provided agent workflow directly in the same process. subproc mode launches a separate process to run the agent. online mode waits for external users to complete sessions via the proxy gateway URL, enabling online RL training. Choices: inline, subproc, online
tool_call_parserstring"qwen"Parser for tool calls in model output.
reasoning_parserstring"qwen3"Parser for reasoning content ( tags).
chat_template_typestring"hf"Chat template type: 'hf' (standard) or 'concat' (multi-turn concatenation). Choices: hf, concat
engine_max_tokensinteger | NoneNoneMaximum total tokens for the engine (prompt + completion).
turn_discountfloat1.0Discount factor for multi-turn reward propagation.
export_stylestring"individual"Export style: 'individual' (all interactions) or 'concat' (leaf nodes only). The 'individual' style exports each interaction (input-output-reward) step separately, and treats them as independent samples to train the model. The 'concat' style exports only the final concatenated trajectory from the root. It is only suitable for linear conversation histories without token mismatching (whether valid depends on the tokenizer). Choices: individual, concat
subproc_max_workersinteger4Maximum number of worker processes for subprocess mode execution pool.
session_timeout_secondsinteger3600Session timeout in seconds. Sessions inactive longer than this will be garbage collected.
set_reward_finish_timeoutfloat0.0Timeout in seconds to wait for additional reward updates before finalizing a session.

(section-archon-engine)=

ArchonEngine Configuration

Configuration for Archon Engine training backend.

ParameterTypeDefaultDescription
attn_typestring"varlen"Attention backend type. Use 'tree' for tree training. Choices: varlen, sdpa, tree
offload_paramsbooleanFalseWhether to offload FSDP parameters to CPU.
enable_compilebooleanFalseEnable torch.compile for TransformerBlocks.
ac_modestring"selective"Activation checkpointing mode. 'memory_budget' requires enable_compile=True. Choices: none, full, selective, memory_budget
selective_ac_optionstring"op"Selective AC option: 'op' for op-level, or integer string (e.g., '2') for every Nth layer.
ac_memory_budgetfloat0.5Memory budget for 'memory_budget' AC mode. 0.0 = minimum memory (max recompute), 1.0 = default behavior (no recompute).
ac_preserve_rng_statebooleanFalsePreserve RNG state during checkpointing for deterministic output. Enabling this may slow down training.
ac_debugbooleanFalse(Testing only) Capture AC debug information. Will be slower.
pp_schedulestring"Interleaved1F1B"Pipeline parallel schedule type. Choices: 1F1B, Interleaved1F1B, InterleavedZeroBubble, ZBVZeroBubble
pp_layers_per_stageinteger | NoneNoneNumber of transformer layers per (virtual) pipeline stage. If set, num_virtual_stages is calculated from num_layers. If None, stages are inferred from schedule type (1 stage/rank for 1F1B, 2 stages/rank for Interleaved1F1B/InterleavedZeroBubble/ZBVZeroBubble).
pp_first_stage_less_layersinteger1Number of layers to reduce in the first pipeline stage. Accounts for embedding layer overhead.
pp_last_stage_less_layersinteger1Number of layers to reduce in the last pipeline stage. Accounts for output layer overhead.
reshard_after_forward_policystring"default"FSDP reshard policy after forward pass. 'default': reshard when pipeline parallelism is off; keep unsharded when on to avoid repeated all-gather per microbatch. 'always': always reshard after forward (saves memory). 'never': never reshard after forward. Choices: default, always, never
fp8_configArchonFP8ConfigRequiredFP8 training configuration. Set mode='blockwise' to enable.
use_deterministic_algorithmsbooleanFalseEnable deterministic algorithms for training reproducibility. Sets torch.use_deterministic_algorithms(True, warn_only=True), CUBLAS_WORKSPACE_CONFIG, NCCL_ALGO, and TORCH_COMPILE_DETERMINISTIC. May reduce performance.
moe_router_dtypestring | None"fp32"Data type for MoE router gate GEMM computation. 'fp32' runs gate linear in float32 for numerical stability. None uses model dtype (no override). Choices: fp32, None

(section-archon-fp8)=

ArchonFP8 Configuration

Archon FP8 training configuration.

ParameterTypeDefaultDescription
modestring"disabled"FP8 precision mode. 'disabled': FP8 training off (default). 'blockwise': blockwise 128x128 FP8 e4m3fn matmuls (requires Hopper GPU). Choices: disabled, blockwise
exclude_moduleslist of stringRequiredFQN substrings of nn.Linear modules to keep in BF16 (not converted to FP8). Any module whose fully-qualified name contains one of these strings is skipped. Meaningful values for Archon models: 'output' (LM head, logit precision sensitive), 'router' (MoE router gate, routing stability sensitive), 'score' (critic head, value precision sensitive). Note: nn.Embedding modules (e.g. tok_embeddings) are never converted regardless of this list. WARNING: Setting this in YAML replaces the entire default list (does not extend it). Include ALL modules you want to keep in BF16.
include_expertsbooleanFalseApply FP8 to MoE expert computation. Uses per-expert blockwise FP8 matmuls via torchao.
use_tritonbooleanTrueUse Triton GEMM kernel for FP8 blockwise matmuls instead of cuBLAS. Currently must be True: torchao's blockwise FP8 is a prototype that uses mixed per-operand scaling (1x128 activations + 128x128 weights), which torch._scaled_mm does not support. The Triton kernel (triton_fp8_gemm_1x128_128x128) handles this natively. Revisit when torchao stabilizes mixed-mode cuBLAS dispatch.

(section-dpo)=

DPO Configuration

Configuration for Direct Preference Optimization (DPO) experiments.

ParameterTypeDefaultDescription
experiment_namestringRequiredName of the experiment (no '_' or '/'). Required.
trial_namestringRequiredName of the trial (no '-' or '/'). Required.
clusterClusterSpecConfigRequiredCluster specification. Mainly used by slurm.
allocation_modestring""DEPRECATED: Use per-engine 'backend' fields instead (e.g., actor.backend, rollout.backend). Legacy pattern-based GPU parallel strategy allocation mode. Only used by SPMD launchers (local/ray/slurm). Manual migration to per-engine 'backend' fields is required.
seedinteger1Random seed for reproducibility.
enable_offloadbooleanFalseWhether to enable training offload using torch_memory_saver. This requires setting up the environment for TMS (e.g., via LD_PRELOAD).
total_train_epochsinteger1Total number of epochs to train the model.
total_train_stepsinteger | NoneNoneTerminate training after this number of steps. For benchmarking purposes only. None indicates normal training.
total_train_n_seqsinteger | NoneNoneTerminate training after consuming this number of samples. For benchmarking purposes only. None indicates normal training.
tokenizer_pathstring""Path to the tokenizer.
train_datasetTrainDatasetConfigRequired-
valid_datasetValidDatasetConfig | NoneNone-
saverSaverConfigRequired-
evaluatorEvaluatorConfigRequired-
stats_loggerStatsLoggerConfigRequired-
perf_tracerPerfTracerConfig | NoneNonePerformance tracer configuration. None means disabled.
memory_profilerMemoryProfilerConfig | NoneNoneMemory snapshot profiler configuration. None means disabled.
recoverRecoverConfigRequired-
sglangSGLangConfigRequired-
vllmvLLMConfigRequired-
schedulerSchedulerConfigRequired-
actorDPOEngineConfigRequired-
refDPOEngineConfigRequired-

(section-dpo-engine)=

DPOEngine Configuration

Engine configuration for DPO training, extending TrainEngineConfig with DPO-specific fields.

ParameterTypeDefaultDescription
experiment_namestringRequired-
trial_namestringRequired-
pathstring""Path to HuggingFace checkpoint
attn_implstring"flash_attention_2"Attention implementation for huggingface transformers model. Accepts builtin transformers backends or a Hugging Face kernels repo ID formatted as org/repo[@revision][:entrypoint]. Choices: eager, sdpa, flash_attention_2, flash_attention_3, flex_attention
use_kernelsbooleanFalseEnable Hugging Face kernels model kernelization after model creation.
init_from_scratchbooleanFalseInitialize model weights randomly
is_criticbooleanFalseWhether to use a critic/reward model
temperaturefloat1.0Temperature during generation.
mb_specMicroBatchSpecRequired-
pad_to_maximumbooleanFalseWhether to pad each microbatch to the length upper bound specified by mb_spec. Can reduce memory fragmentation but slows down training.
disable_dropoutbooleanFalseDisable dropout layers during training
gradient_checkpointingbooleanFalseEnable gradient checkpointing
dtypestring"bfloat16"Forward/backward compute dtype.
grad_reduce_dtypestring"float32"Gradient reduction data type.
optimizer_dtypestring"float32"Underlying parameter storage dtype, also the dtype of optimizer states (exp_avg, exp_avg_sq) since torch.optim.AdamW inherits dtype from model.parameters(). Default 'float32' maintains fp32 master weights matching DeepSpeed ZeRO-3 and Megatron precision-aware optimizer behavior. FSDP2's MixedPrecisionPolicy(param_dtype=dtype) will still cast forward/backward computation to dtype (e.g. bfloat16). Set to 'bfloat16' together with optimizer.type='adam_bf16' to reduce memory at the cost of needing Kahan summation for stability. Currently FSDP-only; Megatron uses use_precision_aware_optimizer instead and ignores this field.
optimizerOptimizerConfig | NoneNoneOptimizer configuration. None means no training.
weight_update_modestring"xccl"Weight update backend type. Choices: disk, xccl
fsdpFSDPEngineConfigRequired-
archonArchonEngineConfigRequired-
megatronMegatronEngineConfigRequired-
offloadbooleanFalseWhether to offload model parameters and optimizer states to CPU.
use_lorabooleanFalseWhether to use LoRA. Only support FSDP. Note that should be enabled together with vLLM/SGLang.
lora_rankinteger32lora rank
lora_alphainteger16lora alpha
target_moduleslist of stringRequiredlora target_modules.
peft_typestring"lora"peft method type. Only LoRA is supported for now.
enable_tree_trainingbooleanFalseEnable tree training with flex attention module.
scheduling_spectupleRequiredTrain engine schedule specs. Can accept 1 or 2 SchedulingSpec: if 1 spec provided, it's used for both worker and engine, engine is embedded in the worker; if 2 specs provided, first one is for worker, second one is for engine. Currently only used by the TrainController.
backendstringRequiredBackend and parallelism strategy. Must include an explicit backend prefix, e.g. 'fsdp:d4', 'megatron:d4t2p2', 'archon:d2'. Required.
_versionstring"v1"Train controller implementation version. Use 'v1' for legacy TrainController, 'v2' for GatewayTrainController. Choices: v1, v2
admin_api_keystring"areal-admin-key"Admin API key used by gateway/router/data-proxy in controller v2.
log_levelstring"warning"Gateway stack log level for controller v2.
request_timeoutfloat3600.0Gateway request timeout in seconds for controller v2.
setup_timeoutfloat3600.0Gateway setup timeout in seconds for controller v2.
workers_ready_timeoutfloat30.0Timeout (seconds) for initialize() to wait for guards to be ready.
scheduling_strategySchedulingStrategyRequiredThe scheduling strategy of this TrainEngine, either separation or colocation. Currently only used by the TrainController.
betafloat0.1KL penalty coefficient for DPO loss.
loss_typestring"sigmoid"DPO loss variant. 'sigmoid': original DPO loss (Rafailov et al. 2023). 'ipo': Identity Preference Optimization with per-token length normalization (Azar et al. 2023). Choices: sigmoid, ipo

(section-distributed-data-parallel)=

DistributedDataParallel Configuration

Configuration for Megatron's DistributedDataParallel.

Refer to Megatron-LM documentation for details.

ParameterTypeDefaultDescription
grad_reduce_in_fp32booleanTrue-
overlap_grad_reducebooleanFalse-
overlap_param_gatherbooleanFalse-
align_param_gatherbooleanFalse-
use_distributed_optimizerbooleanTrue-
check_for_nan_in_gradbooleanFalse-
bucket_sizeinteger | NoneNone-
average_in_collectivebooleanFalse-
fp8_param_gatherbooleanFalse-

(section-fp8-engine)=

FP8Engine Configuration

Configuration for FP8 (8-bit floating point) training.

This configuration encapsulates all FP8-related parameters and can be reused across different engines (e.g., Megatron, FSDP). When None in the parent config, FP8 training is disabled.

ParameterTypeDefaultDescription
modestring"e4m3"FP8 precision mode. Options: 'e4m3' (uniform e4m3), 'hybrid' (e4m3 for activations/weights, e5m2 for output activation gradients).
recipestring"delayed"FP8 scaling recipe. Options: 'tensorwise', 'delayed', 'mxfp8' (Blackwell only), 'blockwise'.
parambooleanFalseKeep parameters in FP8 precision to save memory. Not all parameters will be converted to fp8; for example, biases will remain unchanged.
margininteger0Margin for FP8 scaling factor computation.
amax_history_leninteger1Length of amax history window for scaling factor computation.
amax_compute_algostring"most_recent"Algorithm for choosing amax value. Options: 'max' (largest in history window), 'most_recent'.
wgradbooleanTrueWhen False, override FP8 config and compute weight gradients in higher precision.
dot_product_attentionbooleanFalseUse FP8 implementation of Dot Product Attention.
multi_head_attentionbooleanFalseUse FP8 implementation of Multi Head Attention.
tp_only_amax_redbooleanFalseReduce FP8 AMAX only in TP or TP-CP domain.
first_last_layers_bf16booleanFalseRetain first and last N TransformerBlocks in BF16 instead of FP8.
num_layers_at_start_in_bf16integer1Number of layers at start to keep in BF16 when first_last_layers_bf16 is True.
num_layers_at_end_in_bf16integer1Number of layers at end to keep in BF16 when first_last_layers_bf16 is True.
direct_convertbooleanTrueWhether to use direct FP8 conversion during weight updates and save/load. When True, FP8 parameters are directly converted between TE FP8 and PyTorch FP8 without intermediate dequantization/quantization.

(section-megatron-engine)=

MegatronEngine Configuration

Configuration for Megatron-LM training framework.

Refer to Megatron-LM documentation for implementation details.

ParameterTypeDefaultDescription
wrap_with_ddpbooleanTrue-
use_torch_fsdp2booleanFalse-
use_custom_fsdpbooleanFalse-
ddpDistributedDataParallelConfigRequired-
virtual_pipeline_parallel_sizeinteger1Virtual pipeline parallel size for Megatron interleaved schedule. Set to >1 to enable VPP. Default is 1 (disabled).
overlap_param_gather_with_optimizer_stepbooleanFalse-
use_precision_aware_optimizerbooleanFalseEnable precision-aware optimizer for Megatron. When using adam_bf16 optimizer type with Megatron Engine, this is automatically enabled with exp_avg_dtype=bfloat16 and exp_avg_sq_dtype=bfloat16.
main_grads_dtypestring"float32"-
main_params_dtypestring"float32"-
exp_avg_dtypestring"float32"-
exp_avg_sq_dtypestring"float32"-
async_savebooleanFalseIf True, Megatron checkpoint saves run in background processes and save_checkpoint() returns immediately after weights are durably staged off the GPU. Pending saves are drained before the next load_checkpoint() and during engine.destroy(). Reduces per-save sync wait on large MoE checkpoints.
use_checkpoint_opt_param_schedulerbooleanTrue-
use_deterministic_algorithmsbooleanFalse-
recompute_granularitystring | None"full"-
recompute_methodstring | None"uniform"-
recompute_num_layersinteger | None1-
distribute_saved_activationsboolean | NoneNone-
recompute_moduleslist of string | NoneNone-
moe_router_dtypestring | None"fp32"-
moe_shared_expert_overlapbooleanFalseEnable overlapping between shared expert computations and dispatcher communications. Without this, the shared experts execute after the routed experts.
moe_enable_deepepbooleanFalse-
moe_token_dispatcher_typestring"alltoall"Type of token dispatcher. Options: 'allgather','alltoall' and 'flex'.
moe_permute_fusionbooleanFalseFuse token rearrangement ops during token dispatching.
fp8_configFP8EngineConfig | NoneNone-
bridge_typestring"mbridge"Bridge backend for MegatronEngine. Choices: 'mbridge' or 'megatron-bridge'. Choices: mbridge, megatron-bridge
use_mbridge_savebooleanFalseUse mbridge's save method to save gpu memory when saving weights.
use_bridge_for_update_weightsbooleanFalseWhen True and bridge_type='megatron-bridge', delegate live weight sync to bridge.export_hf_weights instead of the hand-rolled convert_to_hf registry. Required for models without a registry entry (e.g. Qwen3.5). FP8 paths fall back to the registry automatically.

(section-memory-profiler)=

MemoryProfiler Configuration

CUDA memory snapshot profiling configuration.

Attributes: profile_steps: Steps at which to record memory snapshots. max_entries: Max entries for torch.cuda.memory._record_memory_history.

ParameterTypeDefaultDescription
profile_stepslist of integerRequiredList of global steps to capture memory snapshots.
max_entriesinteger100000Max entries for memory history ring buffer.

(section-perf-tracer)=

PerfTracer Configuration

Configuration for perf tracer emission.

ParameterTypeDefaultDescription
experiment_namestringRequired-
trial_namestringRequired-
filerootstringRequired-
enabledbooleanFalseExplicitly enable or disable perf tracing. Set to true to capture perf traces.
save_intervalinteger1Flush trace events to disk every N calls to save(step=...). A value of 1 writes on every step; values <= 0 fall back to 1.
profile_stepslist of integer | NoneNoneList of step numbers at which to capture detailed profiling traces. If None, no detailed profiling traces are captured.
session_tracerSessionTracerConfig | NoneNoneSession tracing configuration.

(section-rejection-sampling)=

RejectionSampling Configuration

Unified configuration for sample filtering based on policy divergence.

Filters tokens/sequences where the divergence between proximal policy and behavior policy exceeds a threshold, via two action modes: - 'mask': zero out loss_mask (rejection, exclude from gradient) - 'clamp': clamp importance weight to bounds (truncation, bounded gradient)

Supports direct ratio bounds and KL divergence estimators (K1/K2/K3),
at both token-level and sequence-level granularity.

Replaces the removed ``behave_imp_weight_cap`` and ``behave_imp_weight_mode``.

Attributes:
    level: Filtering granularity ('token' or 'sequence'). When ``level='sequence'``
        and ``metric='ratio'``, both the filtering decision and the correction
        weight (behave_imp_weight) use the sequence-level geometric mean,
        matching the old ``sequence_mask``/``sequence_truncate`` semantics.
    action: Action mode ('mask' or 'clamp').
    metric: Divergence metric ('ratio', 'kl_k1', 'kl_k2', 'kl_k3').
    agg: Aggregation method for sequence-level ('sum', 'mean', 'max').
        For 'ratio' metric, aggregation is performed in log space (geometric
        mean/sum) to avoid the "length trap" and match GSPO semantics.
        For KL metrics, aggregation is arithmetic.
    upper: Upper bound for filtering.
    lower: Lower bound for filtering (optional).
ParameterTypeDefaultDescription
levelstring"token"Filtering granularity. 'token': per-token filtering (each token judged independently). 'sequence': per-sequence filtering (all tokens in a sequence share the same fate). When metric='ratio', both the filtering decision and the correction weight (behave_imp_weight) operate at sequence level using the geometric mean. Choices: token, sequence
actionstring"mask"Action to take when metric exceeds threshold. 'mask': zero out loss_mask for filtered tokens/sequences (rejection, completely excludes from gradient computation). 'clamp': clamp importance weight to [lower, upper] bounds (truncation, tokens still participate in gradient but with bounded weight). Choices: mask, clamp
metricstring"ratio"Divergence metric for filtering. 'ratio': direct importance ratio π_proximal/π_behave. 'kl_k1': KL estimator k1 = log(r), forward KL unbiased estimator (can be negative). 'kl_k2': KL estimator k2 = 0.5 * (log r)^2, non-negative quadratic approximation. 'kl_k3': KL estimator k3 = r - log(r) - 1, non-negative exact forward KL estimator. 'binary_kl': KPop (symmetric binary KL divergence) — masks tokens where either KL(proximal
aggstring"mean"Aggregation method for sequence-level filtering. Only used when level='sequence'. For 'ratio' metric, aggregation is in log space: 'sum' = exp(sum(log(r_i))), 'mean' = exp(mean(log(r_i))) = geometric mean (length-invariant, consistent with GSPO). For KL metrics, aggregation is arithmetic: 'sum' = sum(kl_i), 'mean' = mean(kl_i). 'max': max of per-token metric values (most conservative). Choices: sum, mean, max
upperfloat5.0Upper bound for filtering. Tokens/sequences with metric > upper are filtered out (loss_mask zeroed). For 'ratio' metric: must be > 1.0, typical values are 2.0 or 5.0. For 'kl_k2'/'kl_k3' metrics: typical values are 0.5-2.0.
lowerfloat | NoneNoneLower bound for filtering (optional). None means no lower bound. For 'ratio' metric: typical value is 0.5 (filter out tokens where policy probability dropped significantly). Must be > 0. For 'kl_k1' metric: can be used to filter negative KL estimates.

(section-scheduler)=

Scheduler Configuration

Configuration for worker scheduling. Used in the single-controller mode. Experimental.

ParameterTypeDefaultDescription
typestring | NoneNone-
endpointstring"http://localhost:8081"-
deploy_modestring"separation"-
functioncall_service_domainstring"http://localhost:8080"-
reward_functioncall_configdictRequired-
reward_model_pathstring""-
reward_model_service_urlstring"http://localhost:30000/classify"-

(section-scheduling)=

Scheduling Specification

Configuration class: SchedulingSpec

ParameterTypeDefaultDescription
cpuinteger8Number of CPU cores required per GPU
gpuinteger0Number of GPU units required. Used only when allocating pods.
meminteger32Amount of memory (GB) required per GPU
port_countinteger2Number of ports to expose
imagestring"/storage/openpsi/images/areal-latest.sif"Docker/Singularity container image to use. Currently only used by Slurm. Will be potentially used by Kubernetes in the future.
task_typestring"worker"Task type (e.g., worker, engine) Choices: worker, engine
env_varsdictRequiredEnvironment variables for the container
cmdstring | NoneNoneCommand to execute inside the container. Defaults to AReaL's RPC server.
srun_additional_argsstring"--unbuffered --mpi=pmi2 -K --chdir $PWD"Additional arguments to pass to the srun command. Only used by slurm.
additional_bash_cmdslist of string | NoneNoneAdditional bash commands to setup the container before running the torchrun command. Only used by slurm.
container_typestring"apptainer"Type of containers used in slurm Choices: apptainer, none
mountstring"/storage:/storage"Mount path for slurm.
nodeliststring | NoneNonesbatch/srun's --nodelist option for slurm.
excludestring | NoneNonesbatch/srun's --exclude option for slurm.
ray_placement_strategystring"shared"Which placement strategy to use for Ray scheduling. Shared will produce 1 placement group for all workers in the role (training). Separate will 1 placement group per worker (rollout). Deferred will do the same as separate but defers accelerator scheduling (multinode rollout). Choices: shared, separate, deferred

(section-scheduling-strategy)=

SchedulingStrategy

Configuration class: SchedulingStrategy

ParameterTypeDefaultDescription
typestring"separation"- Choices: separation, colocation
targetstring | NoneNoneThe target role to be colocated with
forkbooleanTrueWhen True with colocation, the target worker spawns a new process on the same node/GPUs instead of sharing its process. Provides process isolation while sharing GPU resources.

(section-session-tracer)=

SessionTracer Configuration

Configuration for per-session lifecycle tracing.

ParameterTypeDefaultDescription
enabledbooleanFalseEnable per-session lifecycle tracing alongside perf events. When true, session metadata is captured to sessions.jsonl.
flush_thresholdinteger256Flush session trace records once this many entries are ready. Values <= 0 fall back to 1.

(section-teacher)=

Teacher Configuration

Configuration class: TeacherConfig

ParameterTypeDefaultDescription
engine_typestring"rollout"Teacher engine type. 'rollout' uses inference engine scoring; 'train' uses the legacy train-engine teacher path. Choices: rollout, train
rolloutInferenceEngineConfig | NoneNone-
trainPPOActorConfig | NoneNoneLegacy train-engine teacher config. Required when engine_type='train'.
pathstring""Teacher model path. If set, overrides shared rollout backend model path.
offloadbooleanFalseWhether to offload teacher rollout model between steps
rl_loss_weightfloat1.0RL loss weight
distill_loss_weightfloat0.005Distillation loss weight