Configurations

June 15, 2026 · View on GitHub

This page provides a comprehensive reference for all configuration parameters available in AReaL's command-line interface. These parameters are defined using dataclasses and can be specified in YAML configuration files or overridden via command line arguments.

Usage

Configuration files are specified using the --config parameter:

python3 train.py --config path/to/config.yaml

You can override specific parameters from the command line:

python3 train.py --config path/to/config.yaml actor.lr=1e-4 seed=42

For detailed examples, see the experiment configurations in the examples/ directory.

Core Experiment Configurations

Training Configurations

Inference Configurations

Dataset

System and Cluster Configurations

Logging and Monitoring

Others

(section-base-experiment)=

BaseExperiment Configuration

Base configuration class for all experiment types with common settings.

Parameter	Type	Default	Description
`experiment_name`	string	Required	Name of the experiment (no '_' or '/'). Required.
`trial_name`	string	Required	Name of the trial (no '-' or '/'). Required.
`cluster`	`ClusterSpecConfig`	Required	Cluster specification. Mainly used by slurm.
`allocation_mode`	string	`""`	DEPRECATED: Use per-engine 'backend' fields instead (e.g., actor.backend, rollout.backend). Legacy pattern-based GPU parallel strategy allocation mode. Only used by SPMD launchers (local/ray/slurm). Manual migration to per-engine 'backend' fields is required.
`seed`	integer	`1`	Random seed for reproducibility.
`enable_offload`	boolean	`False`	Whether to enable training offload using torch_memory_saver. This requires setting up the environment for TMS (e.g., via LD_PRELOAD).
`total_train_epochs`	integer	`1`	Total number of epochs to train the model.
`total_train_steps`	integer \| None	`None`	Terminate training after this number of steps. For benchmarking purposes only. None indicates normal training.
`total_train_n_seqs`	integer \| None	`None`	Terminate training after consuming this number of samples. For benchmarking purposes only. None indicates normal training.
`tokenizer_path`	string	`""`	Path to the tokenizer.
`train_dataset`	`TrainDatasetConfig`	Required	-
`valid_dataset`	`ValidDatasetConfig` \| None	`None`	-
`saver`	`SaverConfig`	Required	-
`evaluator`	`EvaluatorConfig`	Required	-
`stats_logger`	`StatsLoggerConfig`	Required	-
`perf_tracer`	`PerfTracerConfig` \| None	`None`	Performance tracer configuration. None means disabled.
`memory_profiler`	`MemoryProfilerConfig` \| None	`None`	Memory snapshot profiler configuration. None means disabled.
`recover`	`RecoverConfig`	Required	-
`sglang`	`SGLangConfig`	Required	-
`vllm`	`vLLMConfig`	Required	-
`scheduler`	`SchedulerConfig`	Required	-

(section-grpo)=

GRPO Configuration

A dummy place holder of GRPO config for backward compatibility.

Parameter	Type	Default	Description
`experiment_name`	string	Required	Name of the experiment (no '_' or '/'). Required.
`trial_name`	string	Required	Name of the trial (no '-' or '/'). Required.
`cluster`	`ClusterSpecConfig`	Required	Cluster specification. Mainly used by slurm.
`allocation_mode`	string	`""`	DEPRECATED: Use per-engine 'backend' fields instead (e.g., actor.backend, rollout.backend). Legacy pattern-based GPU parallel strategy allocation mode. Only used by SPMD launchers (local/ray/slurm). Manual migration to per-engine 'backend' fields is required.
`seed`	integer	`1`	Random seed for reproducibility.
`enable_offload`	boolean	`False`	Whether to enable training offload using torch_memory_saver. This requires setting up the environment for TMS (e.g., via LD_PRELOAD).
`total_train_epochs`	integer	`1`	Total number of epochs to train the model.
`total_train_steps`	integer \| None	`None`	Terminate training after this number of steps. For benchmarking purposes only. None indicates normal training.
`total_train_n_seqs`	integer \| None	`None`	Terminate training after consuming this number of samples. For benchmarking purposes only. None indicates normal training.
`tokenizer_path`	string	`""`	Path to the tokenizer.
`train_dataset`	`TrainDatasetConfig`	Required	-
`valid_dataset`	`ValidDatasetConfig` \| None	`None`	-
`saver`	`SaverConfig`	Required	-
`evaluator`	`EvaluatorConfig`	Required	-
`stats_logger`	`StatsLoggerConfig`	Required	-
`perf_tracer`	`PerfTracerConfig` \| None	`None`	Performance tracer configuration. None means disabled.
`memory_profiler`	`MemoryProfilerConfig` \| None	`None`	Memory snapshot profiler configuration. None means disabled.
`recover`	`RecoverConfig`	Required	-
`sglang`	`SGLangConfig`	Required	-
`vllm`	`vLLMConfig`	Required	-
`scheduler`	`SchedulerConfig`	Required	-
`gconfig`	`GenerationHyperparameters`	Required	-
`eval_gconfig`	`GenerationHyperparameters` \| None	`None`	Generation hyperparameters for evaluation. If None, use gconfig.
`rollout`	`InferenceEngineConfig`	Required	-
`actor`	`PPOActorConfig`	Required	-
`ref`	`PPOActorConfig` \| None	`None`	-
`critic`	`PPOCriticConfig` \| None	`None`	-
`teacher`	`TeacherConfig` \| None	`None`	Optional teacher model configuration used for on-policy distillation during PPO training. If provided, the actor may be trained to match the teacher in addition to the standard PPO objective.
`dynamic_bs`	boolean	`False`	Enable dynamic batch sizing in prepare_batch. When True, batch collection stops when (accepted + rejected) >= batch_size, returning only accepted results. This results in variable-sized batches of valid data.

(section-ppo)=

PPO Configuration

Configuration for Proximal Policy Optimization (PPO) reinforcement learning experiments.

Parameter	Type	Default	Description
`experiment_name`	string	Required	Name of the experiment (no '_' or '/'). Required.
`trial_name`	string	Required	Name of the trial (no '-' or '/'). Required.
`cluster`	`ClusterSpecConfig`	Required	Cluster specification. Mainly used by slurm.
`allocation_mode`	string	`""`	DEPRECATED: Use per-engine 'backend' fields instead (e.g., actor.backend, rollout.backend). Legacy pattern-based GPU parallel strategy allocation mode. Only used by SPMD launchers (local/ray/slurm). Manual migration to per-engine 'backend' fields is required.
`seed`	integer	`1`	Random seed for reproducibility.
`enable_offload`	boolean	`False`	Whether to enable training offload using torch_memory_saver. This requires setting up the environment for TMS (e.g., via LD_PRELOAD).
`total_train_epochs`	integer	`1`	Total number of epochs to train the model.
`total_train_steps`	integer \| None	`None`	Terminate training after this number of steps. For benchmarking purposes only. None indicates normal training.
`total_train_n_seqs`	integer \| None	`None`	Terminate training after consuming this number of samples. For benchmarking purposes only. None indicates normal training.
`tokenizer_path`	string	`""`	Path to the tokenizer.
`train_dataset`	`TrainDatasetConfig`	Required	-
`valid_dataset`	`ValidDatasetConfig` \| None	`None`	-
`saver`	`SaverConfig`	Required	-
`evaluator`	`EvaluatorConfig`	Required	-
`stats_logger`	`StatsLoggerConfig`	Required	-
`perf_tracer`	`PerfTracerConfig` \| None	`None`	Performance tracer configuration. None means disabled.
`memory_profiler`	`MemoryProfilerConfig` \| None	`None`	Memory snapshot profiler configuration. None means disabled.
`recover`	`RecoverConfig`	Required	-
`sglang`	`SGLangConfig`	Required	-
`vllm`	`vLLMConfig`	Required	-
`scheduler`	`SchedulerConfig`	Required	-
`gconfig`	`GenerationHyperparameters`	Required	-
`eval_gconfig`	`GenerationHyperparameters` \| None	`None`	Generation hyperparameters for evaluation. If None, use gconfig.
`rollout`	`InferenceEngineConfig`	Required	-
`actor`	`PPOActorConfig`	Required	-
`ref`	`PPOActorConfig` \| None	`None`	-
`critic`	`PPOCriticConfig` \| None	`None`	-
`teacher`	`TeacherConfig` \| None	`None`	Optional teacher model configuration used for on-policy distillation during PPO training. If provided, the actor may be trained to match the teacher in addition to the standard PPO objective.
`dynamic_bs`	boolean	`False`	Enable dynamic batch sizing in prepare_batch. When True, batch collection stops when (accepted + rejected) >= batch_size, returning only accepted results. This results in variable-sized batches of valid data.

(section-rw)=

RW Configuration

Configuration for Reward Model (RW) training experiments.

Parameter	Type	Default	Description
`experiment_name`	string	Required	Name of the experiment (no '_' or '/'). Required.
`trial_name`	string	Required	Name of the trial (no '-' or '/'). Required.
`cluster`	`ClusterSpecConfig`	Required	Cluster specification. Mainly used by slurm.
`allocation_mode`	string	`""`	DEPRECATED: Use per-engine 'backend' fields instead (e.g., actor.backend, rollout.backend). Legacy pattern-based GPU parallel strategy allocation mode. Only used by SPMD launchers (local/ray/slurm). Manual migration to per-engine 'backend' fields is required.
`seed`	integer	`1`	Random seed for reproducibility.
`enable_offload`	boolean	`False`	Whether to enable training offload using torch_memory_saver. This requires setting up the environment for TMS (e.g., via LD_PRELOAD).
`total_train_epochs`	integer	`1`	Total number of epochs to train the model.
`total_train_steps`	integer \| None	`None`	Terminate training after this number of steps. For benchmarking purposes only. None indicates normal training.
`total_train_n_seqs`	integer \| None	`None`	Terminate training after consuming this number of samples. For benchmarking purposes only. None indicates normal training.
`tokenizer_path`	string	`""`	Path to the tokenizer.
`train_dataset`	`TrainDatasetConfig`	Required	-
`valid_dataset`	`ValidDatasetConfig` \| None	`None`	-
`saver`	`SaverConfig`	Required	-
`evaluator`	`EvaluatorConfig`	Required	-
`stats_logger`	`StatsLoggerConfig`	Required	-
`perf_tracer`	`PerfTracerConfig` \| None	`None`	Performance tracer configuration. None means disabled.
`memory_profiler`	`MemoryProfilerConfig` \| None	`None`	Memory snapshot profiler configuration. None means disabled.
`recover`	`RecoverConfig`	Required	-
`sglang`	`SGLangConfig`	Required	-
`vllm`	`vLLMConfig`	Required	-
`scheduler`	`SchedulerConfig`	Required	-
`actor`	`TrainEngineConfig`	Required	-

(section-sft)=

SFT Configuration

Configuration for Supervised Fine-Tuning (SFT) experiments.

Parameter	Type	Default	Description
`experiment_name`	string	Required	Name of the experiment (no '_' or '/'). Required.
`trial_name`	string	Required	Name of the trial (no '-' or '/'). Required.
`cluster`	`ClusterSpecConfig`	Required	Cluster specification. Mainly used by slurm.
`allocation_mode`	string	`""`	DEPRECATED: Use per-engine 'backend' fields instead (e.g., actor.backend, rollout.backend). Legacy pattern-based GPU parallel strategy allocation mode. Only used by SPMD launchers (local/ray/slurm). Manual migration to per-engine 'backend' fields is required.
`seed`	integer	`1`	Random seed for reproducibility.
`enable_offload`	boolean	`False`	Whether to enable training offload using torch_memory_saver. This requires setting up the environment for TMS (e.g., via LD_PRELOAD).
`total_train_epochs`	integer	`1`	Total number of epochs to train the model.
`total_train_steps`	integer \| None	`None`	Terminate training after this number of steps. For benchmarking purposes only. None indicates normal training.
`total_train_n_seqs`	integer \| None	`None`	Terminate training after consuming this number of samples. For benchmarking purposes only. None indicates normal training.
`tokenizer_path`	string	`""`	Path to the tokenizer.
`train_dataset`	`TrainDatasetConfig`	Required	-
`valid_dataset`	`ValidDatasetConfig` \| None	`None`	-
`saver`	`SaverConfig`	Required	-
`evaluator`	`EvaluatorConfig`	Required	-
`stats_logger`	`StatsLoggerConfig`	Required	-
`perf_tracer`	`PerfTracerConfig` \| None	`None`	Performance tracer configuration. None means disabled.
`memory_profiler`	`MemoryProfilerConfig` \| None	`None`	Memory snapshot profiler configuration. None means disabled.
`recover`	`RecoverConfig`	Required	-
`sglang`	`SGLangConfig`	Required	-
`vllm`	`vLLMConfig`	Required	-
`scheduler`	`SchedulerConfig`	Required	-
`actor`	`TrainEngineConfig`	Required	-

(section-fsdp-engine)=

FSDPEngine Configuration

Configuration for Fully Sharded Data Parallel (FSDP) training backend.

Parameter	Type	Default	Description
`wrap_policy`	`FSDPWrapPolicy` \| None	`None`	FSDP wrap policy, specifying model layers to wrap.
`offload_params`	boolean	`False`	Whether to offload FSDP parameters to CPU.
`memory_efficient_load`	boolean	`False`	Enable memory-efficient model loading. When enabled, model weights are initialized on CPU and only rank 0 loads pretrained weights, which are then broadcast to all ranks after FSDP sharding. This reduces peak GPU memory during initialization for large models. Note: For VLMs, rank 0 broadcast is not used; each rank loads weights independently on CPU.
`per_layer_optim_step`	boolean	`False`	Run Adam step on GPU by streaming optimizer states layer-by-layer with async prefetching, instead of running on CPU. Optimizer states are automatically managed on CPU by the per-layer wrapper regardless of offload_params setting. Requires optimizer type 'adam' (AdamW).
`optim_step_prefetch_layers`	integer	`1`	Number of layers to prefetch during per-layer optim step.
`shard_vision_across_sp`	boolean	`False`	Shard vision encoder across SP ranks by image. Only effective when context_parallel_size > 1.

(section-fsdp-wrap-policy)=

FSDPWrapPolicy

Policy configuration for FSDP model layer wrapping. None defaults to wrapping transformer decoder layers defined by transformers.

Parameter	Type	Default	Description
`transformer_layer_cls_to_wrap`	list of string \| None	`None`	A list of transformer layer names for FSDP to wrap.

(section-micro-batch)=

MicroBatch Specification

Specification for splitting micro-batches during training.

Parameter	Type	Default	Description
`n_mbs`	integer \| None	`1`	Number of micro-batches (or minimum number if max_tokens_per_mb is set). Used when max_tokens_per_mb is None or as minimum count
`granularity`	integer	`1`	Granularity of each micro-batch. Adjacent sequences are grouped by this size when dividing microbatches.
`max_tokens_per_mb`	integer \| None	`None`	Maximum tokens per micro-batch for each forward pass. When set, n_mbs becomes the minimum number of micro-batches.
`n_mbs_divisor`	integer	`1`	Divisor for the number of micro-batches. The final number of micro-batches will be adjusted to be divisible by this value.
`packing_algorithm`	string	`"ffd"`	Sequence packing algorithm for micro-batch allocation. Supported values: 'ffd' (First Fit Decreasing, default), 'kk' (Karmarkar-Karp, better balance but slightly slower). KK is recommended when workload balance across DP ranks is critical (e.g., large-scale RL training with variable-length sequences). Choices: `ffd`, `kk`

(section-norm)=

Norm Configuration

Configuration for reward/advantage normalization.

Parameter	Type	Default	Description
`mean_level`	string \| None	`"batch"`	Mean level for normalization. None for no mean normalization. Choices: `batch`, `group`, `None`
`mean_leave1out`	boolean	`False`	Whether to use leave-one-out average.
`std_level`	string \| None	`"batch"`	Standard deviation level for normalization. None for no std normalization. Choices: `batch`, `group`, `None`
`std_unbiased`	boolean	`True`	Whether to use unbiased standard deviation computation. Defaults to True (changed from False in v0.3.4).
`eps`	float	`1e-05`	The eps when dividing by standard deviation to avoid numerical issues.
`group_size`	integer	`1`	Group size for group-level normalization

(section-optimizer)=

Optimizer Configuration

Configuration for model optimization during training.

Parameter	Type	Default	Description
`type`	string	`"adam"`	Optimizer type. For FSDP Engine, adam_bf16 enables memory-efficient BF16 optimizer states. For Megatron Engine, adam_bf16 requires dtype=bfloat16 and is automatically converted to adam with precision-aware optimizer enabled. Choices: `adam`, `sgd`, `adam_bf16`
`lr`	float	`0.001`	Learning rate
`weight_decay`	float	`0.01`	Weight decay
`beta1`	float	`0.9`	Adam beta1 parameter. Only effective when optimizer_type is adam/adam_bf16
`beta2`	float	`0.999`	Adam beta2 parameter. Only effective when optimizer_type is adam/adam_bf16
`eps`	float	`1e-08`	Adam epsilon parameter. Only effective when optimizer_type is adam/adam_bf16
`min_lr_ratio`	float	`0.0`	Minimum learning rate ratio after annealing
`lr_scheduler_type`	string	`"constant"`	Learning rate scheduler type Choices: `linear`, `cosine`, `constant`
`warmup_steps_proportion`	float	`0.001`	Proportion of training steps for warmup
`initial_loss_scale`	float	`4294967296`	Initial loss scaling factor
`min_loss_scale`	float	`1.0`	Minimum loss scaling factor
`loss_scale_window`	float	`5`	Window size for loss scaling adjustment
`hysteresis`	integer	`2`	Hysteresis (scaling factor) for loss scaling
`gradient_clipping`	float	`1.0`	Gradient clipping threshold

(section-ppo-actor)=

PPOActor Configuration

Configuration for PPO actor model, a subclass of a TrainEngine.

Parameter	Type	Default	Description
`experiment_name`	string	Required	-
`trial_name`	string	Required	-
`path`	string	`""`	Path to HuggingFace checkpoint
`attn_impl`	string	`"flash_attention_2"`	Attention implementation for huggingface transformers model. Accepts builtin transformers backends or a Hugging Face kernels repo ID formatted as org/repo[@revision][:entrypoint]. Choices: `eager`, `sdpa`, `flash_attention_2`, `flash_attention_3`, `flex_attention`
`use_kernels`	boolean	`False`	Enable Hugging Face kernels model kernelization after model creation.
`init_from_scratch`	boolean	`False`	Initialize model weights randomly
`is_critic`	boolean	`False`	Whether to use a critic/reward model
`temperature`	float	`1.0`	Temperature during generation.
`mb_spec`	`MicroBatchSpec`	Required	-
`pad_to_maximum`	boolean	`False`	Whether to pad each microbatch to the length upper bound specified by mb_spec. Can reduce memory fragmentation but slows down training.
`disable_dropout`	boolean	`False`	Disable dropout layers during training
`gradient_checkpointing`	boolean	`False`	Enable gradient checkpointing
`dtype`	string	`"bfloat16"`	Forward/backward compute dtype.
`grad_reduce_dtype`	string	`"float32"`	Gradient reduction data type.
`optimizer_dtype`	string	`"float32"`	Underlying parameter storage dtype, also the dtype of optimizer states (exp_avg, exp_avg_sq) since torch.optim.AdamW inherits dtype from model.parameters(). Default 'float32' maintains fp32 master weights matching DeepSpeed ZeRO-3 and Megatron precision-aware optimizer behavior. FSDP2's MixedPrecisionPolicy(param_dtype=`dtype`) will still cast forward/backward computation to `dtype` (e.g. bfloat16). Set to 'bfloat16' together with optimizer.type='adam_bf16' to reduce memory at the cost of needing Kahan summation for stability. Currently FSDP-only; Megatron uses use_precision_aware_optimizer instead and ignores this field.
`optimizer`	`OptimizerConfig` \| None	`None`	Optimizer configuration. None means no training.
`weight_update_mode`	string	`"xccl"`	Weight update backend type. Choices: `disk`, `xccl`
`fsdp`	`FSDPEngineConfig`	Required	-
`archon`	`ArchonEngineConfig`	Required	-
`megatron`	`MegatronEngineConfig`	Required	-
`offload`	boolean	`False`	Whether to offload model parameters and optimizer states to CPU.
`use_lora`	boolean	`False`	Whether to use LoRA. Only support FSDP. Note that should be enabled together with vLLM/SGLang.
`lora_rank`	integer	`32`	lora rank
`lora_alpha`	integer	`16`	lora alpha
`target_modules`	list of string	Required	lora target_modules.
`peft_type`	string	`"lora"`	peft method type. Only LoRA is supported for now.
`enable_tree_training`	boolean	`False`	Enable tree training with flex attention module.
`scheduling_spec`	`tuple`	Required	Train engine schedule specs. Can accept 1 or 2 SchedulingSpec: if 1 spec provided, it's used for both worker and engine, engine is embedded in the worker; if 2 specs provided, first one is for worker, second one is for engine. Currently only used by the TrainController.
`backend`	string	Required	Backend and parallelism strategy. Must include an explicit backend prefix, e.g. 'fsdp:d4', 'megatron:d4t2p2', 'archon:d2'. Required.
`_version`	string	`"v1"`	Train controller implementation version. Use 'v1' for legacy TrainController, 'v2' for GatewayTrainController. Choices: `v1`, `v2`
`admin_api_key`	string	`"areal-admin-key"`	Admin API key used by gateway/router/data-proxy in controller v2.
`log_level`	string	`"warning"`	Gateway stack log level for controller v2.
`request_timeout`	float	`3600.0`	Gateway request timeout in seconds for controller v2.
`setup_timeout`	float	`3600.0`	Gateway setup timeout in seconds for controller v2.
`workers_ready_timeout`	float	`30.0`	Timeout (seconds) for initialize() to wait for guards to be ready.
`scheduling_strategy`	`SchedulingStrategy`	Required	The scheduling strategy of this TrainEngine, either separation or colocation. Currently only used by the TrainController.
`ppo_n_minibatches`	integer	`4`	Number of minibatches for each PPO update
`eps_clip`	float	`0.2`	Clipping factor for policy ratio
`eps_clip_higher`	float \| None	`None`	Clipping factor (higher value) for policy ratio. Default is None. When eps_clip_higher is set (decoupled), eps_clip will be used as the lower value.
`c_clip`	float \| None	`None`	Dual clipping factor for policy ratio, must be > 1.0. None disables dual clipping.
`m2_threshold`	float \| None	`None`	The second momentum threshold for M2PO.
`reward_norm`	`NormConfig` \| None	`None`	Normalization configuration for rewards
`reward_scaling`	float	`1.0`	Reward scaling factor
`reward_bias`	float	`0.0`	Reward bias
`reward_clip`	float	`20.0`	Maximum absolute value for reward clipping
`overlong_reward_penalty`	boolean	`False`	Penalty for overlong sequences. Used within DAPO.
`overlong_tokens`	integer \| None	`None`	Number of tokens in the tail that will receive a penalty
`overlong_penalty_factor`	float \| None	`None`	Penalty factor for tokens in the tail
`mask_no_eos_with_zero`	boolean	`False`	Mask truncated generations (no EOS token) and exclude from training
`discount`	float	`1.0`	Discount factor for future rewards
`gae_lambda`	float	`1.0`	Lambda parameter for GAE
`adv_norm`	`NormConfig` \| None	`None`	Normalization configuration for advantages.
`kl_ctl`	float	`0.1`	KL divergence coefficient
`kl_estimator`	string	`"k1"`	KL divergence estimator Choices: `k1`, `k2`, `k3`
`use_sapo_loss`	boolean	`False`	Use SAPO loss (mutually exclusive with PPO clipping)
`sapo_tau_pos`	float	`1.0`	SAPO temperature for positive advantages
`sapo_tau_neg`	float	`1.05`	SAPO temperature for negative advantages
`recompute_logprob`	boolean	`False`	Recompute log probability and replace the log probability returned by inference.
`use_decoupled_loss`	boolean	`False`	Use the decoupled loss. Implicitly enables recompute_logprob.
`rejection_sampling`	`RejectionSamplingConfig` \| None	`None`	Rejection sampling configuration for filtering stale samples. None disables filtering (equivalent to old behave_imp_weight_mode='disabled'). Only effective when use_decoupled_loss=True.
`importance_sampling_level`	string	`"token"`	Level at which to compute importance sampling ratios. 'token': per-token ratios (standard PPO). 'sequence': sequence-level geometric mean of per-token ratios (GSPO). Choices: `token`, `sequence`
`prox_logp_method`	string	`"recompute"`	Method for computing proximal policy log-probabilities in decoupled PPO. Only effective when use_decoupled_loss=True. Options: 'recompute' (default): Standard decoupled PPO, recompute proximal policy via forward pass. 'loglinear': Use log-linear interpolation to approximate proximal policy (skip forward pass). 'metrics': Like 'recompute', but also compute approximation metrics for evaluation. Choices: `recompute`, `loglinear`, `metrics`
`log_agent_stats`	boolean	`False`	Log statistics for agent trajectories
`log_agent_stats_keys`	list of string	Required	Keys for logging agent trajectory statistics
`max_new_tokens`	integer	`1024`	Maximum number of new tokens to generate

(section-ppo-critic)=

PPOCritic Configuration

Configuration for PPO critic model, a subclass of a TrainEngine.

Parameter	Type	Default	Description
`experiment_name`	string	Required	-
`trial_name`	string	Required	-
`path`	string	`""`	Path to HuggingFace checkpoint
`attn_impl`	string	`"flash_attention_2"`	Attention implementation for huggingface transformers model. Accepts builtin transformers backends or a Hugging Face kernels repo ID formatted as org/repo[@revision][:entrypoint]. Choices: `eager`, `sdpa`, `flash_attention_2`, `flash_attention_3`, `flex_attention`
`use_kernels`	boolean	`False`	Enable Hugging Face kernels model kernelization after model creation.
`init_from_scratch`	boolean	`False`	Initialize model weights randomly
`is_critic`	boolean	`False`	Whether to use a critic/reward model
`temperature`	float	`1.0`	Temperature during generation.
`mb_spec`	`MicroBatchSpec`	Required	-
`pad_to_maximum`	boolean	`False`	Whether to pad each microbatch to the length upper bound specified by mb_spec. Can reduce memory fragmentation but slows down training.
`disable_dropout`	boolean	`False`	Disable dropout layers during training
`gradient_checkpointing`	boolean	`False`	Enable gradient checkpointing
`dtype`	string	`"bfloat16"`	Forward/backward compute dtype.
`grad_reduce_dtype`	string	`"float32"`	Gradient reduction data type.
`optimizer_dtype`	string	`"float32"`	Underlying parameter storage dtype, also the dtype of optimizer states (exp_avg, exp_avg_sq) since torch.optim.AdamW inherits dtype from model.parameters(). Default 'float32' maintains fp32 master weights matching DeepSpeed ZeRO-3 and Megatron precision-aware optimizer behavior. FSDP2's MixedPrecisionPolicy(param_dtype=`dtype`) will still cast forward/backward computation to `dtype` (e.g. bfloat16). Set to 'bfloat16' together with optimizer.type='adam_bf16' to reduce memory at the cost of needing Kahan summation for stability. Currently FSDP-only; Megatron uses use_precision_aware_optimizer instead and ignores this field.
`optimizer`	`OptimizerConfig` \| None	`None`	Optimizer configuration. None means no training.
`weight_update_mode`	string	`"xccl"`	Weight update backend type. Choices: `disk`, `xccl`
`fsdp`	`FSDPEngineConfig`	Required	-
`archon`	`ArchonEngineConfig`	Required	-
`megatron`	`MegatronEngineConfig`	Required	-
`offload`	boolean	`False`	Whether to offload model parameters and optimizer states to CPU.
`use_lora`	boolean	`False`	Whether to use LoRA. Only support FSDP. Note that should be enabled together with vLLM/SGLang.
`lora_rank`	integer	`32`	lora rank
`lora_alpha`	integer	`16`	lora alpha
`target_modules`	list of string	Required	lora target_modules.
`peft_type`	string	`"lora"`	peft method type. Only LoRA is supported for now.
`enable_tree_training`	boolean	`False`	Enable tree training with flex attention module.
`scheduling_spec`	`tuple`	Required	Train engine schedule specs. Can accept 1 or 2 SchedulingSpec: if 1 spec provided, it's used for both worker and engine, engine is embedded in the worker; if 2 specs provided, first one is for worker, second one is for engine. Currently only used by the TrainController.
`backend`	string	Required	Backend and parallelism strategy. Must include an explicit backend prefix, e.g. 'fsdp:d4', 'megatron:d4t2p2', 'archon:d2'. Required.
`_version`	string	`"v1"`	Train controller implementation version. Use 'v1' for legacy TrainController, 'v2' for GatewayTrainController. Choices: `v1`, `v2`
`admin_api_key`	string	`"areal-admin-key"`	Admin API key used by gateway/router/data-proxy in controller v2.
`log_level`	string	`"warning"`	Gateway stack log level for controller v2.
`request_timeout`	float	`3600.0`	Gateway request timeout in seconds for controller v2.
`setup_timeout`	float	`3600.0`	Gateway setup timeout in seconds for controller v2.
`workers_ready_timeout`	float	`30.0`	Timeout (seconds) for initialize() to wait for guards to be ready.
`scheduling_strategy`	`SchedulingStrategy`	Required	The scheduling strategy of this TrainEngine, either separation or colocation. Currently only used by the TrainController.
`ppo_n_minibatches`	integer	`4`	Number of minibatches for each PPO update
`eps_clip`	float	`0.5`	Clipping factor for value loss
`mask_no_eos_with_zero`	boolean	`False`	Mask truncated generations (no EOS token) and exclude from training

(section-train-engine)=

TrainEngine Configuration

Core configuration for model training, including optimization and backend settings.

Parameter	Type	Default	Description
`experiment_name`	string	Required	-
`trial_name`	string	Required	-
`path`	string	`""`	Path to HuggingFace checkpoint
`attn_impl`	string	`"flash_attention_2"`	Attention implementation for huggingface transformers model. Accepts builtin transformers backends or a Hugging Face kernels repo ID formatted as org/repo[@revision][:entrypoint]. Choices: `eager`, `sdpa`, `flash_attention_2`, `flash_attention_3`, `flex_attention`
`use_kernels`	boolean	`False`	Enable Hugging Face kernels model kernelization after model creation.
`init_from_scratch`	boolean	`False`	Initialize model weights randomly
`is_critic`	boolean	`False`	Whether to use a critic/reward model
`temperature`	float	`1.0`	Temperature during generation.
`mb_spec`	`MicroBatchSpec`	Required	-
`pad_to_maximum`	boolean	`False`	Whether to pad each microbatch to the length upper bound specified by mb_spec. Can reduce memory fragmentation but slows down training.
`disable_dropout`	boolean	`False`	Disable dropout layers during training
`gradient_checkpointing`	boolean	`False`	Enable gradient checkpointing
`dtype`	string	`"bfloat16"`	Forward/backward compute dtype.
`grad_reduce_dtype`	string	`"float32"`	Gradient reduction data type.
`optimizer_dtype`	string	`"float32"`	Underlying parameter storage dtype, also the dtype of optimizer states (exp_avg, exp_avg_sq) since torch.optim.AdamW inherits dtype from model.parameters(). Default 'float32' maintains fp32 master weights matching DeepSpeed ZeRO-3 and Megatron precision-aware optimizer behavior. FSDP2's MixedPrecisionPolicy(param_dtype=`dtype`) will still cast forward/backward computation to `dtype` (e.g. bfloat16). Set to 'bfloat16' together with optimizer.type='adam_bf16' to reduce memory at the cost of needing Kahan summation for stability. Currently FSDP-only; Megatron uses use_precision_aware_optimizer instead and ignores this field.
`optimizer`	`OptimizerConfig` \| None	`None`	Optimizer configuration. None means no training.
`weight_update_mode`	string	`"xccl"`	Weight update backend type. Choices: `disk`, `xccl`
`fsdp`	`FSDPEngineConfig`	Required	-
`archon`	`ArchonEngineConfig`	Required	-
`megatron`	`MegatronEngineConfig`	Required	-
`offload`	boolean	`False`	Whether to offload model parameters and optimizer states to CPU.
`use_lora`	boolean	`False`	Whether to use LoRA. Only support FSDP. Note that should be enabled together with vLLM/SGLang.
`lora_rank`	integer	`32`	lora rank
`lora_alpha`	integer	`16`	lora alpha
`target_modules`	list of string	Required	lora target_modules.
`peft_type`	string	`"lora"`	peft method type. Only LoRA is supported for now.
`enable_tree_training`	boolean	`False`	Enable tree training with flex attention module.
`scheduling_spec`	`tuple`	Required	Train engine schedule specs. Can accept 1 or 2 SchedulingSpec: if 1 spec provided, it's used for both worker and engine, engine is embedded in the worker; if 2 specs provided, first one is for worker, second one is for engine. Currently only used by the TrainController.
`backend`	string	Required	Backend and parallelism strategy. Must include an explicit backend prefix, e.g. 'fsdp:d4', 'megatron:d4t2p2', 'archon:d2'. Required.
`_version`	string	`"v1"`	Train controller implementation version. Use 'v1' for legacy TrainController, 'v2' for GatewayTrainController. Choices: `v1`, `v2`
`admin_api_key`	string	`"areal-admin-key"`	Admin API key used by gateway/router/data-proxy in controller v2.
`log_level`	string	`"warning"`	Gateway stack log level for controller v2.
`request_timeout`	float	`3600.0`	Gateway request timeout in seconds for controller v2.
`setup_timeout`	float	`3600.0`	Gateway setup timeout in seconds for controller v2.
`workers_ready_timeout`	float	`30.0`	Timeout (seconds) for initialize() to wait for guards to be ready.
`scheduling_strategy`	`SchedulingStrategy`	Required	The scheduling strategy of this TrainEngine, either separation or colocation. Currently only used by the TrainController.

(section-generation-hyperparameters)=

GenerationHyperparameters

Controls text generation behavior for rollout.

Parameter	Type	Default	Description
`n_samples`	integer	`1`	Number of sequences to generate per prompt.
`max_new_tokens`	integer	`16384`	Maximum number of tokens to generate.
`min_new_tokens`	integer	`0`	Minimum number of tokens to generate.
`max_tokens`	integer	`32768`	Maximum number of tokens including prompt and generated tokens.
`greedy`	boolean	`False`	Whether to use greedy decoding (max probability).
`top_p`	float	`1.0`	Nucleus sampling probability threshold (0.0, 1.0].
`top_k`	integer	`100000000`	Number of highest probability tokens to consider.
`temperature`	float	`1.0`	Sampling temperature. Higher values increase diversity.
`stop_token_ids`	list of integer	Required	Stop generation when encountering these token IDs.
`ignore_eos`	boolean	`False`	Do not stop generation when EOS is encountered.
`skip_special_tokens`	boolean	`True`	Skip special tokens when decoding/displaying outputs.
`stop`	list of string \| None	`None`	One or multiple stop words. Generation will stop if one of these words is sampled.
`frequency_penalty`	float	`0.0`	Penalizes tokens based on their frequency in generation so far. Must be between -2 and 2 where negative numbers encourage repetition.
`lora_name`	string	`"default_lora"`	Lora name to be used for this generation.
`use_beam_search`	boolean	`False`	Enable beam search in the vLLM engine. When enabled, sampling parameters like temperature, top-p, and top-k are auto ignored.

(section-inference-engine)=

InferenceEngine Configuration

Configuration for inference servers, including offpolicyness control.

Parameter	Type	Default	Description
`experiment_name`	string \| None	`None`	-
`trial_name`	string \| None	`None`	-
`fileroot`	string \| None	`None`	Root directory for logs and trajectory dumps.
`max_concurrent_rollouts`	integer \| None	`None`	Maximum number of concurrent rollouts to the inference engine. Defaults to consumer_batch_size.
`queue_size`	integer \| None	`None`	Input/Output queue size for async rollout.
`consumer_batch_size`	integer	`1`	Batch size for consuming rollouts from the queue.
`max_head_offpolicyness`	integer	`0`	Maximum off-policyness for the head. If the current version is more than this many versions behind, the request will not be accepted.
`enable_rollout_tracing`	boolean	`False`	Whether to output verbose tracing messages for each generation request.
`check_trajectory_format`	boolean	`False`	Whether to check the format of produced trajectories of a customized workflow. Useful when debugging the workflow in isolation. Should be False during RL training.
`tokenizer_path`	string	`""`	Path to tokenizer for trajectory text decoding.
`dump_to_file`	boolean	`False`	Whether to dump the trajectories to files under fileroot.
`setup_timeout`	float	`300.0`	Timeout in seconds of connecting to remote servers or launching local servers.
`workers_ready_timeout`	float	`30.0`	Timeout (seconds) for initialize() to wait for guards to be ready.
`request_timeout`	float	`3600`	Timeout for HTTP requests.
`request_retries`	integer	`3`	Number of retries for failed requests.
`pause_grace_period`	float	`0.0`	The grace period after calling /pause_generation. Wait until all requests have been dropped.
`scheduling_spec`	`tuple`	Required	inference engine schedule specs. Can accept 1 or 2 SchedulingSpec: if 1 spec provided, it's used for both worker and engine, engine is embedded in the worker; if 2 specs provided, first one is for worker, second one is for engine. Currently only used by the RolloutController.
`backend`	string	Required	Backend and parallelism strategy. Must include an explicit backend prefix, e.g. 'sglang:d4', 'vllm:d2t4'. Required.
`scheduling_strategy`	`SchedulingStrategy`	Required	The scheduling strategy of this InferenceEngine, either separation or colocation. Currently only used by the RolloutController.
`use_lora`	boolean	`False`	Whether to use LoRA. Should be same as actors LORA option.
`lora_name`	string	`""`	LoRA adapter name the rollout backend serves. Generation requests select the adapter by this name (plus the weight version). Usually left empty and auto-filled from gconfig.lora_name by PPOConfig.post_init so load and request sides stay in sync.
`agent`	`AgentConfig`	Required	Agent workflow configuration used by inference-service rollouts.
`return_routed_experts`	boolean	`False`	Return routed expert indices for MoE models. Effective only when using SGLang engine with MoE models.
`_version`	string	`"v1"`	Rollout controller implementation version. Use 'v1' for legacy RolloutController, 'v2' for RolloutControllerV2. Choices: `v1`, `v2`
`model`	string	`"default"`	Model name exposed through the inference-service gateway.
`routing_strategy`	string	`"round_robin"`	Routing strategy for the inference-service router.
`poll_interval`	float	`5.0`	Health-poll interval in seconds for the inference-service router.
`admin_api_key`	string	`"areal-admin-key"`	Admin API key used by the inference-service gateway, router, and data proxies.
`api_url`	string \| None	`None`	External OpenAI-compatible base URL for inference-service external model mode.
`provider_api_key`	string \| None	`None`	API key for the external OpenAI-compatible provider.

(section-sg-lang)=

SGLang Configuration

Configuration for SGLang runtime. Refer to:

https://github.com/sgl-project/sglang for detailed documentation.

Parameter	Type	Default	Description
`model_path`	string	`""`	-
`random_seed`	integer	`1`	-
`skip_tokenizer_init`	boolean	`False`	-
`disable_cuda_graph`	boolean	`False`	-
`disable_radix_cache`	boolean	`True`	-
`disable_cuda_graph_padding`	boolean	`False`	-
`enable_nccl_nvls`	boolean	`False`	-
`disable_outlines_disk_cache`	boolean	`False`	-
`disable_custom_all_reduce`	boolean	`False`	-
`disable_overlap_schedule`	boolean	`False`	-
`enable_mixed_chunk`	boolean	`False`	-
`enable_dp_attention`	boolean	`False`	-
`enable_ep_moe`	boolean	`False`	-
`enable_torch_compile`	boolean	`False`	-
`torch_compile_max_bs`	integer	`32`	-
`cuda_graph_max_bs`	integer \| None	`None`	-
`cuda_graph_bs`	list of integer \| None	`None`	-
`torchao_config`	string	`""`	-
`enable_nan_detection`	boolean	`False`	-
`enable_p2p_check`	boolean	`False`	-
`triton_attention_reduce_in_fp32`	boolean	`False`	-
`triton_attention_num_kv_splits`	integer	`8`	-
`num_continuous_decode_steps`	integer	`1`	-
`enable_memory_saver`	boolean	`False`	-
`allow_auto_truncate`	boolean	`False`	-
`attention_backend`	string \| None	`"fa3"`	-
`enable_multimodal`	boolean	`False`	-
`sampling_backend`	string \| None	`None`	-
`context_length`	integer \| None	`32768`	-
`mem_fraction_static`	float \| None	`0.9`	-
`max_running_requests`	integer \| None	`None`	-
`chunked_prefill_size`	integer \| None	`-1`	-
`max_prefill_tokens`	integer	`32768`	-
`schedule_policy`	string	`"lpm"`	-
`schedule_conservativeness`	float	`1.0`	-
`cpu_offload_gb`	integer	`0`	-
`dtype`	string	`"bfloat16"`	-
`kv_cache_dtype`	string	`"auto"`	-
`dp_size`	integer	`1`	-
`ep_size`	integer	`1`	-
`enable_lora`	boolean \| None	`None`	-
`max_lora_rank`	integer \| None	`None`	-
`max_loaded_loras`	integer	`8`	-
`lora_paths`	list of string \| None	`None`	-
`lora_backend`	string	`"triton"`	-
`log_level`	string	`"warning"`	-
`log_level_http`	string \| None	`"warning"`	-
`log_requests`	boolean	`False`	-
`log_requests_level`	integer	`0`	-
`show_time_cost`	boolean	`False`	-
`enable_metrics`	boolean	`True`	-
`decode_log_interval`	integer	`1`	-
`enable_multithread_load`	boolean	`False`	-
`enable_return_routed_experts`	boolean	`False`	-

(section-v-llm)=

vLLM Configuration

Configuration for vLLM runtime. Refer to:

https://docs.vllm.ai/en/stable/api/index.html for detailed documentation.

Parameter	Type	Default	Description
`model`	string	`""`	-
`seed`	integer	`1`	-
`skip_tokenizer_init`	boolean	`False`	-
`enforce_eager`	boolean	`False`	-
`dtype`	string	`"bfloat16"`	-
`distributed_executor_backend`	string	`"mp"`	-
`max_num_seqs`	integer	`256`	-
`block_size`	integer	`16`	-
`cpu_offload_gb`	float	`0`	-
`disable_sliding_window`	boolean	`True`	-
`max_model_len`	integer \| None	`32768`	-
`no_enable_chunked_prefill`	boolean	`False`	-
`no_enable_prefix_caching`	boolean	`True`	-
`gpu_memory_utilization`	float	`0.9`	-
`worker_extension_cls`	string	`"areal.engine.vllm_ext.vllm_worker_extension.VLLMWorkerExtension"`	-
`enable_sleep_mode`	boolean	`False`	-
`uvicorn_log_level`	string	`"warning"`	-
`gdn_prefill_backend`	string \| None	`None`	GDN prefill backend for hybrid models like Qwen3.5. Choices: `triton`, `flashinfer`
`enable_lora`	boolean	`False`	-
`max_lora_rank`	integer	`16`	-
`max_loras`	integer	`8`	-
`lora_modules`	list of string \| None	`None`	-

(section-train-dataset)=

TrainDataset Configuration

Configuration for training dataset loading and preprocessing.

Parameter	Type	Default	Description
`split`	string	`"train"`	Dataset split to use, e.g., 'train', 'test'.
`path`	string	Required	Path to the dataset. Can be a local path or a HuggingFace dataset name.
`type`	string	Required	Type of training method, e.g., 'sft', 'rl', etc.
`batch_size`	integer	`1`	Batch size for the dataloader
`shuffle`	boolean	`True`	Whether to shuffle the dataset
`pin_memory`	boolean	`False`	Pin memory for faster data loading (set True for GPU training)
`num_workers`	integer	`0`	Number of worker processes for data loading
`num_dataset_workers`	integer	`1`	Number of remote data-service worker processes to launch when using scheduling_spec.
`drop_last`	boolean	`True`	Drop the last incomplete batch
`max_length`	integer \| None	`None`	Maximum token length of sequences in dataset. Longer sequences are filtered out.
`dataset_kwargs`	`dict`	Required	Additional keyword arguments for dataset loading. These are passed to the dataset loading function `get_custom_dataset`.
`scheduling_spec`	`SchedulingSpec` \| None	Required	Scheduling spec for remote data loading workers. If set, dataset loading will be offloaded to a data service with remote workers.
`setup_timeout`	float	`120.0`	Timeout in seconds for the data service to load and register a dataset. Increase this value when loading large datasets for the first time (e.g. HuggingFace datasets that require downloading and preprocessing).

(section-valid-dataset)=

ValidDataset Configuration

Configuration for validation dataset loading and preprocessing.

It has different default values with TrainDatasetConfig. shuffle and drop_last default to False.

Parameter	Type	Default	Description
`split`	string	`"test"`	Dataset split to use, e.g., 'train', 'test'.
`path`	string	Required	Path to the dataset. Can be a local path or a HuggingFace dataset name.
`type`	string	Required	Type of training method, e.g., 'sft', 'rl', etc.
`batch_size`	integer	`1`	Batch size for the dataloader
`shuffle`	boolean	`False`	Whether to shuffle the dataset
`pin_memory`	boolean	`False`	Pin memory for faster data loading (set True for GPU training)
`num_workers`	integer	`0`	Number of worker processes for data loading
`num_dataset_workers`	integer	`1`	Number of remote data-service worker processes to launch when using scheduling_spec.
`drop_last`	boolean	`False`	Drop the last incomplete batch
`max_length`	integer \| None	`None`	Maximum token length of sequences in dataset. Longer sequences are filtered out.
`dataset_kwargs`	`dict`	Required	Additional keyword arguments for dataset loading. These are passed to the dataset loading function `get_custom_dataset`.
`scheduling_spec`	`SchedulingSpec` \| None	Required	Scheduling spec for remote data loading workers. If set, dataset loading will be offloaded to a data service with remote workers.
`setup_timeout`	float	`120.0`	Timeout in seconds for the data service to load and register a dataset. Increase this value when loading large datasets for the first time (e.g. HuggingFace datasets that require downloading and preprocessing).

(section-cluster)=

Cluster Specification Configuration

Configuration for cluster specification and distributed computing setup.

Parameter	Type	Default	Description
`name_resolve`	`NameResolveConfig`	Required	Name resolving configuration.
`cluster_name`	string	`"local"`	Name of the cluster. Used to set specific environs.
`fileroot`	string	`"/tmp/areal/"`	Root for logs and checkpoints. Should be available on all nodes.
`n_nodes`	integer	`32`	The size of the cluster. Used to decide slurm hostname suffix.
`n_gpus_per_node`	integer	`8`	Number of GPUs per node (physical).

(section-name-resolve)=

NameResolve Configuration

Configuration for distributed name resolution and service discovery.

Parameter	Type	Default	Description
`type`	string	`"nfs"`	Type of the distributed KV store for name resolving. Choices: `nfs`, `etcd3`, `ray`
`nfs_record_root`	string	`"/tmp/areal/name_resolve"`	Record root for NFS name resolving. Should be available on all nodes.
`etcd3_addr`	string	`"localhost:2379"`	Address of the ETCD3 server.
`ray_actor_name`	string	`"ray_kv_store"`	Name of the distributed Ray KV store.

(section-evaluator)=

Evaluator Configuration

Configuration for model evaluation scheduling and timing.

Parameter	Type	Default	Description
`experiment_name`	string	Required	-
`trial_name`	string	Required	-
`fileroot`	string	Required	-
`freq_epochs`	integer \| None	`None`	Trigger frequency in epochs. None disables epoch-based saving.
`freq_steps`	integer \| None	`None`	Trigger frequency in steps. None disables step-based saving.
`freq_secs`	integer \| None	`None`	Trigger frequency in seconds. None disables time-based saving.
`eval_before_train`	boolean	`False`	Run one evaluation before training begins, then continue with the configured evaluation frequency.

(section-recover)=

Recover Configuration

Configuration for experiment recovery and fault tolerance.

Parameter	Type	Default	Description
`experiment_name`	string	Required	-
`trial_name`	string	Required	-
`fileroot`	string	Required	-
`freq_epochs`	integer \| None	`None`	Trigger frequency in epochs. None disables epoch-based saving.
`freq_steps`	integer \| None	`None`	Trigger frequency in steps. None disables step-based saving.
`freq_secs`	integer \| None	`None`	Trigger frequency in seconds. None disables time-based saving.
`mode`	string	`"disabled"`	Recovery mode for the launcher. Options: 'on' or 'auto': Automatically recover from previous runs if recover info and checkpoints are available. 'off' or 'disabled': Never recover from previous runs.
`retries`	integer	`3`	Number of recovery retries when recovery is enabled.
`no_save_optim`	boolean	`False`	Do not save optimizer state in recovery checkpoints. Required when using use_distributed_optimizer with Megatron (flattened_range incompatibility).
`no_load_optim`	boolean	`False`	Do not load optimizer state when recovering from checkpoint.

(section-saver)=

Saver Configuration

Configuration for model checkpoint saving scheduling and timing.

Parameter	Type	Default	Description
`experiment_name`	string	Required	-
`trial_name`	string	Required	-
`fileroot`	string	Required	-
`freq_epochs`	integer \| None	`None`	Trigger frequency in epochs. None disables epoch-based saving.
`freq_steps`	integer \| None	`None`	Trigger frequency in steps. None disables step-based saving.
`freq_secs`	integer \| None	`None`	Trigger frequency in seconds. None disables time-based saving.
`mode`	string	`"auto"`	Checkpoint save mode for HF saves. 'auto': use async for Archon engine, sync for others (default). 'sync': always synchronous. 'async': always process-based async with pinned memory staging, extra CPU pinned memory proportional to per-rank model shard size (e.g., ~17.5GB/rank for 70B model on 8 GPUs). Non-Archon engines fall back to sync with a warning. Choices: `auto`, `sync`, `async`

(section-stats-logger)=

StatsLogger Configuration

Configuration for experiment statistics logging and tracking services.

Parameter	Type	Default	Description
`experiment_name`	string	Required	-
`trial_name`	string	Required	-
`fileroot`	string	Required	-
`wandb`	`WandBConfig`	Required	Weights & Biases configuration.
`swanlab`	`SwanlabConfig`	Required	SwanLab configuration.
`tensorboard`	`TensorBoardConfig`	Required	TensorBoard configuration. Only 'path' field required.
`trackio`	`TrackioConfig`	Required	Trackio configuration (Hugging Face experiment tracking).

(section-swanlab)=

Swanlab Configuration

Configuration for SwanLab experiment tracking and monitoring.

Parameter	Type	Default	Description
`project`	string \| None	`None`	-
`name`	string \| None	`None`	-
`config`	`dict` \| None	`None`	-
`logdir`	string \| None	`None`	-
`mode`	string	`"disabled"`	Tracking mode. One of 'cloud', 'local', 'disabled', or 'offline'. Choices: `cloud`, `local`, `disabled`, `offline`
`api_key`	string \| None	`None`	-

(section-tensor-board)=

TensorBoard Configuration

Configuration for TensorBoard logging and visualization.

Parameter	Type	Default	Description
`path`	string \| None	`None`	-

(section-trackio)=

Trackio Configuration

Configuration for Trackio experiment tracking (Hugging Face).

Trackio is a lightweight, local-first experiment tracking library with a wandb-compatible API. Dashboards can be viewed locally or deployed to Hugging Face Spaces.

See: https://github.com/gradio-app/trackio

Parameter	Type	Default	Description
`mode`	string	`"disabled"`	-
`project`	string \| None	`None`	-
`name`	string \| None	`None`	-
`space_id`	string \| None	`None`	-

(section-wand-b)=

WandB Configuration

Configuration for Weights & Biases experiment tracking.

Parameter	Type	Default	Description
`mode`	string	`"disabled"`	Tracking mode. One of 'online', 'offline', 'disabled', or 'shared'. Choices: `online`, `offline`, `disabled`, `shared`
`wandb_base_url`	string	`""`	-
`wandb_api_key`	string	`""`	-
`entity`	string \| None	`None`	-
`project`	string \| None	`None`	-
`name`	string \| None	`None`	-
`job_type`	string \| None	`None`	-
`group`	string \| None	`None`	-
`notes`	string \| None	`None`	-
`tags`	list of string \| None	`None`	-
`config`	`dict` \| None	`None`	-
`id_suffix`	string \| None	`"train"`	-

(section-agent)=

Agent Configuration

Configuration for agent workflows and the experimental agent service controller.

Consolidates proxy settings (mode, parsers, export) with agent-service orchestration (scheduling, auth) into a single flat dataclass.

Parameter	Type	Default	Description
`agent_cls_path`	string	`""`	Fully-qualified import path for the AgentRunnable implementation.
`admin_api_key`	string	`"areal-admin-key"`	Admin API key for the proxy server and agent-service inter-service auth. Used to authenticate management operations (grant_capacity, start_session). Cannot be used for chat completions. Each session gets a unique API key allocated via start_session. WARNING: Change this from the default for non-local deployments.
`scheduling_spec`	`tuple`	Required	Scheduling spec for agent-service guard workers. Must contain exactly one SchedulingSpec. Use scheduling_spec[0].env_vars for child-process environment variables.
`mode`	string	`"inline"`	OpenAI proxy mode: 'inline' (in-process), 'subproc' (subprocess), or 'online' (external user sessions for online RL training). `inline` mode runs the provided agent workflow directly in the same process. `subproc` mode launches a separate process to run the agent. `online` mode waits for external users to complete sessions via the proxy gateway URL, enabling online RL training. Choices: `inline`, `subproc`, `online`
`tool_call_parser`	string	`"qwen"`	Parser for tool calls in model output.
`reasoning_parser`	string	`"qwen3"`	Parser for reasoning content ( tags).
`chat_template_type`	string	`"hf"`	Chat template type: 'hf' (standard) or 'concat' (multi-turn concatenation). Choices: `hf`, `concat`
`engine_max_tokens`	integer \| None	`None`	Maximum total tokens for the engine (prompt + completion).
`turn_discount`	float	`1.0`	Discount factor for multi-turn reward propagation.
`export_style`	string	`"individual"`	Export style: 'individual' (all interactions) or 'concat' (leaf nodes only). The 'individual' style exports each interaction (input-output-reward) step separately, and treats them as independent samples to train the model. The 'concat' style exports only the final concatenated trajectory from the root. It is only suitable for linear conversation histories without token mismatching (whether valid depends on the tokenizer). Choices: `individual`, `concat`
`subproc_max_workers`	integer	`4`	Maximum number of worker processes for subprocess mode execution pool.
`session_timeout_seconds`	integer	`3600`	Session timeout in seconds. Sessions inactive longer than this will be garbage collected.
`set_reward_finish_timeout`	float	`0.0`	Timeout in seconds to wait for additional reward updates before finalizing a session.

(section-archon-engine)=

ArchonEngine Configuration

Configuration for Archon Engine training backend.

Parameter	Type	Default	Description
`attn_type`	string	`"varlen"`	Attention backend type. Use 'tree' for tree training. Choices: `varlen`, `sdpa`, `tree`
`offload_params`	boolean	`False`	Whether to offload FSDP parameters to CPU.
`enable_compile`	boolean	`False`	Enable torch.compile for TransformerBlocks.
`ac_mode`	string	`"selective"`	Activation checkpointing mode. 'memory_budget' requires enable_compile=True. Choices: `none`, `full`, `selective`, `memory_budget`
`selective_ac_option`	string	`"op"`	Selective AC option: 'op' for op-level, or integer string (e.g., '2') for every Nth layer.
`ac_memory_budget`	float	`0.5`	Memory budget for 'memory_budget' AC mode. 0.0 = minimum memory (max recompute), 1.0 = default behavior (no recompute).
`ac_preserve_rng_state`	boolean	`False`	Preserve RNG state during checkpointing for deterministic output. Enabling this may slow down training.
`ac_debug`	boolean	`False`	(Testing only) Capture AC debug information. Will be slower.
`pp_schedule`	string	`"Interleaved1F1B"`	Pipeline parallel schedule type. Choices: `1F1B`, `Interleaved1F1B`, `InterleavedZeroBubble`, `ZBVZeroBubble`
`pp_layers_per_stage`	integer \| None	`None`	Number of transformer layers per (virtual) pipeline stage. If set, num_virtual_stages is calculated from num_layers. If None, stages are inferred from schedule type (1 stage/rank for 1F1B, 2 stages/rank for Interleaved1F1B/InterleavedZeroBubble/ZBVZeroBubble).
`pp_first_stage_less_layers`	integer	`1`	Number of layers to reduce in the first pipeline stage. Accounts for embedding layer overhead.
`pp_last_stage_less_layers`	integer	`1`	Number of layers to reduce in the last pipeline stage. Accounts for output layer overhead.
`reshard_after_forward_policy`	string	`"default"`	FSDP reshard policy after forward pass. 'default': reshard when pipeline parallelism is off; keep unsharded when on to avoid repeated all-gather per microbatch. 'always': always reshard after forward (saves memory). 'never': never reshard after forward. Choices: `default`, `always`, `never`
`fp8_config`	`ArchonFP8Config`	Required	FP8 training configuration. Set mode='blockwise' to enable.
`use_deterministic_algorithms`	boolean	`False`	Enable deterministic algorithms for training reproducibility. Sets torch.use_deterministic_algorithms(True, warn_only=True), CUBLAS_WORKSPACE_CONFIG, NCCL_ALGO, and TORCH_COMPILE_DETERMINISTIC. May reduce performance.
`moe_router_dtype`	string \| None	`"fp32"`	Data type for MoE router gate GEMM computation. 'fp32' runs gate linear in float32 for numerical stability. None uses model dtype (no override). Choices: `fp32`, `None`

(section-archon-fp8)=

ArchonFP8 Configuration

Archon FP8 training configuration.

Parameter	Type	Default	Description
`mode`	string	`"disabled"`	FP8 precision mode. 'disabled': FP8 training off (default). 'blockwise': blockwise 128x128 FP8 e4m3fn matmuls (requires Hopper GPU). Choices: `disabled`, `blockwise`
`exclude_modules`	list of string	Required	FQN substrings of nn.Linear modules to keep in BF16 (not converted to FP8). Any module whose fully-qualified name contains one of these strings is skipped. Meaningful values for Archon models: 'output' (LM head, logit precision sensitive), 'router' (MoE router gate, routing stability sensitive), 'score' (critic head, value precision sensitive). Note: nn.Embedding modules (e.g. tok_embeddings) are never converted regardless of this list. WARNING: Setting this in YAML replaces the entire default list (does not extend it). Include ALL modules you want to keep in BF16.
`include_experts`	boolean	`False`	Apply FP8 to MoE expert computation. Uses per-expert blockwise FP8 matmuls via torchao.
`use_triton`	boolean	`True`	Use Triton GEMM kernel for FP8 blockwise matmuls instead of cuBLAS. Currently must be True: torchao's blockwise FP8 is a prototype that uses mixed per-operand scaling (1x128 activations + 128x128 weights), which torch._scaled_mm does not support. The Triton kernel (triton_fp8_gemm_1x128_128x128) handles this natively. Revisit when torchao stabilizes mixed-mode cuBLAS dispatch.

(section-dpo)=

DPO Configuration

Configuration for Direct Preference Optimization (DPO) experiments.

Parameter	Type	Default	Description
`experiment_name`	string	Required	Name of the experiment (no '_' or '/'). Required.
`trial_name`	string	Required	Name of the trial (no '-' or '/'). Required.
`cluster`	`ClusterSpecConfig`	Required	Cluster specification. Mainly used by slurm.
`allocation_mode`	string	`""`	DEPRECATED: Use per-engine 'backend' fields instead (e.g., actor.backend, rollout.backend). Legacy pattern-based GPU parallel strategy allocation mode. Only used by SPMD launchers (local/ray/slurm). Manual migration to per-engine 'backend' fields is required.
`seed`	integer	`1`	Random seed for reproducibility.
`enable_offload`	boolean	`False`	Whether to enable training offload using torch_memory_saver. This requires setting up the environment for TMS (e.g., via LD_PRELOAD).
`total_train_epochs`	integer	`1`	Total number of epochs to train the model.
`total_train_steps`	integer \| None	`None`	Terminate training after this number of steps. For benchmarking purposes only. None indicates normal training.
`total_train_n_seqs`	integer \| None	`None`	Terminate training after consuming this number of samples. For benchmarking purposes only. None indicates normal training.
`tokenizer_path`	string	`""`	Path to the tokenizer.
`train_dataset`	`TrainDatasetConfig`	Required	-
`valid_dataset`	`ValidDatasetConfig` \| None	`None`	-
`saver`	`SaverConfig`	Required	-
`evaluator`	`EvaluatorConfig`	Required	-
`stats_logger`	`StatsLoggerConfig`	Required	-
`perf_tracer`	`PerfTracerConfig` \| None	`None`	Performance tracer configuration. None means disabled.
`memory_profiler`	`MemoryProfilerConfig` \| None	`None`	Memory snapshot profiler configuration. None means disabled.
`recover`	`RecoverConfig`	Required	-
`sglang`	`SGLangConfig`	Required	-
`vllm`	`vLLMConfig`	Required	-
`scheduler`	`SchedulerConfig`	Required	-
`actor`	`DPOEngineConfig`	Required	-
`ref`	`DPOEngineConfig`	Required	-

(section-dpo-engine)=

DPOEngine Configuration

Engine configuration for DPO training, extending TrainEngineConfig with DPO-specific fields.

Parameter	Type	Default	Description
`experiment_name`	string	Required	-
`trial_name`	string	Required	-
`path`	string	`""`	Path to HuggingFace checkpoint
`attn_impl`	string	`"flash_attention_2"`	Attention implementation for huggingface transformers model. Accepts builtin transformers backends or a Hugging Face kernels repo ID formatted as org/repo[@revision][:entrypoint]. Choices: `eager`, `sdpa`, `flash_attention_2`, `flash_attention_3`, `flex_attention`
`use_kernels`	boolean	`False`	Enable Hugging Face kernels model kernelization after model creation.
`init_from_scratch`	boolean	`False`	Initialize model weights randomly
`is_critic`	boolean	`False`	Whether to use a critic/reward model
`temperature`	float	`1.0`	Temperature during generation.
`mb_spec`	`MicroBatchSpec`	Required	-
`pad_to_maximum`	boolean	`False`	Whether to pad each microbatch to the length upper bound specified by mb_spec. Can reduce memory fragmentation but slows down training.
`disable_dropout`	boolean	`False`	Disable dropout layers during training
`gradient_checkpointing`	boolean	`False`	Enable gradient checkpointing
`dtype`	string	`"bfloat16"`	Forward/backward compute dtype.
`grad_reduce_dtype`	string	`"float32"`	Gradient reduction data type.
`optimizer_dtype`	string	`"float32"`	Underlying parameter storage dtype, also the dtype of optimizer states (exp_avg, exp_avg_sq) since torch.optim.AdamW inherits dtype from model.parameters(). Default 'float32' maintains fp32 master weights matching DeepSpeed ZeRO-3 and Megatron precision-aware optimizer behavior. FSDP2's MixedPrecisionPolicy(param_dtype=`dtype`) will still cast forward/backward computation to `dtype` (e.g. bfloat16). Set to 'bfloat16' together with optimizer.type='adam_bf16' to reduce memory at the cost of needing Kahan summation for stability. Currently FSDP-only; Megatron uses use_precision_aware_optimizer instead and ignores this field.
`optimizer`	`OptimizerConfig` \| None	`None`	Optimizer configuration. None means no training.
`weight_update_mode`	string	`"xccl"`	Weight update backend type. Choices: `disk`, `xccl`
`fsdp`	`FSDPEngineConfig`	Required	-
`archon`	`ArchonEngineConfig`	Required	-
`megatron`	`MegatronEngineConfig`	Required	-
`offload`	boolean	`False`	Whether to offload model parameters and optimizer states to CPU.
`use_lora`	boolean	`False`	Whether to use LoRA. Only support FSDP. Note that should be enabled together with vLLM/SGLang.
`lora_rank`	integer	`32`	lora rank
`lora_alpha`	integer	`16`	lora alpha
`target_modules`	list of string	Required	lora target_modules.
`peft_type`	string	`"lora"`	peft method type. Only LoRA is supported for now.
`enable_tree_training`	boolean	`False`	Enable tree training with flex attention module.
`scheduling_spec`	`tuple`	Required	Train engine schedule specs. Can accept 1 or 2 SchedulingSpec: if 1 spec provided, it's used for both worker and engine, engine is embedded in the worker; if 2 specs provided, first one is for worker, second one is for engine. Currently only used by the TrainController.
`backend`	string	Required	Backend and parallelism strategy. Must include an explicit backend prefix, e.g. 'fsdp:d4', 'megatron:d4t2p2', 'archon:d2'. Required.
`_version`	string	`"v1"`	Train controller implementation version. Use 'v1' for legacy TrainController, 'v2' for GatewayTrainController. Choices: `v1`, `v2`
`admin_api_key`	string	`"areal-admin-key"`	Admin API key used by gateway/router/data-proxy in controller v2.
`log_level`	string	`"warning"`	Gateway stack log level for controller v2.
`request_timeout`	float	`3600.0`	Gateway request timeout in seconds for controller v2.
`setup_timeout`	float	`3600.0`	Gateway setup timeout in seconds for controller v2.
`workers_ready_timeout`	float	`30.0`	Timeout (seconds) for initialize() to wait for guards to be ready.
`scheduling_strategy`	`SchedulingStrategy`	Required	The scheduling strategy of this TrainEngine, either separation or colocation. Currently only used by the TrainController.
`beta`	float	`0.1`	KL penalty coefficient for DPO loss.
`loss_type`	string	`"sigmoid"`	DPO loss variant. 'sigmoid': original DPO loss (Rafailov et al. 2023). 'ipo': Identity Preference Optimization with per-token length normalization (Azar et al. 2023). Choices: `sigmoid`, `ipo`

(section-distributed-data-parallel)=

DistributedDataParallel Configuration

Configuration for Megatron's DistributedDataParallel.

Refer to Megatron-LM documentation for details.

Parameter	Type	Default	Description
`grad_reduce_in_fp32`	boolean	`True`	-
`overlap_grad_reduce`	boolean	`False`	-
`overlap_param_gather`	boolean	`False`	-
`align_param_gather`	boolean	`False`	-
`use_distributed_optimizer`	boolean	`True`	-
`check_for_nan_in_grad`	boolean	`False`	-
`bucket_size`	integer \| None	`None`	-
`average_in_collective`	boolean	`False`	-
`fp8_param_gather`	boolean	`False`	-

(section-fp8-engine)=

FP8Engine Configuration

Configuration for FP8 (8-bit floating point) training.

This configuration encapsulates all FP8-related parameters and can be reused across different engines (e.g., Megatron, FSDP). When None in the parent config, FP8 training is disabled.

Parameter	Type	Default	Description
`mode`	string	`"e4m3"`	FP8 precision mode. Options: 'e4m3' (uniform e4m3), 'hybrid' (e4m3 for activations/weights, e5m2 for output activation gradients).
`recipe`	string	`"delayed"`	FP8 scaling recipe. Options: 'tensorwise', 'delayed', 'mxfp8' (Blackwell only), 'blockwise'.
`param`	boolean	`False`	Keep parameters in FP8 precision to save memory. Not all parameters will be converted to fp8; for example, biases will remain unchanged.
`margin`	integer	`0`	Margin for FP8 scaling factor computation.
`amax_history_len`	integer	`1`	Length of amax history window for scaling factor computation.
`amax_compute_algo`	string	`"most_recent"`	Algorithm for choosing amax value. Options: 'max' (largest in history window), 'most_recent'.
`wgrad`	boolean	`True`	When False, override FP8 config and compute weight gradients in higher precision.
`dot_product_attention`	boolean	`False`	Use FP8 implementation of Dot Product Attention.
`multi_head_attention`	boolean	`False`	Use FP8 implementation of Multi Head Attention.
`tp_only_amax_red`	boolean	`False`	Reduce FP8 AMAX only in TP or TP-CP domain.
`first_last_layers_bf16`	boolean	`False`	Retain first and last N TransformerBlocks in BF16 instead of FP8.
`num_layers_at_start_in_bf16`	integer	`1`	Number of layers at start to keep in BF16 when first_last_layers_bf16 is True.
`num_layers_at_end_in_bf16`	integer	`1`	Number of layers at end to keep in BF16 when first_last_layers_bf16 is True.
`direct_convert`	boolean	`True`	Whether to use direct FP8 conversion during weight updates and save/load. When True, FP8 parameters are directly converted between TE FP8 and PyTorch FP8 without intermediate dequantization/quantization.

(section-megatron-engine)=

MegatronEngine Configuration

Configuration for Megatron-LM training framework.

Refer to Megatron-LM documentation for implementation details.

Parameter	Type	Default	Description
`wrap_with_ddp`	boolean	`True`	-
`use_torch_fsdp2`	boolean	`False`	-
`use_custom_fsdp`	boolean	`False`	-
`ddp`	`DistributedDataParallelConfig`	Required	-
`virtual_pipeline_parallel_size`	integer	`1`	Virtual pipeline parallel size for Megatron interleaved schedule. Set to >1 to enable VPP. Default is 1 (disabled).
`overlap_param_gather_with_optimizer_step`	boolean	`False`	-
`use_precision_aware_optimizer`	boolean	`False`	Enable precision-aware optimizer for Megatron. When using adam_bf16 optimizer type with Megatron Engine, this is automatically enabled with exp_avg_dtype=bfloat16 and exp_avg_sq_dtype=bfloat16.
`main_grads_dtype`	string	`"float32"`	-
`main_params_dtype`	string	`"float32"`	-
`exp_avg_dtype`	string	`"float32"`	-
`exp_avg_sq_dtype`	string	`"float32"`	-
`async_save`	boolean	`False`	If True, Megatron checkpoint saves run in background processes and save_checkpoint() returns immediately after weights are durably staged off the GPU. Pending saves are drained before the next load_checkpoint() and during engine.destroy(). Reduces per-save sync wait on large MoE checkpoints.
`use_checkpoint_opt_param_scheduler`	boolean	`True`	-
`use_deterministic_algorithms`	boolean	`False`	-
`recompute_granularity`	string \| None	`"full"`	-
`recompute_method`	string \| None	`"uniform"`	-
`recompute_num_layers`	integer \| None	`1`	-
`distribute_saved_activations`	boolean \| None	`None`	-
`recompute_modules`	list of string \| None	`None`	-
`moe_router_dtype`	string \| None	`"fp32"`	-
`moe_shared_expert_overlap`	boolean	`False`	Enable overlapping between shared expert computations and dispatcher communications. Without this, the shared experts execute after the routed experts.
`moe_enable_deepep`	boolean	`False`	-
`moe_token_dispatcher_type`	string	`"alltoall"`	Type of token dispatcher. Options: 'allgather','alltoall' and 'flex'.
`moe_permute_fusion`	boolean	`False`	Fuse token rearrangement ops during token dispatching.
`fp8_config`	`FP8EngineConfig` \| None	`None`	-
`bridge_type`	string	`"mbridge"`	Bridge backend for MegatronEngine. Choices: 'mbridge' or 'megatron-bridge'. Choices: `mbridge`, `megatron-bridge`
`use_mbridge_save`	boolean	`False`	Use mbridge's save method to save gpu memory when saving weights.
`use_bridge_for_update_weights`	boolean	`False`	When True and bridge_type='megatron-bridge', delegate live weight sync to bridge.export_hf_weights instead of the hand-rolled convert_to_hf registry. Required for models without a registry entry (e.g. Qwen3.5). FP8 paths fall back to the registry automatically.

(section-memory-profiler)=

MemoryProfiler Configuration

CUDA memory snapshot profiling configuration.

Attributes: profile_steps: Steps at which to record memory snapshots. max_entries: Max entries for torch.cuda.memory._record_memory_history.

Parameter	Type	Default	Description
`profile_steps`	list of integer	Required	List of global steps to capture memory snapshots.
`max_entries`	integer	`100000`	Max entries for memory history ring buffer.

(section-perf-tracer)=

PerfTracer Configuration

Configuration for perf tracer emission.

Parameter	Type	Default	Description
`experiment_name`	string	Required	-
`trial_name`	string	Required	-
`fileroot`	string	Required	-
`enabled`	boolean	`False`	Explicitly enable or disable perf tracing. Set to true to capture perf traces.
`save_interval`	integer	`1`	Flush trace events to disk every N calls to save(step=...). A value of 1 writes on every step; values <= 0 fall back to 1.
`profile_steps`	list of integer \| None	`None`	List of step numbers at which to capture detailed profiling traces. If None, no detailed profiling traces are captured.
`session_tracer`	`SessionTracerConfig` \| None	`None`	Session tracing configuration.

(section-rejection-sampling)=

RejectionSampling Configuration

Unified configuration for sample filtering based on policy divergence.

Filters tokens/sequences where the divergence between proximal policy and behavior policy exceeds a threshold, via two action modes: - 'mask': zero out loss_mask (rejection, exclude from gradient) - 'clamp': clamp importance weight to bounds (truncation, bounded gradient)

Supports direct ratio bounds and KL divergence estimators (K1/K2/K3),
at both token-level and sequence-level granularity.

Replaces the removed ``behave_imp_weight_cap`` and ``behave_imp_weight_mode``.

Attributes:
    level: Filtering granularity ('token' or 'sequence'). When ``level='sequence'``
        and ``metric='ratio'``, both the filtering decision and the correction
        weight (behave_imp_weight) use the sequence-level geometric mean,
        matching the old ``sequence_mask``/``sequence_truncate`` semantics.
    action: Action mode ('mask' or 'clamp').
    metric: Divergence metric ('ratio', 'kl_k1', 'kl_k2', 'kl_k3').
    agg: Aggregation method for sequence-level ('sum', 'mean', 'max').
        For 'ratio' metric, aggregation is performed in log space (geometric
        mean/sum) to avoid the "length trap" and match GSPO semantics.
        For KL metrics, aggregation is arithmetic.
    upper: Upper bound for filtering.
    lower: Lower bound for filtering (optional).

Parameter	Type	Default	Description
`level`	string	`"token"`	Filtering granularity. 'token': per-token filtering (each token judged independently). 'sequence': per-sequence filtering (all tokens in a sequence share the same fate). When metric='ratio', both the filtering decision and the correction weight (behave_imp_weight) operate at sequence level using the geometric mean. Choices: `token`, `sequence`
`action`	string	`"mask"`	Action to take when metric exceeds threshold. 'mask': zero out loss_mask for filtered tokens/sequences (rejection, completely excludes from gradient computation). 'clamp': clamp importance weight to [lower, upper] bounds (truncation, tokens still participate in gradient but with bounded weight). Choices: `mask`, `clamp`
`metric`	string	`"ratio"`	Divergence metric for filtering. 'ratio': direct importance ratio π_proximal/π_behave. 'kl_k1': KL estimator k1 = log(r), forward KL unbiased estimator (can be negative). 'kl_k2': KL estimator k2 = 0.5 * (log r)^2, non-negative quadratic approximation. 'kl_k3': KL estimator k3 = r - log(r) - 1, non-negative exact forward KL estimator. 'binary_kl': KPop (symmetric binary KL divergence) — masks tokens where either KL(proximal
`agg`	string	`"mean"`	Aggregation method for sequence-level filtering. Only used when level='sequence'. For 'ratio' metric, aggregation is in log space: 'sum' = exp(sum(log(r_i))), 'mean' = exp(mean(log(r_i))) = geometric mean (length-invariant, consistent with GSPO). For KL metrics, aggregation is arithmetic: 'sum' = sum(kl_i), 'mean' = mean(kl_i). 'max': max of per-token metric values (most conservative). Choices: `sum`, `mean`, `max`
`upper`	float	`5.0`	Upper bound for filtering. Tokens/sequences with metric > upper are filtered out (loss_mask zeroed). For 'ratio' metric: must be > 1.0, typical values are 2.0 or 5.0. For 'kl_k2'/'kl_k3' metrics: typical values are 0.5-2.0.
`lower`	float \| None	`None`	Lower bound for filtering (optional). None means no lower bound. For 'ratio' metric: typical value is 0.5 (filter out tokens where policy probability dropped significantly). Must be > 0. For 'kl_k1' metric: can be used to filter negative KL estimates.

(section-scheduler)=

Scheduler Configuration

Configuration for worker scheduling. Used in the single-controller mode. Experimental.

Parameter	Type	Default	Description
`type`	string \| None	`None`	-
`endpoint`	string	`"http://localhost:8081"`	-
`deploy_mode`	string	`"separation"`	-
`functioncall_service_domain`	string	`"http://localhost:8080"`	-
`reward_functioncall_config`	`dict`	Required	-
`reward_model_path`	string	`""`	-
`reward_model_service_url`	string	`"http://localhost:30000/classify"`	-

(section-scheduling)=

Scheduling Specification

Configuration class: SchedulingSpec

Parameter	Type	Default	Description
`cpu`	integer	`8`	Number of CPU cores required per GPU
`gpu`	integer	`0`	Number of GPU units required. Used only when allocating pods.
`mem`	integer	`32`	Amount of memory (GB) required per GPU
`port_count`	integer	`2`	Number of ports to expose
`image`	string	`"/storage/openpsi/images/areal-latest.sif"`	Docker/Singularity container image to use. Currently only used by Slurm. Will be potentially used by Kubernetes in the future.
`task_type`	string	`"worker"`	Task type (e.g., worker, engine) Choices: `worker`, `engine`
`env_vars`	`dict`	Required	Environment variables for the container
`cmd`	string \| None	`None`	Command to execute inside the container. Defaults to AReaL's RPC server.
`srun_additional_args`	string	`"--unbuffered --mpi=pmi2 -K --chdir $PWD"`	Additional arguments to pass to the srun command. Only used by slurm.
`additional_bash_cmds`	list of string \| None	`None`	Additional bash commands to setup the container before running the torchrun command. Only used by slurm.
`container_type`	string	`"apptainer"`	Type of containers used in slurm Choices: `apptainer`, `none`
`mount`	string	`"/storage:/storage"`	Mount path for slurm.
`nodelist`	string \| None	`None`	sbatch/srun's `--nodelist` option for slurm.
`exclude`	string \| None	`None`	sbatch/srun's `--exclude` option for slurm.
`ray_placement_strategy`	string	`"shared"`	Which placement strategy to use for Ray scheduling. Shared will produce 1 placement group for all workers in the role (training). Separate will 1 placement group per worker (rollout). Deferred will do the same as separate but defers accelerator scheduling (multinode rollout). Choices: `shared`, `separate`, `deferred`

(section-scheduling-strategy)=

SchedulingStrategy

Configuration class: SchedulingStrategy

Parameter	Type	Default	Description
`type`	string	`"separation"`	- Choices: `separation`, `colocation`
`target`	string \| None	`None`	The target role to be colocated with
`fork`	boolean	`True`	When True with colocation, the target worker spawns a new process on the same node/GPUs instead of sharing its process. Provides process isolation while sharing GPU resources.

(section-session-tracer)=

SessionTracer Configuration

Configuration for per-session lifecycle tracing.

Parameter	Type	Default	Description
`enabled`	boolean	`False`	Enable per-session lifecycle tracing alongside perf events. When true, session metadata is captured to sessions.jsonl.
`flush_threshold`	integer	`256`	Flush session trace records once this many entries are ready. Values <= 0 fall back to 1.

(section-teacher)=

Teacher Configuration

Configuration class: TeacherConfig

Parameter	Type	Default	Description
`engine_type`	string	`"rollout"`	Teacher engine type. 'rollout' uses inference engine scoring; 'train' uses the legacy train-engine teacher path. Choices: `rollout`, `train`
`rollout`	`InferenceEngineConfig` \| None	`None`	-
`train`	`PPOActorConfig` \| None	`None`	Legacy train-engine teacher config. Required when engine_type='train'.
`path`	string	`""`	Teacher model path. If set, overrides shared rollout backend model path.
`offload`	boolean	`False`	Whether to offload teacher rollout model between steps
`rl_loss_weight`	float	`1.0`	RL loss weight
`distill_loss_weight`	float	`0.005`	Distillation loss weight