Allocation Mode

May 8, 2026 · View on GitHub

This document describes AReaL's allocation mode system, which controls how GPUs are distributed between inference and training backends during distributed RL training.

Overview

Each engine component (actor, critic, rollout, ref, teacher) has its own backend configuration field that specifies:

  • Which backend to use (SGLang, vLLM for inference; FSDP, Megatron, Archon for training)
  • The parallelization strategy
  • The total number of GPUs required

AReaL parses each backend string into a ModelAllocation object that drives resource allocation for that specific engine.

Configuration

Per-Engine Backend Fields

Each engine in the YAML config has its own backend field:

# Rollout (inference) engine
rollout:
  backend: "sglang:d4t2"

# Actor (training) engine
actor:
  backend: "fsdp:d8"

# Critic engine (falls back to actor.backend if empty)
critic:
  backend: ""

# Ref engine (falls back to actor.backend if empty)
ref:
  backend: ""

When critic.backend or ref.backend is empty, it automatically inherits from actor.backend.

Note: The top-level allocation_mode config field is deprecated and only retained for backward compatibility with legacy SPMD launchers (local/ray/slurm). It is ignored by the single-controller scheduler. Use the per-engine backend fields shown above instead.

Backend String Syntax

<backend>:<parallelism_dims>

For example, fsdp:d4t2 means: use the FSDP backend with data parallelism 4 and tensor parallelism 2.

Parallelism Dimensions

DimensionAbbreviationDescriptionValid For
DatadNumber of model replicasAll backends
TensortSplit operations across GPUsAll backends
PipelinepSplit layers across GPUs in stagesMegatron, Archon
ContextcSplit sequence length across GPUsAll backends
ExperteSplit MoE experts across GPUsMegatron, Archon

Dimensions are specified as <abbrev><size>, e.g., d4t2 means data parallel size 4 and tensor parallel size 2.

Calculating GPU Requirements

The total GPUs for a component is computed as:

world_size = dp × tp × pp × cp

Expert parallelism (e) does not increase world size—it redistributes how experts are placed within the existing GPU mesh.

Examples

Backend StringGPUs per EngineNotes
fsdp:d888 data-parallel replicas
`sglang:d2t4$82 \text{instances} \times 4 \text{TP} \text{GPUs}
megatron:d2p2t4megatron:d2p2t4162 \text{DP} \times 2 \text{PP} \times 4 \text{TP}
$megatron:d2p2t4e4`16Same mesh, 4-way expert par

Full Config Example

``$\text{yaml}

16-\text{GPU} \text{setup}: 8 \text{inference} + 8 \text{training}

\text{rollout}: \text{backend}: "\text{sglang}:\text{d2t4}" # 2 \times 4 = 8 \text{GPUs} \text{actor}: \text{backend}: "\text{fsdp}:\text{d4t2}" # 4 \times 2 = 8 \text{GPUs} $``

Backend Selection

Inference Backends

BackendSupported Dimensions
sglangd, t, p
vllmd, t, p

For inference, d represents the number of independent server instances, and each instance uses t × p GPUs.

Note that the internal backend configurations do not affect how AReaL allocates GPUs. Given rollout.backend: "sglang:d4t4", you can also configure sglang.dp_size=4, sglang.ep_size=4, and sglang.enable_dp_attention=True. In this case, we launch 4 model replicas each with 4 GPUs. Within each instance, SGLang will still use DP attention and expert parallelism to distribute computations in attention and expert layers.

Training Backends

BackendSupported DimensionsUse Case
fsdpd, t, cDefault for simple parallelism
megatrond, t, p, c, eRequired for pipeline or expert parallel
archond, t, p, c, eAlternative to Megatron (experimental)

Important: An explicit backend prefix is required in all allocation strings. Bare dimension strings (e.g., d4t2) are no longer accepted. Always specify the backend explicitly: fsdp:d4t2, megatron:d2p2t4, sglang:d4t2.

MoE Hybrid Parallelism

For Mixture-of-Experts models, Megatron/Archon supports different parallelism strategies for attention and FFN (expert) modules using the hybrid syntax:

megatron:(attn:<attn_dims>|ffn:<ffn_dims>)

This enables MoE Parallel Folding, which reduces the minimum GPU requirement for combined context and expert parallelism.

Constraints

  • Pipeline parallel size (p) must be identical for attn and ffn
  • World size must match (if d is omitted in ffn, it is derived automatically)
  • Expert parallel (e) is only valid in the ffn section

Example

actor:
  backend: "megatron:(attn:d4p2t2c2|ffn:d2p2t4e2)"
ModuledppptpcpepWorld Size
attn4222-32
ffn224-232

See Also