online_training.md

April 22, 2026 · View on GitHub

🏃‍♂️ Online Training

To start training, run the embodiment script with your configuration file:

bash policy_and_value/policy_online/examples/embodiment/run_embodiment.sh YOUR_CONFIG_NAME

# Example:
bash policy_and_value/policy_online/examples/embodiment/run_embodiment.sh rl_release

Notes: The initial run may take a while (~10 minutes) as loading the dynamics and reward models, along with torch compile building the acceleration graph, all contribute to the latency. For debugging purposes, you can set actor.model.openpi.use_torch_compile = False.

🖥️ Cluster Configuration

You can flexibly configure the GPU allocation for the env, rollout, and actor components in your YAML config. Here are three common deployment strategies:

Partial Sharing (Default): Components share some GPUs while keeping others dedicated.

cluster:
  num_nodes: 1
  component_placement:
    env: 0-3
    rollout: 4-7
    actor: 0-7

Complete Sharing: All components share all available GPUs.

cluster:
  num_nodes: 1
  component_placement:
    env,rollout,actor: all

Complete Separation: Each component uses its own GPUs without interference, eliminating the need for offload functionality.

cluster:
  num_nodes: 1
  component_placement:
    env: 0-1
    rollout: 2-5
    actor: 6-7

🌐 Multi-Node Training

For N-node training, change cluster.num_nodes to N and assign the component_placement accordingly. (e.g., If N=2 and each node has 8 GPUs, the placement indices range from 0 to 15).

Run the multi-task unified training command:

bash policy_and_value/policy_online/examples/embodiment/run_embodiment_ray_unified_multi_task.sh YOUR_CONFIG_NAME

🔄 Resuming from Checkpoint

To resume training, modify runner.resume_dir in your config to point to your target checkpoint:

runner:
  resume_dir: logs/20251221-00:15:14/${runner.logger.experiment_name}/checkpoints/global_step_13000

⚙️ Configuration Parameters

Parameter	Description
`algorithm.num_group_envs`	Number of parallel environments for rollout. (e.g., If set to 32 with 4 GPUs for rollout, each GPU handles 8 envs).
`algorithm.rollout_epoch`	Number of epochs for rollouts.
`algorithm.policy_config_name`	Task-specific configuration. Must strictly align with your offline (IL) training setting.
`rollout.model_dir`	Path to your pretrained IL model for initialization.
`actor.micro_batch_size`	Micro-batch size per GPU.
`actor.global_batch_size`	Global batch size across all GPUs.
`model.action_dim`	Expected action dimension output for VLA models.
`rollout_ema_decay`	EMA preserving weight for each rollout model update.
`dynamics_model_config`	Task-specific configuration for the dynamics model.
`dynamics_model_image_root`	(Optional) Custom path for dynamics model images.
`dynamics_model_output_path`	(Optional) Custom path for dynamics model outputs.
`reward_model_config`	Task-specific configuration for the reward model.
`reward_model_ckpt`	Checkpoint path for the reward model.
`visualize_wm_pred`	Set to `True` to visualize your world model predictions. If `True`, the `chunk_reward` should be `True` too.
`chunk_reward`	Set to `True` to use only the reward of the last predicted frame as the reward for the current action chunk.
`advantage_scale`	Weighted coefficient for the computed advantage.

Note: For other configurations not listed here, we adopt most settings from RLinf. Please refer to the RLinf Documentation for more details.

📦 Deployment

Once you have trained your own VLA model, you need to convert the Distributed Checkpoint (.dcp) to a PyTorch state dict (.pt) before deployment.

Run the converter script:

python toolkits/ckpt_convertor/convert_dcp_to_state_dict.py \
    --dcp_path <YOUR_DCP_CKPT_DIR> \
    --output_path <YOUR_EXPECTED_PT_CKPT_DIR>

After conversion, you can use the generated .pt checkpoints on your deployment machine to infer actions.