online_training.md

April 22, 2026 ยท View on GitHub

๐Ÿƒโ€โ™‚๏ธ Online Training

To start training, run the embodiment script with your configuration file:

bash policy_and_value/policy_online/examples/embodiment/run_embodiment.sh YOUR_CONFIG_NAME

# Example:
bash policy_and_value/policy_online/examples/embodiment/run_embodiment.sh rl_release

Notes: The initial run may take a while (~10 minutes) as loading the dynamics and reward models, along with torch compile building the acceleration graph, all contribute to the latency. For debugging purposes, you can set actor.model.openpi.use_torch_compile = False.

๐Ÿ–ฅ๏ธ Cluster Configuration

You can flexibly configure the GPU allocation for the env, rollout, and actor components in your YAML config. Here are three common deployment strategies:

  • Partial Sharing (Default): Components share some GPUs while keeping others dedicated.
cluster:
  num_nodes: 1
  component_placement:
    env: 0-3
    rollout: 4-7
    actor: 0-7

  • Complete Sharing: All components share all available GPUs.
cluster:
  num_nodes: 1
  component_placement:
    env,rollout,actor: all
  • Complete Separation: Each component uses its own GPUs without interference, eliminating the need for offload functionality.
cluster:
  num_nodes: 1
  component_placement:
    env: 0-1
    rollout: 2-5
    actor: 6-7

๐ŸŒ Multi-Node Training

For N-node training, change cluster.num_nodes to N and assign the component_placement accordingly. (e.g., If N=2 and each node has 8 GPUs, the placement indices range from 0 to 15).

Run the multi-task unified training command:

bash policy_and_value/policy_online/examples/embodiment/run_embodiment_ray_unified_multi_task.sh YOUR_CONFIG_NAME

๐Ÿ”„ Resuming from Checkpoint

To resume training, modify runner.resume_dir in your config to point to your target checkpoint:

runner:
  resume_dir: logs/20251221-00:15:14/${runner.logger.experiment_name}/checkpoints/global_step_13000

โš™๏ธ Configuration Parameters

ParameterDescription
algorithm.num_group_envsNumber of parallel environments for rollout. (e.g., If set to 32 with 4 GPUs for rollout, each GPU handles 8 envs).
algorithm.rollout_epochNumber of epochs for rollouts.
algorithm.policy_config_nameTask-specific configuration. Must strictly align with your offline (IL) training setting.
rollout.model_dirPath to your pretrained IL model for initialization.
actor.micro_batch_sizeMicro-batch size per GPU.
actor.global_batch_sizeGlobal batch size across all GPUs.
model.action_dimExpected action dimension output for VLA models.
rollout_ema_decayEMA preserving weight for each rollout model update.
dynamics_model_configTask-specific configuration for the dynamics model.
dynamics_model_image_root(Optional) Custom path for dynamics model images.
dynamics_model_output_path(Optional) Custom path for dynamics model outputs.
reward_model_configTask-specific configuration for the reward model.
reward_model_ckptCheckpoint path for the reward model.
visualize_wm_predSet to True to visualize your world model predictions. If True, the chunk_reward should be True too.
chunk_rewardSet to True to use only the reward of the last predicted frame as the reward for the current action chunk.
advantage_scaleWeighted coefficient for the computed advantage.

Note: For other configurations not listed here, we adopt most settings from RLinf. Please refer to the RLinf Documentation for more details.


๐Ÿ“ฆ Deployment

Once you have trained your own VLA model, you need to convert the Distributed Checkpoint (.dcp) to a PyTorch state dict (.pt) before deployment.

Run the converter script:

python toolkits/ckpt_convertor/convert_dcp_to_state_dict.py \
    --dcp_path <YOUR_DCP_CKPT_DIR> \
    --output_path <YOUR_EXPECTED_PT_CKPT_DIR>

After conversion, you can use the generated .pt checkpoints on your deployment machine to infer actions.