Chain of World: World Model Thinking in Latent Motion

March 4, 2026 ยท View on GitHub

๐Ÿ“„ Paper | ๐ŸŒ Project Page | ๐Ÿค— HF Weights

Contents

Overview

Overview of the CoWVLA framework.

CoWVLA is a Vision-Language-Action (VLA) framework that combines world-model temporal reasoning with disentangled latent motion modeling.

Instead of reconstructing every future frame or learning only pairwise latent actions, CoWVLA:

  • uses a pretrained video VAE (VidTwin) to disentangle structure and motion latents;
  • pre-trains a VLA decoder to infer a continuous latent motion chain from an instruction and the initial frame, while predicting the terminal frame;
  • co-fine-tunes the model with sparse keyframes and FAST action tokens so that latent dynamics and action prediction are aligned in a single autoregressive decoder.

This design keeps the temporal reasoning benefits of world models, avoids redundant background reconstruction, and preserves the compactness and interpretability of latent motion.

Environment Setup

conda create -n cowvla python=3.10
conda activate cowvla

cd third_party/RoboVLMs
pip install -e .

pip install -r requirements.txt

Notes:

  • requirements.txt currently contains environment-specific dependencies such as a local flash_attn wheel path. Replace them with packages that match your CUDA and PyTorch stack before installation.
  • Most training scripts assume multi-GPU distributed training and contain hardcoded internal paths. Update paths and launcher variables before running them.

Evaluation

BenchmarkMetricCoWVLA
LIBEROSpatial / Object / Goal / Long / Avg.97.2 / 97.8 / 94.6 / 92.8 / 95.6
SimplerEnv-WidowXStack Block / Put Carrot / Put Spoon / Put Eggplant / Avg.62.5 / 66.7 / 79.2 / 95.8 / 76.0

Evaluation code is provided mainly through reference/RoboVLMs.

Benchmark setup

The setup scripts install dependencies for different benchmarks. Run the appropriate script based on your evaluation target:

  • reference/RoboVLMs/scripts/setup_libero.sh
  • reference/RoboVLMs/scripts/setup_simplerenv.sh
  • reference/RoboVLMs/scripts/setup_simplerenv_vla.sh
  • reference/RoboVLMs/scripts/setup_calvin.sh

LIBERO

The LIBERO evaluation workflow is designed around labtasker(https://github.com/luocfprime/labtasker) so that many checkpoints can be scheduled across multiple machines.

cd reference/RoboVLMs
python submit_libero.py
python run.py 0
python run.py 1

Before running:

  • set the checkpoint path in reference/RoboVLMs/submit_libero.py
  • install and configure labtasker
  • make sure the benchmark dependencies are installed

SimplerEnv-WidowX / BridgeV2

cd reference/RoboVLMs
bash scripts/eval_bridge.sh

Before running:

  • set model_ckpt_paths in reference/RoboVLMs/scripts/eval_bridge.sh
  • update GPU IDs and environment paths as needed

Training Data

The paper uses 236,543 robot videos from OXE-style datasets plus simulator data for latent motion extractor fine-tuning and VLA pre-training.

DatasetCount
Berkeley Autolab UR5892
BridgeV224,879
CMU Play Fusion576
Fractal65,530
Kuka84,202
Maniskill30,029
Taco Play3,242
Toto899
Utaustin Mutex1,500
Viola135
Calvin22,966
LIBERO1,693
Total236,543

Data Preparation

The repository follows a four-step data pipeline:

  1. Convert raw datasets into episode folders with images, actions, and instructions.
  2. Tokenize each image with the Emu3 vision tokenizer and save the discrete codes as .npy files.
  3. Build benchmark-specific .pkl metadata files.
  4. Merge multiple datasets into a world-model pretraining metadata file when needed.

1. Convert datasets to episode folders

Examples:

  • Fractal / Google Robot style data: tools/process/simplerenv_google.py
  • BridgeV2 / WidowX style data: tools/process/simplerenv_bridge.py
  • CALVIN: tools/process/calvin_process.py or tools/process/calvin_process_parallel.py
  • LIBERO: tools/process/libero_process.py

Example for Fractal:

python tools/process/simplerenv_google.py \
  --dataset_dir /path/to/fractal20220817_data \
  --output_dir /path/to/processed/oxembodiment/fractal

Example for BridgeV2:

python tools/process/simplerenv_bridge.py \
  --dataset_dir /path/to/bridgev2 \
  --output_dir /path/to/processed/oxembodiment/bridge

2. Extract Emu3 visual tokens

Download the tokenizer first:

Then run the tokenizer script:

bash scripts/tokenizer/extract_vq_emu3.sh

Important:

  • scripts/tokenizer/extract_vq_emu3.sh is an example script and contains hardcoded dataset names, GPU IDs, and output locations.
  • models/tokenizer/emu3_tokenizer.py also contains dataset-specific path configuration. Edit both files to match your local environment before launching tokenization.

3. Build pickle metadata

Examples:

python tools/pickle_gen/pickle_generation_simplerenv_google.py
python tools/pickle_gen/pickle_generation_simplerenv_bridge.py

The metadata stores fields such as:

  • text: language instruction
  • image or gripper_image: list of tokenized image code paths
  • action: robot action sequence
  • orig_img_list: original image paths for some datasets

4. Build world-model pretraining metadata

To merge multiple datasets for pre-training:

python tools/pickle_gen/world_model_pretrain.py

Or use the parallel version:

python tools/pickle_gen/world_model_pretrain_multi.py

These scripts also contain internal absolute paths and should be edited before use.

Training

Required pretrained components

FAST action tokenizers used by the released scripts are already included under:

  • pretrain/fast
  • pretrain/fast_bridge_t5_s50

Before launching training, update the following in the shell scripts:

  • WORKDIR
  • dataset and metadata paths
  • model_name_or_path
  • latent_action_model_path
  • distributed environment variables such as GPU_NUM, NODE_NUM, RANK, MASTER_ADDR, and MASTER_PORT

1. World-model pre-training

bash scripts/pretrain/train_video_vidtwin.sh

Paper setting:

  • global batch size 256
  • 10k iterations
  • instruction + initial frame -> latent motion + terminal frame supervision

2. Co-fine-tuning on LIBERO

bash scripts/simulator/libero/train_libero.sh

Paper setting:

  • global batch size 128
  • 8k iterations
  • image size 200 x 200
  • action chunk size 10

3. Co-fine-tuning on SimplerEnv-WidowX / BridgeV2

bash scripts/simulator/bridge/train_simplerenv_bridge.sh

Paper setting:

  • global batch size 128
  • 12k iterations
  • image size 256 x 256
  • action chunk size 5

The current example script uses a longer default training schedule. Change it if you want to reproduce the exact paper setting.

Acknowledgement

This project builds upon several excellent open-source efforts:

Citation

If this repository is useful for your research, please cite:

@inproceedings{yang2026cowvla,
  title     = {Chain of World: World Model Thinking in Latent Motion},
  author    = {Yang, Fuxiang and Di, Donglin and Tang, Lulu and Zhang, Xuancheng and Fan, Lei and Li, Hao and Chen, Wei and Su, Tonghua and Ma, Baorui},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}