Chain of World: World Model Thinking in Latent Motion
March 4, 2026 ยท View on GitHub
Contents
- Overview: CoWVLA summary and framework diagram
- Environment Setup: install dependencies and notes
- Evaluation: reported numbers and how to run benchmarks
- Training Data: dataset composition used in the paper
- Data Preparation: processing/tokenization/metadata pipeline
- Training: required pretrained components and training scripts
- Acknowledgement: upstream projects
- Citation: BibTeX entry
Overview

CoWVLA is a Vision-Language-Action (VLA) framework that combines world-model temporal reasoning with disentangled latent motion modeling.
Instead of reconstructing every future frame or learning only pairwise latent actions, CoWVLA:
- uses a pretrained video VAE (VidTwin) to disentangle structure and motion latents;
- pre-trains a VLA decoder to infer a continuous latent motion chain from an instruction and the initial frame, while predicting the terminal frame;
- co-fine-tunes the model with sparse keyframes and FAST action tokens so that latent dynamics and action prediction are aligned in a single autoregressive decoder.
This design keeps the temporal reasoning benefits of world models, avoids redundant background reconstruction, and preserves the compactness and interpretability of latent motion.
Environment Setup
conda create -n cowvla python=3.10
conda activate cowvla
cd third_party/RoboVLMs
pip install -e .
pip install -r requirements.txt
Notes:
requirements.txtcurrently contains environment-specific dependencies such as a localflash_attnwheel path. Replace them with packages that match your CUDA and PyTorch stack before installation.- Most training scripts assume multi-GPU distributed training and contain hardcoded internal paths. Update paths and launcher variables before running them.
Evaluation
| Benchmark | Metric | CoWVLA |
|---|---|---|
| LIBERO | Spatial / Object / Goal / Long / Avg. | 97.2 / 97.8 / 94.6 / 92.8 / 95.6 |
| SimplerEnv-WidowX | Stack Block / Put Carrot / Put Spoon / Put Eggplant / Avg. | 62.5 / 66.7 / 79.2 / 95.8 / 76.0 |
Evaluation code is provided mainly through reference/RoboVLMs.
Benchmark setup
The setup scripts install dependencies for different benchmarks. Run the appropriate script based on your evaluation target:
reference/RoboVLMs/scripts/setup_libero.shreference/RoboVLMs/scripts/setup_simplerenv.shreference/RoboVLMs/scripts/setup_simplerenv_vla.shreference/RoboVLMs/scripts/setup_calvin.sh
LIBERO
The LIBERO evaluation workflow is designed around labtasker(https://github.com/luocfprime/labtasker) so that many checkpoints can be scheduled across multiple machines.
cd reference/RoboVLMs
python submit_libero.py
python run.py 0
python run.py 1
Before running:
- set the checkpoint path in
reference/RoboVLMs/submit_libero.py - install and configure
labtasker - make sure the benchmark dependencies are installed
SimplerEnv-WidowX / BridgeV2
cd reference/RoboVLMs
bash scripts/eval_bridge.sh
Before running:
- set
model_ckpt_pathsinreference/RoboVLMs/scripts/eval_bridge.sh - update GPU IDs and environment paths as needed
Training Data
The paper uses 236,543 robot videos from OXE-style datasets plus simulator data for latent motion extractor fine-tuning and VLA pre-training.
| Dataset | Count |
|---|---|
| Berkeley Autolab UR5 | 892 |
| BridgeV2 | 24,879 |
| CMU Play Fusion | 576 |
| Fractal | 65,530 |
| Kuka | 84,202 |
| Maniskill | 30,029 |
| Taco Play | 3,242 |
| Toto | 899 |
| Utaustin Mutex | 1,500 |
| Viola | 135 |
| Calvin | 22,966 |
| LIBERO | 1,693 |
| Total | 236,543 |
Data Preparation
The repository follows a four-step data pipeline:
- Convert raw datasets into episode folders with images, actions, and instructions.
- Tokenize each image with the Emu3 vision tokenizer and save the discrete codes as
.npyfiles. - Build benchmark-specific
.pklmetadata files. - Merge multiple datasets into a world-model pretraining metadata file when needed.
1. Convert datasets to episode folders
Examples:
- Fractal / Google Robot style data:
tools/process/simplerenv_google.py - BridgeV2 / WidowX style data:
tools/process/simplerenv_bridge.py - CALVIN:
tools/process/calvin_process.pyortools/process/calvin_process_parallel.py - LIBERO:
tools/process/libero_process.py
Example for Fractal:
python tools/process/simplerenv_google.py \
--dataset_dir /path/to/fractal20220817_data \
--output_dir /path/to/processed/oxembodiment/fractal
Example for BridgeV2:
python tools/process/simplerenv_bridge.py \
--dataset_dir /path/to/bridgev2 \
--output_dir /path/to/processed/oxembodiment/bridge
2. Extract Emu3 visual tokens
Download the tokenizer first:
- Emu3 VisionTokenizer: https://huggingface.co/BAAI/Emu3-VisionTokenizer
Then run the tokenizer script:
bash scripts/tokenizer/extract_vq_emu3.sh
Important:
scripts/tokenizer/extract_vq_emu3.shis an example script and contains hardcoded dataset names, GPU IDs, and output locations.models/tokenizer/emu3_tokenizer.pyalso contains dataset-specific path configuration. Edit both files to match your local environment before launching tokenization.
3. Build pickle metadata
Examples:
python tools/pickle_gen/pickle_generation_simplerenv_google.py
python tools/pickle_gen/pickle_generation_simplerenv_bridge.py
The metadata stores fields such as:
text: language instructionimageorgripper_image: list of tokenized image code pathsaction: robot action sequenceorig_img_list: original image paths for some datasets
4. Build world-model pretraining metadata
To merge multiple datasets for pre-training:
python tools/pickle_gen/world_model_pretrain.py
Or use the parallel version:
python tools/pickle_gen/world_model_pretrain_multi.py
These scripts also contain internal absolute paths and should be edited before use.
Training
Required pretrained components
- Emu3 Stage1: https://huggingface.co/BAAI/Emu3-Stage1
- Emu3 VisionTokenizer: https://huggingface.co/BAAI/Emu3-VisionTokenizer
- VidTwin: https://huggingface.co/microsoft/vidtwin
FAST action tokenizers used by the released scripts are already included under:
pretrain/fastpretrain/fast_bridge_t5_s50
Before launching training, update the following in the shell scripts:
WORKDIR- dataset and metadata paths
model_name_or_pathlatent_action_model_path- distributed environment variables such as
GPU_NUM,NODE_NUM,RANK,MASTER_ADDR, andMASTER_PORT
1. World-model pre-training
bash scripts/pretrain/train_video_vidtwin.sh
Paper setting:
- global batch size 256
- 10k iterations
- instruction + initial frame -> latent motion + terminal frame supervision
2. Co-fine-tuning on LIBERO
bash scripts/simulator/libero/train_libero.sh
Paper setting:
- global batch size 128
- 8k iterations
- image size 200 x 200
- action chunk size 10
3. Co-fine-tuning on SimplerEnv-WidowX / BridgeV2
bash scripts/simulator/bridge/train_simplerenv_bridge.sh
Paper setting:
- global batch size 128
- 12k iterations
- image size 256 x 256
- action chunk size 5
The current example script uses a longer default training schedule. Change it if you want to reproduce the exact paper setting.
Acknowledgement
This project builds upon several excellent open-source efforts:
Citation
If this repository is useful for your research, please cite:
@inproceedings{yang2026cowvla,
title = {Chain of World: World Model Thinking in Latent Motion},
author = {Yang, Fuxiang and Di, Donglin and Tang, Lulu and Zhang, Xuancheng and Fan, Lei and Li, Hao and Chen, Wei and Su, Tonghua and Ma, Baorui},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}