Chain of World: World Model Thinking in Latent Motion

March 4, 2026 · View on GitHub

Overview: CoWVLA summary and framework diagram
Environment Setup: install dependencies and notes
Evaluation: reported numbers and how to run benchmarks
Training Data: dataset composition used in the paper
Data Preparation: processing/tokenization/metadata pipeline
Training: required pretrained components and training scripts
Acknowledgement: upstream projects
Citation: BibTeX entry

Overview

Overview of the CoWVLA framework.

CoWVLA is a Vision-Language-Action (VLA) framework that combines world-model temporal reasoning with disentangled latent motion modeling.

Instead of reconstructing every future frame or learning only pairwise latent actions, CoWVLA:

uses a pretrained video VAE (VidTwin) to disentangle structure and motion latents;
pre-trains a VLA decoder to infer a continuous latent motion chain from an instruction and the initial frame, while predicting the terminal frame;
co-fine-tunes the model with sparse keyframes and FAST action tokens so that latent dynamics and action prediction are aligned in a single autoregressive decoder.

This design keeps the temporal reasoning benefits of world models, avoids redundant background reconstruction, and preserves the compactness and interpretability of latent motion.

Environment Setup

conda create -n cowvla python=3.10
conda activate cowvla

cd third_party/RoboVLMs
pip install -e .

pip install -r requirements.txt

Notes:

requirements.txt currently contains environment-specific dependencies such as a local flash_attn wheel path. Replace them with packages that match your CUDA and PyTorch stack before installation.
Most training scripts assume multi-GPU distributed training and contain hardcoded internal paths. Update paths and launcher variables before running them.

Evaluation

Benchmark	Metric	CoWVLA
LIBERO	Spatial / Object / Goal / Long / Avg.	97.2 / 97.8 / 94.6 / 92.8 / 95.6
SimplerEnv-WidowX	Stack Block / Put Carrot / Put Spoon / Put Eggplant / Avg.	62.5 / 66.7 / 79.2 / 95.8 / 76.0

Evaluation code is provided mainly through reference/RoboVLMs.

Benchmark setup

The setup scripts install dependencies for different benchmarks. Run the appropriate script based on your evaluation target:

reference/RoboVLMs/scripts/setup_libero.sh
reference/RoboVLMs/scripts/setup_simplerenv.sh
reference/RoboVLMs/scripts/setup_simplerenv_vla.sh
reference/RoboVLMs/scripts/setup_calvin.sh

LIBERO

The LIBERO evaluation workflow is designed around labtasker(https://github.com/luocfprime/labtasker) so that many checkpoints can be scheduled across multiple machines.

cd reference/RoboVLMs
python submit_libero.py
python run.py 0
python run.py 1

Before running:

set the checkpoint path in reference/RoboVLMs/submit_libero.py
install and configure labtasker
make sure the benchmark dependencies are installed

SimplerEnv-WidowX / BridgeV2

cd reference/RoboVLMs
bash scripts/eval_bridge.sh

Before running:

set model_ckpt_paths in reference/RoboVLMs/scripts/eval_bridge.sh
update GPU IDs and environment paths as needed

Training Data

The paper uses 236,543 robot videos from OXE-style datasets plus simulator data for latent motion extractor fine-tuning and VLA pre-training.

Dataset	Count
Berkeley Autolab UR5	892
BridgeV2	24,879
CMU Play Fusion	576
Fractal	65,530
Kuka	84,202
Maniskill	30,029
Taco Play	3,242
Toto	899
Utaustin Mutex	1,500
Viola	135
Calvin	22,966
LIBERO	1,693
Total	236,543

Data Preparation

The repository follows a four-step data pipeline:

Convert raw datasets into episode folders with images, actions, and instructions.
Tokenize each image with the Emu3 vision tokenizer and save the discrete codes as .npy files.
Build benchmark-specific .pkl metadata files.
Merge multiple datasets into a world-model pretraining metadata file when needed.

1. Convert datasets to episode folders

Examples:

Fractal / Google Robot style data: tools/process/simplerenv_google.py
BridgeV2 / WidowX style data: tools/process/simplerenv_bridge.py
CALVIN: tools/process/calvin_process.py or tools/process/calvin_process_parallel.py
LIBERO: tools/process/libero_process.py

Example for Fractal:

python tools/process/simplerenv_google.py \
  --dataset_dir /path/to/fractal20220817_data \
  --output_dir /path/to/processed/oxembodiment/fractal

Example for BridgeV2:

python tools/process/simplerenv_bridge.py \
  --dataset_dir /path/to/bridgev2 \
  --output_dir /path/to/processed/oxembodiment/bridge

2. Extract Emu3 visual tokens

Download the tokenizer first:

Emu3 VisionTokenizer: https://huggingface.co/BAAI/Emu3-VisionTokenizer

Then run the tokenizer script:

bash scripts/tokenizer/extract_vq_emu3.sh

Important:

scripts/tokenizer/extract_vq_emu3.sh is an example script and contains hardcoded dataset names, GPU IDs, and output locations.
models/tokenizer/emu3_tokenizer.py also contains dataset-specific path configuration. Edit both files to match your local environment before launching tokenization.

3. Build pickle metadata

Examples:

python tools/pickle_gen/pickle_generation_simplerenv_google.py
python tools/pickle_gen/pickle_generation_simplerenv_bridge.py

The metadata stores fields such as:

text: language instruction
image or gripper_image: list of tokenized image code paths
action: robot action sequence
orig_img_list: original image paths for some datasets

4. Build world-model pretraining metadata

To merge multiple datasets for pre-training:

python tools/pickle_gen/world_model_pretrain.py

Or use the parallel version:

python tools/pickle_gen/world_model_pretrain_multi.py

These scripts also contain internal absolute paths and should be edited before use.

Training

Required pretrained components

Emu3 Stage1: https://huggingface.co/BAAI/Emu3-Stage1
Emu3 VisionTokenizer: https://huggingface.co/BAAI/Emu3-VisionTokenizer
VidTwin: https://huggingface.co/microsoft/vidtwin

FAST action tokenizers used by the released scripts are already included under:

pretrain/fast
pretrain/fast_bridge_t5_s50

Before launching training, update the following in the shell scripts:

WORKDIR
dataset and metadata paths
model_name_or_path
latent_action_model_path
distributed environment variables such as GPU_NUM, NODE_NUM, RANK, MASTER_ADDR, and MASTER_PORT

1. World-model pre-training

bash scripts/pretrain/train_video_vidtwin.sh

Paper setting:

global batch size 256
10k iterations
instruction + initial frame -> latent motion + terminal frame supervision

2. Co-fine-tuning on LIBERO

bash scripts/simulator/libero/train_libero.sh

Paper setting:

global batch size 128
8k iterations
image size 200 x 200
action chunk size 10

3. Co-fine-tuning on SimplerEnv-WidowX / BridgeV2

bash scripts/simulator/bridge/train_simplerenv_bridge.sh

Paper setting:

global batch size 128
12k iterations
image size 256 x 256
action chunk size 5

The current example script uses a longer default training schedule. Change it if you want to reproduce the exact paper setting.

Acknowledgement

This project builds upon several excellent open-source efforts:

Citation

If this repository is useful for your research, please cite:

@inproceedings{yang2026cowvla,
  title     = {Chain of World: World Model Thinking in Latent Motion},
  author    = {Yang, Fuxiang and Di, Donglin and Tang, Lulu and Zhang, Xuancheng and Fan, Lei and Li, Hao and Chen, Wei and Su, Tonghua and Ma, Baorui},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}