World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems
April 19, 2026 · View on GitHub
This repository contains the implementation of World-Value-Action Model(WAV), a world-model-based framework. The central idea is to unify instruction-conditioned video prediction, trajectory value estimation, action decoding, and latent trajectory planning within a single multi-view diffusion transformer for long-horizon robotic manipulation.
Instead of planning directly in action space, WAV performs iterative inference in a compact latent trajectory space. This design biases sampling toward feasible futures, allows trajectory-level evaluation before action execution, and improves long-horizon decision making in both simulated and real-world settings.
Highlights
- A unified multi-view transformer backbone with video, value, and action experts.
- A three-stage training recipe: task-specific video adaptation, trajectory value learning, and action post-training.
- Latent trajectory planning at inference time through iterative elite reweighting in latent space.
- Support for LIBERO closed-loop evaluation, open-loop validation, and real-world deployment.
TODO
- Release inference & training code
- Release model weights
Method Overview
Motivation
Direct action prediction is often insufficient for long-horizon embodied tasks because it provides limited trajectory-level reasoning. WAV addresses this by first imagining future visual trajectories, then evaluating their long-horizon quality, and finally decoding executable robot actions from optimized trajectory features.
The current paper motivates this design from a model-based planning perspective:
- direct action-space planning suffers from vanishing feasible mass as the horizon grows;
- latent planning reweights probability mass toward feasible trajectories;
- iterative latent inference is necessary to concentrate samples on high-value futures.
Architecture
WAV decomposes planning and control into three tightly coupled modules:
-
Instruction-conditioned video generation. A multi-view diffusion transformer predicts future visual trajectories conditioned on history frames and language instructions.
-
Trajectory value estimation. A value expert evaluates candidate futures and provides the trajectory-level signal used for latent planning.
-
Action decoding. An action expert predicts executable action chunks from optimized video and value features, optionally conditioned on robot state.
Main Results
LIBERO Benchmark
Real-World Evaluation
Getting Started
Setup
git clone https://github.com/Win-commit/WAV.git
cd WAV
conda create -n wav python=3.10.4
conda activate wav
pip install -r requirements.txt
Required Checkpoints
Please download the following pretrained weights before training or inference:
-
LTX_video_part -
GE_base
After downloading the checkpoints, please update the corresponding paths in your config file:
pretrained_model_name_or_path: PATH/TO/LTX_video_part
diffusion_model:
model_path: PATH/TO/GE_base_fast.safetensors
Dataset Format
This codebase uses a LeRobot-like layout. A typical dataset is organized as:
ROOT_PATH/
└── DATASETNAME/
├── data/
│ └── chunk-000/
│ ├── episode_000000.parquet
│ └── ...
├── meta/
│ ├── episodes.jsonl
│ ├── tasks.jsonl
│ ├── info.json
│
└── videos/
└── chunk-000/
├── CAMERA_A/
│ ├── episode_000000.mp4
│ └── ...
└── CAMERA_B/
├── episode_000000.mp4
└── ...
Action / State Statistics
We provide scripts/get_statistics.py to compute normalization statistics:
python scripts/get_statistics.py \
--data_root PATH/TO/YOUR/DATASET/data/ \
--data_name DATASETNAME \
--data_type eef \
--action_key actions \
--state_key state \
--value_key state_value \
--save_path PATH/TO/YOUR/DATASET/meta/stats.jsonl
After running the script, you can get a jsonl file of statistics. You should specific the path of json file in configs
data:
train:
...
stat_file: PATH/OF/FILE.jsonl
val:
...
stat_file: PATH/OF/FILE.jsonl
Training
1. Video Adaptation
For the unseen robots or customized new tasks, we recommend performing this step of video adaptation to achieve better performance.
i. Modify the config in configs/ltx_model/*/video_model.yaml. More details of dataset can be found in data/*_dataset.py:
data:
train / val:
data_roots: [ROOT_PATH_TO_YOUR_DATASETS, ]
domains: [DATASETNAME, ]
# rewrite to the camera names used in your dataset
valid_cam: ["observation.images.top_head", "observation.images.hand_left", "observation.images.hand_right"]
...
ii. Disable value-model and action-model as bellow in configs/ltx_model/*/video_model.yaml:
return_video: True
return_value:False
return_action: False
train_mode: 'video_only'
diffusion_model:
config:
value_expert: False
action_expert: False
iii. Run
bash scripts/train.sh main.py configs/ltx_model/*/video_model.yaml
2. Trajectory Value Learning
i. Modify the config in configs/ltx_model/*/value_model.yaml
diffusion_model:
model_path: PATH_TO_VIDEO_POST_TRAINING_CHECKPOINT_SAFETENSOR
data:
train / val:
data_roots: [ROOT_PATH_TO_YOUR_DATASETS, ]
domains: [DATASETNAME, ]
# rewrite to the camera names used in your dataset
valid_cam: ["observation.images.top_head", "observation.images.hand_left", "observation.images.hand_right"]
# rewrite to the keys used in your dataset
value_key: "state_value"
value_dense: True
...
More details of dataset can be found in data/*_dataset.py
ii. Enable value-model as bellow in configs/ltx_model/*/value_model.yaml:
return_video: False
return_value:True
return_action: False
train_mode: 'value_only'
diffusion_model:
config:
value_expert: True
action_expert: False
noisy_video: True
iii. Run
bash scripts/train.sh main.py configs/ltx_model/*/value_model.yaml
3. Action Post-Training
i. Modify the config in configs/ltx_model/*/policy_model.yaml
diffusion_model:
model_path: PATH_TO_VALUE_POST_TRAINING_CHECKPOINT_SAFETENSOR
data:
train / val:
data_roots: [ROOT_PATH_TO_YOUR_DATASETS, ]
domains: [DATASETNAME, ]
# rewrite to the camera names used in your dataset
valid_cam: ["observation.images.top_head", "observation.images.hand_left", "observation.images.hand_right"]
# rewrite to the keys used in your dataset
action_key: "action"
state_key: "observation.state"
action_type: "absolute" # "absolute", "delta" or "relative"
action_space: "joint"
...
More details of dataset can be found in data/*_dataset.py
ii. Enable action-model as bellow in configs/ltx_model/*/policy_model.yaml:
return_video: False
return_value: True
return_action: True
train_mode: 'action_full'
diffusion_model:
config:
value_expert: True
action_expert: True
noisy_video: True
iii. Run
bash scripts/train.sh main.py configs/ltx_model/*/policy_model.yaml
Evaluation
Open-Loop Validation
bash scripts/infer.sh \
main.py \
PATH/TO/CONFIG \
PATH/TO/CHECKPOINT \
PATH/TO/OUTPUTS \
DomainName
This path is useful for quick qualitative inspection and open-loop video/value/action prediction.
Real-World Deployment
We provide both WebSocket-based serving and HTTP-based robot deployment.
WebSocket Server
python3 web_infer_scripts/main_server.py \
-c PATH/TO/CONFIG \
-w PATH/TO/CHECKPOINT \
--host 0.0.0.0 \
--port $PORT \
--domain_name $DOMAIN_NAME \
--action_dim $ACTION_DIM \
--norm_type $NORM_TYPE \
--device 0
A minimal test client is available in web_infer_scripts/simple_client.py.
HTTP Server for Robot Deployment
python3 web_infer_utils/Real_deploy.py \
-c PATH/TO/CONFIG \
-w PATH/TO/CHECKPOINT \
--host 0.0.0.0 \
--port $PORT \
--domain_name $DOMAIN_NAME \
--action_dim $ACTION_DIM \
--norm_type $NORM_TYPE \
--device 0
Acknowledgement
This codebase builds on the current Genie-Envisioner implementation.
Citation
If you find this project useful, please consider citing the paper once the public version is released.
@article{li2026world,
title={World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems},
author={Li, Runze and Zhang, Hongyin and Jin, Junxi and Zeng, Qixin and Zhuang, Zifeng and Tang, Yiqi and Lyu, Shangke and Wang, Donglin},
journal={arXiv preprint arXiv:2604.14732},
year={2026}
}
