README.md
June 5, 2026 · View on GitHub
WBench: A Comprehensive Multi-turn Benchmark for
Interactive Video World Model Evaluation
TL;DR — WBench evaluates 20 video world models across 5 dimensions and 22 metrics.
📢 News
- [2026/06/01] WBench is now featured as an official benchmark on Hugging Face (navi & full tasks)!
- [2026/06/01] We release WBench-examples — generated videos from two reference models (HY-World 1.5, Kling 3.0) as ready-to-evaluate submission examples.
- [2026/06/01] We add camera- and action-conditioned model examples, a web-product automation pipeline (Project Genie, Happy Oyster), and Claude Code skills for generation, evaluation, and submission.
- [2026/05/29] Our paper ranked #2 on Hugging Face Daily Papers!
- [2026/05/28] Our paper is now available on arXiv!
- [2026/05/28] Homepage with interactive leaderboard and dataset gallery is live!
- [2026/05/28] We release the full WBench dataset, evaluation code, and model weights.
✨ Contributions
- A comprehensive evaluation framework with 289 cases, 1,058 interaction turns, covering 4 interaction types (navigation, subject action, event editing, perspective switching) across diverse scenes and perspectives.
- A unified navigation protocol that bridges text, 6-DoF camera pose, and discrete-action interfaces, enabling fair comparison across model families.
- 22 automatic metrics spanning 5 complementary dimensions, validated against human judgments, ensuring reliable automatic evaluation at scale.
- Systematic diagnosis of 20 models revealing that current world models have not yet unified high-fidelity rendering with reliable controllability, consistency, and physics compliance.
🏆 Leaderboard
20 Models — Navigation Split (5 Dimensions, sorted by average)
| # | Model | Average | Quality | Setting | Interaction | Consistency | Physical |
|---|---|---|---|---|---|---|---|
| 1 | 79.2 🥇 | 83.0 🥈 | 91.0 🥈 | 70.3 | 82.5 | 69.3 🥉 | |
| 2 | 78.8 🥈 | 81.5 | 72.6 | 79.8 | 88.9 🥇 | 71.2 🥈 | |
| 3 | 78.5 🥉 | 82.6 🥉 | 91.4 🥇 | 66.0 | 80.5 | 71.8 🥇 | |
| 4 | 78.4 | 80.2 | 72.2 | 87.5 🥇 | 86.0 | 66.3 | |
| 5 | 78.2 | 79.7 | 85.6 🥉 | 71.8 | 86.7 🥉 | 67.4 | |
| 6 | 77.1 | 79.3 | 74.2 | 85.1 🥈 | 83.3 | 63.5 | |
| 7 | 76.5 | 83.2 🥇 | 82.9 | 68.0 | 80.2 | 68.4 | |
| 8 | 75.2 | 75.6 | 83.3 | 64.1 | 85.6 | 67.4 | |
| 9 | 74.4 | 78.7 | 85.2 | 67.6 | 75.6 | 64.9 | |
| 10 | 74.3 | 74.9 | 71.4 | 72.8 | 87.4 🥈 | 65.2 | |
| 11 | 74.2 | 75.5 | 71.3 | 72.1 | 85.3 | 66.8 | |
| 12 | 74.1 | 77.4 | 72.5 | 73.3 | 81.4 | 65.7 | |
| 13 | 73.7 | 78.2 | 72.3 | 63.1 | 85.9 | 68.9 | |
| 14 | 73.5 | 79.5 | 72.4 | 72.0 | 78.6 | 65.2 | |
| 15 | 72.9 | 78.7 | 69.3 | 75.9 | 78.7 | 62.1 | |
| 16 | 71.2 | 76.9 | 63.6 | 83.5 🥉 | 72.9 | 59.3 | |
| 17 | 70.7 | 76.4 | 70.3 | 65.1 | 81.4 | 60.4 | |
| 18 | 68.5 | 74.9 | 66.6 | 67.8 | 70.6 | 62.4 | |
| 19 | 68.5 | 75.7 | 67.1 | 80.6 | 62.0 | 57.2 | |
| 20 | 64.0 | 69.7 | 59.6 | 67.7 | 71.6 | 51.4 |
9 Text-driven Models — Full Split (5 Dimensions, sorted by average)
| # | Model | Average | Quality | Setting | Interaction | Consistency | Physical |
|---|---|---|---|---|---|---|---|
| 1 | 79.5 🥇 | 81.8 🥉 | 91.0 🥈 | 73.1 🥇 | 82.6 | 69.2 🥈 | |
| 2 | 78.2 🥈 | 82.2 🥈 | 91.4 🥇 | 72.1 🥈 | 73.8 | 71.6 🥇 | |
| 3 | 76.2 🥉 | 83.0 🥇 | 82.9 | 68.3 🥉 | 78.5 | 68.2 | |
| 4 | 74.6 | 78.9 | 85.6 🥉 | 54.7 | 86.8 🥇 | 67.1 | |
| 5 | 71.0 | 78.8 | 85.2 | 49.4 | 76.4 | 65.1 | |
| 6 | 70.8 | 74.6 | 83.3 | 43.5 | 85.4 🥉 | 67.0 | |
| 7 | 70.2 | 79.7 | 72.3 | 45.1 | 85.5 🥈 | 68.4 🥉 | |
| 8 | 69.0 | 79.7 | 72.4 | 48.4 | 79.3 | 65.4 | |
| 9 | 66.0 | 75.8 | 70.3 | 41.6 | 81.9 | 60.5 |
20 Models — Navigation Split (19 metrics)
| Model | Aesthetic Quality | Imaging Quality | Background Consistency | Temporal Flickering | Dynamic Degree | Motion Smoothness | HPSv3 Quality | Scene Adherence | Subject Adherence | Navigation Trajectory | Spatial Consistency | Gated Spatial Consistency | Perspective Consistency | Segment Continuity | Geometric Consistency | Photometric Consistency | Subject Consistency Cross-Model | Visual Plausibility | Causal Fidelity |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 63.4 | 67.4 | 92.1 | 94.2 | 73.9 | 98.7 | 68.0 | 77.5 | 93.6 | 71.8 | 79.2 | 75.1 | 86.6 | 99.4 | 94.6 | 80.3 | 91.6 | 59.7 | 75.0 | |
| 63.0 | 68.1 | 92.3 | 93.2 | 97.5 | 97.6 | 69.1 | 89.0 | 92.9 | 70.3 | 75.2 | 75.1 | 76.8 | 93.0 | 88.9 | 79.9 | 88.5 | 60.7 | 78.0 | |
| 61.8 | 66.9 | 92.3 | 94.8 | 49.0 | 98.2 | 66.5 | 72.4 | 94.2 | 64.1 | 78.1 | 74.3 | 84.3 | 94.3 | 94.6 | 81.6 | 92.3 | 60.1 | 74.7 | |
| 57.9 | 61.0 | 88.3 | 93.2 | 98.1 | 96.4 | 56.1 | 81.3 | 89.2 | 67.6 | 70.2 | 70.2 | 69.8 | 75.8 | 76.9 | 79.2 | 87.2 | 55.7 | 74.0 | |
| 61.0 | 69.3 | 89.6 | 92.4 | 99.4 | 97.5 | 73.0 | 71.6 | 94.2 | 68.0 | 72.7 | 72.4 | 70.5 | 96.2 | 82.4 | 76.8 | 90.1 | 60.7 | 76.0 | |
| 61.4 | 68.0 | 89.4 | 92.2 | 100.0 | 96.3 | 71.1 | 88.3 | 94.6 | 66.0 | 71.0 | 71.0 | 78.2 | 92.4 | 83.7 | 76.4 | 90.7 | 60.3 | 83.3 | |
| 59.9 | 62.7 | 91.1 | 95.4 | 70.1 | 97.5 | 58.5 | 52.2 | 88.5 | 65.1 | 76.8 | 62.0 | 76.3 | 94.3 | 89.0 | 80.8 | 90.8 | 58.0 | 62.7 | |
| 66.5 | 69.6 | 95.1 | 94.8 | 45.9 | 97.9 | 77.6 | 53.1 | 91.5 | 63.1 | 83.3 | 66.2 | 81.5 | 99.4 | 95.4 | 82.2 | 93.4 | 61.8 | 76.0 | |
| 58.7 | 63.3 | 90.3 | 93.0 | 96.8 | 97.0 | 57.0 | 53.1 | 91.7 | 72.0 | 71.5 | 71.4 | 48.0 | 99.4 | 88.0 | 83.3 | 88.8 | 57.7 | 72.7 | |
| 48.6 | 52.5 | 85.3 | 96.0 | 79.6 | 97.7 | 28.0 | 43.4 | 75.9 | 67.7 | 64.7 | 63.3 | 30.0 | 86.6 | 85.6 | 87.5 | 83.5 | 54.6 | 48.3 | |
| 63.0 | 62.8 | 94.2 | 95.8 | 49.0 | 97.9 | 65.8 | 52.4 | 90.1 | 72.1 | 80.6 | 64.2 | 79.8 | 100.0 | 95.3 | 84.8 | 92.5 | 59.7 | 74.0 | |
| 52.6 | 58.7 | 86.5 | 93.7 | 96.8 | 97.6 | 38.3 | 50.6 | 82.5 | 67.8 | 60.5 | 60.5 | 17.9 | 99.4 | 88.3 | 85.0 | 82.6 | 56.5 | 68.3 | |
| 51.6 | 59.3 | 90.7 | 95.0 | 92.4 | 97.8 | 55.2 | 61.1 | 83.8 | 73.3 | 79.9 | 78.4 | 54.5 | 93.6 | 88.6 | 84.5 | 90.4 | 59.7 | 71.7 | |
| 56.6 | 63.9 | 91.4 | 94.0 | 94.2 | 97.0 | 58.3 | 57.4 | 91.1 | 85.1 | 77.7 | 75.8 | 75.0 | 96.2 | 87.2 | 79.8 | 91.5 | 57.6 | 69.3 | |
| 60.1 | 65.4 | 92.7 | 93.5 | 91.1 | 98.1 | 60.5 | 53.5 | 90.8 | 87.5 | 90.6 | 84.9 | 62.5 | 100.0 | 92.0 | 83.1 | 89.1 | 58.6 | 74.0 | |
| 58.7 | 66.1 | 88.8 | 94.1 | 82.8 | 98.0 | 62.3 | 54.0 | 84.5 | 75.9 | 74.9 | 74.4 | 33.8 | 100.0 | 94.3 | 85.1 | 88.4 | 57.2 | 67.0 | |
| 64.4 | 67.6 | 95.0 | 96.0 | 26.1 | 98.8 | 76.1 | 51.7 | 91.1 | 72.8 | 93.8 | 66.5 | 72.5 | 100.0 | 97.3 | 87.4 | 94.4 | 63.1 | 67.3 | |
| 66.9 | 67.9 | 96.9 | 94.1 | 66.2 | 96.9 | 81.4 | 51.6 | 93.6 | 79.8 | 92.7 | 67.1 | 90.9 | 99.4 | 95.4 | 83.3 | 93.5 | 64.8 | 77.7 | |
| 54.0 | 60.3 | 86.9 | 94.6 | 94.9 | 98.2 | 41.0 | 49.4 | 84.9 | 80.6 | 64.5 | 64.5 | 29.2 | 21.0 | 86.1 | 81.3 | 87.2 | 55.0 | 59.3 | |
| 46.4 | 70.0 | 85.7 | 86.3 | 97.5 | 95.4 | 57.1 | 48.9 | 78.4 | 83.5 | 81.0 | 80.4 | 13.3 | 89.8 | 87.6 | 75.3 | 83.0 | 54.0 | 64.7 |
9 Text-driven Models — Full Split (22 metrics)
| Model | Aesthetic Quality | Imaging Quality | Background Consistency | Temporal Flickering | Dynamic Degree | Motion Smoothness | HPSv3 Quality | Scene Adherence | Subject Adherence | Navigation Trajectory | Event Edit Adherence | Subject Action Adherence | Perspective Switch Adherence | Spatial Consistency | Gated Spatial Consistency | Perspective Consistency | Segment Continuity | Geometric Consistency | Photometric Consistency | Subject Consistency Cross-Model | Visual Plausibility | Causal Fidelity |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 61.9 | 67.4 | 92.4 | 95.5 | 68.8 | 98.8 | 67.5 | 77.5 | 93.6 | 71.8 | 63.8 | 55.6 | 27.6 | 79.2 | 75.1 | 86.6 | 99.3 | 94.4 | 81.4 | 91.5 | 59.3 | 75.0 | |
| 61.3 | 67.7 | 92.7 | 94.5 | 89.9 | 97.9 | 68.8 | 89.0 | 92.9 | 70.3 | 81.4 | 85.6 | 55.0 | 75.2 | 75.1 | 76.8 | 92.7 | 89.4 | 80.4 | 88.5 | 60.4 | 78.0 | |
| 60.1 | 67.2 | 92.3 | 96.0 | 42.4 | 98.3 | 65.9 | 72.4 | 94.2 | 64.1 | 48.2 | 41.6 | 20.0 | 78.1 | 74.3 | 84.3 | 93.1 | 94.2 | 82.1 | 91.8 | 59.3 | 74.7 | |
| 56.9 | 62.3 | 89.3 | 94.1 | 94.4 | 96.8 | 57.7 | 81.3 | 89.2 | 67.6 | 53.0 | 51.8 | 25.0 | 70.2 | 70.2 | 69.8 | 77.8 | 81.1 | 79.4 | 86.7 | 56.2 | 74.0 | |
| 59.7 | 69.8 | 89.6 | 93.4 | 98.3 | 97.6 | 72.9 | 71.6 | 94.2 | 68.0 | 80.4 | 80.0 | 45.0 | 72.7 | 72.4 | 62.7 | 92.4 | 83.5 | 76.7 | 89.3 | 60.5 | 76.0 | |
| 59.6 | 68.1 | 89.5 | 93.0 | 99.3 | 96.5 | 69.4 | 88.3 | 94.6 | 66.0 | 84.0 | 83.4 | 55.0 | 71.0 | 71.0 | 62.2 | 65.6 | 82.6 | 75.5 | 88.7 | 59.8 | 83.3 | |
| 58.4 | 63.6 | 91.8 | 96.3 | 63.5 | 97.9 | 58.8 | 52.2 | 88.5 | 65.1 | 46.8 | 41.4 | 13.3 | 76.8 | 62.0 | 76.3 | 94.1 | 91.5 | 82.1 | 90.7 | 58.2 | 62.7 | |
| 64.7 | 69.8 | 94.7 | 94.9 | 59.7 | 97.7 | 76.3 | 53.1 | 91.5 | 63.1 | 50.4 | 48.4 | 18.3 | 83.3 | 66.2 | 81.5 | 98.6 | 94.7 | 81.5 | 92.4 | 60.8 | 76.0 | |
| 59.3 | 65.7 | 92.0 | 94.8 | 86.1 | 97.7 | 62.0 | 53.1 | 91.7 | 72.0 | 57.8 | 47.0 | 16.7 | 71.5 | 71.4 | 48.0 | 99.3 | 91.1 | 84.1 | 89.4 | 58.1 | 72.7 |
🚀 Quick Start
# Install
git clone --recursive https://github.com/meituan-longcat/WBench.git
cd WBench
# Download data and weights
pip install huggingface_hub
hf download meituan-longcat/WBench --repo-type dataset --local-dir data/ --exclude "splits/*"
hf download meituan-longcat/WBench-weights --local-dir weights/
# Environment 1: wbench-main (all metrics except visual_plausibility)
bash tools/install.sh wbench-main
conda activate wbench-main
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
# Environment 2: wbench-vp (visual_plausibility only, requires vLLM)
bash tools/install_vp.sh wbench-vp
# Verify
conda activate wbench-main
python tools/verify_install.py
# Run evaluation (auto multi-GPU)
python main.py --model your_model
See docs/installation.md for detailed setup instructions.
🎮 Evaluate Your Model
- Generate multi-turn videos → place in
work_dirs/<model>/videos/case_{id}_combined.mp4 - Run the 3-phase pipeline:
# Full pipeline (precompute → GPU metrics → VLM metrics → report)
python main.py --model my_model --gpus 0,1,2,3,4,5,6,7
# Or run phases independently:
python main.py --model my_model --phase precompute # SAM2 + DA3 + MegaSAM
python main.py --model my_model --phase gpu # GPU metrics (per-metric)
python main.py --model my_model --phase vlm # VLM metrics (API)
python main.py --model my_model --phase report # Aggregate report
- Results:
work_dirs/<model>/evaluation/{metric}/case_{id}.json+report.json
# Run specific metrics (by name or dimension)
python main.py --model my_model --phase gpu --metrics hpsv3_quality
python main.py --model my_model --phase gpu --metrics renderer # all 6 video quality
python main.py --model my_model --phase gpu --metrics consistency # all consistency metrics
# Skip pre-computation if already done
python main.py --model my_model --phase gpu --skip_megasam --skip_sam2 --skip_da3
# Single video evaluation
python main.py --video video.mp4 --case data/cases/case_1.json
Dimensions (--metrics supports these as shorthand):
| Dimension | Metrics |
|---|---|
quality | aesthetic_quality, imaging_quality, temporal_flickering, dynamic_degree, motion_smoothness, hpsv3_quality |
consistency | background_consistency, segment_continuity, perspective_consistency, subject_consistency, geometric_consistency, photometric_consistency, spatial_consistency, gated_spatial_consistency |
interaction | navigation_trajectory, event_edit_adherence, subject_action_adherence, perspective_switch_adherence |
setting | scene_adherence, subject_adherence |
physical | visual_plausibility, causal_fidelity |
Set environment variables for VLM metrics (we use Doubao-Seed-2.0-lite via Volcengine ARK):
export VLM_API_KEY="<your-ark-api-key>"
# Optional (defaults shown):
# export VLM_API_URL="https://ark.cn-beijing.volces.com/api/v3"
# export VLM_MODEL_NAME="doubao-seed-2-0-lite-260215"
For visual_plausibility, use the separate wbench-vp environment:
python tools/run_visual_plausibility.py --model my_model # uses all available GPUs
🔌 Implement Your Model
WBench supports 3 model types with different control interfaces:
| Type | Input | Cases | Status |
|---|---|---|---|
| Text-conditioned | Text prompt + first-frame image | 289 (all) | ✅ Implemented |
| Camera-conditioned | First-frame image + 6-DoF camera pose | 158 (navi) | ✅ Implemented |
| Action-conditioned | First-frame image + discrete action | 158 (navi) | ✅ Implemented |
Text-conditioned models
from src.models import get_model
# Available: wan, kling, seedance (or register your own)
model = get_model("wan")
# Generate multi-turn video from a case
result = model.generate_multi_turn(
case=case_dict,
output_path="work_dirs/wan/videos/case_1_combined.mp4",
data_root="data/",
)
Each turn: build prompt from interaction → call I2V API → extract last frame → next turn.
Set API credentials:
export VIDEO_API_URL="https://your-video-api.com"
export VIDEO_API_KEY="your-key"
Camera-conditioned models
The benchmark's navigation actions (W/A/S/D + arrows) are converted to per-turn
{move, yaw, pitch} intent and then to a 6-DoF camera trajectory. Subclass
CameraConditionedModel and implement one hook — case parsing, action→pose
conversion, and video writing are handled for you:
from src.models.camera import CameraConditionedModel
class MyWorldModel(CameraConditionedModel):
def generate_with_poses(self, image, poses, video_length, **kw):
# image: first-frame path; poses: {"<latent_idx>": {"extrinsic": 4x4, "K": 3x3}, ...}
# return: list of `video_length` BGR uint8 frames
return my_model.infer(image, poses, video_length)
MyWorldModel("mymodel").generate_multi_turn(case_dict,
"work_dirs/mymodel/videos/case_1_combined.mp4", data_root="data/")
The pose convention (axes, speeds, intrinsics) lives in src/models/camera/poses.py
— copy and adapt it to your model; the navigation metric normalises scale, so what
matters is matching the per-action intent. Quick look at one case:
python -m src.models.camera.demo --case data/cases/case_1.json # prints poses + renders a preview
Note: Camera/action models only cover the 158 navigation cases (cases containing at least one W/A/S/D/arrow action). When generating at scale, pass only those cases — e.g. via
generate.py --model your_model --cases <navi_list>.
Action-conditioned models
Two flavours, both fed from the same per-turn navigation plan:
Programmatic controllers (e.g. Matrix-Game-3). Subclass ActionConditionedModel
and implement generate_with_actions. Each action carries both raw key tokens
and an MG3-style {keyboard, mouse} tensor:
from src.models.action import ActionConditionedModel
class MyActionModel(ActionConditionedModel):
def generate_with_actions(self, image, actions, video_length, **kw):
# actions: [{"turn", "tokens", "keyboard", "mouse", "duration"}, ...]
return my_model.infer(image, actions, video_length)
MyActionModel("mymodel").generate_multi_turn(case_dict,
"work_dirs/mymodel/videos/case_1_combined.mp4", data_root="data/")
python -m src.models.action.demo --case data/cases/case_1.json # prints actions + renders a preview
Web products (e.g. Project Genie, Happy Oyster) — no weights/API; driven by
browser automation + simulated keystrokes. See
src/models/action/web/.
🤖 Claude Code Skills
If you use Claude Code, this repo ships skills that drive the full workflow — just ask in natural language and Claude runs the right commands:
| Skill | Triggers on | What it does |
|---|---|---|
wbench-generate | "generate kling videos" | Runs generate.py over the dataset → work_dirs/<model>/videos/ |
wbench-evaluate | "evaluate kling3" | Runs the 4-phase main.py pipeline (precompute → gpu → vlm → report) |
wbench-submit | "package my model for submission" | Builds the meta.json / turns.json bundle and uploads to HuggingFace |
genie3 / happy | "run case_5 on genie3" | Browser automation for the web products (details) |
Skills live in .claude/skills/ (and src/models/action/web/.claude/skills/) and
are auto-discovered when you open the repo in Claude Code.
📋 TODO
- Text-conditioned model generation (Wan, Kling, Seedance)
- Homepage with interactive leaderboard
- Dataset and weights release on HuggingFace
- Camera-conditioned model generation example
- Action-conditioned model generation example
- Hosted submission & evaluation service (submit videos, get scores)
- ArXiv paper release
📝 Citation
If you find our work useful, please consider citing:
@article{ying2026wbenchcomprehensivemultiturnbenchmark,
title={WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation},
author={Ying, Kaining and Hu, Hengrui and Ren, Siyu and Li, Jiamu and Chen, Fengjiao and Wang, Ziwen and Cao, Xuezhi and Cai, Xunliang and Ding, Henghui},
journal={arXiv preprint arXiv:2605.25874},
year={2026}
}
🙏 Acknowledgement
This project builds upon the following excellent works:
- WorldScore — World model evaluation framework
- VBench — Video quality metrics
- SAM2 — Segment Anything Model 2 for mask tracking
- Depth-Anything-V3 — Monocular depth estimation
- MegaSAM — Camera pose estimation
- DreamSim — Perceptual similarity metric
- HPSv3 — Human Preference Score
- AMT — Frame interpolation for motion smoothness
- RAFT — Optical flow estimation
- TransNetV2 — Scene boundary detection
- ... and many other excellent open-source projects
📧 Contact
Feel free to open an Issue or Pull Request. You can also reach us directly:
- Kaining Ying:
kaining.ying.cv@gmail.com - Siyu Ren:
rensiyu07@meituan.com
📄 License
Code and data: MIT License. Model weights retain their original licenses.