README.md

June 5, 2026 · View on GitHub

WBench: A Comprehensive Multi-turn Benchmark for
Interactive Video World Model Evaluation

Homepage Paper HF Daily Paper Leaderboard Datasets Weights Examples ModelScope 中文解读 WeChat Group

Is Your World Model an All-Round Player?

TL;DR — WBench evaluates 20 video world models across 5 dimensions and 22 metrics.

📢 News

✨ Contributions

  • A comprehensive evaluation framework with 289 cases, 1,058 interaction turns, covering 4 interaction types (navigation, subject action, event editing, perspective switching) across diverse scenes and perspectives.
  • A unified navigation protocol that bridges text, 6-DoF camera pose, and discrete-action interfaces, enabling fair comparison across model families.
  • 22 automatic metrics spanning 5 complementary dimensions, validated against human judgments, ensuring reliable automatic evaluation at scale.
  • Systematic diagnosis of 20 models revealing that current world models have not yet unified high-fidelity rendering with reliable controllability, consistency, and physics compliance.

🏆 Leaderboard

20 Models — Navigation Split (5 Dimensions, sorted by average)

#ModelAverageQualitySettingInteractionConsistencyPhysical
1 Kling 3.079.2 🥇83.0 🥈91.0 🥈70.3   82.5   69.3 🥉
2 LingBot-World78.8 🥈81.5   72.6   79.8   88.9 🥇71.2 🥈
3 Wan 2.778.5 🥉82.6 🥉91.4 🥇66.0   80.5   71.8 🥇
4 HY-World 1.578.4     80.2   72.2   87.5 🥇86.0   66.3   
5 HY-Video 1.578.2   79.7   85.6 🥉71.8   86.7 🥉67.4   
6 Happy Oyster77.1   79.3   74.2   85.1 🥈83.3   63.5   
7 Seedance 1.576.5   83.2 🥇82.9   68.0   80.2   68.4   
8 Cosmos 2.575.2   75.6   83.3   64.1   85.6   67.4   
9 LTX 2.374.4   78.7   85.2   67.6   75.6   64.9   
10 InSpatio-World74.3   74.9   71.4   72.8   87.4 🥈65.2   
11 Fantasy-World74.2   75.5   71.3   72.1   85.3   66.8   
12 Genie 374.1   77.4   72.5   73.3   81.4   65.7   
13 LongCat-Video73.7   78.2   72.3   63.1   85.9   68.9   
14 YUME 1.573.5   79.5   72.4   72.0   78.6   65.2   
15 Infinite-World72.9   78.7   69.3   75.9   78.7   62.1   
16 MatrixGame371.2   76.9   63.6   83.5 🥉72.9   59.3   
17 Kairos 3.070.7   76.4   70.3   65.1   81.4   60.4   
18 HY-GameCraft68.5   74.9   66.6   67.8   70.6   62.4   
19 MatrixGame268.5   75.7   67.1   80.6   62.0   57.2   
20 Astra64.0   69.7   59.6   67.7   71.6   51.4   

9 Text-driven Models — Full Split (5 Dimensions, sorted by average)

#ModelAverageQualitySettingInteractionConsistencyPhysical
1 Kling 3.079.5 🥇81.8 🥉91.0 🥈73.1 🥇82.6   69.2 🥈
2 Wan 2.778.2 🥈82.2 🥈91.4 🥇72.1 🥈73.8   71.6 🥇
3 Seedance 1.576.2 🥉83.0 🥇82.9   68.3 🥉78.5   68.2   
4 HY-Video 1.574.6   78.9   85.6 🥉54.7   86.8 🥇67.1   
5 LTX 2.371.0   78.8   85.2   49.4   76.4   65.1   
6 Cosmos 2.570.8   74.6   83.3   43.5   85.4 🥉67.0   
7 LongCat-Video70.2   79.7   72.3   45.1   85.5 🥈68.4 🥉
8 YUME 1.569.0   79.7   72.4   48.4   79.3   65.4   
9 Kairos 3.066.0   75.8   70.3   41.6   81.9   60.5   
20 Models — Navigation Split (19 metrics)
ModelAesthetic QualityImaging QualityBackground ConsistencyTemporal FlickeringDynamic DegreeMotion SmoothnessHPSv3 QualityScene AdherenceSubject AdherenceNavigation TrajectorySpatial ConsistencyGated Spatial ConsistencyPerspective ConsistencySegment ContinuityGeometric ConsistencyPhotometric ConsistencySubject Consistency Cross-ModelVisual PlausibilityCausal Fidelity
HY-Video 1.563.4   67.4   92.1   94.2   73.9   98.7   68.0   77.5   93.6   71.8   79.2   75.1   86.6   99.4   94.6   80.3   91.6   59.7   75.0   
Kling 3.063.0   68.1   92.3   93.2   97.5   97.6   69.1   89.0   92.9   70.3   75.2   75.1   76.8   93.0   88.9   79.9   88.5   60.7   78.0   
Cosmos 2.561.8   66.9   92.3   94.8   49.0   98.2   66.5   72.4   94.2   64.1   78.1   74.3   84.3   94.3   94.6   81.6   92.3   60.1   74.7   
LTX 2.357.9   61.0   88.3   93.2   98.1   96.4   56.1   81.3   89.2   67.6   70.2   70.2   69.8   75.8   76.9   79.2   87.2   55.7   74.0   
Seedance 1.561.0   69.3   89.6   92.4   99.4   97.5   73.0   71.6   94.2   68.0   72.7   72.4   70.5   96.2   82.4   76.8   90.1   60.7   76.0   
Wan 2.761.4   68.0   89.4   92.2   100.0   96.3   71.1   88.3   94.6   66.0   71.0   71.0   78.2   92.4   83.7   76.4   90.7   60.3   83.3   
Kairos 3.059.9   62.7   91.1   95.4   70.1   97.5   58.5   52.2   88.5   65.1   76.8   62.0   76.3   94.3   89.0   80.8   90.8   58.0   62.7   
LongCat-Video66.5   69.6   95.1   94.8   45.9   97.9   77.6   53.1   91.5   63.1   83.3   66.2   81.5   99.4   95.4   82.2   93.4   61.8   76.0   
YUME 1.558.7   63.3   90.3   93.0   96.8   97.0   57.0   53.1   91.7   72.0   71.5   71.4   48.0   99.4   88.0   83.3   88.8   57.7   72.7   
Astra48.6   52.5   85.3   96.0   79.6   97.7   28.0   43.4   75.9   67.7   64.7   63.3   30.0   86.6   85.6   87.5   83.5   54.6   48.3   
Fantasy-World63.0   62.8   94.2   95.8   49.0   97.9   65.8   52.4   90.1   72.1   80.6   64.2   79.8   100.0   95.3   84.8   92.5   59.7   74.0   
HY-GameCraft52.6   58.7   86.5   93.7   96.8   97.6   38.3   50.6   82.5   67.8   60.5   60.5   17.9   99.4   88.3   85.0   82.6   56.5   68.3   
Genie 351.6   59.3   90.7   95.0   92.4   97.8   55.2   61.1   83.8   73.3   79.9   78.4   54.5   93.6   88.6   84.5   90.4   59.7   71.7   
Happy Oyster56.6   63.9   91.4   94.0   94.2   97.0   58.3   57.4   91.1   85.1   77.7   75.8   75.0   96.2   87.2   79.8   91.5   57.6   69.3   
HY-World 1.560.1   65.4   92.7   93.5   91.1   98.1   60.5   53.5   90.8   87.5   90.6   84.9   62.5   100.0   92.0   83.1   89.1   58.6   74.0   
Infinite-World58.7   66.1   88.8   94.1   82.8   98.0   62.3   54.0   84.5   75.9   74.9   74.4   33.8   100.0   94.3   85.1   88.4   57.2   67.0   
InSpatio-World64.4   67.6   95.0   96.0   26.1   98.8   76.1   51.7   91.1   72.8   93.8   66.5   72.5   100.0   97.3   87.4   94.4   63.1   67.3   
LingBot-World66.9   67.9   96.9   94.1   66.2   96.9   81.4   51.6   93.6   79.8   92.7   67.1   90.9   99.4   95.4   83.3   93.5   64.8   77.7   
MatrixGame254.0   60.3   86.9   94.6   94.9   98.2   41.0   49.4   84.9   80.6   64.5   64.5   29.2   21.0   86.1   81.3   87.2   55.0   59.3   
MatrixGame346.4   70.0   85.7   86.3   97.5   95.4   57.1   48.9   78.4   83.5   81.0   80.4   13.3   89.8   87.6   75.3   83.0   54.0   64.7   
9 Text-driven Models — Full Split (22 metrics)
ModelAesthetic QualityImaging QualityBackground ConsistencyTemporal FlickeringDynamic DegreeMotion SmoothnessHPSv3 QualityScene AdherenceSubject AdherenceNavigation TrajectoryEvent Edit AdherenceSubject Action AdherencePerspective Switch AdherenceSpatial ConsistencyGated Spatial ConsistencyPerspective ConsistencySegment ContinuityGeometric ConsistencyPhotometric ConsistencySubject Consistency Cross-ModelVisual PlausibilityCausal Fidelity
HY-Video 1.561.9   67.4   92.4   95.5   68.8   98.8   67.5   77.5   93.6   71.8   63.8   55.6   27.6   79.2   75.1   86.6   99.3   94.4   81.4   91.5   59.3   75.0   
Kling 3.061.3   67.7   92.7   94.5   89.9   97.9   68.8   89.0   92.9   70.3   81.4   85.6   55.0   75.2   75.1   76.8   92.7   89.4   80.4   88.5   60.4   78.0   
Cosmos 2.560.1   67.2   92.3   96.0   42.4   98.3   65.9   72.4   94.2   64.1   48.2   41.6   20.0   78.1   74.3   84.3   93.1   94.2   82.1   91.8   59.3   74.7   
LTX 2.356.9   62.3   89.3   94.1   94.4   96.8   57.7   81.3   89.2   67.6   53.0   51.8   25.0   70.2   70.2   69.8   77.8   81.1   79.4   86.7   56.2   74.0   
Seedance 1.559.7   69.8   89.6   93.4   98.3   97.6   72.9   71.6   94.2   68.0   80.4   80.0   45.0   72.7   72.4   62.7   92.4   83.5   76.7   89.3   60.5   76.0   
Wan 2.759.6   68.1   89.5   93.0   99.3   96.5   69.4   88.3   94.6   66.0   84.0   83.4   55.0   71.0   71.0   62.2   65.6   82.6   75.5   88.7   59.8   83.3   
Kairos 3.058.4   63.6   91.8   96.3   63.5   97.9   58.8   52.2   88.5   65.1   46.8   41.4   13.3   76.8   62.0   76.3   94.1   91.5   82.1   90.7   58.2   62.7   
LongCat-Video64.7   69.8   94.7   94.9   59.7   97.7   76.3   53.1   91.5   63.1   50.4   48.4   18.3   83.3   66.2   81.5   98.6   94.7   81.5   92.4   60.8   76.0   
YUME 1.559.3   65.7   92.0   94.8   86.1   97.7   62.0   53.1   91.7   72.0   57.8   47.0   16.7   71.5   71.4   48.0   99.3   91.1   84.1   89.4   58.1   72.7   

🚀 Quick Start

# Install
git clone --recursive https://github.com/meituan-longcat/WBench.git
cd WBench

# Download data and weights
pip install huggingface_hub
hf download meituan-longcat/WBench --repo-type dataset --local-dir data/ --exclude "splits/*"
hf download meituan-longcat/WBench-weights --local-dir weights/

# Environment 1: wbench-main (all metrics except visual_plausibility)
bash tools/install.sh wbench-main
conda activate wbench-main
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH

# Environment 2: wbench-vp (visual_plausibility only, requires vLLM)
bash tools/install_vp.sh wbench-vp

# Verify
conda activate wbench-main
python tools/verify_install.py

# Run evaluation (auto multi-GPU)
python main.py --model your_model

See docs/installation.md for detailed setup instructions.

🎮 Evaluate Your Model

  1. Generate multi-turn videos → place in work_dirs/<model>/videos/case_{id}_combined.mp4
  2. Run the 3-phase pipeline:
# Full pipeline (precompute → GPU metrics → VLM metrics → report)
python main.py --model my_model --gpus 0,1,2,3,4,5,6,7

# Or run phases independently:
python main.py --model my_model --phase precompute    # SAM2 + DA3 + MegaSAM
python main.py --model my_model --phase gpu           # GPU metrics (per-metric)
python main.py --model my_model --phase vlm           # VLM metrics (API)
python main.py --model my_model --phase report        # Aggregate report
  1. Results: work_dirs/<model>/evaluation/{metric}/case_{id}.json + report.json
# Run specific metrics (by name or dimension)
python main.py --model my_model --phase gpu --metrics hpsv3_quality
python main.py --model my_model --phase gpu --metrics renderer        # all 6 video quality
python main.py --model my_model --phase gpu --metrics consistency     # all consistency metrics

# Skip pre-computation if already done
python main.py --model my_model --phase gpu --skip_megasam --skip_sam2 --skip_da3

# Single video evaluation
python main.py --video video.mp4 --case data/cases/case_1.json

Dimensions (--metrics supports these as shorthand):

DimensionMetrics
qualityaesthetic_quality, imaging_quality, temporal_flickering, dynamic_degree, motion_smoothness, hpsv3_quality
consistencybackground_consistency, segment_continuity, perspective_consistency, subject_consistency, geometric_consistency, photometric_consistency, spatial_consistency, gated_spatial_consistency
interactionnavigation_trajectory, event_edit_adherence, subject_action_adherence, perspective_switch_adherence
settingscene_adherence, subject_adherence
physicalvisual_plausibility, causal_fidelity

Set environment variables for VLM metrics (we use Doubao-Seed-2.0-lite via Volcengine ARK):

export VLM_API_KEY="<your-ark-api-key>"
# Optional (defaults shown):
# export VLM_API_URL="https://ark.cn-beijing.volces.com/api/v3"
# export VLM_MODEL_NAME="doubao-seed-2-0-lite-260215"

For visual_plausibility, use the separate wbench-vp environment:

python tools/run_visual_plausibility.py --model my_model  # uses all available GPUs

🔌 Implement Your Model

WBench supports 3 model types with different control interfaces:

TypeInputCasesStatus
Text-conditionedText prompt + first-frame image289 (all)✅ Implemented
Camera-conditionedFirst-frame image + 6-DoF camera pose158 (navi)✅ Implemented
Action-conditionedFirst-frame image + discrete action158 (navi)✅ Implemented

Text-conditioned models

from src.models import get_model

# Available: wan, kling, seedance (or register your own)
model = get_model("wan")

# Generate multi-turn video from a case
result = model.generate_multi_turn(
    case=case_dict,
    output_path="work_dirs/wan/videos/case_1_combined.mp4",
    data_root="data/",
)

Each turn: build prompt from interaction → call I2V API → extract last frame → next turn.

Set API credentials:

export VIDEO_API_URL="https://your-video-api.com"
export VIDEO_API_KEY="your-key"

Camera-conditioned models

The benchmark's navigation actions (W/A/S/D + arrows) are converted to per-turn {move, yaw, pitch} intent and then to a 6-DoF camera trajectory. Subclass CameraConditionedModel and implement one hook — case parsing, action→pose conversion, and video writing are handled for you:

from src.models.camera import CameraConditionedModel

class MyWorldModel(CameraConditionedModel):
    def generate_with_poses(self, image, poses, video_length, **kw):
        # image: first-frame path; poses: {"<latent_idx>": {"extrinsic": 4x4, "K": 3x3}, ...}
        # return: list of `video_length` BGR uint8 frames
        return my_model.infer(image, poses, video_length)

MyWorldModel("mymodel").generate_multi_turn(case_dict,
    "work_dirs/mymodel/videos/case_1_combined.mp4", data_root="data/")

The pose convention (axes, speeds, intrinsics) lives in src/models/camera/poses.py — copy and adapt it to your model; the navigation metric normalises scale, so what matters is matching the per-action intent. Quick look at one case:

python -m src.models.camera.demo --case data/cases/case_1.json   # prints poses + renders a preview

Note: Camera/action models only cover the 158 navigation cases (cases containing at least one W/A/S/D/arrow action). When generating at scale, pass only those cases — e.g. via generate.py --model your_model --cases <navi_list>.

Action-conditioned models

Two flavours, both fed from the same per-turn navigation plan:

Programmatic controllers (e.g. Matrix-Game-3). Subclass ActionConditionedModel and implement generate_with_actions. Each action carries both raw key tokens and an MG3-style {keyboard, mouse} tensor:

from src.models.action import ActionConditionedModel

class MyActionModel(ActionConditionedModel):
    def generate_with_actions(self, image, actions, video_length, **kw):
        # actions: [{"turn", "tokens", "keyboard", "mouse", "duration"}, ...]
        return my_model.infer(image, actions, video_length)

MyActionModel("mymodel").generate_multi_turn(case_dict,
    "work_dirs/mymodel/videos/case_1_combined.mp4", data_root="data/")
python -m src.models.action.demo --case data/cases/case_1.json   # prints actions + renders a preview

Web products (e.g. Project Genie, Happy Oyster) — no weights/API; driven by browser automation + simulated keystrokes. See src/models/action/web/.

🤖 Claude Code Skills

If you use Claude Code, this repo ships skills that drive the full workflow — just ask in natural language and Claude runs the right commands:

SkillTriggers onWhat it does
wbench-generate"generate kling videos"Runs generate.py over the dataset → work_dirs/<model>/videos/
wbench-evaluate"evaluate kling3"Runs the 4-phase main.py pipeline (precompute → gpu → vlm → report)
wbench-submit"package my model for submission"Builds the meta.json / turns.json bundle and uploads to HuggingFace
genie3 / happy"run case_5 on genie3"Browser automation for the web products (details)

Skills live in .claude/skills/ (and src/models/action/web/.claude/skills/) and are auto-discovered when you open the repo in Claude Code.

📋 TODO

  • Text-conditioned model generation (Wan, Kling, Seedance)
  • Homepage with interactive leaderboard
  • Dataset and weights release on HuggingFace
  • Camera-conditioned model generation example
  • Action-conditioned model generation example
  • Hosted submission & evaluation service (submit videos, get scores)
  • ArXiv paper release

📝 Citation

If you find our work useful, please consider citing:

@article{ying2026wbenchcomprehensivemultiturnbenchmark,
  title={WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation},
  author={Ying, Kaining and Hu, Hengrui and Ren, Siyu and Li, Jiamu and Chen, Fengjiao and Wang, Ziwen and Cao, Xuezhi and Cai, Xunliang and Ding, Henghui},
  journal={arXiv preprint arXiv:2605.25874},
  year={2026}
}

🙏 Acknowledgement

This project builds upon the following excellent works:

  • WorldScore — World model evaluation framework
  • VBench — Video quality metrics
  • SAM2 — Segment Anything Model 2 for mask tracking
  • Depth-Anything-V3 — Monocular depth estimation
  • MegaSAM — Camera pose estimation
  • DreamSim — Perceptual similarity metric
  • HPSv3 — Human Preference Score
  • AMT — Frame interpolation for motion smoothness
  • RAFT — Optical flow estimation
  • TransNetV2 — Scene boundary detection
  • ... and many other excellent open-source projects

📧 Contact

Feel free to open an Issue or Pull Request. You can also reach us directly:

  • Kaining Ying: kaining.ying.cv@gmail.com
  • Siyu Ren: rensiyu07@meituan.com

📄 License

Code and data: MIT License. Model weights retain their original licenses.