MBench
June 2, 2026 · View on GitHub

MBench
MBench is a benchmark for the memory capability of long-video world models. Most existing benchmarks reward single-frame quality or short-horizon prompt following; MBench targets a harder question: when a subject leaves the frame and returns, when the camera departs from a viewpoint and comes back, or when an off-screen physical process keeps evolving, can the model maintain a consistent world state? We decompose this into three orthogonal capability axes — Entity / Environment / Causal Consistency — and evaluate them under two complementary settings: MBench-A (action-conditioned, for action-conditioned world models) and MBench-T (text-segment-conditioned, for long-video text continuation models).
This repository ships the full evaluation pipeline: a contract-driven, plugin-based CLI tool mbench, 12 official metric implementations spanning both settings, and an integrated VLM trigger judge for trigger-conditioned scoring.
If you run into problems, please open an issue.
Table of Contents
- Overview
- Updates
- Installation
- Quick Start
- Data Preparation
- Evaluation Dimensions
- Trigger-Conditioned Scoring
- Score Schema
- Usage Reference
- Output Layout
- Citation
- License
Overview
MBench formalises "long-video memory capability" as three orthogonal capability axes:
- Entity Consistency — does a subject's identity, appearance, geometry, and texture stay consistent after it exits the frame and returns?
- Environment Consistency — after the camera departs from a viewpoint and returns, does the 3D spatial layout, lighting, and rendering style recover?
- Causal Consistency — does an off-screen physical process continue plausibly? Does the model faithfully respond to external instructions (camera actions, segment-level captions)?
Each axis decomposes into 4 sub-dimensions for a total of 12 evaluation dimensions. Each dimension is computed by a dedicated automatic metric and normalised to the [0, 1] range. Where the setting calls for it, the metric is gated by a Vision-Language Model trigger so that samples that never enter the memory challenge (e.g. completely static videos) cannot inflate model rankings. See the Trigger-Conditioned Scoring section below.
The two settings:
| Setting | dataset_id | Conditioning | Typical models |
|---|---|---|---|
| MBench-A | mbencha | Camera action + a single caption sentence | hy_worldplay, matrix_game_3, yume, infinite_world, lingbot_world, matrix_game_2 |
| MBench-T | mbencht | Five-segment text prompts (condition_id=text) | cosmos, longcat, skyreels, longlive, helios, memflow, self_forcing, causal_forcing |
Both settings share the same samples/{subset}/{sample_id}/ directory layout; at run time only the metrics under the matching prefix are dispatched.
Updates
- 2026-06 — Initial public release: 12 metric implementations, MBench-A / MBench-T data adapter, canonical score schema, integrated VLM trigger judge, contract-validated CLI.
Installation
Requirements: Python ≥ 3.10.
git clone https://github.com/study-overflow/MBench.git
cd MBench
pip install -e .
Additional dependencies:
| Metric family | Required extras |
|---|---|
| Spatial epipolar / reprojection, object geometry | numpy, opencv-python, plus a precomputed camera-pose artifact (you can produce it with DepthAnything v3) |
| Rendering style | torch, torchvision, lpips, transformers (DINOv2), VGG-19 weights |
| Rendering lighting | opencv-python |
| Entity human (identity / appearance) | insightface (Buffalo_L), torch, transformers (DINOv2) |
| Entity object (texture / geometry) | groundingdino, segment-anything-2, transformers (DINOv2) |
| Causal text / action interaction | open_clip_torch |
| Causal self-evolution (state / correctness) | An OpenAI-compatible VLM endpoint |
A typical full setup:
pip install torch torchvision torchaudio # match your CUDA
pip install opencv-python lpips transformers open_clip_torch insightface
pip install "segment-anything-2 @ git+https://github.com/facebookresearch/segment-anything-2.git"
pip install "groundingdino @ git+https://github.com/IDEA-Research/GroundingDINO.git"
Verify the install:
mbench list-metrics # should print 23 registered metrics
Quick Start
Assuming your data is already laid out under data/MBench-A-Setup/ (see the next section):
# 1. Validate that the metric's required fields are present on a few samples.
mbench validate data/MBench-A-Setup \
--metrics mbencha.environment.spatial_epipolar \
--models my_model --subsets environment --limit 4
# 2. Run a real evaluation: one metric, one model.
mbench eval data/MBench-A-Setup \
--metrics mbencha.environment.spatial_epipolar \
--models my_model --subsets environment \
--output runs/my_first_run
For an MBench-T run with a VLM trigger:
export MBENCH_VLM_BASE_URL="https://your-openai-compat-endpoint.example.com/v1"
export MBENCH_VLM_API_KEY="sk-..."
export MBENCH_VLM_MODEL="gpt-4o"
mbench eval data/MBench-T-Setup \
--metrics mbencht.entity.human_identity_consistency \
--models my_model --subsets human \
--vlm-judge openai-compatible \
--output runs/my_t_run_with_trigger
Without --vlm-judge, the same command still runs in raw-consistency mode (no API call); the metric still emits score, just without trigger gating.
Data Preparation
You can follow this section to lay out the data your own model produces, and evaluate your own model:
data/MBench-{A|T}-Setup/
├── dataset.yaml
├── samples/{subset}/{sample_id}/
│ ├── sample.json # metadata (object_anchor, caption_segments, …)
│ └── reference.png / source_video.mp4 # optional ground-truth assets
└── models/{model_id}/
├── samples.jsonl # one line per (sample × condition) generation
├── outputs/{subset}/{sample_id}/{condition_id}/video.mp4
└── artifacts/{subset}/{sample_id}/{condition_id}/da3/results.npz
# only required for spatial / object-geometry / causal-action metrics
The four subsets (human, object, environment, causal) match the metric subsets one-to-one. Currently condition_id is {action}_{length} for MBench-A (e.g. left_then_right_25s, forward_then_backward_10s) and the literal string text for MBench-T. You are free to add new conditions of your own (e.g. new camera-trajectory conditions).
A minimal dataset.yaml:
dataset_id: mbencha # or mbencht
dataset_name: My MBench-A Setup
version: '1.0'
path_mode: relative_to_dataset_root
subsets:
environment: {description: Environment / spatial memory.}
human: {description: Human entity identity memory.}
object: {description: Object entity identity memory.}
causal: {description: Causal / interaction memory.}
condition_id:
pattern: "{action}_{length}"
paths:
output_video: models/{model_id}/outputs/{subset}/{sample_id}/{condition_id}/video.mp4
artifact_da3: models/{model_id}/artifacts/{subset}/{sample_id}/{condition_id}/da3/results.npz
A samples.jsonl line per generated rollout:
{"item_id": "human:my_sample_001:left_then_right_25s", "dataset_id": "mbencha", "subset": "human", "sample_id": "my_sample_001", "condition_id": "left_then_right_25s", "model_id": "my_model", "media": {"videos": [{"path": "outputs/human/my_sample_001/left_then_right_25s/video.mp4", "role": "generated"}]}, "artifacts": {"da3": {"path": "artifacts/human/my_sample_001/left_then_right_25s/da3/results.npz"}}}
sample.json required fields by subset:
object: T side requiresmetadata.object_card, A side requiresmetadata.object_anchor, describing the target object.causal: T side requiresmetadata.caption_segments, A side requiresannotations.actionormetadata.caption.environment/human: onlymedia.videosis required; segment-level captions help the T-side trigger judge.
DA3 camera-pose artifacts are produced externally — see DA3; MBench only consumes the resulting results.npz.
Evaluation Dimensions
The 12 dimensions and their registration names:
| Axis | Sub-dimension | MBench-A | MBench-T |
|---|---|---|---|
| Entity | Human Identity | mbencha.entity.human_identity_consistency | mbencht.entity.human_identity_consistency |
| Human Appearance | mbencha.entity.human_appearance_consistency | mbencht.entity.human_appearance_consistency | |
| Object Texture | mbencha.entity.object_texture_consistency | mbencht.entity.object_texture_consistency | |
| Object Geometry | mbencha.entity.object_geometry_consistency | mbencht.entity.object_geometry_consistency | |
| Environment | Spatial Epipolar | mbencha.environment.spatial_epipolar | mbencht.environment.spatial_epipolar |
| Spatial Reprojection | mbencha.environment.spatial_reprojection | mbencht.environment.spatial_reprojection | |
| Rendering Style | mbencha.environment.rendering_style | mbencht.environment.rendering_style | |
| Rendering Lighting | mbencha.environment.rendering_lighting | mbencht.environment.rendering_lighting | |
| Causal | Action Interaction | mbencha.causal.camera_interaction | – |
| Text Interaction | mbencha.causal.prompt_interaction | mbencht.causal.prompt_interaction | |
| Self-Evolution: State | mbencha.causal.state_progress | mbencht.causal.state_progress | |
| Self-Evolution: Correctness | mbencha.causal.progress_correctness | mbencht.causal.progress_correctness |
mbench list-metrics prints the live registry with one-line descriptions per metric.
Trigger-Conditioned Scoring
A model can game a memory benchmark by generating static, overly conservative content — never engaging with the challenge — and walk away with an inflated consistency score. MBench prevents this by decoupling reliability from coverage:
- Memory Reliability — average consistency, computed only on samples that actually triggered the memory challenge.
- Trigger Coverage — fraction of samples that triggered the memory challenge.
- M-Score — the harmonic mean of the two: .
Where the trigger fires:
| Setting | Trigger source | Default behaviour |
|---|---|---|
| MBench-T | A VLM judge over 8 uniformly sampled frames + caption segments | OFF; opt in by passing --vlm-judge openai-compatible |
| MBench-A | Not used (action-conditioned rollouts deterministically actuate the memory challenge — paper §App) | Always OFF, even if --vlm-judge is passed |
Configure the VLM via CLI flags or environment variables:
export MBENCH_VLM_BASE_URL="..." # OpenAI-compatible /v1 endpoint
export MBENCH_VLM_API_KEY="..."
export MBENCH_VLM_MODEL="..." # any vision-capable model id
# or pass per-run:
mbench eval ... --vlm-judge openai-compatible \
--vlm-base-url "$MBENCH_VLM_BASE_URL" \
--vlm-api-key "$MBENCH_VLM_API_KEY" \
--vlm-model "$MBENCH_VLM_MODEL"
For ablation runs, pass --ignore-trigger to force the trigger gate off even when --vlm-judge is set.
Score Schema
Regardless of A or T, regardless of internal computation, every metric emits the same score shape at the unit, item, and summary layers:
| Key | Range | When present | Meaning |
|---|---|---|---|
score | [0, 1] | Always (when valid) | The normalised score. |
score_raw | native unit | Distance- or VLM-1-5-based metrics | Raw signal before normalisation; omitted when raw == score. |
triggered | {0.0, 1.0} | T-side metrics only | Whether the VLM trigger fired (0 → sample skipped from the reliability average). |
trigger_score | [0, 1] | T-side metrics only | Normalised VLM trigger confidence. |
Examples (after mbench eval):
// mbencha.environment.spatial_epipolar units.jsonl line
{"unit_id": "...__f12_f48", "scores": {"score": 0.5523, "score_raw": 2.964}}
// mbencht.causal.progress_correctness summary.json group
{"score": 0.74, "score_raw": 3.7, "triggered": 1.0, "trigger_score": 0.6}
// mbencht.* untriggered sample
{"triggered": 0.0, "trigger_score": 0.2}
Usage Reference
List registered metrics
mbench list-metrics
Validate (strict)
mbench validate <source> --metrics <csv> [--models …] [--subsets …] [--limit N]
Any contract violation exits non-zero.
Eval (lenient — skips bad items, runs the rest)
mbench eval <source> --metrics <csv> --output runs/<run_name>
[--models <csv>] [--subsets <csv>] [--conditions <csv>]
[--sample-ids <csv>] [--limit N]
[--vlm-judge openai-compatible
--vlm-base-url <URL> --vlm-api-key <KEY> --vlm-model <ID>
--vlm-n-frames 8 --vlm-max-retries 3]
[--workers N] # parallel across items
[--ignore-trigger] # bypass VLM trigger gate (ablation only)
[--no-progress-bar] [--quiet] [--no-log-file]
Prepare auxiliary artifacts (never auto-invoked during eval)
mbench prepare <source> --producer <name> --output artifacts/
Python API
from mbench.core.pipeline import run_eval
from mbench.core.registry import metric_registry, adapter_registry, aggregator_registry
import mbench.cli; mbench.cli._bootstrap() # register all built-in plugins
adapter = adapter_registry.get("mbench-dir")
items = adapter.load("data/MBench-A-Setup", subsets=["environment"], limit=4)
metric = metric_registry.get("mbencha.environment.spatial_epipolar")
aggregator = aggregator_registry.get("mean")
summary = run_eval(items, [metric], aggregator, "runs/python_api_demo")
Output Layout
runs/<run_name>/
├── items.jsonl # one line per evaluated sample (item_id, model_id)
├── eval.log # per-item progress log (timestamps, scores, errors)
└── metrics/<metric_name>/
├── units.jsonl # one UnitResult per evaluation unit (frame pair, track, segment, …)
├── items.jsonl # one ItemResult per sample (aggregated from units)
└── summary.json # SummaryResult grouped by model_id
UnitResult.metadata['raw_scores'] preserves the metric's original internal fields (the ones dropped by the canonical projection), useful for after-the-fact debugging or re-deriving alternative score keys offline.
Citation
If MBench helps your research, please cite us:
@article{zhang2026mbench,
title = {MBench: A Comprehensive Benchmark on Memory Capability for Video World Models},
author = {Zhang, Shengjun and Zhang, Zhang and Huang, Simin and Tang, Zhenyu and Wang, Hanyang and Dai, Chensheng and Chen, Min and Li, Yifan and Li, Yuxin and Chen, Yingjie and Liu, Hao and Li, Chen and Duan, Yueqi},
year = {2026},
eprint = {2606.00793},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2606.00793}
}
License
The code is released under the MIT License.