MBench

June 2, 2026 · View on GitHub

MBench

arXiv HuggingFace Leaderboard HuggingFace Dataset Project Page GitHub License: MIT

MBench

English | 中文

MBench is a benchmark for the memory capability of long-video world models. Most existing benchmarks reward single-frame quality or short-horizon prompt following; MBench targets a harder question: when a subject leaves the frame and returns, when the camera departs from a viewpoint and comes back, or when an off-screen physical process keeps evolving, can the model maintain a consistent world state? We decompose this into three orthogonal capability axes — Entity / Environment / Causal Consistency — and evaluate them under two complementary settings: MBench-A (action-conditioned, for action-conditioned world models) and MBench-T (text-segment-conditioned, for long-video text continuation models).

This repository ships the full evaluation pipeline: a contract-driven, plugin-based CLI tool mbench, 12 official metric implementations spanning both settings, and an integrated VLM trigger judge for trigger-conditioned scoring.

If you run into problems, please open an issue.

Table of Contents

Overview

Teaser

MBench formalises "long-video memory capability" as three orthogonal capability axes:

  • Entity Consistency — does a subject's identity, appearance, geometry, and texture stay consistent after it exits the frame and returns?
  • Environment Consistency — after the camera departs from a viewpoint and returns, does the 3D spatial layout, lighting, and rendering style recover?
  • Causal Consistency — does an off-screen physical process continue plausibly? Does the model faithfully respond to external instructions (camera actions, segment-level captions)?

Each axis decomposes into 4 sub-dimensions for a total of 12 evaluation dimensions. Each dimension is computed by a dedicated automatic metric and normalised to the [0, 1] range. Where the setting calls for it, the metric is gated by a Vision-Language Model trigger so that samples that never enter the memory challenge (e.g. completely static videos) cannot inflate model rankings. See the Trigger-Conditioned Scoring section below.

The two settings:

Settingdataset_idConditioningTypical models
MBench-AmbenchaCamera action + a single caption sentencehy_worldplay, matrix_game_3, yume, infinite_world, lingbot_world, matrix_game_2
MBench-TmbenchtFive-segment text prompts (condition_id=text)cosmos, longcat, skyreels, longlive, helios, memflow, self_forcing, causal_forcing

Both settings share the same samples/{subset}/{sample_id}/ directory layout; at run time only the metrics under the matching prefix are dispatched.

Updates

  • 2026-06 — Initial public release: 12 metric implementations, MBench-A / MBench-T data adapter, canonical score schema, integrated VLM trigger judge, contract-validated CLI.

Installation

Requirements: Python ≥ 3.10.

git clone https://github.com/study-overflow/MBench.git
cd MBench
pip install -e .

Additional dependencies:

Metric familyRequired extras
Spatial epipolar / reprojection, object geometrynumpy, opencv-python, plus a precomputed camera-pose artifact (you can produce it with DepthAnything v3)
Rendering styletorch, torchvision, lpips, transformers (DINOv2), VGG-19 weights
Rendering lightingopencv-python
Entity human (identity / appearance)insightface (Buffalo_L), torch, transformers (DINOv2)
Entity object (texture / geometry)groundingdino, segment-anything-2, transformers (DINOv2)
Causal text / action interactionopen_clip_torch
Causal self-evolution (state / correctness)An OpenAI-compatible VLM endpoint

A typical full setup:

pip install torch torchvision torchaudio   # match your CUDA
pip install opencv-python lpips transformers open_clip_torch insightface
pip install "segment-anything-2 @ git+https://github.com/facebookresearch/segment-anything-2.git"
pip install "groundingdino @ git+https://github.com/IDEA-Research/GroundingDINO.git"

Verify the install:

mbench list-metrics    # should print 23 registered metrics

Quick Start

Assuming your data is already laid out under data/MBench-A-Setup/ (see the next section):

# 1. Validate that the metric's required fields are present on a few samples.
mbench validate data/MBench-A-Setup \
    --metrics mbencha.environment.spatial_epipolar \
    --models my_model --subsets environment --limit 4

# 2. Run a real evaluation: one metric, one model.
mbench eval data/MBench-A-Setup \
    --metrics mbencha.environment.spatial_epipolar \
    --models my_model --subsets environment \
    --output runs/my_first_run

For an MBench-T run with a VLM trigger:

export MBENCH_VLM_BASE_URL="https://your-openai-compat-endpoint.example.com/v1"
export MBENCH_VLM_API_KEY="sk-..."
export MBENCH_VLM_MODEL="gpt-4o"

mbench eval data/MBench-T-Setup \
    --metrics mbencht.entity.human_identity_consistency \
    --models my_model --subsets human \
    --vlm-judge openai-compatible \
    --output runs/my_t_run_with_trigger

Without --vlm-judge, the same command still runs in raw-consistency mode (no API call); the metric still emits score, just without trigger gating.

Data Preparation

You can follow this section to lay out the data your own model produces, and evaluate your own model:

data/MBench-{A|T}-Setup/
├── dataset.yaml
├── samples/{subset}/{sample_id}/
│   ├── sample.json                   # metadata (object_anchor, caption_segments, …)
│   └── reference.png / source_video.mp4         # optional ground-truth assets
└── models/{model_id}/
    ├── samples.jsonl                 # one line per (sample × condition) generation
    ├── outputs/{subset}/{sample_id}/{condition_id}/video.mp4
    └── artifacts/{subset}/{sample_id}/{condition_id}/da3/results.npz
                                      # only required for spatial / object-geometry / causal-action metrics

The four subsets (human, object, environment, causal) match the metric subsets one-to-one. Currently condition_id is {action}_{length} for MBench-A (e.g. left_then_right_25s, forward_then_backward_10s) and the literal string text for MBench-T. You are free to add new conditions of your own (e.g. new camera-trajectory conditions).

A minimal dataset.yaml:

dataset_id: mbencha           # or mbencht
dataset_name: My MBench-A Setup
version: '1.0'
path_mode: relative_to_dataset_root

subsets:
  environment: {description: Environment / spatial memory.}
  human:       {description: Human entity identity memory.}
  object:      {description: Object entity identity memory.}
  causal:      {description: Causal / interaction memory.}

condition_id:
  pattern: "{action}_{length}"

paths:
  output_video: models/{model_id}/outputs/{subset}/{sample_id}/{condition_id}/video.mp4
  artifact_da3: models/{model_id}/artifacts/{subset}/{sample_id}/{condition_id}/da3/results.npz

A samples.jsonl line per generated rollout:

{"item_id": "human:my_sample_001:left_then_right_25s", "dataset_id": "mbencha", "subset": "human", "sample_id": "my_sample_001", "condition_id": "left_then_right_25s", "model_id": "my_model", "media": {"videos": [{"path": "outputs/human/my_sample_001/left_then_right_25s/video.mp4", "role": "generated"}]}, "artifacts": {"da3": {"path": "artifacts/human/my_sample_001/left_then_right_25s/da3/results.npz"}}}

sample.json required fields by subset:

  • object: T side requires metadata.object_card, A side requires metadata.object_anchor, describing the target object.
  • causal: T side requires metadata.caption_segments, A side requires annotations.action or metadata.caption.
  • environment / human: only media.videos is required; segment-level captions help the T-side trigger judge.

DA3 camera-pose artifacts are produced externally — see DA3; MBench only consumes the resulting results.npz.

Evaluation Dimensions

The 12 dimensions and their registration names:

AxisSub-dimensionMBench-AMBench-T
EntityHuman Identitymbencha.entity.human_identity_consistencymbencht.entity.human_identity_consistency
Human Appearancembencha.entity.human_appearance_consistencymbencht.entity.human_appearance_consistency
Object Texturembencha.entity.object_texture_consistencymbencht.entity.object_texture_consistency
Object Geometrymbencha.entity.object_geometry_consistencymbencht.entity.object_geometry_consistency
EnvironmentSpatial Epipolarmbencha.environment.spatial_epipolarmbencht.environment.spatial_epipolar
Spatial Reprojectionmbencha.environment.spatial_reprojectionmbencht.environment.spatial_reprojection
Rendering Stylembencha.environment.rendering_stylembencht.environment.rendering_style
Rendering Lightingmbencha.environment.rendering_lightingmbencht.environment.rendering_lighting
CausalAction Interactionmbencha.causal.camera_interaction
Text Interactionmbencha.causal.prompt_interactionmbencht.causal.prompt_interaction
Self-Evolution: Statembencha.causal.state_progressmbencht.causal.state_progress
Self-Evolution: Correctnessmbencha.causal.progress_correctnessmbencht.causal.progress_correctness

mbench list-metrics prints the live registry with one-line descriptions per metric.

Trigger-Conditioned Scoring

A model can game a memory benchmark by generating static, overly conservative content — never engaging with the challenge — and walk away with an inflated consistency score. MBench prevents this by decoupling reliability from coverage:

  • Memory Reliability SrelS^{\text{rel}} — average consistency, computed only on samples that actually triggered the memory challenge.
  • Trigger Coverage CtrigC^{\text{trig}} — fraction of samples that triggered the memory challenge.
  • M-Score — the harmonic mean of the two: M-Score=2SrelCtrig/(Srel+Ctrig)\text{M-Score} = 2 \cdot S^{\text{rel}} \cdot C^{\text{trig}} / (S^{\text{rel}} + C^{\text{trig}}).

Where the trigger fires:

SettingTrigger sourceDefault behaviour
MBench-TA VLM judge over 8 uniformly sampled frames + caption segmentsOFF; opt in by passing --vlm-judge openai-compatible
MBench-ANot used (action-conditioned rollouts deterministically actuate the memory challenge — paper §App)Always OFF, even if --vlm-judge is passed

Configure the VLM via CLI flags or environment variables:

export MBENCH_VLM_BASE_URL="..."     # OpenAI-compatible /v1 endpoint
export MBENCH_VLM_API_KEY="..."
export MBENCH_VLM_MODEL="..."        # any vision-capable model id

# or pass per-run:
mbench eval ... --vlm-judge openai-compatible \
    --vlm-base-url "$MBENCH_VLM_BASE_URL" \
    --vlm-api-key  "$MBENCH_VLM_API_KEY" \
    --vlm-model    "$MBENCH_VLM_MODEL"

For ablation runs, pass --ignore-trigger to force the trigger gate off even when --vlm-judge is set.

Score Schema

Regardless of A or T, regardless of internal computation, every metric emits the same score shape at the unit, item, and summary layers:

KeyRangeWhen presentMeaning
score[0, 1]Always (when valid)The normalised score.
score_rawnative unitDistance- or VLM-1-5-based metricsRaw signal before normalisation; omitted when raw == score.
triggered{0.0, 1.0}T-side metrics onlyWhether the VLM trigger fired (0 → sample skipped from the reliability average).
trigger_score[0, 1]T-side metrics onlyNormalised VLM trigger confidence.

Examples (after mbench eval):

// mbencha.environment.spatial_epipolar  units.jsonl line
{"unit_id": "...__f12_f48", "scores": {"score": 0.5523, "score_raw": 2.964}}

// mbencht.causal.progress_correctness  summary.json group
{"score": 0.74, "score_raw": 3.7, "triggered": 1.0, "trigger_score": 0.6}

// mbencht.* untriggered sample
{"triggered": 0.0, "trigger_score": 0.2}

Usage Reference

List registered metrics

mbench list-metrics

Validate (strict)

mbench validate <source> --metrics <csv> [--models …] [--subsets …] [--limit N]

Any contract violation exits non-zero.

Eval (lenient — skips bad items, runs the rest)

mbench eval <source> --metrics <csv> --output runs/<run_name>
    [--models <csv>] [--subsets <csv>] [--conditions <csv>]
    [--sample-ids <csv>] [--limit N]
    [--vlm-judge openai-compatible
       --vlm-base-url <URL> --vlm-api-key <KEY> --vlm-model <ID>
       --vlm-n-frames 8 --vlm-max-retries 3]
    [--workers N]                 # parallel across items
    [--ignore-trigger]            # bypass VLM trigger gate (ablation only)
    [--no-progress-bar] [--quiet] [--no-log-file]

Prepare auxiliary artifacts (never auto-invoked during eval)

mbench prepare <source> --producer <name> --output artifacts/

Python API

from mbench.core.pipeline import run_eval
from mbench.core.registry import metric_registry, adapter_registry, aggregator_registry
import mbench.cli; mbench.cli._bootstrap()    # register all built-in plugins

adapter = adapter_registry.get("mbench-dir")
items = adapter.load("data/MBench-A-Setup", subsets=["environment"], limit=4)
metric = metric_registry.get("mbencha.environment.spatial_epipolar")
aggregator = aggregator_registry.get("mean")

summary = run_eval(items, [metric], aggregator, "runs/python_api_demo")

Output Layout

runs/<run_name>/
├── items.jsonl              # one line per evaluated sample (item_id, model_id)
├── eval.log                 # per-item progress log (timestamps, scores, errors)
└── metrics/<metric_name>/
    ├── units.jsonl          # one UnitResult per evaluation unit (frame pair, track, segment, …)
    ├── items.jsonl          # one ItemResult per sample (aggregated from units)
    └── summary.json         # SummaryResult grouped by model_id

UnitResult.metadata['raw_scores'] preserves the metric's original internal fields (the ones dropped by the canonical projection), useful for after-the-fact debugging or re-deriving alternative score keys offline.

Citation

If MBench helps your research, please cite us:

@article{zhang2026mbench,
  title         = {MBench: A Comprehensive Benchmark on Memory Capability for Video World Models},
  author        = {Zhang, Shengjun and Zhang, Zhang and Huang, Simin and Tang, Zhenyu and Wang, Hanyang and Dai, Chensheng and Chen, Min and Li, Yifan and Li, Yuxin and Chen, Yingjie and Liu, Hao and Li, Chen and Duan, Yueqi},
  year          = {2026},
  eprint        = {2606.00793},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2606.00793}
}

License

The code is released under the MIT License.