V-ReasonBench

February 18, 2026 ยท View on GitHub

arXiv Website

A comprehensive benchmark for evaluating video generation models across four reasoning dimensions: structured problem-solving, spatial cognition, pattern-based inference, and physical dynamics.

V-ReasonBench Pipeline

Key Features:

  • ๐ŸŽฏ 13 reasoning tasks spanning 4 core dimensions
  • ๐Ÿ“Š Pass@5 evaluation with reproducible, answer-verifiable metrics
  • ๐Ÿ”ง Unified evaluation framework with automated scoring
  • ๐Ÿ“ Standardized dataset with clear input-output pairs

๐Ÿ“‹ TODOs

  • Release paper
  • Release dataset and eval code
  • Release data generation code

๐Ÿš€ Quick Start

Prerequisites

Python Dependencies:

pip install -r requirements.txt

Download SAM 2 checkpoint:

mkdir -p checkpoints
wget https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_large.pt -O checkpoints/sam2.1_hiera_large.pt

API Configuration

Some tasks require VLM (Gemini) API for evaluation. Set these environment variables:

export VLM_API_KEY="your_api_key_here"
export VLM_API_URL="your_api_url_here"
export VLM_MODEL="gemini-2.5-pro" 

Tasks requiring VLM API: code, math, shape_fit, sudoku, temperature, color_connect

Run Evaluation

Place all generated videos in a directory and evaluate:

# Single task
python evaluate.py --generated_videos ./my_videos --task block_slide

# Multiple tasks
python evaluate.py --generated_videos ./my_videos --task math code sudoku

Supported tasks: math, code, sudoku, tic_tac_toe, shape_fit, visual_symmetry, color_connect, sequence_completion, visual_analogy, rule_follow, temperature, communicating_vessels, block_slide

Video naming format: <input_name>_<model>_<gt_index>_seed<N>.mp4

  • Example: shape_fit_model1_00_seed0.mp4
  • Each (model, input) pair should have 5 seeds (seed0-seed4) for Pass@5 evaluation

Results are saved to evaluations/<TaskName>/eval_results/<task_name>_eval.json

๐Ÿ“‚ Dataset

The dataset/ folder contains input images for all benchmark tasks in a flat structure.

Naming Format:

Without subtype: <task_name>_<index>.png

  • Example: shape_fit_00.png, visual_analogy_resize_05.png

With subtype: <task_name>_<subtype>_<index>.png

  • Example: tic_tac_toe_3_05.png, math_level1_2_07.png

๐ŸŽฌ Video Generation

To generate videos for evaluation, use inputs from dataset/ and prompts from prompts.txt.

Workflow:

  1. Pick an input image from dataset/ (e.g., shape_fit_00.png)
  2. Get the corresponding task prompt from prompts.txt
  3. Generate 5 videos per input with different seeds (seed0-seed4)
  4. Name outputs: <input_name>_<model>_<gt_index>_seed<N>.mp4

Examples:

  • Input: shape_fit_00.png โ†’ Outputs: shape_fit_00_model1_seed0.mp4, shape_fit_00_model1_seed1.mp4, ...
  • Input: tic_tac_toe_3_05.png โ†’ Outputs: tic_tac_toe_3_05_model1_seed0.mp4, ...

Task prompts in prompts.txt:

  • 10 reasoning task prompts (shape_fit, code, math, etc.)
  • 2 sudoku prompts (4x4, 9x9)
  • 4 visual symmetry prompts (vertical, horizontal, rotational, diagonal)
  • 10 temperature scenario prompts (different ice melting conditions)

๐ŸŽฏ Supported Tasks

Structured Problem-Solving

TaskDescriptionGT FormatKey Metrics
arithmetic operationMathematical expression solvingGT/<level>/<idx>.csvProblem preservation + answer accuracy
code executionCode execution and output`GT//.csv$\text{Problem} \text{preservation} + \text{execution} \text{correctness}
\text{sudoku}\text{Sudoku} \text{puzzle} \text{solving} (4 \times 4, 9 \times 9)$GT/.csv`Cell-by-cell grid accuracy
tic_tac_toeGame state progressionGT/<idx>.pngGrid cell comparison

Spatial Cognition

TaskDescriptionGT FormatKey Metrics
shape fittingShape fitting puzzle solvingInputs onlyVLM-based hole filling accuracy
visual symmetrySymmetry completionGT/<type>/single/<idx>.pngDelta-E color accuracy
color_connectColor matching and connectionGT/<idx>.pngVLM-based connection accuracy

Pattern-based Inference

TaskDescriptionGT FormatKey Metrics
sequence completionSequence pattern completionGT/<idx>.png + masksShape/background accuracy
analogy solvingVisual transformation understandingGT/<concept>/<idx>.pngIoU with SAM segmentation
rule followingPattern completion following rulesGT/<idx>.pngCell-by-cell grid accuracy

Physical Dynamics

TaskDescriptionGT FormatKey Metrics
temperatureIce melting under different conditionsinputs/<idx>.pngVLM physical reasoning score
lever balanceLever balance physicsGT/<idx>.csvMask-specific pixel accuracy
communicating vesselsFluid dynamicsGT/<idx>.csvMask-specific pixel accuracy
block_slideBlock sliding puzzleGT/<idx>_gt.png + masksShape/background accuracy

๐Ÿ“Š Evaluation Details

Directory Structure

evaluations/
  <TaskName>/
    GT/              # Ground truth (images/CSVs)
    inputs/          # Initial state inputs
    predictions/     # Auto-generated: extracted frames
    eval_results/    # Auto-generated: JSON results

Output Format

Results are saved to evaluations/<TaskName>/eval_results/<task_name>_eval.json:

{
  "model_summary": [
    {
      "model": "model_name",
      "pass_at_k": 0.92,
      "mean_score": 0.85,
      "count": 50
    }
  ],
  "aggregate": {
    "num_videos": 300,
    "num_gt": 10,
    "num_models": 6,
    "mean_score": 0.61,
    "pass_at_k": 0.18,
    "threshold": 0.95,
  },
  "results": {
    "/path/to/video.mp4": {
      "score": 0.85,
      "gt_index": "01",
      "model": "model_name",
      "passed": true
    }
  }
}

Metrics

Pass@k: Probability that at least one of k attempts succeeds (averaged across all GT instances)

Calculation:

  1. For each (model, GT) pair, check if any of the k predictions pass (score โ‰ฅ threshold)
  2. Average success rate across all GTs for each model

๐Ÿ“ Citation

If you find V-ReasonBench useful for your research, please cite:

@misc{luo2025vreasonbenchunifiedreasoningbenchmark,
      title={V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models}, 
      author={Yang Luo and Xuanlei Zhao and Baijiong Lin and Lingting Zhu and Liyao Tang and Yuqi Liu and Ying-Cong Chen and Shengju Qian and Xin Wang and Yang You},
      year={2025},
      eprint={2511.16668},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.16668}, 
}