MME-CoF: Evaluation of Video Chain-of-frames 🎬

November 24, 2025 · View on GitHub

Official repository for the project "Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-COF Benchmark"

[🌍 Homepage] [📖 arXiv Paper] [🤗 HF Datasets]

💥 News

[2025.11.15] 🔥 We update the MME-CoF results for the Wan 2.2 series and HunyuanVideo, alongside the previously reported results for the closed-source Veo 3 series, Sora 2 series, Kling, and Seedance. The leaderboard covering all evaluated models on the updated benchmark will be refreshed shortly.
[2025.11.15] 🔥 We expand MME-CoF to support a more comprehensive and reliable evaluation. Please access the updated benchmark on [🤗 HF Datasets].
[2025.11.04] 🔥 We release the evaluation code.
[2025.11.03] 🔥 We publish MME-CoF benchmark data at [🤗 Huggingface Dataset].
[2025.11.01] 🚀 We release the arXiv paper.

🧠 Study Overview

Study overview

Overview of Our Study on the Reasoning Potential of Video Models.

We investigate a key question: Are current video models reliable zero-shot reasoners? While modern video models can “see the world” and show promising ability to perceive, understand, and manipulate complex visual scenes, their actual reliability in visual reasoning remains unverified.

We conduct a comprehensive Chain-of-Frame (CoF) evaluation of the leading model Veo-3 across 12 core dimensions and introduce MME-CoF, a compact and standardized benchmark for systematic CoF reasoning assessment. Our findings show that current video models are not yet dependable standalone zero-shot reasoners, but they demonstrate strong potential as powerful visual perception and scene-understanding modules to complement dedicated reasoning systems.

🔍 Deep-Dive Analysis on Veo-3

We provide the first investigation of video models (Veo-3) to analyze their visual reasoning potential, detailing representative successes, characteristic errors, and the conditions under which CoF reasoning emerges, holds, or breaks.

💪 Evaluation

Download Dataset

git lfs install
git clone https://huggingface.co/datasets/ZiyuG/MME-CoF

Run Evaluation

By default, each image is padded to 16:9, and the video model generates six videos per image. We evaluate using Gemini-2.5-Pro.

Place evaluate.py and genai_client.py under the dataset folder
Edit line 24 in genai_client.py to add your Google AI API Key
Run: python evaluate.py

Results will be saved to mme-cof_eval_results.json

📦 MME-CoF Benchmark

We curate MME-CoF, a compact benchmark providing a standardized taxonomy and an evaluation protocol aligned with CoF reasoning, enabling consistent and category-wise assessment beyond surface-level visual fidelity.

MME-CoF radar evaluation MME-CoF word cloud

Evaluation Radar Map and Word Cloud of MME-CoF.

📜 Citation

If you find this work useful, please cite:

@article{guo2025video,
  title={Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark},
  author={Guo, Ziyu and Chen, Xinyan and Zhang, Renrui and An, Ruichuan and Qi, Yu and Jiang, Dongzhi and Li, Xiangtai and Zhang, Manyuan and Li, Hongsheng and Heng, Pheng-Ann},
  journal={arXiv preprint arXiv:2510.26802},
  year={2025}
}