MME-CoF: Evaluation of Video Chain-of-frames π¬
November 24, 2025 Β· View on GitHub
Official repository for the project "Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-COF Benchmark"
[π Homepage] [π arXiv Paper] [π€ HF Datasets]
π₯ News
- [2025.11.15] π₯ We update the MME-CoF results for the Wan 2.2 series and HunyuanVideo, alongside the previously reported results for the closed-source Veo 3 series, Sora 2 series, Kling, and Seedance. The leaderboard covering all evaluated models on the updated benchmark will be refreshed shortly.
- [2025.11.15] π₯ We expand MME-CoF to support a more comprehensive and reliable evaluation. Please access the updated benchmark on [π€ HF Datasets].
- [2025.11.04] π₯ We release the evaluation code.
- [2025.11.03] π₯ We publish MME-CoF benchmark data at [π€ Huggingface Dataset].
- [2025.11.01] π We release the arXiv paper.
π§ Study Overview
Overview of Our Study on the Reasoning Potential of Video Models.
We investigate a key question: Are current video models reliable zero-shot reasoners? While modern video models can βsee the worldβ and show promising ability to perceive, understand, and manipulate complex visual scenes, their actual reliability in visual reasoning remains unverified.
We conduct a comprehensive Chain-of-Frame (CoF) evaluation of the leading model Veo-3 across 12 core dimensions and introduce MME-CoF, a compact and standardized benchmark for systematic CoF reasoning assessment. Our findings show that current video models are not yet dependable standalone zero-shot reasoners, but they demonstrate strong potential as powerful visual perception and scene-understanding modules to complement dedicated reasoning systems.
π Deep-Dive Analysis on Veo-3
We provide the first investigation of video models (Veo-3) to analyze their visual reasoning potential, detailing representative successes, characteristic errors, and the conditions under which CoF reasoning emerges, holds, or breaks.
πͺ Evaluation
Download Dataset
git lfs install
git clone https://huggingface.co/datasets/ZiyuG/MME-CoF
Run Evaluation
By default, each image is padded to 16:9, and the video model generates six videos per image. We evaluate using Gemini-2.5-Pro.
- Place
evaluate.pyandgenai_client.pyunder the dataset folder - Edit line 24 in
genai_client.pyto add your Google AI API Key - Run:
python evaluate.py
Results will be saved to mme-cof_eval_results.json
π¦ MME-CoF Benchmark
We curate MME-CoF, a compact benchmark providing a standardized taxonomy and an evaluation protocol aligned with CoF reasoning, enabling consistent and category-wise assessment beyond surface-level visual fidelity.
Β Β Β Β Β Β Β Β
Evaluation Radar Map and Word Cloud of MME-CoF.
π Citation
If you find this work useful, please cite:
@article{guo2025video,
title={Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark},
author={Guo, Ziyu and Chen, Xinyan and Zhang, Renrui and An, Ruichuan and Qi, Yu and Jiang, Dongzhi and Li, Xiangtai and Zhang, Manyuan and Li, Hongsheng and Heng, Pheng-Ann},
journal={arXiv preprint arXiv:2510.26802},
year={2025}
}