README.md

May 5, 2026 · View on GitHub

The Landscape of Video Reasoning: Tasks, Paradigms and Benchmarks— An Open-Source Survey

🗺️ Overview

This Awesome list systematically curates and tracks the latest progress in Video Reasoning, covering diverse modalities, tasks, and modeling paradigms. Rather than focusing on a single line of research, we organize the landscape from multiple complementary perspectives. Following the emerging taxonomy of the field, current works are grouped into four major paradigms:

🗒️ CoT-based Video Reasoning — language-centric, chain-of-thought reasoning with Video-LMMs
🕹️ CoF-based Video Reasoning — vision-centric reasoning grounded in world models or video generation
🌈 Interleaved Video Reasoning — unified models that integrate multimodal interaction and iterative inference
🔁 Streaming Video Reasoning — continuous, low-latency reasoning over long or unbounded video streams with online perception and incremental state updates.

We additionally maintain a dedicated Benchmark section that summarizes datasets, evaluation settings, and standardized tasks to support fair comparison across paradigms.

Note

This repository aims to provide a structured, up-to-date, and open-source overview of the evolving landscape of video reasoning. Contributions and PRs are warmly welcome — preferably in reverse chronological order (newest first) to keep the list fresh and easy to browse.

📖 Contents

Awesome-Video-Reasoning-Landscape

📑 Task Definition

TBD

😎 Paradigms

🕹️ CoT-based Video Reasoning

Title	Model & Code	Checkpoint	Time	Venue
Rethinking Chain-of-Thought Reasoning for Videos	GitHub	`N/A`	2025-12	`Arxiv`
1+1 > 2 : Detector-Empowered Video Large Language Model for Spatio-Temporal Grounding and Reasoning	GitHub	`N/A`	2025-12	`Arxiv`
TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning	`N/A`	`N/A`	2025-12	`Arxiv`
OneThinker: All-in-one Reasoning Model for Image and Video	GitHub	Hugging Face	2025-12	`Arxiv`
WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning	GitHub	`N/A`	2025-12	`Arxiv`
Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding	`N/A`	`N/A`	2025-12	`Arxiv`
Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models	GitHub	`N/A`	2025-11	`Arxiv`
Video-CoM: Interactive Video Reasoning via Chain of Manipulations	GitHub	`N/A`	2025-11	`Arxiv`
VideoSeg-R1: Reasoning Video Object Segmentation via Reinforcement Learning	GitHub	`N/A`	2025-11	`Arxiv`
AVATAAR: Agentic Video Answering via Temporal Adaptive Alignment and Reasoning	`N/A`	`N/A`	2025-11	`Arxiv`
Agentic Video Intelligence: A Flexible Framework for Advanced Video Exploration and Understanding	`N/A`	`N/A`	2025-11	`Arxiv`
Video Spatial Reasoning with Object-Centric 3D Rollout	`N/A`	`N/A`	2025-11	`Arxiv`
ViSS-R1: Self-Supervised Reinforcement Video Reasoning	`N/A`	`N/A`	2025-11	`Arxiv`
Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning	GitHub	Hugging Face	2025-10	`Arxiv`
Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence	GitHub	Hugging Face	2025-10	`Arxiv`
VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception	GitHub	Hugging Face	2025-09	`Arxiv`
MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal Reasoning	`N/A`	`N/A`	2025-09	`Arxiv`
Kwai Keye-VL 1.5 Technical Report	GitHub	Hugging Face	2025-09	`Arxiv`
Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data	GitHub	Google_Drive	2025-09	`Arxiv`
Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding	`N/A`	`N/A`	2025-08	`Arxiv`
Ovis2.5 Technical Report	GitHub	Hugging Face	2025-08	`Arxiv`
Veason-R1: Reinforcing Video Reasoning Segmentation to Think Before It Segments	`N/A`	`N/A`	2025-08	`Arxiv`
ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking	GitHub	`N/A`	2025-08	`Arxiv`
TAR-TVG: Enhancing VLMs with Timestamp Anchor-Constrained Reasoning for Temporal Video Grounding	`N/A`	`N/A`	2025-08	`Arxiv`
Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning	GitHub	Hugging Face	2025-08	`Arxiv`
AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video	GitHub	Hugging Face	2025-08	`Arxiv`
ReasonAct: Progressive Training for Fine-Grained Video Reasoning in Small Models	`N/A`	`N/A`	2025-08	`Arxiv`
VideoForest: Person-Anchored Hierarchical Reasoning for Cross-Video Question Answering	`N/A`	`N/A`	2025-08	`ACM-MM 2025`
ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts	GitHub	Hugging Face	2025-07	`Arxiv`
METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and Benchmark	`N/A`	`N/A`	2025-07	`Arxiv`
CoTasks: Chain-of-Thought based Video Instruction Tuning Tasks	`N/A`	`N/A`	2025-07	`Arxiv`
EmbRACE-3K: Embodied Reasoning and Action in Complex Environments	`N/A`	`N/A`	2025-07	`Arxiv`
Scaling RL to Long Videos	GitHub	Hugging Face	2025-07	`NeurIPS 2025`
Kwai Keye-VL Technical Report	GitHub	`N/A`	2025-07	`Arxiv`
ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models	GitHub	`N/A`	2025-07	`ACM-MM 2025`
Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning	GitHub	Hugging Face	2025-07	`EMNLP 2025`
Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames	`N/A`	`N/A`	2025-07	`Arxiv`
VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning	`N/A`	`N/A`	2025-06	`Arxiv`
Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning	GitHub	`N/A`	2025-06	`Arxiv`
DAVID-XR1: Detecting AI-Generated Videos with Explainable Reasoning	`N/A`	`N/A`	2025-06	`Arxiv`
VidBridge-R1: Bridging QA and Captioning for RL-based Video Understanding Models with Intermediate Proxy Tasks	GitHub	Hugging Face	2025-06	`Arxiv`
HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context	GitHub	`N/A`	2025-06	`Arxiv`
MiMo-VL Technical Report	GitHub	Hugging Face	2025-06	`Arxiv`
Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning	GitHub	`N/A`	2025-06	`EMNLP 2025 (Findinds)`
EgoVLM: Policy Optimization for Egocentric Video Understanding	GitHub	Hugging Face	2025-06	`Arxiv`
Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency	GitHub	Hugging Face	2025-06	`Arxiv`
VideoCap-R1: Enhancing MLLMs for Video Captioning via Structured Thinking	`N/A`	`N/A`	2025-06	`Arxiv`
ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding	GitHub	`N/A`	2025-06	`NeurIPS 2025`
ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding	`N/A`	`N/A`	2025-06	`Arxiv`
DIVE: Deep-search Iterative Video Exploration	Github	`N/A`	2025-06	`CVPR 2025`
VideoDeepResearch: Long Video Understanding With Agentic Tool Using	Github	`N/A`	2025-06	`Arxiv`
Wait, We Don't Need to "Wait"! Removing Thinking Tokens Improves Reasoning Efficiency	`N/A`	`N/A`	2025-06	`Arxiv`
DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO	Github	`N/A`	2025-06	`NeurIPS 2025`
Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought	`N/A`	Project_Page	2025-06	`Arxiv`
VideoChat-A1: Thinking with Long Videos by Chain-of-Shot Reasoning	`N/A`	`N/A`	2025-06	`Arxiv`
Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning	Github	Hugging Face	2025-06	`Arxiv`
Reinforcing Video Reasoning with Focused Thinking	Github	Hugging Face	2025-05	`Arxiv`
A2Seek: Towards Reasoning-Centric Benchmark for Aerial Anomaly Understanding	Github	`N/A`	2025-05	`Arxiv`
Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration	Github	Hugging Face	2025-05	`NeurIPS 2025`
Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought	Github	`N/A`	2025-05	`NeurIPS 2025`
VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Guided Iterative Policy Optimization	Github	Hugging Face	2025-05	`Arxiv`
Fact-R1: Towards Explainable Video Misinformation Detection with Deep Reasoning	Github	`N/A`	2025-05	`NeurIPS 2025`
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning	Github	Hugging Face	2025-05	`NeurIPS 2025`
UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning	Github	Hugging Face	2025-05	`Arxiv`
VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning	Github	Hugging Face	2025-05	`NeurIPS 2025`
Seed1.5-VL Technical Report	`N/A`	`N/A`	2025-05	`Arxiv`
TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action	Github	Hugging Face	2025-05	`Arxiv`
Fostering Video Reasoning via Next-Event Prediction	Github	`N/A`	2025-05	`Arxiv`
SiLVR: A Simple Language-based Video Reasoning Framework	Github	`N/A`	2025-05	`Arxiv`
RVTBench: A Benchmark for Visual Reasoning Tasks	GitHub	Hugging Face	2025-05	`Arxiv`
CoT-Vid: Dynamic Chain-of-Thought Routing with Self-Verification for Training-Free Video Reasoning	`N/A`	`N/A`	2025-05	`Arxiv`
VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models	GitHub	`N/A`	2025-05	`Arxiv`
AVA: Towards Agentic Video Analytics with Vision Language Models	GitHub	`N/A`	2025-05	`NSDI 2026`
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning	GitHub	Hugging Face	2025-04	`Arxiv`
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning	GitHub	Hugging Face	2025-04	`Arxiv`
Spatial-R1: Enhancing MLLMs in Video Spatial Reasoning	GitHub	Hugging Face	2025-04	`Arxiv`
Improved Visual-Spatial Reasoning via R1-Zero-Like Training	GitHub	Hugging Face	2025-04	`Arxiv`
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models	GitHub	`N/A`	2025-04	`Arxiv`
LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding	`N/A`	`N/A`	2025-04	`Arxiv`
From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models	`N/A`	Hugging Face	2025-04	`Arxiv`
MR. Video: "MapReduce" is the Principle for Long Video Understanding	GitHub	`N/A`	2025-04	`Arxiv`
Multimodal Long Video Modeling Based on Temporal Dynamic Context	GitHub	Hugging Face	2025-04	`Arxiv`
WikiVideo: Article Generation from Multiple Videos	GitHub	`N/A`	2025-04	`Arxiv`
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1	GitHub	Hugging Face	2025-03	`Arxiv`
Video-R1: Reinforcing Video Reasoning in MLLMs	GitHub	Hugging Face	2025-03	`NeurIPS 2025`
TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM	GitHub	Hugging Face	2025-03	`NeurIPS 2025`
ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos	`N/A`	`N/A`	2025-03	`NeurIPS 2025`
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning	GitHub	Hugging Face	2025-03	`Arxiv`
Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs	GitHub	`N/A`	2025-03	`ICCV 2025`
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model	GitHub	Hugging Face	2025-02	`Arxiv`
TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding	GitHub	`N/A`	2025-02	`ACL 2025 (Oral)`
CoS: Chain-of-Shot Prompting for Long Video Understanding	GitHub	`N/A`	2025-02	`Arxiv`
Temporal Preference Optimization for Long-Form Video Understanding	GitHub	Hugging Face	2025-01	`Arxiv`
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model	GitHub	Hugging Face	2025-01	`ACL 2025 (Findings)`
MECD+: Unlocking Event-Level Causal Graph Discovery for Video Reasoning	GitHub	Hugging Face	2025-01	`IEEE TPAMI`
Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs	`N/A`	`N/A`	2025-01	`Arxiv`
Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition	GitHub	`N/A`	2025-01	`ICML 2024`
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling	GitHub	Hugging Face	2024-12	`Arxiv`
STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training	`N/A`	`N/A`	2024-12	`CVPR 2025`
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection	GitHub	Hugging Face	2024-11	`CVPR 2025`
Adaptive Video Understanding Agent: Enhancing efficiency with dynamic frame sampling and feedback-driven reasoning	`N/A`	`N/A`	2024-10	`NeurIPS 2024 (Workshop)`
VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs	GitHub	`N/A`	2024-09	`EMNLP 2024 (Findinds)`
MECD: Unlocking Multi-Event Causal Discovery in Video Reasoning	GitHub	Hugging Face	2024-09	`NeurIPS 2024 (Spotlight)`

🕹️ CoF-based Video Reasoning

Title	Code	Checkpoint	Time	Venue
CoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image Generation	GitHub	`N/A`	2026-01	`Arxiv`
Unified Video Editing with Temporal Reasoner	GitHub	Hugging Face	2025-12	`Arxiv`
Video Models Start to Solve Chess, Maze, Sudoku, Mental Rotation, and Raven’ Matrices	GitHub	`N/A`	2025-12	`Arxiv`
McSc: Motion-Corrective Preference Alignment for Video Generation with Self-Critic Hierarchical Reasoning	GitHub	`N/A`	2025-11	`Arxiv`
In-Video Instructions: Visual Signals as Generative Control	GitHub	`N/A`	2025-11	`Arxiv`
Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO	GitHub	Hugging Face	2025-11	`Arxiv`
Reasoning via Video: The First Evaluation of Video Models’ Reasoning Abilities through Maze-Solving Tasks	GitHub	Hugging Face	2025-11	`Arxiv`
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm	GitHub	`N/A`	2025-11	`Arxiv`
Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark	GitHub	Hugging Face	2025-10	`Arxiv`
VChain : Chain-of-Visual-Thought for Reasoning in Video Generation	GitHub	`N/A`	2025-10	`Arxiv`
Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning	GitHub	`N/A`	2025-06	`Arxiv`

🌈 Interleaved Video Reasoning

Title	Code	Checkpoint	Time	Venue
LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling	GitHub	Hugging Face	2025-11	`Arxiv`
JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation	GitHub	`N/A`	2025-11	`NeurIPS 2025 (Spotlight)`
Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination	GitHub	`N/A`	2025-11	`Arxiv`
Orion: A Unified Visual Agent for Multimodal Perception, Advanced Visual Reasoning and Execution	`N/A`	`N/A`	2025-11	`Arxiv`
Video-in-the-Loop: Span-Grounded Long Video QA with Interleaved Reasoning	`N/A`	`N/A`	2025-10	`Arxiv`
ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models	GitHub	Hugging Face	2025-10	`ACM-MM 2025`
FrameMind: Frame-Interleaved Video Reasoning via Reinforcement Learning	`N/A`	`N/A`	2025-09	`Arxiv`
Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning	GitHub	Hugging Face	2025-08	`Arxiv`
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation	GitHub	Hugging Face	2024-09	`ICLR 2025`

🔁 Streaming Video Reasoning

Title	Code	Checkpoint	Time	Venue
Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously	GitHub	`N/A`	2026-03	`Arxiv`
LiveStar: Live Streaming Assistant for Real-World Online Video Understanding	GitHub	Hugging Face	2025-11	`NeurIPS 2025`
StreamingVLM: Real-Time Understanding for Infinite Video Streams	GitHub	`N/A`	2025-10	`Arxiv`
StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding	`N/A`	`N/A`	2025-10	`Arxiv`
StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA	GitHub	`N/A`	2025-10	`ACM-MM 2025`
StreamForest: Efficient Online Video Understanding with Persistent Event Memory	GitHub	Hugging Face	2025-09	`NeurIPS 2025 (Spotlighht)`
StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling	GitHub	Hugging Face	2025-07	`Arxiv`
Flash-VStream: Efficient Real-Time Understanding for Long Video Streams	GitHub	Hugging Face	2025-06	`ICCV 2025`
StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant	GitHub	`N/A`	2025-05	`NeurIPS 2025`
LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval	`N/A`	`N/A`	2025-05	`Arxiv`
TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos	GitHub	Hugging Face	2025-04	`ACM-MM 2025`
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale	GitHub	`N/A`	2025-04	`Arxiv`
ViSpeak: Visual Instruction Feedback in Streaming Videos	GitHub	Model_Zoo	2025-03	`ICCV 2025`
StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition	GitHub	`N/A`	2025-03	`ICCV 2025`
Streaming Video Question-Answering with In-context Video KV-Cache Retrieval	GitHub	`N/A`	2025-03	`ICLR 2025`
SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding	GitHub	Hugging Face	2025-02	`ICLR 2025 (Spotlight)`
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction	GitHub	Hugging Face	2025-01	`CVPR 2025`
Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge	GitHub	`N/A`	2025-01	`ICLR 2025`
Online Video Understanding: A Comprehensive Benchmark and Memory-Augmented Method	GitHub	Hugging Face	2025-01	`CVPR 2025`
StreamChat: Chatting with Streaming Video	`N/A`	`N/A`	2024-11	`Arxiv`

✨️ Benchmarks

Name	Paper	Link	Time	Venue
MMGR	MMGR: Multi-Modal Generative Reasoning	GitHub `<br>`Hugging Face	2015-12	`Arxiv`
MM-CoT	MM-CoT:A Benchmark for Probing Visual Chain-of-Thought Reasoning in Multimodal Models	`N/A`	2015-12	`Arxiv`
RULER-Bench	RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence	GitHub `<br>`Hugging Face	2025-12	`Arxiv`
AV-SpeakerBench	See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models	GitHub	2025-12	`Arxiv`
PAI-Bench	PAI-Bench: A Comprehensive Benchmark For Physical AI	GitHub	2025-12	`Arxiv`
Envision	Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights	GitHub	2025-12	`Arxiv`
STREAMGAZE	STREAMGAZE: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos	GitHub `<br>`Hugging Face	2025-12	`Arxiv`
V-ReasonBench	V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models	GitHub	2025-11	`Arxiv`
VR-Bench	Reasoning via Video: The First Evaluation of Video Models’ Reasoning Abilities through Maze-Solving Tasks	GitHub `<br>`Hugging Face	2025-11	`Arxiv`
Gen-ViRe	Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark	GitHub	2025-11	`Arxiv`
TiViBench	TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models	GitHub	2025-11	`Arxiv`
VideoThinkBench	Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm	GitHub	2025-11	`Arxiv`
MME-CoF	Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark	Hugging Face	2025-10	`Arxiv`
SciVideoBench	SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models	GitHub	2025-10	`Arxiv`
ReasoningTrack	ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking	GitHub	2025-08	`Arxiv`
METER	METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and Benchmark	`N/A`	2025-07	`Arxiv`
Video-TT	Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding	Hugging Face	2025-07	`ICCV 2025`
ImplicitQA	ImplicitQA: Going beyond frames towards Implicit Video Reasoning	Hugging Face	2025-06	`Arxiv`
Video-CoT	Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought	Hugging Face	2025-06	`Arxiv`
Implicit-VideoQA	Looking Beyond Visible Cues: Implicit Video Question Answering via Dual-Clue Reasoning	GitHub	2025-06	`Arxiv`
MORSE-500	MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning	GitHub `<br>`Hugging Face	2025-06	`Arxiv`
SpookyBench	Time Blindness: Why Video-Language Models Can't See What Humans Can	GitHub `<br>`Hugging Face	2025-05	`Arxiv`
VideoReasonBench	VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?	GitHub `<br>`Hugging Face	2025-05	`Arxiv`
Video-Holmes	Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?	GitHub	2025-05	`Arxiv`
VideoEval-Pro	VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation	GitHub `<br>`Hugging Face	2025-05	`Arxiv`
VBenchComp	Breaking Down Video LLM Benchmarks	`N/A`	2025-05	`Arxiv`
RVTBench	RVTBench: A Benchmark for Visual Reasoning Tasks	GitHub `<br>`Hugging Face	2025-05	`Arxiv`
VCRBench	VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models	GitHub	2025-05	`Arxiv`
RTV-Bench	RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video	GitHub `<br>`Hugging Face	2025-05	`NeurIPS 2025 (D&B)`
MINERVA	MINERVA: Evaluating Complex Video Reasoning	GitHub	2025-05	`Arxiv`
VCR-Bench	VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning	GitHub `<br>`Hugging Face	2025-04	`Arxiv`
SEED-Bench-R1	Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1	GitHub `<br>`Hugging Face	2025-03	`Arxiv`
H2VU-Benchmark	H2VU-Benchmark: A Comprehensive Benchmark for Hierarchical Holistic Video Understanding	GitHub	2025-03	`Arxiv`
OmniMMI	OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts	GitHub `<br>`Hugging Face	2025-03	`CVPR 2025`
HAVEN	Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation	GitHub `<br>`Hugging Face	2025-03	`Arxiv`
V-STaR	V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning	GitHub `<br>`Hugging Face	2025-03	`Arxiv`
COVER	Reasoning is All You Need for Video Generalization	GitHub	2025-03	`ACL 2025 (Findinds)`
MOMA-QA	Towards Fine-Grained Video Question Answering	`N/A`	2025-03	`Arxiv`
SVBench	SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding	GitHub	2025-02	`ICLR 2025 (Spotlight)`
StreamBench	Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge	GitHub `<br>`Hugging Face	2025-01	`ICLR 2025`
MMVU	MMVU: Measuring Expert-Level Multi-Discipline Video Understanding	GitHub `<br>`Hugging Face	2025-01	`Arxiv`
OVO-Bench	OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?	GitHub Hugging Face	2025-01	`CVPR 2025`
HLV-1K	HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding	GitHub	2025-01	`ICME 2025`
OVBench	Online Video Understanding: OVBench and VideoChat-Online	GitHub `<br>`Hugging Face	2025-01	`CVPR 2025`
VSI-Bench	Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces	GitHub	2024-12	`CVPR 2025 (Oral)`
3DSRBench	3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark	Hugging Face	2024-12	`ICCV 2025`
BlackSwanSuite	Black Swan: Abductive and Defeasible Video Reasoning in Unpredictable Events	GitHub `<br>`Hugging Face	2024-12	`CVPR 2025`
TOMATO	TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models	Github	2024-10	`CVPR 2025`
OmnixR	OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities	`N/A`	2024-10	`ICLR 2025`
VideoVista	VideoVista: A Versatile Benchmark for Video Understanding and Reasoning	Github	2024-06	`Arxiv`
SOK-Bench	SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge	GitHub	2024-05	`CVPR 2024`
CVRR-ES	How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs	GitHub	2024-05	`Arxiv`

In addition, several recent and concurrent surveys have discussed multimodal or video reasoning. The works listed below offer complementary perspectives to ours, reflecting the field’s rapid and parallel development: