README.md

May 5, 2026 · View on GitHub

The Landscape of Video Reasoning: Tasks, Paradigms and Benchmarks— An Open-Source Survey

Awesome

🗺️ Overview

This Awesome list systematically curates and tracks the latest progress in Video Reasoning, covering diverse modalities, tasks, and modeling paradigms. Rather than focusing on a single line of research, we organize the landscape from multiple complementary perspectives. Following the emerging taxonomy of the field, current works are grouped into four major paradigms:

  • 🗒️ CoT-based Video Reasoning — language-centric, chain-of-thought reasoning with Video-LMMs
  • 🕹️ CoF-based Video Reasoning — vision-centric reasoning grounded in world models or video generation
  • 🌈 Interleaved Video Reasoning — unified models that integrate multimodal interaction and iterative inference
  • 🔁 Streaming Video Reasoning — continuous, low-latency reasoning over long or unbounded video streams with online perception and incremental state updates.

We additionally maintain a dedicated Benchmark section that summarizes datasets, evaluation settings, and standardized tasks to support fair comparison across paradigms.

Note

This repository aims to provide a structured, up-to-date, and open-source overview of the evolving landscape of video reasoning. Contributions and PRs are warmly welcome — preferably in reverse chronological order (newest first) to keep the list fresh and easy to browse.

📖 Contents

📑 Task Definition

TBD

😎 Paradigms

🕹️ CoT-based Video Reasoning

TitleModel & CodeCheckpointInput ModalitiesTimeVenue
Rethinking Chain-of-Thought Reasoning for VideosGitHub N/AText Video2025-12Arxiv
1+1 > 2 : Detector-Empowered Video Large Language Model for Spatio-Temporal Grounding and ReasoningGitHub N/AText Video2025-12Arxiv
TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement LearningN/AN/AText Video2025-12Arxiv
OneThinker: All-in-one Reasoning Model for Image and VideoGitHub Hugging FaceText Video2025-12Arxiv
WorldMM: Dynamic Multimodal Memory Agent for Long Video ReasoningGitHub N/AText Video2025-12Arxiv
Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video UnderstandingN/AN/AText Video2025-12Arxiv
Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language ModelsGitHub N/AText Video2025-11Arxiv
Video-CoM: Interactive Video Reasoning via Chain of ManipulationsGitHub N/AText Video2025-11Arxiv
VideoSeg-R1: Reasoning Video Object Segmentation via Reinforcement LearningGitHub N/AText Video2025-11Arxiv
AVATAAR: Agentic Video Answering via Temporal Adaptive Alignment and ReasoningN/AN/AAudio Video2025-11Arxiv
Agentic Video Intelligence: A Flexible Framework for Advanced Video Exploration and UnderstandingN/AN/AText Video2025-11Arxiv
Video Spatial Reasoning with Object-Centric 3D RolloutN/AN/AText Video2025-11Arxiv
ViSS-R1: Self-Supervised Reinforcement Video ReasoningN/AN/AText Video2025-11Arxiv
Video-Thinker: Sparking "Thinking with Videos" via Reinforcement LearningGitHub Hugging FaceText Video2025-10Arxiv
Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal EvidenceGitHub Hugging FaceText Video2025-10Arxiv
VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative PerceptionGitHub Hugging FaceText Video2025-09Arxiv
MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal ReasoningN/AN/AText Video2025-09Arxiv
Kwai Keye-VL 1.5 Technical ReportGitHub Hugging FaceText Video2025-09Arxiv
Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction DataGitHub Google_DriveText Video2025-09Arxiv
Video-MTR: Reinforced Multi-Turn Reasoning for Long Video UnderstandingN/AN/AText Video2025-08Arxiv
Ovis2.5 Technical ReportGitHub Hugging FaceText Video2025-08Arxiv
Veason-R1: Reinforcing Video Reasoning Segmentation to Think Before It SegmentsN/AN/AText Video2025-08Arxiv
ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language TrackingGitHub N/AText Video2025-08Arxiv
TAR-TVG: Enhancing VLMs with Timestamp Anchor-Constrained Reasoning for Temporal Video GroundingN/AN/AText Video2025-08Arxiv
Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video ReasoningGitHub Hugging FaceText Video2025-08Arxiv
AVATAR: Reinforcement Learning to See, Hear, and Reason Over VideoGitHub Hugging FaceAudio Video Text2025-08Arxiv
ReasonAct: Progressive Training for Fine-Grained Video Reasoning in Small ModelsN/AN/AText Video2025-08Arxiv
VideoForest: Person-Anchored Hierarchical Reasoning for Cross-Video Question AnsweringN/AN/AText Video2025-08ACM-MM 2025
ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World ShortsGitHub Hugging FaceText Audio Video2025-07Arxiv
METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and BenchmarkN/AN/AText Audio Video2025-07Arxiv
CoTasks: Chain-of-Thought based Video Instruction Tuning TasksN/AN/AText Video2025-07Arxiv
EmbRACE-3K: Embodied Reasoning and Action in Complex EnvironmentsN/AN/AText Video2025-07Arxiv
Scaling RL to Long VideosGitHub Hugging FaceText Video2025-07NeurIPS 2025
Kwai Keye-VL Technical ReportGitHub N/AText Video2025-07Arxiv
ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language ModelsGitHub N/AText Video2025-07ACM-MM 2025
Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video ReasoningGitHub Hugging FaceText Video2025-07EMNLP 2025
Temporal Chain of Thought: Long-Video Understanding by Thinking in FramesN/AN/AText Video2025-07Arxiv
VLN-R1: Vision-Language Navigation via Reinforcement Fine-TuningN/AN/AText Video2025-06Arxiv
Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video ReasoningGitHub N/AText Video2025-06Arxiv
DAVID-XR1: Detecting AI-Generated Videos with Explainable ReasoningN/AN/AText Video2025-06Arxiv
VidBridge-R1: Bridging QA and Captioning for RL-based Video Understanding Models with Intermediate Proxy TasksGitHub Hugging FaceText Video2025-06Arxiv
HumanOmniV2: From Understanding to Omni-Modal Reasoning with ContextGitHub N/AAudio Video Text2025-06Arxiv
MiMo-VL Technical ReportGitHub Hugging FaceText Video2025-06Arxiv
Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video ReasoningGitHub N/AText Video2025-06EMNLP 2025 (Findinds)
EgoVLM: Policy Optimization for Egocentric Video UnderstandingGitHub Hugging FaceText Video2025-06Arxiv
Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data EfficiencyGitHub Hugging FaceText Video2025-06Arxiv
VideoCap-R1: Enhancing MLLMs for Video Captioning via Structured ThinkingN/AN/AText Video2025-06Arxiv
ReAgent-V: A Reward-Driven Multi-Agent Framework for Video UnderstandingGitHub N/AText Video2025-06NeurIPS 2025
ReFoCUS: Reinforcement-guided Frame Optimization for Contextual UnderstandingN/AN/AText Video2025-06Arxiv
DIVE: Deep-search Iterative Video ExplorationGithub N/AText Video2025-06CVPR 2025
VideoDeepResearch: Long Video Understanding With Agentic Tool UsingGithub N/AText Video2025-06Arxiv
Wait, We Don't Need to "Wait"! Removing Thinking Tokens Improves Reasoning EfficiencyN/AN/AText Video2025-06Arxiv
DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPOGithub N/AText Video2025-06NeurIPS 2025
Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-ThoughtN/AProject_PageText Video2025-06Arxiv
VideoChat-A1: Thinking with Long Videos by Chain-of-Shot ReasoningN/AN/AText Video2025-06Arxiv
Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware ReasoningGithub Hugging FaceText Video2025-06Arxiv
Reinforcing Video Reasoning with Focused ThinkingGithub Hugging FaceText Video2025-05Arxiv
A2Seek: Towards Reasoning-Centric Benchmark for Aerial Anomaly UnderstandingGithub N/AText Video2025-05Arxiv
Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System CollaborationGithub Hugging FaceText Audio Video2025-05NeurIPS 2025
Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-ThoughtGithub N/AText Video2025-05NeurIPS 2025
VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Guided Iterative Policy OptimizationGithub Hugging FaceText Video2025-05Arxiv
Fact-R1: Towards Explainable Video Misinformation Detection with Deep ReasoningGithub N/AText Speech Video2025-05NeurIPS 2025
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement LearningGithub Hugging FaceText Video2025-05NeurIPS 2025
UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement LearningGithub Hugging FaceText Video2025-05Arxiv
VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-TuningGithub Hugging FaceText Video2025-05NeurIPS 2025
Seed1.5-VL Technical ReportN/AN/AText Video2025-05Arxiv
TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in ActionGithub Hugging FaceText Video2025-05Arxiv
Fostering Video Reasoning via Next-Event PredictionGithub N/AText Video2025-05Arxiv
SiLVR: A Simple Language-based Video Reasoning FrameworkGithub N/AText Video2025-05Arxiv
RVTBench: A Benchmark for Visual Reasoning TasksGitHub Hugging FaceText Video2025-05Arxiv
CoT-Vid: Dynamic Chain-of-Thought Routing with Self-Verification for Training-Free Video ReasoningN/AN/AText Video2025-05Arxiv
VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language ModelsGitHub N/AText Video2025-05Arxiv
AVA: Towards Agentic Video Analytics with Vision Language ModelsGitHub N/AText Video2025-05NSDI 2026
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video ReasoningGitHub Hugging FaceText Video2025-04Arxiv
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-TuningGitHub Hugging FaceText Video2025-04Arxiv
Spatial-R1: Enhancing MLLMs in Video Spatial ReasoningGitHub Hugging FaceText Video2025-04Arxiv
Improved Visual-Spatial Reasoning via R1-Zero-Like TrainingGitHub Hugging FaceText Video2025-04Arxiv
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language ModelsGitHub N/AText Video2025-04Arxiv
LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video UnderstandingN/AN/AText Video2025-04Arxiv
From 128K to 4M: Efficient Training of Ultra-Long Context Large Language ModelsN/AHugging FaceText Video2025-04Arxiv
MR. Video: "MapReduce" is the Principle for Long Video UnderstandingGitHub N/AText Video2025-04Arxiv
Multimodal Long Video Modeling Based on Temporal Dynamic ContextGitHub Hugging FaceText Video2025-04Arxiv
WikiVideo: Article Generation from Multiple VideosGitHub N/AText Video2025-04Arxiv
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1GitHub Hugging FaceText Video2025-03Arxiv
Video-R1: Reinforcing Video Reasoning in MLLMsGitHub Hugging FaceText Video2025-03NeurIPS 2025
TimeZero: Temporal Video Grounding with Reasoning-Guided LVLMGitHub Hugging FaceText Video2025-03NeurIPS 2025
ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric VideosN/AN/AText Video2025-03NeurIPS 2025
VideoMind: A Chain-of-LoRA Agent for Long Video ReasoningGitHub Hugging FaceText Video2025-03Arxiv
Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMsGitHub N/AAudio Video Text2025-03ICCV 2025
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language ModelGitHub Hugging FaceAudio Video Text2025-02Arxiv
TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem UnderstandingGitHub N/AText Video2025-02ACL 2025 (Oral)
CoS: Chain-of-Shot Prompting for Long Video UnderstandingGitHub N/AText Video2025-02Arxiv
Temporal Preference Optimization for Long-Form Video UnderstandingGitHub Hugging FaceText Video2025-01Arxiv
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward ModelGitHub Hugging FaceText Video2025-01ACL 2025 (Findings)
MECD+: Unlocking Event-Level Causal Graph Discovery for Video ReasoningGitHub Hugging FaceText Video2025-01IEEE TPAMI
Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMsN/AN/AText Video2025-01Arxiv
Video-of-Thought: Step-by-Step Video Reasoning from Perception to CognitionGitHub N/AText Video2025-01ICML 2024
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time ScalingGitHub Hugging FaceText Video2024-12Arxiv
STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-TrainingN/AN/AText Video2024-12CVPR 2025
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame SelectionGitHub Hugging FaceText Video2024-11CVPR 2025
Adaptive Video Understanding Agent: Enhancing efficiency with dynamic frame sampling and feedback-driven reasoningN/AN/AText Video2024-10NeurIPS 2024 (Workshop)
VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMsGitHub N/AText Video2024-09EMNLP 2024 (Findinds)
MECD: Unlocking Multi-Event Causal Discovery in Video ReasoningGitHub Hugging FaceText Video2024-09NeurIPS 2024 (Spotlight)

🕹️ CoF-based Video Reasoning

TitleCodeCheckpointTimeVenue
CoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image GenerationGitHub N/A2026-01Arxiv
Unified Video Editing with Temporal ReasonerGitHub Hugging Face2025-12Arxiv
Video Models Start to Solve Chess, Maze, Sudoku, Mental Rotation, and Raven’ MatricesGitHub N/A2025-12Arxiv
McSc: Motion-Corrective Preference Alignment for Video Generation with Self-Critic Hierarchical ReasoningGitHub N/A2025-11Arxiv
In-Video Instructions: Visual Signals as Generative ControlGitHub N/A2025-11Arxiv
Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPOGitHub Hugging Face2025-11Arxiv
Reasoning via Video: The First Evaluation of Video Models’ Reasoning Abilities through Maze-Solving TasksGitHub Hugging Face2025-11Arxiv
Thinking with Video: Video Generation as a Promising Multimodal Reasoning ParadigmGitHub N/A2025-11Arxiv
Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF BenchmarkGitHub Hugging Face2025-10Arxiv
VChain : Chain-of-Visual-Thought for Reasoning in Video GenerationGitHub N/A2025-10Arxiv
Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware ReasoningGitHub N/A2025-06Arxiv

🌈 Interleaved Video Reasoning

TitleCodeCheckpointTimeVenue
LongVT: Incentivizing "Thinking with Long Videos" via Native Tool CallingGitHub Hugging Face2025-11Arxiv
JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and GenerationGitHub N/A2025-11NeurIPS 2025 (Spotlight)
Video-R4: Reinforcing Text-Rich Video Reasoning with Visual RuminationGitHub N/A2025-11Arxiv
Orion: A Unified Visual Agent for Multimodal Perception, Advanced Visual Reasoning and ExecutionN/AN/A2025-11Arxiv
Video-in-the-Loop: Span-Grounded Long Video QA with Interleaved ReasoningN/AN/A2025-10Arxiv
ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language ModelsGitHub Hugging Face2025-10ACM-MM 2025
FrameMind: Frame-Interleaved Video Reasoning via Reinforcement LearningN/AN/A2025-09Arxiv
Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video ReasoningGitHub Hugging Face2025-08Arxiv
VILA-U: a Unified Foundation Model Integrating Visual Understanding and GenerationGitHub Hugging Face2024-09ICLR 2025

🔁 Streaming Video Reasoning

TitleCodeCheckpointTimeVenue
Video Streaming Thinking: VideoLLMs Can Watch and Think SimultaneouslyGitHub N/A2026-03Arxiv
LiveStar: Live Streaming Assistant for Real-World Online Video UnderstandingGitHub Hugging Face2025-11NeurIPS 2025
StreamingVLM: Real-Time Understanding for Infinite Video StreamsGitHub N/A2025-10Arxiv
StreamAgent: Towards Anticipatory Agents for Streaming Video UnderstandingN/AN/A2025-10Arxiv
StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQAGitHub N/A2025-10ACM-MM 2025
StreamForest: Efficient Online Video Understanding with Persistent Event MemoryGitHub Hugging Face2025-09NeurIPS 2025 (Spotlighht)
StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context ModelingGitHub Hugging Face2025-07Arxiv
Flash-VStream: Efficient Real-Time Understanding for Long Video StreamsGitHub Hugging Face2025-06ICCV 2025
StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming AssistantGitHub N/A2025-05NeurIPS 2025
LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and RetrievalN/AN/A2025-05Arxiv
TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming VideosGitHub Hugging Face2025-04ACM-MM 2025
LiveCC: Learning Video LLM with Streaming Speech Transcription at ScaleGitHub N/A2025-04Arxiv
ViSpeak: Visual Instruction Feedback in Streaming VideosGitHub Model_Zoo2025-03ICCV 2025
StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated CognitionGitHub N/A2025-03ICCV 2025
Streaming Video Question-Answering with In-context Video KV-Cache RetrievalGitHub N/A2025-03ICLR 2025
SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video UnderstandingGitHub Hugging Face2025-02ICLR 2025 (Spotlight)
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and ReactionGitHub Hugging Face2025-01CVPR 2025
Streaming Video Understanding and Multi-round Interaction with Memory-enhanced KnowledgeGitHub N/A2025-01ICLR 2025
Online Video Understanding: A Comprehensive Benchmark and Memory-Augmented MethodGitHub Hugging Face2025-01CVPR 2025
StreamChat: Chatting with Streaming VideoN/AN/A2024-11Arxiv

✨️ Benchmarks

NamePaperLinkTaskTimeVenue
MMGRMMGR: Multi-Modal Generative ReasoningGitHub <br>Hugging FaceVision2015-12Arxiv
MM-CoTMM-CoT:A Benchmark for Probing Visual Chain-of-Thought Reasoning in Multimodal ModelsN/ALanguage2015-12Arxiv
RULER-BenchRULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation IntelligenceGitHub <br>Hugging FaceVision2025-12Arxiv
AV-SpeakerBenchSee, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language ModelsGitHub Language2025-12Arxiv
PAI-BenchPAI-Bench: A Comprehensive Benchmark For Physical AIGitHub Language Vision2025-12Arxiv
EnvisionEnvision: Benchmarking Unified Understanding & Generation for Causal World Process InsightsGitHub Vision2025-12Arxiv
STREAMGAZESTREAMGAZE: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming VideosGitHub <br>Hugging FaceStreaming Language2025-12Arxiv
V-ReasonBenchV-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation ModelsGitHub Vision2025-11Arxiv
VR-BenchReasoning via Video: The First Evaluation of Video Models’ Reasoning Abilities through Maze-Solving TasksGitHub <br>Hugging FaceVision2025-11Arxiv
Gen-ViReCan World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning BenchmarkGitHub Vision2025-11Arxiv
TiViBenchTiViBench: Benchmarking Think-in-Video Reasoning for Video Generative ModelsGitHub Vision2025-11Arxiv
VideoThinkBenchThinking with Video: Video Generation as a Promising Multimodal Reasoning ParadigmGitHub Vision2025-11Arxiv
MME-CoFAre Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF BenchmarkHugging FaceVision2025-10Arxiv
SciVideoBenchSciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal ModelsGitHub Language2025-10Arxiv
ReasoningTrackReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language TrackingGitHubLanguage2025-08Arxiv
METERMETER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and BenchmarkN/ALanguage2025-07Arxiv
Video-TTTowards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and UnderstandingHugging FaceLanguage2025-07ICCV 2025
ImplicitQAImplicitQA: Going beyond frames towards Implicit Video ReasoningHugging FaceLanguage2025-06Arxiv
Video-CoTVideo-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-ThoughtHugging FaceLanguage2025-06Arxiv
Implicit-VideoQALooking Beyond Visible Cues: Implicit Video Question Answering via Dual-Clue ReasoningGitHubLanguage2025-06Arxiv
MORSE-500MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal ReasoningGitHub <br>Hugging FaceLanguage2025-06Arxiv
SpookyBenchTime Blindness: Why Video-Language Models Can't See What Humans CanGitHub <br>Hugging FaceLanguage2025-05Arxiv
VideoReasonBenchVideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?GitHub <br>Hugging FaceLanguage2025-05Arxiv
Video-HolmesVideo-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?GitHub Language2025-05Arxiv
VideoEval-ProVideoEval-Pro: Robust and Realistic Long Video Understanding EvaluationGitHub <br>Hugging FaceLanguage2025-05Arxiv
VBenchCompBreaking Down Video LLM BenchmarksN/ALanguage2025-05Arxiv
RVTBenchRVTBench: A Benchmark for Visual Reasoning TasksGitHub <br>Hugging FaceLanguage2025-05Arxiv
VCRBenchVCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language ModelsGitHub Language2025-05Arxiv
RTV-BenchRTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time VideoGitHub <br>Hugging FaceStreaming Language2025-05NeurIPS 2025 (D&B)
MINERVAMINERVA: Evaluating Complex Video ReasoningGitHub Language2025-05Arxiv
VCR-BenchVCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought ReasoningGitHub <br>Hugging FaceLanguage2025-04Arxiv
SEED-Bench-R1Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1GitHub <br>Hugging FaceLanguage2025-03Arxiv
H2VU-BenchmarkH2VU-Benchmark: A Comprehensive Benchmark for Hierarchical Holistic Video UnderstandingGitHub Streaming Language2025-03Arxiv
OmniMMIOmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video ContextsGitHub <br>Hugging FaceStreaming Language2025-03CVPR 2025
HAVENExploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and MitigationGitHub <br>Hugging FaceLanguage2025-03Arxiv
V-STaRV-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal ReasoningGitHub <br>Hugging FaceLanguage2025-03Arxiv
COVERReasoning is All You Need for Video GeneralizationGitHub Language2025-03ACL 2025 (Findinds)
MOMA-QATowards Fine-Grained Video Question AnsweringN/ALanguage2025-03Arxiv
SVBenchSVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video UnderstandingGitHub Streaming Language2025-02ICLR 2025 (Spotlight)
StreamBenchStreaming Video Understanding and Multi-round Interaction with Memory-enhanced KnowledgeGitHub <br>Hugging FaceStreaming Language2025-01ICLR 2025
MMVUMMVU: Measuring Expert-Level Multi-Discipline Video UnderstandingGitHub <br>Hugging FaceLanguage2025-01Arxiv
OVO-BenchOVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?GitHub Hugging FaceStreaming Language2025-01CVPR 2025
HLV-1KHLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video UnderstandingGitHub Language2025-01ICME 2025
OVBenchOnline Video Understanding: OVBench and VideoChat-OnlineGitHub <br>Hugging FaceStreaming Language2025-01CVPR 2025
VSI-BenchThinking in Space: How Multimodal Large Language Models See, Remember, and Recall SpacesGitHub Language2024-12CVPR 2025 (Oral)
3DSRBench3DSRBench: A Comprehensive 3D Spatial Reasoning BenchmarkHugging FaceLanguage2024-12ICCV 2025
BlackSwanSuiteBlack Swan: Abductive and Defeasible Video Reasoning in Unpredictable EventsGitHub <br>Hugging FaceLanguage2024-12CVPR 2025
TOMATOTOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation ModelsGithub Language2024-10CVPR 2025
OmnixROmnixR: Evaluating Omni-modality Language Models on Reasoning across ModalitiesN/ALanguage2024-10ICLR 2025
VideoVistaVideoVista: A Versatile Benchmark for Video Understanding and ReasoningGithub Language2024-06Arxiv
SOK-BenchSOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World KnowledgeGitHub Language2024-05CVPR 2024
CVRR-ESHow Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMsGitHub Language2024-05Arxiv

In addition, several recent and concurrent surveys have discussed multimodal or video reasoning. The works listed below offer complementary perspectives to ours, reflecting the field’s rapid and parallel development:



🌟 Star History

Star History Chart

♥️ Contributors

Contributors for Awesome Video Reasoning Landscape