| Rethinking Chain-of-Thought Reasoning for Videos | GitHub  | N/A |  | 2025-12 | Arxiv |
| 1+1 > 2 : Detector-Empowered Video Large Language Model for Spatio-Temporal Grounding and Reasoning | GitHub  | N/A |  | 2025-12 | Arxiv |
| TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning | N/A | N/A |  | 2025-12 | Arxiv |
| OneThinker: All-in-one Reasoning Model for Image and Video | GitHub  | Hugging Face |  | 2025-12 | Arxiv |
| WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning | GitHub  | N/A |  | 2025-12 | Arxiv |
| Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding | N/A | N/A |  | 2025-12 | Arxiv |
| Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models | GitHub  | N/A |  | 2025-11 | Arxiv |
| Video-CoM: Interactive Video Reasoning via Chain of Manipulations | GitHub  | N/A |  | 2025-11 | Arxiv |
| VideoSeg-R1: Reasoning Video Object Segmentation via Reinforcement Learning | GitHub  | N/A |  | 2025-11 | Arxiv |
| AVATAAR: Agentic Video Answering via Temporal Adaptive Alignment and Reasoning | N/A | N/A |  | 2025-11 | Arxiv |
| Agentic Video Intelligence: A Flexible Framework for Advanced Video Exploration and Understanding | N/A | N/A |  | 2025-11 | Arxiv |
| Video Spatial Reasoning with Object-Centric 3D Rollout | N/A | N/A |  | 2025-11 | Arxiv |
| ViSS-R1: Self-Supervised Reinforcement Video Reasoning | N/A | N/A |  | 2025-11 | Arxiv |
| Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning | GitHub  | Hugging Face |  | 2025-10 | Arxiv |
| Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence | GitHub  | Hugging Face |  | 2025-10 | Arxiv |
| VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception | GitHub  | Hugging Face |  | 2025-09 | Arxiv |
| MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal Reasoning | N/A | N/A |  | 2025-09 | Arxiv |
| Kwai Keye-VL 1.5 Technical Report | GitHub  | Hugging Face |  | 2025-09 | Arxiv |
| Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data | GitHub  | Google_Drive |  | 2025-09 | Arxiv |
| Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding | N/A | N/A |  | 2025-08 | Arxiv |
| Ovis2.5 Technical Report | GitHub  | Hugging Face |  | 2025-08 | Arxiv |
| Veason-R1: Reinforcing Video Reasoning Segmentation to Think Before It Segments | N/A | N/A |  | 2025-08 | Arxiv |
| ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking | GitHub  | N/A |  | 2025-08 | Arxiv |
| TAR-TVG: Enhancing VLMs with Timestamp Anchor-Constrained Reasoning for Temporal Video Grounding | N/A | N/A |  | 2025-08 | Arxiv |
| Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning | GitHub  | Hugging Face |  | 2025-08 | Arxiv |
| AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video | GitHub  | Hugging Face |  | 2025-08 | Arxiv |
| ReasonAct: Progressive Training for Fine-Grained Video Reasoning in Small Models | N/A | N/A |  | 2025-08 | Arxiv |
| VideoForest: Person-Anchored Hierarchical Reasoning for Cross-Video Question Answering | N/A | N/A |  | 2025-08 | ACM-MM 2025 |
| ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts | GitHub  | Hugging Face |  | 2025-07 | Arxiv |
| METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and Benchmark | N/A | N/A |  | 2025-07 | Arxiv |
| CoTasks: Chain-of-Thought based Video Instruction Tuning Tasks | N/A | N/A |  | 2025-07 | Arxiv |
| EmbRACE-3K: Embodied Reasoning and Action in Complex Environments | N/A | N/A |  | 2025-07 | Arxiv |
| Scaling RL to Long Videos | GitHub  | Hugging Face |  | 2025-07 | NeurIPS 2025 |
| Kwai Keye-VL Technical Report | GitHub  | N/A |  | 2025-07 | Arxiv |
| ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models | GitHub  | N/A |  | 2025-07 | ACM-MM 2025 |
| Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning | GitHub  | Hugging Face |  | 2025-07 | EMNLP 2025 |
| Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames | N/A | N/A |  | 2025-07 | Arxiv |
| VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning | N/A | N/A |  | 2025-06 | Arxiv |
| Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning | GitHub  | N/A |  | 2025-06 | Arxiv |
| DAVID-XR1: Detecting AI-Generated Videos with Explainable Reasoning | N/A | N/A |  | 2025-06 | Arxiv |
| VidBridge-R1: Bridging QA and Captioning for RL-based Video Understanding Models with Intermediate Proxy Tasks | GitHub  | Hugging Face |  | 2025-06 | Arxiv |
| HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context | GitHub  | N/A |  | 2025-06 | Arxiv |
| MiMo-VL Technical Report | GitHub  | Hugging Face |  | 2025-06 | Arxiv |
| Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning | GitHub  | N/A |  | 2025-06 | EMNLP 2025 (Findinds) |
| EgoVLM: Policy Optimization for Egocentric Video Understanding | GitHub  | Hugging Face |  | 2025-06 | Arxiv |
| Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency | GitHub  | Hugging Face |  | 2025-06 | Arxiv |
| VideoCap-R1: Enhancing MLLMs for Video Captioning via Structured Thinking | N/A | N/A |  | 2025-06 | Arxiv |
| ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding | GitHub  | N/A |  | 2025-06 | NeurIPS 2025 |
| ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding | N/A | N/A |  | 2025-06 | Arxiv |
| DIVE: Deep-search Iterative Video Exploration | Github  | N/A |  | 2025-06 | CVPR 2025 |
| VideoDeepResearch: Long Video Understanding With Agentic Tool Using | Github  | N/A |  | 2025-06 | Arxiv |
| Wait, We Don't Need to "Wait"! Removing Thinking Tokens Improves Reasoning Efficiency | N/A | N/A |  | 2025-06 | Arxiv |
| DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO | Github  | N/A |  | 2025-06 | NeurIPS 2025 |
| Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought | N/A | Project_Page |  | 2025-06 | Arxiv |
| VideoChat-A1: Thinking with Long Videos by Chain-of-Shot Reasoning | N/A | N/A |  | 2025-06 | Arxiv |
| Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning | Github  | Hugging Face |  | 2025-06 | Arxiv |
| Reinforcing Video Reasoning with Focused Thinking | Github  | Hugging Face |  | 2025-05 | Arxiv |
| A2Seek: Towards Reasoning-Centric Benchmark for Aerial Anomaly Understanding | Github  | N/A |  | 2025-05 | Arxiv |
| Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration | Github  | Hugging Face |  | 2025-05 | NeurIPS 2025 |
| Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought | Github  | N/A |  | 2025-05 | NeurIPS 2025 |
| VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Guided Iterative Policy Optimization | Github  | Hugging Face |  | 2025-05 | Arxiv |
| Fact-R1: Towards Explainable Video Misinformation Detection with Deep Reasoning | Github  | N/A |  | 2025-05 | NeurIPS 2025 |
| Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning | Github  | Hugging Face |  | 2025-05 | NeurIPS 2025 |
| UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning | Github  | Hugging Face |  | 2025-05 | Arxiv |
| VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning | Github  | Hugging Face |  | 2025-05 | NeurIPS 2025 |
| Seed1.5-VL Technical Report | N/A | N/A |  | 2025-05 | Arxiv |
| TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action | Github  | Hugging Face |  | 2025-05 | Arxiv |
| Fostering Video Reasoning via Next-Event Prediction | Github  | N/A |  | 2025-05 | Arxiv |
| SiLVR: A Simple Language-based Video Reasoning Framework | Github  | N/A |  | 2025-05 | Arxiv |
| RVTBench: A Benchmark for Visual Reasoning Tasks | GitHub  | Hugging Face |  | 2025-05 | Arxiv |
| CoT-Vid: Dynamic Chain-of-Thought Routing with Self-Verification for Training-Free Video Reasoning | N/A | N/A |  | 2025-05 | Arxiv |
| VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models | GitHub  | N/A |  | 2025-05 | Arxiv |
| AVA: Towards Agentic Video Analytics with Vision Language Models | GitHub  | N/A |  | 2025-05 | NSDI 2026 |
| TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning | GitHub  | Hugging Face |  | 2025-04 | Arxiv |
| VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning | GitHub  | Hugging Face |  | 2025-04 | Arxiv |
| Spatial-R1: Enhancing MLLMs in Video Spatial Reasoning | GitHub  | Hugging Face |  | 2025-04 | Arxiv |
| Improved Visual-Spatial Reasoning via R1-Zero-Like Training | GitHub  | Hugging Face |  | 2025-04 | Arxiv |
| Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models | GitHub  | N/A |  | 2025-04 | Arxiv |
| LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding | N/A | N/A |  | 2025-04 | Arxiv |
| From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models | N/A | Hugging Face |  | 2025-04 | Arxiv |
| MR. Video: "MapReduce" is the Principle for Long Video Understanding | GitHub  | N/A |  | 2025-04 | Arxiv |
| Multimodal Long Video Modeling Based on Temporal Dynamic Context | GitHub  | Hugging Face |  | 2025-04 | Arxiv |
| WikiVideo: Article Generation from Multiple Videos | GitHub  | N/A |  | 2025-04 | Arxiv |
| Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 | GitHub  | Hugging Face |  | 2025-03 | Arxiv |
| Video-R1: Reinforcing Video Reasoning in MLLMs | GitHub  | Hugging Face |  | 2025-03 | NeurIPS 2025 |
| TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM | GitHub  | Hugging Face |  | 2025-03 | NeurIPS 2025 |
| ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos | N/A | N/A |  | 2025-03 | NeurIPS 2025 |
| VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning | GitHub  | Hugging Face |  | 2025-03 | Arxiv |
| Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs | GitHub  | N/A |  | 2025-03 | ICCV 2025 |
| video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model | GitHub  | Hugging Face |  | 2025-02 | Arxiv |
| TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding | GitHub  | N/A |  | 2025-02 | ACL 2025 (Oral) |
| CoS: Chain-of-Shot Prompting for Long Video Understanding | GitHub  | N/A |  | 2025-02 | Arxiv |
| Temporal Preference Optimization for Long-Form Video Understanding | GitHub  | Hugging Face |  | 2025-01 | Arxiv |
| InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model | GitHub  | Hugging Face |  | 2025-01 | ACL 2025 (Findings) |
| MECD+: Unlocking Event-Level Causal Graph Discovery for Video Reasoning | GitHub  | Hugging Face |  | 2025-01 | IEEE TPAMI |
| Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs | N/A | N/A |  | 2025-01 | Arxiv |
| Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition | GitHub  | N/A |  | 2025-01 | ICML 2024 |
| Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling | GitHub  | Hugging Face |  | 2024-12 | Arxiv |
| STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training | N/A | N/A |  | 2024-12 | CVPR 2025 |
| VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection | GitHub  | Hugging Face |  | 2024-11 | CVPR 2025 |
| Adaptive Video Understanding Agent: Enhancing efficiency with dynamic frame sampling and feedback-driven reasoning | N/A | N/A |  | 2024-10 | NeurIPS 2024 (Workshop) |
| VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs | GitHub  | N/A |  | 2024-09 | EMNLP 2024 (Findinds) |
| MECD: Unlocking Multi-Event Causal Discovery in Video Reasoning | GitHub  | Hugging Face |  | 2024-09 | NeurIPS 2024 (Spotlight) |