Awesome-Video-LMM-Post-Training [](https://awesome.re)

March 3, 2026 ยท View on GitHub

Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

Yolo Yunlong Tang1, Jing Bi1, Pinxin Liu1, Zhenyu Pan2, Zhangyun Tan1, Qianxiang Shen1, Jiani Liu1, Hang Hua1, Junjia Guo1, Yunzhong Xiao3, Chao Huang1, Zhiyuan Wang4, Susan Liang1, Xinyi Liu1, Yizhi Song5, Junhua Huang6, Jia-Xing Zhong7, Bozheng Li8, Daiqing Qi9, Ziyun Zeng1, Ali Vosoughi1, Luchuan Song1, Zeliang Zhang1, Daiki Shimada10, Han Liu2, Jiebo Luo1, Chenliang Xu1

1University of Rochester, 2Northwestern University, 3CMU, 4UCSB, 5Purdue University, 6UCLA, 7University of Oxford, 8Brown University, 9University of Virginia, 10Sony Group Corporation

hf_paper arXiv

image

News

  • [2025/10/06] ๐ŸŽ‰ Our survey paper on Video-LMM Post-Training for Video Reasoning is now available on arXiv and Hugging Face Papers!
  • [2025/06/18] ๐Ÿš€ Initial release of the Awesome-Video-LMM-Post-Training repository! We welcome contributions via Pull Requests.
  • [2025/05/04] ๐Ÿ“ข Our survey paper on Video Understanding with Large Language Model has been accepted to the IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)! ๐Ÿ‘‰ IEEE Xplore | GitHub

Overview

This Awesome list systematically curates and tracks the latest research in the post-training of Video-LMMs, with a special emphasis on works that enhance their reasoning capabilities. Following the taxonomy of the field, we focus on three key paradigms:

  • ๐Ÿง  Reinforced Video-LMMs: Exploring how RL techniques are used to align Video-LMMs with human preferences or specific metrics. This includes methods like RLHF, DPO, GRPO and the design of effective reward models to enhance the logical consistency and factuality of model outputs.

  • โš™๏ธ SFT for Reasoning: Collecting studies that leverage SFT on meticulously curated, reasoning-centric datasets. These works often incorporate CoT or other structured formats to directly teach models how to perform complex, multi-step reasoning.

  • ๐Ÿš€ Test-Time Scaling in Video Reasoning: Focusing on strategies that enhance reasoning capabilities at inference time without requiring further model training. This includes techniques like agentic frameworks, tool use, RAG, long CoT, and other methods that scale reasoning through computation.

  • ๐Ÿ“Š Benchmarks for Video Reasoning: Including the latest and most challenging benchmarks designed specifically to evaluate the complex reasoning abilities of Video-LMMs.

We hope this repository serves as a comprehensive and up-to-date resource hub for researchers and developers in this cutting-edge field. Contributions from the community are highly welcome via Pull Requests!

Table of Contents

image

๐Ÿ“ Citation

If you find our survey useful for your research, please cite the following paper:

@misc{tang2025videollmposttraining,
  title={Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models},
  author={Yunlong Tang, Jing Bi, Pinxin Liu, Zhenyu Pan, Zhangyun Tan, Qianxiang Shen, Jiani Liu, Hang Hua, Junjia Guo, Yunzhong Xiao, Chao Huang, Zhiyuan Wang, Susan Liang, Xinyi Liu, Yizhi Song, Junhua Huang, Jia-Xing Zhong, Bozheng Li, Daiqing Qi, Ziyun Zeng, Ali Vosoughi, Luchuan Song, Zeliang Zhang, Daiki Shimada, Han Liu, Jiebo Luo, Chenliang Xu},
  journal={arXiv preprint arXiv:2510.05034},
  year={2025}

Latest Research in Video-LMMs Post-Training

Reinforced Video-LMMs

TitlePaperCodeDatasetVenue
Self-alignment of Large Video Language Models with Refined Regularized Preference OptimizationPaperGitHubDatasetNeurIPS 2025
VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative PerceptionPaperGitHubNIPS 2025
MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal ReasoningPaper
TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMsPaperNeurIPS 2025
ChronoForge-RL: Chronological Forging through Reinforcement Learning for Enhanced Video UnderstandingPaper
AdsQA: Towards Advertisement Video UnderstandingPaperGitHubICCV 2025
Kwai Keye-VL 1.5 Technical ReportPaperGithub
Video-MTR: Reinforced Multi-Turn Reasoning for Long Video UnderstandingPaper
Ovis2.5 Technical ReportPaperGithub
ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language TrackingPaperGithub
TAR-TVG: Enhancing VLMs with Timestamp Anchor-Constrained Reasoning for Temporal Video GroundingPaper
VQAThinker: Exploring Generalizable and Explainable Video Quality Assessment via Reinforcement LearningPaperGithub
Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video ReasoningPaperGithubDataset
AVATAR: Reinforcement Learning to See, Hear, and Reason Over VideoPaperGithub
ReasonAct: Progressive Training for Fine-Grained Video Reasoning in Small ModelsPaper
ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World ShortsPaperGithub
METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and BenchmarkPaper
EmbRACE-3K: Embodied Reasoning and Action in Complex EnvironmentsPaper
Scaling RL to Long VideosPaperGitHubDatasetNeurIPS 2025
Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video ReasoningPaperGitHubEMNLP 2025
Tempo-R0: A Video-MLLM for Temporal Video Grounding through Efficient Temporal Sensing Reinforcement LearningPaper
VRAgent-R1: Boosting Video Recommendation with MLLM-based Agents via Reinforcement LearningPaper
Kwai Keye-VL Technical ReportPaperGitHub
VLN-R1: Vision-Language Navigation via Reinforcement Fine-TuningPaper
Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video ReasoningPaperGitHubDataset
VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning TasksPaperGitHubDataset
VidBridge-R1: Bridging QA and Captioning for RL-based Video Understanding Models with Intermediate Proxy TasksPaperGithub
DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPOPaperGitHub
AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMsPaperGitHub
MiMo-VL Technical ReportPaperGithub
EgoVLM: Policy Optimization for Egocentric Video UnderstandingPaperGitHubDataset
Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data EfficiencyPaperGitHub
VideoCap-R1: Enhancing MLLMs for Video Captioning via Structured ThinkingPaper
ReAgent-V: A Reward-Driven Multi-Agent Framework for Video UnderstandingPaperGitHubNeurIPS 2025
ReFoCUS: Reinforcement-guided Frame Optimization for Contextual UnderstandingPaper
Reinforcing Video Reasoning with Focused ThinkingPaperGitHub
VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-TuningPaperGitHubDataset
A2Seek: Towards Reasoning-Centric Benchmark for Aerial Anomaly UnderstandingPaper
MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment GroundingPaperGitHub
Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System CollaborationPaperGitHub
Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-ThoughtPaper
VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy OptimizationPaperGitHub
Fact-R1: Towards Explainable Video Misinformation Detection with Deep ReasoningPaper
From Evaluation to Defense: Advancing Safety in Video Large Language ModelsPaper
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement LearningPaper
ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement LearningPaperGitHub
UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement LearningPaperGitHubDataset
BusterX: MLLM-Powered AI-Generated Video Forgery Detection and ExplanationPaper
VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-TuningPaperGitHubDatasetNeurIPS 2025
Seed1.5-VL Technical ReportPaper
Compile Scene Graphs with Reinforcement LearningPaper
Self-alignment of Large Video Language Models with Refined Regularized Preference OptimizationPaper
Mavors: Multi-granularity Video Representation for Multimodal Large Language ModelPaper
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video ReasoningPaperGitHub
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-TuningPaperGitHub
Spatial-R1: Enhancing MLLMs in Video Spatial ReasoningPaperGitHubDataset
Improved Visual-Spatial Reasoning via R1-Zero-Like TrainingPaper
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1PaperGitHubDataset
Video-R1: Reinforcing Video Reasoning in MLLMsPaperGitHubDataset
Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and MitigationPaperGitHubDataset
TimeZero: Temporal Video Grounding with Reasoning-Guided LVLMPaperGitHubDataset
ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric VideosPaperGitHub
Memory-enhanced Retrieval Augmentation for Long Video UnderstandingPaper
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language ModelPaperGitHub
Unhackable Temporal Rewarding for Scalable Video MLLMsPaper
Temporal Preference Optimization for Long-Form Video UnderstandingPaper
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward ModelPaperGitHubACL 2025 Findings
VidChain: Chain-of-Tasks with Metric-based Direct Preference Optimization for Dense Video CaptioningPaper
VideoSAVi: Self-Aligned Video Language Models without Human SupervisionPaper
Veason-R1: Reinforcing Video Reasoning Segmentation to Think Before It SegmentsPaper
SAIL-VL2 Technical ReportPaper
Factorized Learning for Temporally Grounded Video-Language ModelsPaperGitHubDatasetICCV 2025

Video-LMM SFT for Reasoning

TitlePaperCodeDatasetVenue
Kwai Keye-VL 1.5 Technical ReportPaper
Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction DataPaper
Video-MTR: Reinforced Multi-Turn Reasoning for Long Video UnderstandingPaper
Video-MTR: Reinforced Multi-Turn Reasoning for Long Video UnderstandingPaper
Ovis2.5 Technical ReportPaper
ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language TrackingPaper
TAR-TVG: Enhancing VLMs with Timestamp Anchor-Constrained Reasoning for Temporal Video GroundingPaper
Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video ReasoningPaper
ReasonAct: Progressive Training for Fine-Grained Video Reasoning in Small ModelsPaper
ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World ShortsPaper
METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and BenchmarkPaper
CoTasks: Chain-of-Thought based Video Instruction Tuning TasksPaper
EmbRACE-3K: Embodied Reasoning and Action in Complex EnvironmentsPaper
Scaling RL to Long VideosPaperGitHubDatasetNeurIPS 2025
Video Event Reasoning and Prediction by Fusing World Knowledge from LLMs with Vision Foundation ModelsPaper
Kwai Keye-VL Technical ReportPaperGitHub
VLN-R1: Vision-Language Navigation via Reinforcement Fine-TuningPaper
Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video ReasoningPaperGitHubDataset
DAVID-XR1: Detecting AI-Generated Videos with Explainable ReasoningPaper
VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning TasksPaperGitHubDataset
VidBridge-R1: Bridging QA and Captioning for RL-based Video Understanding Models with Intermediate Proxy TasksPaper
AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMsPaperGitHub
Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video ReasoningPaperGitHubEMNLP 2025 Findings
MiMo-VL Technical ReportPaper
Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video ReasoningPaper
ReAgent-V: A Reward-Driven Multi-Agent Framework for Video UnderstandingPaperGitHubNeurIPS 2025
Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware ReasoningPaperGitHub
Universal Visuo-Tactile Video Understanding for Embodied InteractionPaper
Fostering Video Reasoning via Next-Event PredictionPaperGitHubDataset
A2Seek: Towards Reasoning-Centric Benchmark for Aerial Anomaly UnderstandingPaper
Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-ThoughtPaper
Fact-R1: Towards Explainable Video Misinformation Detection with Deep ReasoningPaper
From Evaluation to Defense: Advancing Safety in Video Large Language ModelsPaper
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement LearningPaper
UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement LearningPaperGitHubDataset
VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-TuningPaperGitHubDatasetNeurIPS 2025
Seed1.5-VL Technical ReportPaper
TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in ActionPaper
VEU-Bench: Towards Comprehensive Understanding of Video EditingPaper
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language ModelsPaper
Compile Scene Graphs with Reinforcement LearningPaper
Mavors: Multi-granularity Video Representation for Multimodal Large Language ModelPaper
LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video UnderstandingPaper
From 128K to 4M: Efficient Training of Ultra-Long Context Large Language ModelsPaper
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1PaperGitHubDataset
Video-R1: Reinforcing Video Reasoning in MLLMsPaperGitHubDataset
PAVE: Patching and Adapting Video Large Language ModelsPaper
Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and MitigationPaperGitHubDataset
VideoMind: A Chain-of-LoRA Agent for Long Video ReasoningPaperGitHubDataset
ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric VideosPaperGitHub
TIME: Temporal-sensitive Multi-dimensional Instruction Tuning and Benchmarking for Video-LLMsPaper
Memory-enhanced Retrieval Augmentation for Long Video UnderstandingPaper
Token-Efficient Long Video Understanding for Multimodal LLMsPaper
M-LLM Based Video Frame Selection for Efficient Video UnderstandingPaper
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language ModelPaperGitHub
Unhackable Temporal Rewarding for Scalable Video MLLMsPaper
Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context AccurayPaper
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward ModelPaperGitHubACL 2025 Findings
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token MarksPaper
LongViTU: Instruction Tuning for Long-Form Video UnderstandingPaper
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLMPaper
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall SpacesPaper
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time ScalingPaperGitHub
STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-TrainingPaper
ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long VideosPaperGitHubCVPR 2025
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame SelectionPaperGitHubDataset
Veason-R1: Reinforcing Video Reasoning Segmentation to Think Before It SegmentsPaper
SAIL-VL2 Technical ReportPaper

Test-Time Scaling in Video Reasoning

TitlePaperCodeDatasetVenue
VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative PerceptionPaper
Video-MTR: Reinforced Multi-Turn Reasoning for Long Video UnderstandingPaper
Video-MTR: Reinforced Multi-Turn Reasoning for Long Video UnderstandingPaper
Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video ReasoningPaper
VideoForest: Person-Anchored Hierarchical Reasoning for Cross-Video Question AnsweringPaper
Free-MoRef: Instantly Multiplexing Context Perception Capabilities of Video-MLLMs within Single InferencePaper
ReasonAct: Progressive Training for Fine-Grained Video Reasoning in Small ModelsPaper
EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied AgentPaper
Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and UnderstandingPaper
ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language ModelsPaper
Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video ReasoningPaperGitHubEMNLP 2025
StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context ModelingPaper
VRAgent-R1: Boosting Video Recommendation with MLLM-based Agents via Reinforcement LearningPaper
Temporal Chain of Thought: Long-Video Understanding by Thinking in FramesPaper
Temporal Chain of Thought: Long-Video Understanding by Thinking in FramesPaper
DIVE: Deep-search Iterative Video Exploration A Technical Report for the CVRR Challenge at CVPR 2025PaperGitHub
How Far Can Off-the-Shelf Multimodal Large Language Models Go in Online Episodic Memory Question Answering?Paper
Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video ReasoningPaperGitHubDataset
VideoDeepResearch: Long Video Understanding With Agentic Tool UsingPaper
CogStream: Context-guided Streaming Video Question AnsweringPaperGitHubDataset
Wait, We Don't Need to "Wait"! Removing Thinking Tokens Improves Reasoning EfficiencyPaper
Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-ThoughtPaper
CyberV: Cybernetics for Test-time Scaling in Video UnderstandingPaper
Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMsPaper
VideoChat-A1: Thinking with Long Videos by Chain-of-Shot ReasoningPaper
MiMo-VL Technical ReportPaper
Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video ReasoningPaper
ReAgent-V: A Reward-Driven Multi-Agent Framework for Video UnderstandingPaperGitHubNeurIPS 2025
ReFoCUS: Reinforcement-guided Frame Optimization for Contextual UnderstandingPaper
SiLVR: A Simple Language-based Video Reasoning FrameworkPaperGitHub
Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video UnderstandingPaper
VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?PaperGitHubDataset
Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System CollaborationPaperGitHub
Deep Video Discovery: Agentic Search with Tool Use for Long-form Video UnderstandingPaper
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement LearningPaper
ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding ValidationPaper
ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement LearningPaperGitHub
RVTBench: A Benchmark for Visual Reasoning TasksPaperGitHubDataset
CoT-Vid: Dynamic Chain-of-Thought Routing with Self Verification for Training-Free Video ReasoningPaper
VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language ModelsPaper
Seed1.5-VL Technical ReportPaper
Empowering Agentic Video Analytics Systems with Video Language ModelsPaper
Divide and Conquer: Exploring Language-centric Tree Reasoning for Video Question-AnsweringPaper
SeriesBench: A Benchmark for Narrative-Driven Drama Series UnderstandingPaperGitHubCVPR 2025
VideoMultiAgents: A Multi-Agent Framework for Video Question AnsweringPaper
MR. Video: "MapReduce" is the Principle for Long Video UnderstandingPaper
Multimodal Long Video Modeling Based on Temporal Dynamic ContextPaperGitHub
VideoAgent2: Enhancing the LLM-Based Agent System for Long-Form Video Understanding by Uncertainty-Aware CoTPaper
WikiVideo: Article Generation from Multiple VideosPaper
Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMsPaper
From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward AlignmentPaper
Agentic Keyframe Search for Video Question AnsweringPaper
VideoMind: A Chain-of-LoRA Agent for Long Video ReasoningPaperGitHubDataset
Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?Paper
Memory-enhanced Retrieval Augmentation for Long Video UnderstandingPaper
Everything Can Be Described in Words: A Simple Unified Multi-Modal Framework with Semantic and Temporal AlignmentPaper
QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video ComprehensionPaperGitHub
Token-Efficient Long Video Understanding for Multimodal LLMsPaper
M-LLM Based Video Frame Selection for Efficient Video UnderstandingPaper
TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem UnderstandingPaperGitHubDatasetACL 2025 main
CoS: Chain-of-Shot Prompting for Long Video UnderstandingPaper
Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context AccurayPaper
Streaming Video Understanding and Multi-round Interaction with Memory-enhanced KnowledgePaperGitHubICLR2025
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward ModelPaperGitHubACL 2025 Findings
MECD+: Unlocking Event-Level Causal Graph Discovery for Video ReasoningPaper
VidChain: Chain-of-Tasks with Metric-based Direct Preference Optimization for Dense Video CaptioningPaper
Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMsPaper
Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene UnderstandingPaper
PruneVid: Visual Token Pruning for Efficient Video Large Language ModelsPaper
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall SpacesPaper
VCA: Video Curious Agent for Long Video UnderstandingPaper
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time ScalingPaperGitHub
VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video UnderstandingPaper
VideoICL: Confidence-based Iterative In-context Learning for Out-of-Distribution Video UnderstandingPaper
ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long VideosPaperGitHubCVPR 2025
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame SelectionPaperGitHubDataset
Adaptive Video Understanding Agent: Enhancing efficiency with dynamic frame sampling and feedback-driven reasoningPaper
VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMsPaper
MECD: Unlocking Multi-Event Causal Discovery in Video ReasoningPaperGitHubDatasetNeurIPS 2024 (Spotlight)
Video-of-Thought: Step-by-Step Video Reasoning from Perception to CognitionPaperGitHubICML 2024 Oral

Benchmarks for Video Reasoning

TitlePaperCodeDatasetVenue
Scaling RL to Long VideosPaperGitHubDatasetNeurIPS 2025
AdsQA: Towards Advertisement Video UnderstandingPaper
CVBench: Evaluating Cross-Video Synergies for Complex Multimodal Understanding and ReasoningPaper
ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language TrackingPaper
METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and BenchmarkPaper
Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and UnderstandingPaper
ImplicitQA: Going beyond frames towards Implicit Video ReasoningPaperDataset
Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-ThoughtPaper
Looking Beyond Visible Cues: Implicit Video Question Answering via Dual-Clue ReasoningPaperGitHub
MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal ReasoningPaperGitHubDataset
Time Blindness: Why Video-Language Models Can't See What Humans Can?Paper
ScaleLong: A Multi-Timescale Benchmark for Long Video UnderstandingPaper
VidText: Towards Comprehensive Evaluation for Video Text UnderstandingPaperGitHubDataset
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?PaperGitHub
From Evaluation to Defense: Advancing Safety in Video Large Language ModelsPaper
VideoEval-Pro: Robust and Realistic Long Video Understanding EvaluationPaper
Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding?Paper
RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time VideoPaperGitHubDatasetNeurIPS 2025
MINERVA: Evaluating Complex Video ReasoningPaperGitHub
VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought ReasoningPaperGitHubDataset
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1PaperGitHubDataset
H2VU-Benchmark: A Comprehensive Benchmark for Hierarchical Holistic Video UnderstandingPaper
OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video ContextsPaperGithubDatasetCVPR 2025
Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and MitigationPaperGitHubDataset
V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal ReasoningPaperGitHubDataset
Reasoning is All You Need for Video Generalization: A Counterfactual Benchmark with Sub-question EvaluationPaper
Towards Fine-Grained Video Question AnsweringPaper
SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video UnderstandingPaper
MMVU: Measuring Expert-Level Multi-Discipline Video UnderstandingPaperGitHubDataset
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?PaperGitHubDataset
HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video UnderstandingPaper
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall SpacesPaper
3DSRBench: A Comprehensive 3D Spatial Reasoning BenchmarkPaper
Black Swan: Abductive and Defeasible Video Reasoning in Unpredictable EventsPaperGitHubDatasetCVPR 2025
VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video UnderstandingPaper
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation ModelsPaper
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video ModelsPaper
On the Consistency of Video Large Language Models in Temporal ComprehensionPaperGithubDatasetCVPR 2025
EgoExo-Con: Exploring View-Invariant Video Temporal UnderstandingPaper
TitlePaperCodeDatasetVenue
Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning ModelsPaperGitHub
VideoLLM Benchmarks and Evaluation: A SurveyPaper
Reinforced MLLM: A Survey on RL-Based Reasoning in Multimodal Large Language ModelsPaper
Multimodal Chain-of-Thought Reasoning: A Comprehensive SurveyPaper
From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video UnderstandingPaper
From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video UnderstandingPaper
Video Understanding with Large Language Models: A SurveyPaper

๐ŸŒŸ Star History

Star History Chart