README.md

October 30, 2025 · View on GitHub

Sparking "Thinking with Videos" via Reinforcement Learning

If you like our project, please give us a star ⭐ on GitHub for the latest update.

📣 Latest News

[October 30, 2025]: 📄 Our paper is now available on arXiv and HF Paper.
[October 28, 2025]: 🚀 Our codebase and model released. You can now use Video-Thinker-7B at Huggingface Model.

Video-Thinker is an end-to-end video reasoning framework that empowers MLLMs to autonomously leverage intrinsic "grounding" and "captioning" capabilities during inference. This paradigm extends "Thinking with Images" to video understanding, enabling dynamic temporal navigation and visual cue extraction without relying on external tools or pre-designed prompts. To spark this capability, we construct Video-Thinker-10K, a curated dataset with structured reasoning traces synthesized through hindsight-curation reasoning, ensuring that temporal localizations and visual descriptions genuinely contribute to correct answers. Furthermore, we propose a two-stage training strategy combining SFT for format learning and GRPO with pure outcome reward for reinforcement learning, enabling Video-Thinker to achieve state-of-the-art performance on challenging video reasoning benchmarks with remarkable data efficiency.

📊 Overall Performance

Video-Thinker-7B achieves state-of-the-art performance among 7B-sized MLLMs across multiple challenging video reasoning benchmarks. Our model demonstrates exceptional capabilities on both in-domain and out-of-domain tasks:

Out-of-Domain Benchmarks:
- Video-Holmes: 43.22% (↑4.68% over best baseline)
- CG-Bench-Reasoning: 33.25% (↑3.81% over best baseline)
- VRBench: 80.69% (↑11.44% over best baseline)
In-Domain Benchmarks:
- ActivityNet: 78.72% | Star: 70.66% | ScaleLong: 49.53%
- YouCook2: 73.66% | LVBench: 37.04%

Our approach enables MLLMs to "Think with Videos" by autonomously leveraging intrinsic grounding and captioning capabilities, achieving superior reasoning performance with only 10K training samples.

✨ The Video-Thinker Framework

🔄 Data Synthesis Pipeline

We construct Video-Thinker-10K through a systematic pipeline that transforms diverse video data into structured reasoning samples:

Data Sources: We curate from 6 datasets spanning multiple domains:
- Caption-labeled (ActivityNet, TutorialVQA, YouCook2): Rich temporal annotations but lack complex reasoning questions
- QA-labeled (STAR, ScaleLong, LVBench): Challenging QA pairs but lack granular visual descriptions
Complementary Generation:
- For caption-labeled data → Generate complex multi-segment reasoning questions
- For QA-labeled data → Generate answer-conditioned visual descriptions for key segments
Hindsight-Curation Reasoning: We employ a novel quality assurance process where generated <time> and <caption> contents are validated by testing whether they enable models to derive correct answers, with up to 3 regeneration attempts to ensure high-quality supervision.

🎯 Training Strategy of Video-Thinker

We adopt a two-stage training approach to progressively build video reasoning capabilities:

Stage 1: SFT for Format-Following

Initialize the model to generate structured reasoning traces with <time>, <caption>, and <think> tags
Provides essential cold-start by teaching the specialized reasoning format

Stage 2: GRPO for Autonomous Navigation

Strengthens intrinsic grounding and captioning capabilities through reinforcement learning
Uses outcome-based rewards (correctness + format adherence) without requiring step-wise annotations
Enables the model to autonomously discover effective temporal reasoning strategies
Demonstrates remarkable data efficiency (10K samples)

🔧 Installation

# Create conda environment
conda create -n videothinker python=3.10
conda activate videothinker

# Install requirements
cd Video-Thinker
pip install -r requirements.txt

📦 Data Preparation

📂 Training and evaluation data are available in data:

data/train/ - Training data
data/eval/id/ - In-domain Evaluation data
data/eval/ood/ - Out-of-domain Evaluation data

Note: Video files will be released soon. Current data files contain video IDs and annotations.

📊 Benchmark Datasets

We evaluate on both in-domain and out-of-domain benchmarks:

Out-of-Domain:

Video-Holmes, CG-Bench-Reasoning, VRBench

In-Domain:

ActivityNet, STAR, ScaleLong, YouCook2, LVBench

🎯 Training Data

Video-Thinker-10K is curated from diverse video reasoning tasks:

Caption-labeled: ActivityNet, TutorialVQA, YouCook2
QA-labeled: STAR, ScaleLong, LVBench

bash scripts/run_sft_video.sh

Step 2: Group Relative Policy Optimization (GRPO)

After SFT completion, run GRPO training:

bash scripts/run_grpo_video.sh

📈 Evaluation

Our trained model Video-Thinker-7B is available on Hugging Face. You can directly use it to evaluate on your custom video reasoning tasks.

To run batch evaluation on trained models:

bash scripts/run_eval_batch.py

📋 TODO

Release Paper
Release Model Weights (Video-Thinker-7B)
Release Training & Evaluation Data (Annotations)
Release Code
Release Video Files
Provide Detailed Training Guidelines
Provide Detailed Evaluation Guidelines

🙏 Acknowledgement

We sincerely appreciate the contributions of the open-source community:

📝 Citation

If you find Video-Thinker useful in your research, please consider citing:

@article{wang2025video,
  title={Video-Thinker: Sparking" Thinking with Videos" via Reinforcement Learning},
  author={Wang, Shijian and Jin, Jiarui and Wang, Xingjian and Song, Linxin and Fu, Runhao and Wang, Hecheng and Ge, Zongyuan and Lu, Yuan and Cheng, Xuelian},
  journal={arXiv preprint arXiv:2510.23473},
  year={2025}
}