README.md
October 30, 2025 ยท View on GitHub

Sparking "Thinking with Videos" via Reinforcement Learning
If you like our project, please give us a star โญ on GitHub for the latest update.
๐ฃ Latest News
- [October 30, 2025]: ๐ Our paper is now available on arXiv and HF Paper.
- [October 28, 2025]: ๐ Our codebase and model released. You can now use Video-Thinker-7B at Huggingface Model.
๐ก Overview
Video-Thinker is an end-to-end video reasoning framework that empowers MLLMs to autonomously leverage intrinsic "grounding" and "captioning" capabilities during inference. This paradigm extends "Thinking with Images" to video understanding, enabling dynamic temporal navigation and visual cue extraction without relying on external tools or pre-designed prompts. To spark this capability, we construct Video-Thinker-10K, a curated dataset with structured reasoning traces synthesized through hindsight-curation reasoning, ensuring that temporal localizations and visual descriptions genuinely contribute to correct answers. Furthermore, we propose a two-stage training strategy combining SFT for format learning and GRPO with pure outcome reward for reinforcement learning, enabling Video-Thinker to achieve state-of-the-art performance on challenging video reasoning benchmarks with remarkable data efficiency.
๐ Overall Performance
Video-Thinker-7B achieves state-of-the-art performance among 7B-sized MLLMs across multiple challenging video reasoning benchmarks. Our model demonstrates exceptional capabilities on both in-domain and out-of-domain tasks:
-
Out-of-Domain Benchmarks:
- Video-Holmes: 43.22% (โ4.68% over best baseline)
- CG-Bench-Reasoning: 33.25% (โ3.81% over best baseline)
- VRBench: 80.69% (โ11.44% over best baseline)
-
In-Domain Benchmarks:
- ActivityNet: 78.72% | Star: 70.66% | ScaleLong: 49.53%
- YouCook2: 73.66% | LVBench: 37.04%
Our approach enables MLLMs to "Think with Videos" by autonomously leveraging intrinsic grounding and captioning capabilities, achieving superior reasoning performance with only 10K training samples.
โจ The Video-Thinker Framework
๐ Data Synthesis Pipeline
We construct Video-Thinker-10K through a systematic pipeline that transforms diverse video data into structured reasoning samples:
-
Data Sources: We curate from 6 datasets spanning multiple domains:
- Caption-labeled (ActivityNet, TutorialVQA, YouCook2): Rich temporal annotations but lack complex reasoning questions
- QA-labeled (STAR, ScaleLong, LVBench): Challenging QA pairs but lack granular visual descriptions
-
Complementary Generation:
- For caption-labeled data โ Generate complex multi-segment reasoning questions
- For QA-labeled data โ Generate answer-conditioned visual descriptions for key segments
-
Hindsight-Curation Reasoning: We employ a novel quality assurance process where generated
<time>and<caption>contents are validated by testing whether they enable models to derive correct answers, with up to 3 regeneration attempts to ensure high-quality supervision.
๐ฏ Training Strategy of Video-Thinker
We adopt a two-stage training approach to progressively build video reasoning capabilities:
Stage 1: SFT for Format-Following
- Initialize the model to generate structured reasoning traces with
<time>,<caption>, and<think>tags - Provides essential cold-start by teaching the specialized reasoning format
Stage 2: GRPO for Autonomous Navigation
- Strengthens intrinsic grounding and captioning capabilities through reinforcement learning
- Uses outcome-based rewards (correctness + format adherence) without requiring step-wise annotations
- Enables the model to autonomously discover effective temporal reasoning strategies
- Demonstrates remarkable data efficiency (10K samples)
๐ง Installation
# Create conda environment
conda create -n videothinker python=3.10
conda activate videothinker
# Install requirements
cd Video-Thinker
pip install -r requirements.txt
๐ฆ Data Preparation
๐ Training and evaluation data are available in data:
data/train/- Training datadata/eval/id/- In-domain Evaluation datadata/eval/ood/- Out-of-domain Evaluation data
Note: Video files will be released soon. Current data files contain video IDs and annotations.
๐ Benchmark Datasets
We evaluate on both in-domain and out-of-domain benchmarks:
Out-of-Domain:
- Video-Holmes, CG-Bench-Reasoning, VRBench
In-Domain:
- ActivityNet, STAR, ScaleLong, YouCook2, LVBench
๐ฏ Training Data
Video-Thinker-10K is curated from diverse video reasoning tasks:
- Caption-labeled: ActivityNet, TutorialVQA, YouCook2
- QA-labeled: STAR, ScaleLong, LVBench
๐จ Base Model
We build upon Qwen2.5-VL-7B-Instruct as our foundation model, which provides strong multimodal understanding capabilities.
๐ Training
Step 1: Supervised Fine-Tuning (SFT)
Configure your training parameters and run:
bash scripts/run_sft_video.sh
Step 2: Group Relative Policy Optimization (GRPO)
After SFT completion, run GRPO training:
bash scripts/run_grpo_video.sh
๐ Evaluation
Our trained model Video-Thinker-7B is available on Hugging Face. You can directly use it to evaluate on your custom video reasoning tasks.
To run batch evaluation on trained models:
bash scripts/run_eval_batch.py
๐ TODO
- Release Paper
- Release Model Weights (Video-Thinker-7B)
- Release Training & Evaluation Data (Annotations)
- Release Code
- Release Video Files
- Provide Detailed Training Guidelines
- Provide Detailed Evaluation Guidelines
๐ Acknowledgement
We sincerely appreciate the contributions of the open-source community:
๐ Citation
If you find Video-Thinker useful in your research, please consider citing:
@article{wang2025video,
title={Video-Thinker: Sparking" Thinking with Videos" via Reinforcement Learning},
author={Wang, Shijian and Jin, Jiarui and Wang, Xingjian and Song, Linxin and Fu, Runhao and Wang, Hecheng and Ge, Zongyuan and Lu, Yuan and Cheng, Xuelian},
journal={arXiv preprint arXiv:2510.23473},
year={2025}
}
๐ License
This project is released under the MIT License.
๐ Contact
For any questions or feedback, please reach out to us at shijian@seu.edu.cn.