VideoChat-R1 & -R1.5: Spatio-Temporal RL for Video Perception and Reasoning

October 17, 2025 · View on GitHub

:fire: Updates

2025/09/26:🔥🔥🔥 We release our VideoChat-R1.5 model at Huggingface, paper, and eval code.
2025/09/22: 🎉🎉🎉 Our VideoChat-R1.5 is accepted by NIPS2025.
2025/04/22:🔥🔥🔥 We release our VideoChat-R1-caption at Huggingface.
2025/04/14:🔥🔥🔥 We release our VideoChat-R1 and VideoChat-R1-thinking at Huggingface.
2025/04/10:🔥🔥🔥 We release our VideoChat-R1 paper and code.

🎯 Performances on Video Benchmarks

alt text

Across short-form & long-form videos, temporal grounding, video reasoning, and spatio-temporal perception, the model delivers consistently stronger results.

:parrot: Introduction

alt text

We adopt multi-task joint RL to strengthen the model’s spatio-temporal perception and reasoning capabilities.

alt text

During inference, we simulate hierarchical human attention to enable the model to progressively localize the Region of Interest (ROI) within input videos. This multi-step perception process ensures that the model's performance improves with each step.

:page_facing_up: Citation

If you find this project useful in your research, please consider cite:

@article{li2025videochatr1,
  title={VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning},
  author={Li, Xinhao and Yan, Ziang and Meng, Desen and Dong, Lu and Zeng, Xiangyu and He, Yinan and Wang, Yali and Qiao, Yu and Wang, Yi and Wang, Limin},
  journal={arXiv preprint arXiv:2504.06958},
  year={2025}
}

@article{yan2025videochatr15,
  title={VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception},
  author={Yan, Ziang and Li, Xinhao and He, Yinan and Zhengrong Yue and Zeng, Xiangyu and Wang, Yali and Qiao, Yu and Wang, Limin and Wang, Yi},
  journal={arXiv preprint arXiv:2509.21100},
  year={2025}
}

For any inquiries regarding this work, please contact us at yanziang@pjlab.org.cn .

VideoChat-R1 & -R1.5: Spatio-Temporal RL for Video Perception and Reasoning

:fire: Updates

🎯 Performances on Video Benchmarks

:parrot: Introduction

Demo & Inference

Evaluation

Training

:page_facing_up: Citation