Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination

November 24, 2025 · View on GitHub

Yolo Yunlong Tang¹, Daiki Shimada², Hang Hua³, Chao Huang¹, Jing Bi¹, Rogerio Feris³, Chenliang Xu¹

¹University of Rochester, ²Sony Group Corporation, ³MIT-IBM Watson AI Lab

🌟 News

[2025-11-23] Introducing Video-R4, a reinforced video agent with visual rumination for text-rich video reasoning. The arXiv paper has been released. Code, model, and dataset are coming soon.

🚀 Video-R4 Training Framework

📊 Data Curation Pipeline

📈 Performance

📦 Installation

conda create -n video-r4 python=3.10
conda activate video-r4
git clone https://github.com/yunlong10/Video-R4.git
cd Video-R4
pip install -r requirements.txt

📖 Citation

If you find this work useful, please consider citing:

@article{tang2025video-r4,
  title={Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination},
  author={Tang, Yunlong and Shimada, Daiku and Hua, Hang and Huang, Chao and Bi, Jing and Feris, Rogerio and Xu, Chenliang},
  journal={arXiv preprint arXiv:2511.17490},
  year={2025}
}

🤝 Acknowledgments

This work was supported by Sony Group Corporation. We would like to thank Sayaka Nakamura and Jerry Jun Yokono for their insightful discussion.

We also thank the authors of the following projects for their contributions: