๐ŸŽฌ Video Streaming Thinking

May 21, 2026 ยท View on GitHub

๐ŸŽฌ Video Streaming Thinking

VideoLLMs Can Watch and Think Simultaneously

arXiv Homepage License Model 3/7/32B Training Data Training Data MS

Video Streaming Thinking introduces a new paradigm for streaming video understanding that interleaves active reasoning with continuous video consumption, enabling amortized test-time scaling with real-time responsiveness.


๐Ÿ” Overview

Existing online VideoLLMs focus on efficient streaming perception but lack explicit analytical reasoning. Offline VideoLLMs with Chain-of-Thought (CoT) can reason deeply, but incur high query-answer (QA) latency that violates real-time constraints. VST bridges this gap by shifting the LLM backend from passive waiting to active, intermittent reasoning during video consumption, implementing a thinking-while-watching mechanism inspired by human neural coupling.

https://github.com/user-attachments/assets/49846db5-bf76-4cf8-b923-4b9b88117482

โœจ Key Idea

Instead of deferring all reasoning until a user query arrives, VST continuously processes incoming video clips and produces intermediate streaming thoughts in real time. This front-loads and amortizes the reasoning cost, so the final response is both deeply grounded and instantly available.

๐Ÿ—๏ธ Model Zoo

ModelHuggingFaceOVO-BenchStreamingBenchVideoMMELongVideoBenchVideoHolmes
VST-3B๐Ÿค— Link56.275.559.554.136.1
VST-7B๐Ÿค— Link59.379.564.958.041.9
VST-32B๐Ÿค— Link63.580.767.260.745.1

๐Ÿ“ฆ Training Data

We release the full training data used for both SFT and RL stages on HuggingFace and ModelScope:

DatasetHuggingFaceModelScopeDescription
vst_sft_data๐Ÿค— Link๐Ÿค– LinkSFT data including video-text pairs from multiple sources
vst_rl_data๐Ÿค— Link๐Ÿค– LinkRL data for reinforcement learning stage

๐Ÿ“… TODO

  • Release the paper.
  • Release checkpoint and eval code.
  • Release training code.
  • Release training data.

๐Ÿ‘ Acknowledgement

We thank the following great works and open-source repositories:

๐Ÿ“– Citation

@article{guan2026videostreamingthinking,
      title={Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously}, 
      author={Yiran Guan and Liang Yin and Dingkang Liang and Jianzhong Ju and Zhenbo Luo and Jian Luan and Yuliang Liu and Xiang Bai},
      journal={arXiv preprint arXiv:2603.12262},
      year={2026},
}