๐ฌ Video Streaming Thinking
May 21, 2026 ยท View on GitHub
Video Streaming Thinking introduces a new paradigm for streaming video understanding that interleaves active reasoning with continuous video consumption, enabling amortized test-time scaling with real-time responsiveness.
๐ Overview
Existing online VideoLLMs focus on efficient streaming perception but lack explicit analytical reasoning. Offline VideoLLMs with Chain-of-Thought (CoT) can reason deeply, but incur high query-answer (QA) latency that violates real-time constraints. VST bridges this gap by shifting the LLM backend from passive waiting to active, intermittent reasoning during video consumption, implementing a thinking-while-watching mechanism inspired by human neural coupling.
https://github.com/user-attachments/assets/49846db5-bf76-4cf8-b923-4b9b88117482
โจ Key Idea
Instead of deferring all reasoning until a user query arrives, VST continuously processes incoming video clips and produces intermediate streaming thoughts in real time. This front-loads and amortizes the reasoning cost, so the final response is both deeply grounded and instantly available.
๐๏ธ Model Zoo
| Model | HuggingFace | OVO-Bench | StreamingBench | VideoMME | LongVideoBench | VideoHolmes |
|---|---|---|---|---|---|---|
| VST-3B | ๐ค Link | 56.2 | 75.5 | 59.5 | 54.1 | 36.1 |
| VST-7B | ๐ค Link | 59.3 | 79.5 | 64.9 | 58.0 | 41.9 |
| VST-32B | ๐ค Link | 63.5 | 80.7 | 67.2 | 60.7 | 45.1 |
๐ฆ Training Data
We release the full training data used for both SFT and RL stages on HuggingFace and ModelScope:
| Dataset | HuggingFace | ModelScope | Description |
|---|---|---|---|
| vst_sft_data | ๐ค Link | ๐ค Link | SFT data including video-text pairs from multiple sources |
| vst_rl_data | ๐ค Link | ๐ค Link | RL data for reinforcement learning stage |
๐ TODO
- Release the paper.
- Release checkpoint and eval code.
- Release training code.
- Release training data.
๐ Acknowledgement
We thank the following great works and open-source repositories:
๐ Citation
@article{guan2026videostreamingthinking,
title={Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously},
author={Yiran Guan and Liang Yin and Dingkang Liang and Jianzhong Ju and Zhenbo Luo and Jian Luan and Yuliang Liu and Xiang Bai},
journal={arXiv preprint arXiv:2603.12262},
year={2026},
}