StreamingVLM: Real-Time Understanding for Infinite Video Streams
October 13, 2025 ยท View on GitHub
๐ง TL;DR
StreamingVLM enables real-time, stable understanding of effectively infinite video by keeping a compact KV cache and aligning training with streaming inference. It avoids quadratic cost and sliding-window pitfalls, runs up to 8 FPS on a single H100, and wins 66.18% vs GPT-4o mini on a new long-video benchmark. It also boosts general VQA without task-specific finetuning. You can grasp the gist by skimming this section first.
๐ฌ Demo
Go to streamingvlm.hanlab.ai to see more cases and try our model.
https://github.com/user-attachments/assets/1a15b496-55c5-4c66-809d-a49d70e5d864
๐ ๏ธ Install
./scripts/env_infer.sh
./scripts/env_sft.sh
You can set up the environment by running the scripts above.
๐ Inference
You can run inference by the command below.
conda activate streamingvlm-infer
python streaming_vlm/inference/inference.py
๐ SFT
Prepare Dataset
First, download mit-han-lab/Inf-Stream-Train to /path/to/your/Inf-Stream-Train.
Then, download chenjoya/Live-WhisperX-526K to /path/to/your/Inf-Stream-Train/Livecc_sft.
Preprocess the LiveCC dataset with the following command:
cd $DATASET_PATH/Livecc_sft
find . -type f -exec mv -t . {} +
Download mit-han-lab/Inf-Stream-Eval to /path/to/your/Inf-Stream-Eval.
Finally, set environment paths:
export DATASET_PATH=/path/to/your/Inf-Stream-Train
export EVAL_DATASET_PATH=/path/to/your/Inf-Stream-Eval
You can prepare data by following the steps in order.
โถ๏ธ Run SFT
conda activate streamingvlm-sft
./scripts/sft_stage_1.sh
./scripts/sft_stage_2.sh # High Quality Annealing Data
๐ Evaluation
Efficiency
conda activate streamingvlm-infer
./scripts/eval_efficiency.sh
You can benchmark efficiency by running the script above.
OVOBench
First, make the OVOBench data structure like:
data/ovobench
โโโ AutoEvalMetaData
โโโ COIN
โโโ cross_task
โโโ Ego4D
โโโ hirest
โโโ MovieNet
โโโ OpenEQA
โโโ ovo_bench_new.json
โโโ perception_test
โโโ star
โโโ thumos
โโโ youcook2
โโโ YouTube_Games
Then, prepare the OVOBench environment and run evaluation:
./scripts/env_ovo.sh
conda activate streamingvlm-ovo
./scripts/eval_OVOBench.sh
You can start OVOBench eval by these commands.
VQA
We use VLMEvalKit to evaluate VQA tasks.
conda activate streamingvlm-infer
./scripts/eval_VQA.sh
You can launch VQA evaluation with the script above.
Inf-Stream-Eval
conda activate streamingvlm-infer
./scripts/eval_Inf-Stream-Eval.sh
You can run the in-house eval by calling this script.
LiveSports3k-cc
conda activate streamingvlm-infer
export LIVESPORTS3K_PATH=/path/to/your/LiveSports-3K/videos
conda activate streamingvlm-infer
./scripts/eval_LiveSports3k-cc.sh
You can evaluate LiveSports3k-cc with the path set above.
Modify FPS
If you would like to change inference FPS, use the following command:
sed -i 's/^FPS = .*/FPS = float(os.environ.get("QWENVL_FPS", "2.0"))/' \
"$(python -c 'import inspect,qwen_vl_utils.vision_process as m; import os; print(os.path.abspath(inspect.getsourcefile(m)))')"
You can tweak FPS by editing the line via the command above.
Citation
If you find StreamingVLM useful or relevant to your project and research, please kindly cite our paper:
@misc{xu2025streamingvlmrealtimeunderstandinginfinite,
title={StreamingVLM: Real-Time Understanding for Infinite Video Streams},
author={Ruyi Xu and Guangxuan Xiao and Yukang Chen and Liuning He and Kelly Peng and Yao Lu and Song Han},
year={2025},
eprint={2510.09608},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.09608},
}