README.md
June 8, 2026 ยท View on GitHub
๐ Accelerating Streaming Video LLMs via Hierarchical Token Compression
Yiyu Wang1*, Xuyang Liu1,2*โ , Xiyan Gui1,3, Xinying Lin4, Boxue Yang1,
Chenfei Liao1,5, Tailai Chen1, Linfeng Zhang1โ
1 EPIC Lab, SJTU โ 2 Sichuan University โ 3 HUST โ 4 SYSU โ 5 HKUST (Guangzhou)
โก The first plug-and-play token-compression framework for streaming video understanding.
โ24.5% ViT encoding latency ย ยทย โ45.3% LLM pre-filling latency ย ยทย up to 99% accuracy retained
on the ReKV framework โ see Results
Highlights ยท Method ยท Supported Frameworks ยท Results ยท Installation & Reproduction ยท Evaluation ยท Latency Benchmark ยท Citation
๐ฅ News
2026.06.04ย ๐ Added a runtime latency benchmark (speed_benchmark/), add support for StreamForest & Dispider.2026.06.03ย ๐งฑ Refactored the codebase into a standalonestcPython package โ STC-Cacher, STC-Pruner, HF ViT integrations in a clean layout.2026.02.21ย ๐ STC is accepted by CVPR 2026!2025.12.02ย ๐ค We release STC, the first plug-and-play inference-acceleration framework for streaming video understanding.
โจ Highlights
โก Streaming-FirstBuilt for latency-sensitive, continuously-arriving frames โ live sports, AR glasses, long-running streams. ๐งฉ STC-CacherExploits temporal redundancy: selectively recomputes only the dynamic visual tokens of each frame and reuses the rest, instead of fully re-encoding every frame. |
โ๏ธ STC-PrunerCompresses visual tokens after encoding to shorten the LLM pre-fill sequence, while preserving spatiotemporal saliency. ๐ Plug-and-Play & Hardware-AwareModel-agnostic core; drops into ReKV, StreamForest, and Dispider with one call. |
๐ง Method
STC compresses visual tokens hierarchically: STC-Cacher acts within the ViT (reusing static tokens, recomputing only the dynamic ones across frames), and STC-Pruner acts before the LLM (keeping the most salient tokens to shorten pre-fill).
STC-Cacher exploits temporal change: tokens that barely change between frames are reused, and only the dynamic regions are re-encoded.
๐งฉ Supported Frameworks
STC's core package is framework-agnostic. STC-Cacher attaches to any HuggingFace pre-LN CLIP / SigLIP vision tower via a one-line monkey-patch; STC-Pruner is an explicit call before LLM pre-fill.
| Framework | Vision Tower | STC-Cacher | STC-Pruner | Notes |
|---|---|---|---|---|
| ReKV | SigLIP (LLaVA-OneVision) | โ | โ | Reference integration |
| StreamForest | SigLIP | โ | โ | Per-frame streaming cacher |
| Dispider | CLIP | โ | โ | Per-frame streaming cacher |
| LiveCC | โ | ๐ | ๐ | Vendored; integration WIP |
๐ Results
TL;DR โ On ReKV, STC retains up to 99% of accuracy while cutting ViT encoding latency by 24.5% and LLM pre-filling latency by 45.3%, surpassing VidComยฒ by 1.6 on both OVO-Bench and StreamingBench (and ToMe by 5.6 / 5.8). Evaluated on 4 streaming VideoLLM baselines ร 5 benchmarks at a 0.5 fps streaming protocol. Latencies are in seconds (ViT: encode 16 frames; LLM: pre-fill time).
Streaming benchmarks โ ReKV (LLaVA-OV-7B)
| Method | OVO Real-Time | OVO Backward | OVO Forward | StreamingBench | ViT Enc. Lat. | LLM Pref. Lat. |
|---|---|---|---|---|---|---|
| ReKV | 64.4 | 64.6 | 52.6 | 69.1 | 103.7 | 482.4 |
| ย ย + ToMe | 53.1 | 60.7 | 46.4 | 59.4 | 70.5 โ32.0% | 257.8 โ46.6% |
| ย ย + VisionZip | 53.8 | 58.4 | 47.5 | 60.4 | 103.7 | 258.3 โ46.5% |
| ย ย + VidComยฒ | 60.4 | 59.0 | 50.4 | 63.6 | 103.7 | 259.1 โ46.3% |
| + STC (Cacher & Pruner) | 62.5 | 63.3 | 52.0 | 65.2 | 78.3 โ24.5% | 263.7 โ45.3% |
STC-Cacher generality โ OVO-Bench, baseline โ + STC-Cacher
| Framework | Real-Time | Backward | Forward | ViT Enc. Lat. |
|---|---|---|---|---|
| Dispider | 51.0 โ 49.1 | 40.1 โ 36.6 | 40.4 โ 39.2 | 26.4 โ 18.9 โ28.4% |
| LiveCC | 57.0 โ 53.8 | 56.4 โ 54.2 | 59.7 โ 57.3 | 181.2 โ 126.8 โ30.0% |
| StreamForest | 61.6 โ 59.1 | 70.8 โ 68.2 | 54.3 โ 52.3 | 103.7 โ 67.7 โ34.7% |
Offline long-video understanding โ ReKV
| Method | EgoSchema | MLVU-dev | VideoMME | Avg |
|---|---|---|---|---|
| ReKV | 57.7 | 68.6 | 57.7 | 61.3 |
| ย ย + ToMe | 55.2 | 63.1 | 51.7 | 56.7 |
| ย ย + VisionZip | 55.8 | 63.2 | 51.6 | 56.9 |
| ย ย + VidComยฒ | 60.6 | 67.1 | 56.8 | 61.5 |
| + STC-Pruner | 60.8 | 67.6 | 57.1 | 61.8 |
See the paper for full results, per-subset VideoMME breakdowns, and ablations.
๐ Installation
pip install -e . # core package (requires torch)
pip install -e .[hf] # + transformers, for HF CLIP / SigLIP integrations
Reproducing the baseline frameworks
To make the vendored frameworks easy to reproduce, we made small, documented
modifications to each upstream repo (path handling, model discovery, launch
scripts, benchmark entrypoints). For every framework we ship a doc pair โ a
REPRODUCE guide (fresh GPU machine โ environment โ smoke run) and a
CHANGES doc (exactly what we adapted, with the upstream git diff). Just
follow the guide for the framework you want to run.
| Framework | Quick reproduce | What changed from upstream |
|---|---|---|
| ReKV | docs/rekv/REPRODUCE.md | docs/rekv/CHANGES.md |
| StreamForest | docs/streamforest/REPRODUCE.md | docs/streamforest/CHANGES.md |
| Dispider | docs/dispider/reproduce.md | docs/dispider/changes.md |
| LiveCC | Vendored under models/livecc/ ยท guide TBD | โ |
๐งช Evaluation
Copy a block, replace the /path/to/... placeholders with your own, and run.
Outputs land under results/. Each block runs + STC; for the baseline,
change the mode arg (rekv_stcโrekv, sf_stcโsf, dispider_stcโdispider).
ReKV (Baseline)
# Offline benchmark (dataset: mlvu / egoschema / videomme / ...)
export STC_PATCH_VISION=1 STC_TOKEN_PER_FRAME=64 STC_UPDATE_TOKEN_RATIO=0.25
bash scripts/eval_rekv/eval_offline_benchs.sh \
--dataset mlvu --model llava_ov_7b --save_dir results/mlvu_stc
# OVO-Bench
export ANNO_PATH=/path/to/ovo_bench_new.json
export VIDEO_DIR=/path/to/src_videos
export CHUNKED_DIR=/path/to/chunked_videos
export STC_PATCH_VISION=1 STC_TOKEN_PER_FRAME=64 STC_UPDATE_TOKEN_RATIO=0.25
bash scripts/eval_rekv/ovobench_scripts/eval_rekv.sh
bash scripts/eval_rekv/ovobench_scripts/score_rekv.sh
StreamForest (Baseline)
export STREAMFOREST_CKPT_PATH=/path/to/StreamForest-Qwen2-7B
TASKS=ovobench bash scripts/eval_streamforest/eval_streamforest.sh sf_stc
Dispider (Baseline)
export MODEL_PATH=/path/to/Dispider
export CLIP_CKPT_PATH=/path/to/clip-vit-large-patch14
export ANNO_PATH=/path/to/ovo_bench_new.json
export CHUNKED_DIR=/path/to/chunked_videos
bash scripts/eval_dispider/eval_dispider_ovobench.sh dispider_stc
STC knobs
export STC_PATCH_VISION=1 # enable STC-Cacher (0 = baseline)
export STC_TOKEN_PER_FRAME=64 # STC-Pruner token budget per frame (196 = full; ReKV only)
export STC_UPDATE_TOKEN_RATIO=0.25 # STC-Cacher selective-recompute ratio
export STC_CACHE_INTERVAL=4 # full reference frame every N frames
- StreamForest & Dispider are cacher-only (no pruner). CUDA-graph replay and per-frame shared token selection are always on (not user-configurable) and fall back safely if unsupported.
- More options (GPUs, frames, tasks, sharding) are in each
scripts/eval_<framework>/README.md.
โฑ๏ธ Latency Benchmark
Measure the latency reduction (baseline vs +STC) on your GPU โ
runtime-instrumented, no code changes, 16-frame default. One script per framework
under speed_benchmark/:
GPU=0 bash speed_benchmark/run_rekv.sh # ReKV: ViT encoding + LLM pre-fill
GPU=0 bash speed_benchmark/run_streamforest.sh # StreamForest (SigLIP): ViT encoding
GPU=0 bash speed_benchmark/run_dispider.sh # Dispider (CLIP): ViT encoding
Each runs the baseline and +STC configurations and prints the reduction. Pin a
dedicated GPU and read min / median over repeats โ absolute latency is
GPU-dependent, the reduction ratio is what reproduces. See
speed_benchmark/README.md for options and methodology.
๐๏ธ Architecture
The codebase is organized around the standalone stc package; each framework consumes it through its own model wrappers / eval drivers.
Repository layout
STC/
โโโ stc/ Standalone Python package
โ โโโ config.py Dataclasses + env-driven config
โ โโโ core/ Shared algorithms (token similarity, layer ratios)
โ โโโ cacher/ STC-Cacher
โ โ โโโ state.py Per-stream cache state
โ โ โโโ reference_forward.py Selective ViT-layer forwards
โ โ โโโ graph.py CUDA-graph runner for selective frames
โ โโโ pruner/ STC-Pruner (scoring, anchors, index mapping, specs)
โ โโโ integrations/
โ โโโ hf_vit.py register_stc_cacher() for HF CLIP / SigLIP
โ โโโ streaming.py enable_streaming_cacher() โ per-frame wrapper
โโโ models/ Framework code + entrypoints (rekv, StreamForest, Dispider, livecc)
โโโ speed_benchmark/ Latency harness (benchmark.py + run.sh)
โโโ scripts/eval_rekv/ ReKV evaluation drivers
โโโ results/ Run outputs (git-ignored)
How the two components attach:
- STC-Cacher is a monkey-patch:
register_stc_cacher()swaps eachvision_model.encoder.layers[*].forwardfor a selective-recompute forward. For batched towers,enable_streaming_cacher()additionally splits a clip into per-frame calls so the cache advances frame-by-frame (reset once per video withreset_streaming_cacher()). - STC-Pruner is an explicit call:
STCPruner().compress(...)runs after vision encoding / projection / pooling and before LLM pre-fill.
๐ Acknowledgements
Built on the excellent work of ReKV, StreamForest, Dispider, and LiveCC โ thanks to all authors for releasing their code.
โ๏ธ Citation
If STC helps your research, please consider citing:
@inproceedings{wang2026stc,
title = {Accelerating Streaming Video Large Language Models via Hierarchical Token Compression},
author = {Wang, Yiyu and Liu, Xuyang and Gui, Xiyan and Lin, Xinying and
Yang, Boxue and Liao, Chenfei and Chen, Tailai and Zhang, Linfeng},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}
๐ฎ Contact
Questions about the paper or code? Email liuxuyang@stu.scu.edu.cn or ustywan8@ljmu.ac.uk.