README.md

June 8, 2026 ยท View on GitHub

STC logo

๐Ÿš€ Accelerating Streaming Video LLMs via Hierarchical Token Compression

arXiv CVPR 2026 PaperWeekly Python PyTorch GitHub

Yiyu Wang1*, Xuyang Liu1,2*โ€ , Xiyan Gui1,3, Xinying Lin4, Boxue Yang1,
Chenfei Liao1,5, Tailai Chen1, Linfeng Zhang1โœ‰

1 EPIC Lab, SJTU โ€ƒ 2 Sichuan University โ€ƒ 3 HUST โ€ƒ 4 SYSU โ€ƒ 5 HKUST (Guangzhou)

โšก The first plug-and-play token-compression framework for streaming video understanding.

โ†“24.5% ViT encoding latency ย ยทย  โ†“45.3% LLM pre-filling latency ย ยทย  up to 99% accuracy retained
on the ReKV framework โ€” see Results

Highlights ยท Method ยท Supported Frameworks ยท Results ยท Installation & Reproduction ยท Evaluation ยท Latency Benchmark ยท Citation


๐Ÿ”ฅ News

  • 2026.06.04 ย ๐Ÿš€ Added a runtime latency benchmark (speed_benchmark/), add support for StreamForest & Dispider.
  • 2026.06.03 ย ๐Ÿงฑ Refactored the codebase into a standalone stc Python package โ€” STC-Cacher, STC-Pruner, HF ViT integrations in a clean layout.
  • 2026.02.21 ย ๐ŸŽŠ STC is accepted by CVPR 2026!
  • 2025.12.02 ย ๐Ÿค— We release STC, the first plug-and-play inference-acceleration framework for streaming video understanding.

โœจ Highlights

โšก Streaming-First

Built for latency-sensitive, continuously-arriving frames โ€” live sports, AR glasses, long-running streams.

๐Ÿงฉ STC-Cacher

Exploits temporal redundancy: selectively recomputes only the dynamic visual tokens of each frame and reuses the rest, instead of fully re-encoding every frame.

โœ‚๏ธ STC-Pruner

Compresses visual tokens after encoding to shorten the LLM pre-fill sequence, while preserving spatiotemporal saliency.

๐Ÿ”Œ Plug-and-Play & Hardware-Aware

Model-agnostic core; drops into ReKV, StreamForest, and Dispider with one call.


๐Ÿง  Method

STC compresses visual tokens hierarchically: STC-Cacher acts within the ViT (reusing static tokens, recomputing only the dynamic ones across frames), and STC-Pruner acts before the LLM (keeping the most salient tokens to shorten pre-fill).

STC pipeline

STC-Cacher exploits temporal change: tokens that barely change between frames are reused, and only the dynamic regions are re-encoded.

STC-Cacher: reuse static tokens, recompute dynamic ones

๐Ÿงฉ Supported Frameworks

STC's core package is framework-agnostic. STC-Cacher attaches to any HuggingFace pre-LN CLIP / SigLIP vision tower via a one-line monkey-patch; STC-Pruner is an explicit call before LLM pre-fill.

FrameworkVision TowerSTC-CacherSTC-PrunerNotes
ReKVSigLIP (LLaVA-OneVision)โœ…โœ…Reference integration
StreamForestSigLIPโœ…โ€”Per-frame streaming cacher
DispiderCLIPโœ…โ€”Per-frame streaming cacher
LiveCCโ€”๐Ÿ”œ๐Ÿ”œVendored; integration WIP

๐Ÿ“Š Results

TL;DR โ€” On ReKV, STC retains up to 99% of accuracy while cutting ViT encoding latency by 24.5% and LLM pre-filling latency by 45.3%, surpassing VidComยฒ by 1.6 on both OVO-Bench and StreamingBench (and ToMe by 5.6 / 5.8). Evaluated on 4 streaming VideoLLM baselines ร— 5 benchmarks at a 0.5 fps streaming protocol. Latencies are in seconds (ViT: encode 16 frames; LLM: pre-fill time).

Streaming benchmarks โ€” ReKV (LLaVA-OV-7B)

MethodOVO Real-TimeOVO BackwardOVO ForwardStreamingBenchViT Enc. Lat.LLM Pref. Lat.
ReKV64.464.652.669.1103.7482.4
ย ย + ToMe53.160.746.459.470.5 โ†“32.0%257.8 โ†“46.6%
ย ย + VisionZip53.858.447.560.4103.7258.3 โ†“46.5%
ย ย + VidComยฒ60.459.050.463.6103.7259.1 โ†“46.3%
+ STC (Cacher & Pruner)62.563.352.065.278.3 โ†“24.5%263.7 โ†“45.3%

STC-Cacher generality โ€” OVO-Bench, baseline โ†’ + STC-Cacher

FrameworkReal-TimeBackwardForwardViT Enc. Lat.
Dispider51.0 โ†’ 49.140.1 โ†’ 36.640.4 โ†’ 39.226.4 โ†’ 18.9 โ†“28.4%
LiveCC57.0 โ†’ 53.856.4 โ†’ 54.259.7 โ†’ 57.3181.2 โ†’ 126.8 โ†“30.0%
StreamForest61.6 โ†’ 59.170.8 โ†’ 68.254.3 โ†’ 52.3103.7 โ†’ 67.7 โ†“34.7%

Offline long-video understanding โ€” ReKV

MethodEgoSchemaMLVU-devVideoMMEAvg
ReKV57.768.657.761.3
ย ย + ToMe55.263.151.756.7
ย ย + VisionZip55.863.251.656.9
ย ย + VidComยฒ60.667.156.861.5
+ STC-Pruner60.867.657.161.8

See the paper for full results, per-subset VideoMME breakdowns, and ablations.


๐Ÿ›  Installation

pip install -e .            # core package (requires torch)
pip install -e .[hf]        # + transformers, for HF CLIP / SigLIP integrations

Reproducing the baseline frameworks

To make the vendored frameworks easy to reproduce, we made small, documented modifications to each upstream repo (path handling, model discovery, launch scripts, benchmark entrypoints). For every framework we ship a doc pair โ€” a REPRODUCE guide (fresh GPU machine โ†’ environment โ†’ smoke run) and a CHANGES doc (exactly what we adapted, with the upstream git diff). Just follow the guide for the framework you want to run.

FrameworkQuick reproduceWhat changed from upstream
ReKVdocs/rekv/REPRODUCE.mddocs/rekv/CHANGES.md
StreamForestdocs/streamforest/REPRODUCE.mddocs/streamforest/CHANGES.md
Dispiderdocs/dispider/reproduce.mddocs/dispider/changes.md
LiveCCVendored under models/livecc/ ยท guide TBDโ€”

๐Ÿงช Evaluation

Copy a block, replace the /path/to/... placeholders with your own, and run. Outputs land under results/. Each block runs + STC; for the baseline, change the mode arg (rekv_stcโ†’rekv, sf_stcโ†’sf, dispider_stcโ†’dispider).

ReKV (Baseline)


# Offline benchmark (dataset: mlvu / egoschema / videomme / ...)
export STC_PATCH_VISION=1 STC_TOKEN_PER_FRAME=64 STC_UPDATE_TOKEN_RATIO=0.25
bash scripts/eval_rekv/eval_offline_benchs.sh \
  --dataset mlvu --model llava_ov_7b --save_dir results/mlvu_stc

# OVO-Bench
export ANNO_PATH=/path/to/ovo_bench_new.json
export VIDEO_DIR=/path/to/src_videos
export CHUNKED_DIR=/path/to/chunked_videos
export STC_PATCH_VISION=1 STC_TOKEN_PER_FRAME=64 STC_UPDATE_TOKEN_RATIO=0.25
bash scripts/eval_rekv/ovobench_scripts/eval_rekv.sh
bash scripts/eval_rekv/ovobench_scripts/score_rekv.sh

StreamForest (Baseline)

export STREAMFOREST_CKPT_PATH=/path/to/StreamForest-Qwen2-7B
TASKS=ovobench bash scripts/eval_streamforest/eval_streamforest.sh sf_stc

Dispider (Baseline)

export MODEL_PATH=/path/to/Dispider
export CLIP_CKPT_PATH=/path/to/clip-vit-large-patch14
export ANNO_PATH=/path/to/ovo_bench_new.json
export CHUNKED_DIR=/path/to/chunked_videos
bash scripts/eval_dispider/eval_dispider_ovobench.sh dispider_stc
STC knobs
export STC_PATCH_VISION=1          # enable STC-Cacher (0 = baseline)
export STC_TOKEN_PER_FRAME=64      # STC-Pruner token budget per frame (196 = full; ReKV only)
export STC_UPDATE_TOKEN_RATIO=0.25 # STC-Cacher selective-recompute ratio
export STC_CACHE_INTERVAL=4        # full reference frame every N frames
  • StreamForest & Dispider are cacher-only (no pruner). CUDA-graph replay and per-frame shared token selection are always on (not user-configurable) and fall back safely if unsupported.
  • More options (GPUs, frames, tasks, sharding) are in each scripts/eval_<framework>/README.md.

โฑ๏ธ Latency Benchmark

Measure the latency reduction (baseline vs +STC) on your GPU โ€” runtime-instrumented, no code changes, 16-frame default. One script per framework under speed_benchmark/:

GPU=0 bash speed_benchmark/run_rekv.sh          # ReKV: ViT encoding + LLM pre-fill
GPU=0 bash speed_benchmark/run_streamforest.sh  # StreamForest (SigLIP): ViT encoding
GPU=0 bash speed_benchmark/run_dispider.sh      # Dispider (CLIP): ViT encoding

Each runs the baseline and +STC configurations and prints the reduction. Pin a dedicated GPU and read min / median over repeats โ€” absolute latency is GPU-dependent, the reduction ratio is what reproduces. See speed_benchmark/README.md for options and methodology.


๐Ÿ—๏ธ Architecture

The codebase is organized around the standalone stc package; each framework consumes it through its own model wrappers / eval drivers.

Repository layout
STC/
โ”œโ”€โ”€ stc/                      Standalone Python package
โ”‚   โ”œโ”€โ”€ config.py                 Dataclasses + env-driven config
โ”‚   โ”œโ”€โ”€ core/                     Shared algorithms (token similarity, layer ratios)
โ”‚   โ”œโ”€โ”€ cacher/                   STC-Cacher
โ”‚   โ”‚   โ”œโ”€โ”€ state.py                  Per-stream cache state
โ”‚   โ”‚   โ”œโ”€โ”€ reference_forward.py      Selective ViT-layer forwards
โ”‚   โ”‚   โ””โ”€โ”€ graph.py                  CUDA-graph runner for selective frames
โ”‚   โ”œโ”€โ”€ pruner/                   STC-Pruner (scoring, anchors, index mapping, specs)
โ”‚   โ””โ”€โ”€ integrations/
โ”‚       โ”œโ”€โ”€ hf_vit.py                 register_stc_cacher() for HF CLIP / SigLIP
โ”‚       โ””โ”€โ”€ streaming.py              enable_streaming_cacher() โ€” per-frame wrapper
โ”œโ”€โ”€ models/                   Framework code + entrypoints (rekv, StreamForest, Dispider, livecc)
โ”œโ”€โ”€ speed_benchmark/          Latency harness (benchmark.py + run.sh)
โ”œโ”€โ”€ scripts/eval_rekv/        ReKV evaluation drivers
โ””โ”€โ”€ results/                  Run outputs (git-ignored)

How the two components attach:

  • STC-Cacher is a monkey-patch: register_stc_cacher() swaps each vision_model.encoder.layers[*].forward for a selective-recompute forward. For batched towers, enable_streaming_cacher() additionally splits a clip into per-frame calls so the cache advances frame-by-frame (reset once per video with reset_streaming_cacher()).
  • STC-Pruner is an explicit call: STCPruner().compress(...) runs after vision encoding / projection / pooling and before LLM pre-fill.

๐Ÿ™ Acknowledgements

Built on the excellent work of ReKV, StreamForest, Dispider, and LiveCC โ€” thanks to all authors for releasing their code.


โœ๏ธ Citation

If STC helps your research, please consider citing:

@inproceedings{wang2026stc,
  title     = {Accelerating Streaming Video Large Language Models via Hierarchical Token Compression},
  author    = {Wang, Yiyu and Liu, Xuyang and Gui, Xiyan and Lin, Xinying and
               Yang, Boxue and Liao, Chenfei and Chen, Tailai and Zhang, Linfeng},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}

๐Ÿ“ฎ Contact

Questions about the paper or code? Email liuxuyang@stu.scu.edu.cn or ustywan8@ljmu.ac.uk.