Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs (ICCV 2025)
July 25, 2025 ยท View on GitHub
TL;DR
STTM is a training-free spatio-temporal token merging method that supports KV-cache reuse. It operates in two steps: (1) Spatial merging based on a quadtree structure; (2) Temporal merging of multi-granular spatial tokens;
STTM is validated using three models: LLaVA-Video-7B/72B, LLaVA-OneVision-7B, and Qwen2VL-7B. Evaluation is conducted across six video QA benchmarks:
- NIAH: VNBench
- Long videos: Video-MME; LongVideoBench; MLVU
- Short videos: NExT-QA; EgoSchema
Update
- ๐๏ธ
Coming in August 2025: Token merging demo code will be released - stay tuned! - ๐๏ธ
July 26, 2025: Code is now available! - ๐๏ธ
June 26, 2025: STTM is accepted to ICCV 2025
Environment Setup
git clone https://github.com/HYUNJS/STTM.git
cd STTM
## (1) Option with conda. I used virtualenv for experimentation.
conda create -n sttm python=3.10 -y
conda activate sttm
pip install -e ".[train]" --extra-index-url https://download.pytorch.org/whl/cu121 # for cu121 - default is cu124
pip install flash-attn==2.7.3 --no-build-isolation # compatible version with torch==2.5.1
๐๏ธ Dataset Setup
Please prepare the checkpoints in ./ckpts/ folder.
The datasets are organized as follows:
datasets/
โโโ egoschema/
โโโ longvideobench/
โโโ mlvu/
โโโ nextqa/
โโโ videomme/
โโโ vnbench/
โโโ annotations/
โโโ videos/ (Optional) for feature extraction and visualization
โโโ preprocess_data/
โโโ {model_name}/
โ โโโ {frame_sampling_name}/
โ โโโ features/
โ โ โโโ {vid}.pt
โ โโโ metadata/
โ โโโ {vid}.pkl
โโโ llava-video-7b-qwen2-video-only/
โโโ F-180_fps-1/
โโโ features/
โ โโโ 10109006686_cnt_edit1.pt
โโโ metadata/
โโโ 10109006686_cnt_edit1.pkl
- Each benchmark (e.g., egoschema, longvideobench, etc.) has its own folder.
videos/: Raw video files (can be removed after feature extraction).annotations/: Contains annotation files (some are reformatted) for the benchmark. We format some benchmarks and save insttm_annotations/folder. Please copy it to setup the datasets.preprocess_data/: Stores preprocessed features and metadata.- Model-specific preprocessed data is stored in the
{model_name}/folder.llava-video-7b-qwen2-video-only/is example of model directory. {frame_sampling_name}/: Name of frame sampling strategy used for feature extraction (e.g., F-128_fps-1 or F-180_fps-1).features/: Extracted video features ({vid.pt}).metadata/: Associated metadata ({vid.pkl}).
To help you get started easily, we provide preprocessed feature data for Video-MME and VNBench on HuggingFace. Each dataset includes multiple frame sampling setups (e.g., F-64_fps-1, F-128_fps-1). Please use the Hugging Face Hub API to selectively download only the configurations you need.
- Video-MME: https://huggingface.co/datasets/js-hyun/preprocess-videomme-data
- VNBench: https://huggingface.co/datasets/js-hyun/preprocess-vnbench-data
๐ File Structure
The project is organized into modular components for token merging, model adaptation, and evaluation. Below is a brief overview of the key directories and scripts:
-
token_merging_utils/
Core implementations of the token merging algorithms. -
token_merging_monkey_patch/
Monkey patch files for injecting token merging into intermediate LLM layers of LLaVA-Video and LLaVA-OneVision models. -
token_merging_qwen2vl_monkey_patch/
Monkey patch files tailored for the Qwen2VL model. -
llava/eval/video_feat_{model_name}.py
Video feature extractor script.
โค Example:video_feat_llavavideo.py -
llava/eval/eval_vidqa_by_feat_{model_name}.py
Video QA evaluation using pre-extracted features. -
llava/eval/eval_vidqa_by_video_{model_name}.py
Video QA evaluation directly from raw video input. -
llava/eval/metric_{dataset_name}.py
Metric computation scripts specific to each dataset.
โค Example:metric_vnbench.py,metric_videomme.py
๐โโ๏ธ How to Run
๐น Frame Extraction
To extract video frames and features, refer to the following script:
scripts/eval/run_feat_extr.shโ Example commands for running feature extraction.
๐น Reproducible Evaluation
For reproducible results, we provide a --reproduce flag that sets a fixed random seed and enables deterministic CUDA operations.
scripts/eval/run_vidqa.shโ Contains example commands for video QA evaluation with reproducibility enabled. The basic running format is:
CUDA_VISIBLE_DEVICES=${device} \
python llava/eval/eval_vidqa_by_feat_{model_name}.py \
--reproduce \
${<data_loader_cfg>} \
${<model_cfg>} \
${<token_reduction_cfg>}
Citation
If you find this project helpful for your research or applications, please cite our paper:
@article{hyun2025multi,
title={Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs},
author={Hyun, Jeongseok and Hwang, Sukjun and Han, Su Ho and Kim, Taeoh and Lee, Inwoong and Wee, Dongyoon and Lee, Joon-Young and Kim, Seon Joo and Shim, Minho},
journal={arXiv preprint arXiv:2507.07990},
year={2025}
}
Acknowledgement ๐
We would like to thank the authors of the following projects for their valuable contributions, which our work builds upon or references:
- LLaVA-NeXT: We use its codebase for the LLaVA architecture, including the llava-video and llava-onevision models.
- ToMe, DyCoke, and FrameFusion: These codebases are used as references for our baseline experiments.