Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs

June 11, 2026 · View on GitHub

Task Model Status

The official source code for Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs.

Overview

Key Finding: Sink Tokens as an Obstacle

Through a systematic analysis, we identify sink tokens — semantically uninformative tokens that attract excessive attention — as a key obstacle to fine-grained video understanding. When sink tokens survive pruning, they distort the model's visual evidence and hinder fine-grained understanding.

  • Sink tokens exhibit spatially persistent high attention across the temporal dimension despite carrying little semantic information.
  • Due to their high attention weights, sink tokens are preferentially retained during spatial pruning, crowding out truly informative tokens.

Proposed Approach: SToP

We propose Sink-Token-aware Pruning (SToP), a simple yet effective plug-and-play method consisting of:

  1. Sink Score — Quantifies each token's tendency to behave as a sink by measuring the persistence of its attention across frames.
  2. STSP Module (Sink-Token-aware Spatial Pruning) — Adjusts the attention distribution of sink tokens, lowering their priority for retention during spatial pruning.
  3. STTP Module (Sink-Token-aware Temporal Pruning) — Further promotes the elimination of sink tokens along the time dimension.

SToP is agnostic to existing pruning frameworks and can be seamlessly integrated with VisionZip, FastVid, and HoliTom.

Installation

Set up the environment with a single command:

bash environment.sh

Main package versions:

torch 2.2.1 (CUDA 12.1) transformers 4.51.3 numpy 1.26.1

Dataset

Each benchmark has its own download script — run only the ones you need.

Fine-grained tasks

cd dataset/fine_grained_task

bash download_EventHallusion.sh   # EventHallusion
bash download_VCGBench.sh         # VCGBench (VideoChatGPT)
bash download_VideoComp.sh        # VideoComp (ActivityNet + YouCook2)

VQA benchmarks

cd dataset/VQA

bash download_videomme.sh         # Video-MME
bash download_mvbench.sh          # MVBench
bash download_mlvu.sh             # MLVU

Evaluation

This repository supports 3 backbones, each with its own scripts directory:

BackboneScripts directory
LLaVA-OneVisionscripts/llava_ov/
LLaVA-Videoscripts/llava_video/
Qwen2.5-VLscripts/qwen2.5_vl/ — VisionZip only (see folder README)

Supported pruning methods: VisionZip, FastVid, HoliTom, and FlashVid.

PruneVid (paper) was used as a baseline in our paper, but its integration code is not distributed in this repository. PruneVid is licensed under CC BY-NC-SA 4.0 (NonCommercial), which is incompatible with SToP's Apache-2.0 / MIT licensing. To obtain the original PruneVid implementation for reference, run bash pruning/download_prunevid.sh (NonCommercial — see THIRD_PARTY_LICENSES.md).

Quick start

Every backbone directory follows the same 4-file layout. Replace {MODEL} with llava_ov, llava_video, or qwen2.5_vl.

# No pruning (vanilla)
bash scripts/{MODEL}/no_pruning.sh

# Pruning without SToP
bash scripts/{MODEL}/baseline.sh

# SToP — spatial only (VisionZip, FastVid)
bash scripts/{MODEL}/SToP_spatial.sh

# SToP — spatial + temporal (HoliTom)
bash scripts/{MODEL}/SToP_spatial_temporal.sh

Inside each script you can adjust DATASET, RETENTION_RATIO, PRUNING, and CUDA_VISIBLE_DEVICES. The SToP hyperparameters (μs\mu_s, μt\mu_t) are set automatically per backbone and pruning method via eval/utils/config.py — no manual tuning needed.

OPENAI_KEY is required for GPT-based evaluation on VCGBench and EventHallusion. Set it at the top of the script before running.

Acknowledgement

This code is built upon the following open-source projects. We thank the authors for releasing their work. Each project's license and the corresponding files in this repository are listed in THIRD_PARTY_LICENSES.md; full license texts are under licenses/.

ProjectLicense
LLaVA-NeXTApache-2.0
lmms-evalApache-2.0 + MIT
VisionZipApache-2.0
HoliTomApache-2.0
FastVidMIT
FlashVidMIT
PruneVidCC BY-NC-SA 4.0 — not bundled, see note below

PruneVid (NonCommercial). PruneVid is licensed under CC BY-NC-SA 4.0, which is incompatible with this repository's Apache-2.0 / MIT licensing, so no PruneVid-derived code is included here. If you need it, obtain it separately:

bash pruning/download_prunevid.sh

This clones the upstream PruneVid repository (CC BY-NC-SA 4.0) into a git-ignored directory for reference. Anything derived from it is bound by CC BY-NC-SA 4.0: attribution, ShareAlike, and NonCommercial use only.

License

SToP is released under the Apache License 2.0 — see LICENSE. All code distributed in this repository is licensed under Apache-2.0 or MIT (mutually compatible). Third-party components and their licenses are documented in THIRD_PARTY_LICENSES.md.

Citation

If you find this work useful, please cite:

@article{kim2026sink,
  title={Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs},
  author={Kim, Kibum and Kim, Jiwan and Min, Kyle and Wang, Yueqi and Moon, Jinyoung and McAuley, Julian and Park, Chanyoung},
  journal={arXiv preprint arXiv:2604.20937},
  year={2026}
}