✂️ Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs

July 1, 2025 · View on GitHub

A plug-and-play method that utilizes visual cues for more effective token pruning in visual language models.

📰 News

🔥 [2025/06/28] Our Project Page is updated!

🔥 [2025/06/26] Our VisPruner is accepted by ICCV 2025! 🎉

🔥 [2025/06/24] A new version of FasterVLM named VisPruner is released!

In this work, we find that the text-visual attention in the language model is not an ideal indicator for visual token pruning. Based on the analysis, We propose VisPruner, a plug-and-play method that utilizes visual cues for more effective token pruning in visual language models. Specifically, we first use visual attention to select a limited number of significant tokens. Then, we remove duplicate tokens from the remaining ones based on their similarity. By retaining diverse tokens alongside the initially selected important tokens, we maximally preserve the visual information of the input image.

case

⚙️ Setup

🏝️ Environment

Clone this repository.

git clone https://github.com/anonymous-4869/VisPruner.git
cd VisPruner

Install necessary packages.

conda create -n vispruner python=3.10 -y
conda activate vispruner
pip install -e .

(Optional) Install FlashAttention for further inference acceleration.

pip install flash-attn --no-build-isolation

📦️ Model

Download corresponding LLaVA checkpoints from Hugging Face 🤗:

Version	LLM	Checkpoint
LLaVA-1.5	Vicuna-7B	liuhaotian/llava-v1.5-7b
LLaVA-1.5	Vicuna-13B	liuhaotian/llava-v1.5-13b
LLaVA-1.6 (LLaVA-NeXT)	Vicuna-7B	liuhaotian/llava-v1.6-vicuna-7b
LLaVA-1.6 (LLaVA-NeXT)	Vicuna-13B	liuhaotian/llava-v1.6-vicuna-13b

📊 Data

Download each dataset according to EVAL.md.

📋️ Evaluation

The main implementation of VisPruner is highlighted with [VisPruner] annotations, mainly in llava_llama.py, llava_arch.py and clip_encoder.py.

We provide the evaluation scripts for each benchmark:

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/${DATASET}.sh ${VISUAL_TOKEN_NUMBER} ${IMPORTANT_RATIO}

You only need to set the remaining visual token number and important ratio as the bash arguments. For example, if you want to evaluate VisPruner with 64 important tokens and 64 diverse tokens retained on the VQAv2 benchmark, you can run the script ./scripts/v1_5/eval/vqav2.sh with argument 128 and 0.5:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/vqav2.sh 128 0.5

And if you want to evaluate VisPruner with 48 important tokens and 16 diverse tokens retained on the MME benchmark, you can run the following command:

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mme.sh 64 0.75

For evaluation with the 13B LLM, you just need to replace the CKPT argument from llava-v1.5-7b to llava-v1.5-13b in each script. And for evaluation with LLaVA-NeXT, you can use the scripts in ./scripts/v1_6/eval. For example, if you want to evaluate VisPruner with 32 * 5 = 160 visual tokens retained on the TextVQA benchmark, you can run the following command:

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_6/eval/textvqa.sh 32 0.5

The detailed guidance for evaluation commands and online submission of each benchmark can be found in EVAL.md.

🔖 Citation

If you find VisPruner useful for your research and applications, please cite using this BibTeX:

@article{zhang2025vispruner,
      title={Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs}, 
      author={Zhang, Qizhe and Cheng, Aosong and Lu, Ming and Zhang, Renrui and Zhuo, Zhiyong and Cao, Jiajun and Guo, Shaobo and She, Qi and Zhang, Shanghang},
      journal={arXiv preprint arXiv:2412.01818},
      year={2025},
}

🎟️ License

This project is released under the Apache 2.0 license.

🏅 Acknowledgement

We appreciate the open-source efforts of LLaVA and ToMe.