README.md
March 3, 2026 · View on GitHub
VisiPruner: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs
VisiPruner: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs
Yingqi Fan1, Anhao Zhao1,3, Jinlan Fu5, Junlong Tong1,2, Hui Su5, Yijie Pan1, Wei Zhang1, Xiaoyu Shen*,1
1Institute of Digital Twin, Eastern Institute of Technology, Ningbo
2Shanghai Jiao Tong University, 3Southeast University, 4Innopolis University
* Corresponding Author.
Contact: yingqi949@gmail.com, xyshen@eitech.edu.cn
If you find this repository useful, please cite using this BibTeX:
@inproceedings{fan2025visipruner,
title={VisiPruner: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs},
author={Fan, Yingqi and Zhao, Anhao and Fu, Jinlan and Tong, Junlong and Su, Hui and Pan, Yijie and Zhang, Wei and Shen, Xiaoyu},
booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
pages={18896--18913},
year={2025}
}
Contents
Abstract
Multimodal Large Language Models (MLLMs) have achieved strong performance across vision-language tasks, but suffer from significant computational overhead due to the quadratic growth of attention computations with the number of multimodal tokens. Though efforts have been made to prune tokens in MLLMs, they lack a fundamental understanding of how MLLMs process and fuse multimodal information. Through systematic analysis, we uncover a three-stage cross-modal interaction process: (1) Shallow layers recognize task intent, with visual tokens acting as passive attention sinks; (2) Cross-modal fusion occurs abruptly in middle layers, driven by a few critical visual tokens; (3) Deep layers discard vision tokens, focusing solely on linguistic refinement. Based on these findings, we propose VisiPruner, a training-free pruning framework that reduces up to 99% of vision-related attention computations and 53.9% of FLOPs on LLaVA-v1.5 7B. It significantly outperforms existing token pruning methods and generalizes across diverse MLLMs. Beyond pruning, our insights further provide actionable guidelines for training efficient MLLMs by aligning model architecture with its intrinsic layer-wise processing dynamics.
Install
-
Clone this repository and navigate to
LLaVA_visiPrunerfoldergit clone https://github.com/xxx cd LLaVA_visiPruner -
Install Package
conda create -n llava_visiPruner python=3.10 -y conda activate llava_visiPruner pip install --upgrade pip # enable PEP 660 support pip install -e . -
Install additional packages for training
pip install -e ".[train]" pip install flash-attn --no-build-isolation
Upgrade to latest code base
git pull
pip install -e .
Visualization Tools
We provide several visualization tools to inspect intermediate results produced during model analysis. All dependencies required for visualization are already included in the Python package configuration file, so no additional installation is needed.
1. Task Recognition
1.1 Logits Lens Based
We employ a logits lens approach to uncover latent task information encoded in the representation of the last input token.
Location: LLaVA_visiPruner/visualization/project_vo_to_semantic_space.ipynb.
Example:
Layer 10 top tokens: ['number', 'zero', 'sterd', 'Zero', 'cyc']
Layer 11 top tokens: ['number', 'stor', 'VS', 'gem', 'cyc']
1.2 Value-Output Based
Similar to the logits-lens view, the value–output matrix of the last input token captures hidden task semantics. We provide an easy-to-use Jupyter notebook for this analysis.
Location: LLaVA_visiPruner/visualization/project_vo_to_semantic_space.ipynb.
Example:
Layer 17 : ['Four', '四', 'fourth', 'four', 'four']
Layer 18 : ['four', 'five', 'three', 'six', 'four']
Layer 19 : ['numbers', 'counting', 'counted', 'numbers', 'counts']
2. Attention Visualization
We also provide code for visualizing the distribution of attention over image regions.
Location: LLaVA_visiPruner/visualization/attention_visualization.ipynb

3. norm of Value Matrix
We show that the L1 norms of value vectors () corresponding to attention sinks are significantly lower than those of normal visual tokens.
Location:LLaVA_visiPruner/visualization/L1_norms_of_value_matrix.ipynb

Evaluation
Firstly, please prepare the data of evaluation following the instructions from LLaVA: https://github.com/haotian-liu/LLaVA/blob/main/docs/Evaluation.md.
We provide ready-to-use evaluation scripts with default configurations for GQA, MME, and TextVQA benchmarks:
- run GQA:
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/visiPruner_eval/gqa.sh
- run MME:
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/visiPruner_eval/mme.sh
- run TextVQA:
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/visiPruner_eval/textvqa.sh
Pruning configurations
You can modify hyperparameters using the --pruning-config argument in the provided bash scripts. Or, you can also run inference directly using:./llava/serve/cli_pruning.py.
For example:
--pruning-config '{"mode":["shallow","middle","deep"], "shallow_mid_layer":6,"layer_threshold":0.995, "tokens_threshold":0.2}'
-
mode: Specifies which layers to prune (shallow, middle, or deep). -
shallow_mid_layer: Defines the boundary between shallow and middle layers. -
layer_threshold: Controls the pruning ratio by filtering layer. -
tokens_threshold: Controls the pruning ratio by remaining tokens.
License
This project is released under the Apache 2.0 license.
Acknowledgments
Thanks for the LLaVA, and FastV library, which helps us to quickly implement our ideas.
Related Projects
- Survey
- Vision Encoder
- MLLM