README.md

March 3, 2026 · View on GitHub

VisiPruner: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs

arXiv EMNLP License Last Commit

VisiPruner: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs

Yingqi Fan1, Anhao Zhao1,3, Jinlan Fu5, Junlong Tong1,2, Hui Su5, Yijie Pan1, Wei Zhang1, Xiaoyu Shen*,1

1Institute of Digital Twin, Eastern Institute of Technology, Ningbo

2Shanghai Jiao Tong University, 3Southeast University, 4Innopolis University

* Corresponding Author.

Contact: yingqi949@gmail.com, xyshen@eitech.edu.cn

If you find this repository useful, please cite using this BibTeX:

@inproceedings{fan2025visipruner,
  title={VisiPruner: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs},
  author={Fan, Yingqi and Zhao, Anhao and Fu, Jinlan and Tong, Junlong and Su, Hui and Pan, Yijie and Zhang, Wei and Shen, Xiaoyu},
  booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
  pages={18896--18913},
  year={2025}
}

Contents

Abstract

Multimodal Large Language Models (MLLMs) have achieved strong performance across vision-language tasks, but suffer from significant computational overhead due to the quadratic growth of attention computations with the number of multimodal tokens. Though efforts have been made to prune tokens in MLLMs, they lack a fundamental understanding of how MLLMs process and fuse multimodal information. Through systematic analysis, we uncover a three-stage cross-modal interaction process: (1) Shallow layers recognize task intent, with visual tokens acting as passive attention sinks; (2) Cross-modal fusion occurs abruptly in middle layers, driven by a few critical visual tokens; (3) Deep layers discard vision tokens, focusing solely on linguistic refinement. Based on these findings, we propose VisiPruner, a training-free pruning framework that reduces up to 99% of vision-related attention computations and 53.9% of FLOPs on LLaVA-v1.5 7B. It significantly outperforms existing token pruning methods and generalizes across diverse MLLMs. Beyond pruning, our insights further provide actionable guidelines for training efficient MLLMs by aligning model architecture with its intrinsic layer-wise processing dynamics.

framework

Install

  1. Clone this repository and navigate to LLaVA_visiPruner folder

    git clone https://github.com/xxx
    cd LLaVA_visiPruner
    
  2. Install Package

    conda create -n llava_visiPruner python=3.10 -y
    conda activate llava_visiPruner
    pip install --upgrade pip  # enable PEP 660 support
    pip install -e .
    
  3. Install additional packages for training

    pip install -e ".[train]"
    pip install flash-attn --no-build-isolation
    

Upgrade to latest code base

git pull
pip install -e .

Visualization Tools

We provide several visualization tools to inspect intermediate results produced during model analysis. All dependencies required for visualization are already included in the Python package configuration file, so no additional installation is needed.

1. Task Recognition

1.1 Logits Lens Based

We employ a logits lens approach to uncover latent task information encoded in the representation of the last input token.

Location: LLaVA_visiPruner/visualization/project_vo_to_semantic_space.ipynb.

Example:

Layer 10 top tokens: ['number', 'zero', 'sterd', 'Zero', 'cyc']
Layer 11 top tokens: ['number', 'stor', 'VS', 'gem', 'cyc']

1.2 Value-Output Based

Similar to the logits-lens view, the value–output matrix of the last input token captures hidden task semantics. We provide an easy-to-use Jupyter notebook for this analysis.

Location: LLaVA_visiPruner/visualization/project_vo_to_semantic_space.ipynb.

Example:

Layer 17 : ['Four', '四', 'fourth', 'four', 'four'] 
Layer 18 : ['four', 'five', 'three', 'six', 'four'] 
Layer 19 : ['numbers', 'counting', 'counted', 'numbers', 'counts'] 

2. Attention Visualization

We also provide code for visualizing the distribution of attention over image regions.

Location: LLaVA_visiPruner/visualization/attention_visualization.ipynb

L1_norm

3. L1L1 norm of Value Matrix

We show that the L1 norms of value vectors (V\mathcal{V}) corresponding to attention sinks are significantly lower than those of normal visual tokens.

Location:LLaVA_visiPruner/visualization/L1_norms_of_value_matrix.ipynb

L1_norm

Evaluation

Firstly, please prepare the data of evaluation following the instructions from LLaVA: https://github.com/haotian-liu/LLaVA/blob/main/docs/Evaluation.md.

We provide ready-to-use evaluation scripts with default configurations for GQA, MME, and TextVQA benchmarks:

  1. run GQA:
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/visiPruner_eval/gqa.sh
  1. run MME:
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/visiPruner_eval/mme.sh
  1. run TextVQA:
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/visiPruner_eval/textvqa.sh

Pruning configurations

You can modify hyperparameters using the --pruning-config argument in the provided bash scripts. Or, you can also run inference directly using:./llava/serve/cli_pruning.py.

For example:

--pruning-config '{"mode":["shallow","middle","deep"], "shallow_mid_layer":6,"layer_threshold":0.995, "tokens_threshold":0.2}'
  • mode: Specifies which layers to prune (shallow, middle, or deep).

  • shallow_mid_layer: Defines the boundary between shallow and middle layers.

  • layer_threshold: Controls the pruning ratio by filtering layer.

  • tokens_threshold: Controls the pruning ratio by remaining tokens.

License

This project is released under the Apache 2.0 license.

Acknowledgments

Thanks for the LLaVA, and FastV library, which helps us to quickly implement our ideas.