SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs

October 14, 2025 · View on GitHub

This repository contains PyTorch implementation for SparseMM.

Introduce SparseMM

We investigate how MLLMs process visual inputs by analyzing their attention mechanisms and reveal a surprising sparsity phenomenon: only a small subset (approximately less than 5%) of attention heads in LLMs actively contribute to visual understanding, termed Visual Heads. To identify these heads efficiently, we design a training-free framework that quantifies head-level visual relevance through targeted response analysis.

Building on this discovery, we introduce SparseMM, a KV-Cache optimization strategy that allocates asymmetric computation budgets to heads in LLMs based on their visual scores, leveraging the sparity of visual heads for accelerating the inference of MLLMs. Compared with prior KV-Cache acceleration methods that ignore the particularity of visual, SparseMM prioritizes stress and retaining visual semantics during decoding.

Clone this repository:

git clone https://github.com/CR400AF-A/SparseMM.git
cd SparseMM

Init your environment

conda create -n sparsemm python=3.10 -y
conda activate sparsemm

Install packages

Compile CUDA code for Flatten Cache Storage. If you encounter a CUDA compile error, please check your GPU Virtual Architecture and GPU Feature, then change the corresponding compile flag in csrc/build.py

pip install packaging torch==2.5.1
pip uninstall ninja && pip cache purge && pip install ninja --no-cache-dir
cd csrc && make
cd ..

Install other packages

pip install -e .
pip install flash-attn==2.4.1 --no-build-isolation # currently only support FlashAttention
pip install qwen-vl-utils

Install lmms-eval for evaluation

cd lmms-eval
pip install -e .
cd ..

Chase Visual Head

download dataset:

huggingface-cli download --repo-type dataset --resume-download nnethercott/synthdog-en-detection --local-dir /path/to/datasets/synthdog-en-detection

huggingface-cli download --repo-type dataset --resume-download detection-datasets/coco --local-dir /path/to/datasets/coco

process dataset:

python3 scripts/chase_visual_head/process_data.py
python3 scripts/chase_visual_head/process_data_coco.py

chase visual head:

bash scripts/chase_visual_head/llava.sh
bash scripts/chase_visual_head/llava_coco.sh
bash scripts/chase_visual_head/qwen.sh

Eval

bash scripts/eval/llava.sh
bash scripts/eval/mistral.sh
bash scripts/eval/qwen.sh

Viz

bash scripts/others/viz.sh

Speed and Memory

bash scripts/others/speed_and_memory.sh

Citation

If you found this repository useful, please consider citing:

@article{wang2025sparsemm,
  title={SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs},
  author={Wang, Jiahui and Liu, Zuyan and Rao, Yongming and Lu, Jiwen},
  journal={arXiv preprint arXiv:2506.05344},
  year={2025}
}

Acknowledgement

Our codebase is conducted on AdaKV and PyramidKV.
Thanks to lmms-eval team, for building such a useful evaluation system!

SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs

Introduce SparseMM

Main idea of Visual Head

SparseMM for MLLM Acceleration

Main Results

Efficiency Evaluation for SparseMM

Visualization of Visual Head

Get Started

Install

Chase Visual Head

Eval

Viz

Speed and Memory

Citation

Acknowledgement