SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs

October 14, 2025 ยท View on GitHub

This repository contains PyTorch implementation for SparseMM.

Project Page | arXiv Paper

Introduce SparseMM

We investigate how MLLMs process visual inputs by analyzing their attention mechanisms and reveal a surprising sparsity phenomenon: only a small subset (approximately less than 5%) of attention heads in LLMs actively contribute to visual understanding, termed Visual Heads. To identify these heads efficiently, we design a training-free framework that quantifies head-level visual relevance through targeted response analysis.

Building on this discovery, we introduce SparseMM, a KV-Cache optimization strategy that allocates asymmetric computation budgets to heads in LLMs based on their visual scores, leveraging the sparity of visual heads for accelerating the inference of MLLMs. Compared with prior KV-Cache acceleration methods that ignore the particularity of visual, SparseMM prioritizes stress and retaining visual semantics during decoding.

Main idea of Visual Head

Visual_Head.png

SparseMM for MLLM Acceleration

SparseMM.png

Main Results

Results on Multi-modal Benchmarks

main_result.png

Efficiency Evaluation for SparseMM

efficiency.png

Visualization of Visual Head

viz.png

Get Started

Install

  1. Clone this repository:
git clone https://github.com/CR400AF-A/SparseMM.git
cd SparseMM
  1. Init your environment
conda create -n sparsemm python=3.10 -y
conda activate sparsemm
  1. Install packages

Compile CUDA code for Flatten Cache Storage. If you encounter a CUDA compile error, please check your GPU Virtual Architecture and GPU Feature, then change the corresponding compile flag in csrc/build.py

pip install packaging torch==2.5.1
pip uninstall ninja && pip cache purge && pip install ninja --no-cache-dir
cd csrc && make
cd ..

Install other packages

pip install -e .
pip install flash-attn==2.4.1 --no-build-isolation # currently only support FlashAttention
pip install qwen-vl-utils
  1. Install lmms-eval for evaluation
cd lmms-eval
pip install -e .
cd ..

Chase Visual Head

  1. download dataset:
huggingface-cli download --repo-type dataset --resume-download nnethercott/synthdog-en-detection --local-dir /path/to/datasets/synthdog-en-detection

huggingface-cli download --repo-type dataset --resume-download detection-datasets/coco --local-dir /path/to/datasets/coco
  1. process dataset:
python3 scripts/chase_visual_head/process_data.py
python3 scripts/chase_visual_head/process_data_coco.py
  1. chase visual head:
bash scripts/chase_visual_head/llava.sh
bash scripts/chase_visual_head/llava_coco.sh
bash scripts/chase_visual_head/qwen.sh

Eval

bash scripts/eval/llava.sh
bash scripts/eval/mistral.sh
bash scripts/eval/qwen.sh

Viz

bash scripts/others/viz.sh

Speed and Memory

bash scripts/others/speed_and_memory.sh

Citation

If you found this repository useful, please consider citing:

@article{wang2025sparsemm,
  title={SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs},
  author={Wang, Jiahui and Liu, Zuyan and Rao, Yongming and Lu, Jiwen},
  journal={arXiv preprint arXiv:2506.05344},
  year={2025}
}

Acknowledgement

  • Our codebase is conducted on AdaKV and PyramidKV.

  • Thanks to lmms-eval team, for building such a useful evaluation system!