README.md

June 27, 2026 ยท View on GitHub

CLSE: Cross-Layer Spectral Evolution for Token Pruning
in Multimodal Large Language Models

๐Ÿ”ฅ Accepted at ECCV 2026

Bin Chen1,2, Yuxiang Cai1,2*, Yadan Luo3, Yi Zhang4,
Jianwei Yin1,2, Zhi Chen5โœ‰

1 School of Software Technology, Zhejiang University, Ningbo, China ย  2 Zhejiang Key Laboratory of Digital-Intelligence Service Technology, China ย  3 The University of Queensland, St Lucia, QLD, Australia ย  4 Singapore Management University, Singapore ย  5 The University of Southern Queensland, Toowoomba, QLD, Australia

* Corresponding Authors: Yuxiang Cai, Zhi Chen

ECCV 2026 arXiv License GitHub Stars

๐Ÿ”ฅ News

  • 2026.06.19 ๐ŸŽ‰๐ŸŽ‰ CLSE is accepted at ECCV 2026! Code and models are now available!

๐Ÿ‘€ Overview

CLSE Overview

TLDR: We propose CLSE (Cross-Layer Spectral Evolution), a training-free token pruning method for MLLMs that quantifies how visual token representations evolve across Transformer layers in the frequency domain. Tokens with stronger spectral redistribution from high-frequency details to low-frequency semantics are preserved. CLSE achieves up to 66.7% token reduction while maintaining 99.4% of original performance, and is compatible with both image and video MLLMs.

๐Ÿ›  Installation

LLaVA-1.5

git clone https://github.com/zjubinchen/CLSE
cd CLSE/LLaVA1.5

conda create -n clse python=3.10 -y
conda activate clse
pip install -e transformers-4.37.2  -e .
pip install -e ../lmms-eval 
pip install -e transformers-4.37.2   # patched transformers last, overrides lmms-eval's
pip install "numpy<2.0.0" 

Qwen2-VL

cd CLSE/Qwen2VL

conda create -n clse_qwen python=3.10 -y
conda activate clse_qwen
pip install -r requirements.txt
pip install -e ../lmms-eval -e transformers-4.57.6   # patched transformers last, overrides lmms-eval's

Video-LLaVA

cd CLSE
git checkout video                     # switch to video branch

conda create -n clse_video python=3.10 -y
conda activate clse_video
pip install -e transformers-4.37.2 -e pytorchvideo -e .
pip install opencv-python openai decord
pip install "numpy<2.0.0"

๐ŸŽฏ Usage

LLaVA-1.5

cd LLaVA1.5

CUDA_VISIBLE_DEVICES=0 RETAIN_TOKEN=192 PRUNE=True bash scripts/v1_5/eval/gqa.sh
CUDA_VISIBLE_DEVICES=0 RETAIN_TOKEN=192 PRUNE=True bash scripts/v1_5/eval/mmbench.sh
CUDA_VISIBLE_DEVICES=0 RETAIN_TOKEN=192 PRUNE=True bash scripts/v1_5/eval/mme.sh
CUDA_VISIBLE_DEVICES=0 RETAIN_TOKEN=192 PRUNE=True bash scripts/v1_5/eval/pope.sh

RETAIN_TOKEN=192 PRUNE=True bash llava_lmms_eval.sh

Qwen2-VL

cd Qwen2VL
RETAIN_RATIO=0.334 PRUNE=True bash qwen2vl_lmms_eval.sh
RETAIN_RATIO=0.223 PRUNE=True bash qwen2vl_lmms_eval.sh
RETAIN_RATIO=0.112 PRUNE=True bash qwen2vl_lmms_eval.sh

Video-LLaVA

git checkout video
# Evaluate with CLSE token pruning (video)
RETAIN_TOKEN=194 PRUNE=True bash activitynet_eval.sh  
RETAIN_TOKEN=194 PRUNE=True bash msvd_eval.sh         
RETAIN_TOKEN=194 PRUNE=True bash msrvtt_eval.sh       
RETAIN_TOKEN=194 PRUNE=True bash tgif_eval.sh     

๐Ÿ“Š Key Results

Image Benchmarks (LLaVA-1.5-7B)

MethodVenue192 Tokens (โ†“66.7%)128 Tokens (โ†“77.8%)64 Tokens (โ†“88.9%)
FastVECCV'2492.1%87.2%78.0%
PDropCVPR'2596.9%95.3%77.0%
SparseVLMICML'2596.3%93.7%84.3%
FiCoCo-VAAAI'2696.2%94.3%89.8%
CLSE (Ours)ECCV'2699.4%98.1%94.8%

Performance relative to the vanilla model (576 tokens, 100%). Averaged over GQA, MMB, MMB-CN, MME, POPE, SQA, VQAText, VizWiz, and OCRBench.

Video Benchmarks (Video-LLaVA-7B)

CLSE and CLSE-M achieve the highest accuracy among all training-free methods under >90% token reduction, matching or exceeding vanilla model performance when combined with token merging.

Efficiency Gains

Prefill Time โ†“FLOPs โ†“KV Cache โ†“Throughput โ†‘
LLaVA-1.5 (192 tok)1.51ร— faster2.28ร— lower5.59ร— smaller1.44ร— higher
Video-LLaVA (194 tok)2.73ร— faster9.35ร— lower8.70ร— smaller2.28ร— higher

๐Ÿ“ Repository Structure

This repository uses a branch-based layout: main for image MLLMs (LLaVA, Qwen2-VL) and video for Video-LLaVA. Shared modules (transformers-4.37.2) are present on both branches.

๐Ÿ”‘ License

This project is released under the Apache 2.0 license.

๐Ÿ“Œ Citation

If you find CLSE helpful for your research, please consider citing:

@inproceedings{chen2026clse,
  title={Spectral Evolution-Guided Token Pruning in Multimodal Large Language Models},
  author={Chen, Bin and Cai, Yuxiang and Luo, Yadan and Zhang, Yi and Yin, Jianwei and Chen, Zhi},
  booktitle={European Conference on Computer Vision (ECCV)},
  year={2026}
}

๐Ÿ‘ Acknowledgment

We extend our gratitude to the open-source efforts of LLaVA, Qwen2-VL, Video-LLaVA, and lmms-eval.

๐Ÿ“ฉ Contact

For questions about the paper or code, please email 22451332@zju.edu.cn, or open an issue on GitHub.