README.md
June 27, 2026 ยท View on GitHub
CLSE: Cross-Layer Spectral Evolution for Token Pruning
in Multimodal Large Language Models
๐ฅ Accepted at ECCV 2026
Bin Chen1,2,
Yuxiang Cai1,2*,
Yadan Luo3,
Yi Zhang4,
Jianwei Yin1,2,
Zhi Chen5โ
1 School of Software Technology, Zhejiang University, Ningbo, China ย 2 Zhejiang Key Laboratory of Digital-Intelligence Service Technology, China ย 3 The University of Queensland, St Lucia, QLD, Australia ย 4 Singapore Management University, Singapore ย 5 The University of Southern Queensland, Toowoomba, QLD, Australia
* Corresponding Authors: Yuxiang Cai, Zhi Chen
๐ฅ News
2026.06.19๐๐ CLSE is accepted at ECCV 2026! Code and models are now available!
๐ Overview
TLDR: We propose CLSE (Cross-Layer Spectral Evolution), a training-free token pruning method for MLLMs that quantifies how visual token representations evolve across Transformer layers in the frequency domain. Tokens with stronger spectral redistribution from high-frequency details to low-frequency semantics are preserved. CLSE achieves up to 66.7% token reduction while maintaining 99.4% of original performance, and is compatible with both image and video MLLMs.
๐ Installation
LLaVA-1.5
git clone https://github.com/zjubinchen/CLSE
cd CLSE/LLaVA1.5
conda create -n clse python=3.10 -y
conda activate clse
pip install -e transformers-4.37.2 -e .
pip install -e ../lmms-eval
pip install -e transformers-4.37.2 # patched transformers last, overrides lmms-eval's
pip install "numpy<2.0.0"
Qwen2-VL
cd CLSE/Qwen2VL
conda create -n clse_qwen python=3.10 -y
conda activate clse_qwen
pip install -r requirements.txt
pip install -e ../lmms-eval -e transformers-4.57.6 # patched transformers last, overrides lmms-eval's
Video-LLaVA
cd CLSE
git checkout video # switch to video branch
conda create -n clse_video python=3.10 -y
conda activate clse_video
pip install -e transformers-4.37.2 -e pytorchvideo -e .
pip install opencv-python openai decord
pip install "numpy<2.0.0"
๐ฏ Usage
LLaVA-1.5
cd LLaVA1.5
CUDA_VISIBLE_DEVICES=0 RETAIN_TOKEN=192 PRUNE=True bash scripts/v1_5/eval/gqa.sh
CUDA_VISIBLE_DEVICES=0 RETAIN_TOKEN=192 PRUNE=True bash scripts/v1_5/eval/mmbench.sh
CUDA_VISIBLE_DEVICES=0 RETAIN_TOKEN=192 PRUNE=True bash scripts/v1_5/eval/mme.sh
CUDA_VISIBLE_DEVICES=0 RETAIN_TOKEN=192 PRUNE=True bash scripts/v1_5/eval/pope.sh
RETAIN_TOKEN=192 PRUNE=True bash llava_lmms_eval.sh
Qwen2-VL
cd Qwen2VL
RETAIN_RATIO=0.334 PRUNE=True bash qwen2vl_lmms_eval.sh
RETAIN_RATIO=0.223 PRUNE=True bash qwen2vl_lmms_eval.sh
RETAIN_RATIO=0.112 PRUNE=True bash qwen2vl_lmms_eval.sh
Video-LLaVA
git checkout video
# Evaluate with CLSE token pruning (video)
RETAIN_TOKEN=194 PRUNE=True bash activitynet_eval.sh
RETAIN_TOKEN=194 PRUNE=True bash msvd_eval.sh
RETAIN_TOKEN=194 PRUNE=True bash msrvtt_eval.sh
RETAIN_TOKEN=194 PRUNE=True bash tgif_eval.sh
๐ Key Results
Image Benchmarks (LLaVA-1.5-7B)
| Method | Venue | 192 Tokens (โ66.7%) | 128 Tokens (โ77.8%) | 64 Tokens (โ88.9%) |
|---|---|---|---|---|
| FastV | ECCV'24 | 92.1% | 87.2% | 78.0% |
| PDrop | CVPR'25 | 96.9% | 95.3% | 77.0% |
| SparseVLM | ICML'25 | 96.3% | 93.7% | 84.3% |
| FiCoCo-V | AAAI'26 | 96.2% | 94.3% | 89.8% |
| CLSE (Ours) | ECCV'26 | 99.4% | 98.1% | 94.8% |
Performance relative to the vanilla model (576 tokens, 100%). Averaged over GQA, MMB, MMB-CN, MME, POPE, SQA, VQAText, VizWiz, and OCRBench.
Video Benchmarks (Video-LLaVA-7B)
CLSE and CLSE-M achieve the highest accuracy among all training-free methods under >90% token reduction, matching or exceeding vanilla model performance when combined with token merging.
Efficiency Gains
| Prefill Time โ | FLOPs โ | KV Cache โ | Throughput โ | |
|---|---|---|---|---|
| LLaVA-1.5 (192 tok) | 1.51ร faster | 2.28ร lower | 5.59ร smaller | 1.44ร higher |
| Video-LLaVA (194 tok) | 2.73ร faster | 9.35ร lower | 8.70ร smaller | 2.28ร higher |
๐ Repository Structure
This repository uses a branch-based layout:
mainfor image MLLMs (LLaVA, Qwen2-VL) andvideofor Video-LLaVA. Shared modules (transformers-4.37.2) are present on both branches.
๐ License
This project is released under the Apache 2.0 license.
๐ Citation
If you find CLSE helpful for your research, please consider citing:
@inproceedings{chen2026clse,
title={Spectral Evolution-Guided Token Pruning in Multimodal Large Language Models},
author={Chen, Bin and Cai, Yuxiang and Luo, Yadan and Zhang, Yi and Yin, Jianwei and Chen, Zhi},
booktitle={European Conference on Computer Vision (ECCV)},
year={2026}
}
๐ Acknowledgment
We extend our gratitude to the open-source efforts of LLaVA, Qwen2-VL, Video-LLaVA, and lmms-eval.
๐ฉ Contact
For questions about the paper or code, please email 22451332@zju.edu.cn, or open an issue on GitHub.