README.md
November 13, 2025 ยท View on GitHub
Filter, Correlate, Compress: Training-Free Token Reduction for MLLM Acceleration
Yuhang Han1*,
Xuyang Liu2*,
Zihan Zhang3,
Pengxiang Ding1,
Junjie Chen2,
Donglin Wang1,
Honggang Chen2,
Qingsen Yan4,5,
Siteng Huang6โ
1 Westlake University
2 Sichuan University
3 Johns Hopkins University
4 Northwestern Polytechnical University
5 Shenzhen Research Institute of Northwestern Polytechnical University
6 Zhejiang University
๐ฅ News
2025.11.08๐๐ Our FiCoCo and GlobalCom2 have been accepted by AAAI 2026!2025.01.10๐ค๐ค We release our latest work GlobalCom2, a "global-to-local" approach for training-free acceleration of high-resolution MLLMs. Code is available!2024.11.17๐ค๐ค We release our work FiCoCo which proposes a unified paradigm to demystify the popular works and guide the future designs of training-free token reduction for MLLMs.
๐ Overview
TLDR: This study introduces a unified "filter-correlate-compress" paradigm to streamline training-free token reduction in Multimodal Large Language Models (MLLMs), achieving up to 82.4% FLOPs reduction with minimal performance impact and outperforming existing methods across 10 benchmarks.
๐ Preparation
- Clone this repository.
git clone https://github.com/kawhiiiileo/FiCoCo.git
cd FiCoCo
- Environment Setup and Preparation
conda create -n FiCoCo python=3.10 -y
conda activate FiCoCo
pip install -e .
- Download Multimodal Benchmark
Please follow the detailed instruction in LLaVA-Evaluation.
- Download LLaVA and put them under
./liuhaotian/llava-v1.5-7b.
๐ Run and Evaluation
To configure the FiCoCo model with these parameters, update the corresponding settings in your code or configuration file. Below is an example configuration:
For example:
merge_visual: true # Enable FiCoCo-V for visual tokens compression
AT: true # Enable FiCoCo-L for visual tokens compression
r: 42 # Compress 42 tokens per layer
control_encoding_layer: 11 # Start compression from the 12th transformer layer
Example for evaluating SQA results (r=42, control_encoding_layer=11, merge_visual=True):
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/sqa.sh
To calculate the theoretical computational efficiency shown above, we recommend the methodology presented in the work of LLM-Viewer. We deeply appreciate their outstanding contribution to this field.
๐ Exploring Without CLS Token
Considering that some MLLM visual encoders do not involve a [CLS] token, we propose a feasible alternative. The specific results are as follows, and further details can be found in the paper.
๐ Citation
If you use FiCoCo in your research, please cite our work by using the following BibTeX entry:
@misc{han2025filtercorrelatecompresstrainingfree,
title={Filter, Correlate, Compress: Training-Free Token Reduction for MLLM Acceleration},
author={Yuhang Han and Xuyang Liu and Zihan Zhang and Pengxiang Ding and Donglin Wang and Honggang Chen and Qingsen Yan and Siteng Huang},
year={2025},
eprint={2411.17686},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.17686},
}
๐ Acknowledgment
We extend our gratitude to the open-source efforts of LLaVA, ToMe and Open-LLaVA-NeXT.
:e-mail: Contact
For any question about our paper or code, please email yuhangh984@gmail.com or liuxuyang@stu.scu.edu.cn.