README.md

November 13, 2025 ยท View on GitHub

Filter, Correlate, Compress: Training-Free Token Reduction for MLLM Acceleration

Yuhang Han1*, Xuyang Liu2*, Zihan Zhang3, Pengxiang Ding1, Junjie Chen2,
Donglin Wang1, Honggang Chen2, Qingsen Yan4,5, Siteng Huang6โœ‰

1 Westlake University
2 Sichuan University
3 Johns Hopkins University
4 Northwestern Polytechnical University
5 Shenzhen Research Institute of Northwestern Polytechnical University
6 Zhejiang University

image

๐Ÿ”ฅ News

  • 2025.11.08 ๐ŸŽ‰๐ŸŽ‰ Our FiCoCo and GlobalCom2 have been accepted by AAAI 2026!
  • 2025.01.10 ๐Ÿค—๐Ÿค— We release our latest work GlobalCom2, a "global-to-local" approach for training-free acceleration of high-resolution MLLMs. Code is available!
  • 2024.11.17 ๐Ÿค—๐Ÿค— We release our work FiCoCo which proposes a unified paradigm to demystify the popular works and guide the future designs of training-free token reduction for MLLMs.

๐Ÿ‘€ Overview

image

TLDR: This study introduces a unified "filter-correlate-compress" paradigm to streamline training-free token reduction in Multimodal Large Language Models (MLLMs), achieving up to 82.4% FLOPs reduction with minimal performance impact and outperforming existing methods across 10 benchmarks.

๐Ÿ›  Preparation

  1. Clone this repository.
git clone https://github.com/kawhiiiileo/FiCoCo.git
cd FiCoCo
  1. Environment Setup and Preparation
 conda create -n FiCoCo python=3.10 -y
 conda activate FiCoCo
 pip install -e .
  1. Download Multimodal Benchmark

Please follow the detailed instruction in LLaVA-Evaluation.

  1. Download LLaVA and put them under ./liuhaotian/llava-v1.5-7b.

๐Ÿš€ Run and Evaluation

To configure the FiCoCo model with these parameters, update the corresponding settings in your code or configuration file. Below is an example configuration:

For example:
merge_visual: true # Enable FiCoCo-V for visual tokens compression
AT: true # Enable FiCoCo-L for visual tokens compression
r: 42 # Compress 42 tokens per layer
control_encoding_layer: 11 # Start compression from the 12th transformer layer

Example for evaluating SQA results (r=42, control_encoding_layer=11, merge_visual=True):

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/sqa.sh

To calculate the theoretical computational efficiency shown above, we recommend the methodology presented in the work of LLM-Viewer. We deeply appreciate their outstanding contribution to this field.

๐Ÿš€ Exploring Without CLS Token

Considering that some MLLM visual encoders do not involve a [CLS] token, we propose a feasible alternative. The specific results are as follows, and further details can be found in the paper.

๐Ÿ“Œ Citation

If you use FiCoCo in your research, please cite our work by using the following BibTeX entry:

@misc{han2025filtercorrelatecompresstrainingfree,
      title={Filter, Correlate, Compress: Training-Free Token Reduction for MLLM Acceleration}, 
      author={Yuhang Han and Xuyang Liu and Zihan Zhang and Pengxiang Ding and Donglin Wang and Honggang Chen and Qingsen Yan and Siteng Huang},
      year={2025},
      eprint={2411.17686},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.17686}, 
}

๐Ÿ‘ Acknowledgment

We extend our gratitude to the open-source efforts of LLaVA, ToMe and Open-LLaVA-NeXT.

:e-mail: Contact

For any question about our paper or code, please email yuhangh984@gmail.com or liuxuyang@stu.scu.edu.cn.