README.md
February 24, 2026 · View on GitHub
⚡️ VisionTrim: Unified Vision Token Compression for
Training-Free MLLM Acceleration
[ICLR 2026]
This is an official repository for the paper "VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration".
Multimodal large language models (MLLMs) suffer from high computational costs due to excessive visual tokens, particularly in high-resolution and video-based scenarios. Existing token reduction methods typically focus on isolated pipeline components and often neglect textual alignment, leading to performance degradation. In this paper, we propose VisionTrim, a unified framework for training-free MLLM acceleration, integrating two effective plug-and-play modules: 1) the Dominant Vision Token Selection (DVTS) module, which preserves essential visual tokens via a global-local view, and 2) the Text-Guided Vision Complement (TGVC)module, which facilitates context-aware token merging guided by textual cues.
📰 News
Feb. 2nd, 2026: Paper is now available at arXiv.Jan. 26th, 2026: VisionTrim is accepted by ICLR 2026!
⚙️ Setup
🏝️ Environment
- Clone this repository.
git clone https://github.com/hanxunyu/VisionTrim.git
cd VisionTrim
- Install necessary packages.
conda create -n visiontrim python=3.10 -y
conda activate visiontrim
pip install -e .
pip install protobuf
- (Optional) Install FlashAttention for further inference acceleration.
pip install flash-attn --no-build-isolation
📦️ Model
Download corresponding LLaVA checkpoints from Hugging Face 🤗:
| Version | LLM | Checkpoint |
|---|---|---|
| LLaVA-1.5 | Vicuna-7B | liuhaotian/llava-v1.5-7b |
| LLaVA-1.5 | Vicuna-13B | liuhaotian/llava-v1.5-13b |
| LLaVA-1.6 (LLaVA-NeXT) | Vicuna-7B | liuhaotian/llava-v1.6-vicuna-7b |
| LLaVA-1.6 (LLaVA-NeXT) | Vicuna-13B | liuhaotian/llava-v1.6-vicuna-13b |
📊 Data
Download each dataset according to EVAL.md.
🔬 Analysis
To analyze the inaccurate text-visual attention in VLMs, you need to download the visual instruction tuning data for LLaVA first, which we use for attention computation. And we provide the 1K subset for attention analysis in ./playground/data/analysis/llava_v1_5_mix1k.jsonl.
🛹 Attention Shift
To analyze the attention shift in VLMs, run the script ./scripts/analyze_attn_shift.sh.
bash scripts/v1_5/analyze_attn_shift.sh
🪩 Attention Dispersion
To analyze the attention dispersion in VLMs, run the script ./scripts/analyze_attn_dispersion.sh.
bash scripts/v1_5/analyze_attn_dispersion.sh
📋️ Evaluation
The main implementation of our VisionTrim is mainly in llava_llama.py, clip_encoder.py, llava_arch.py, model_vqa.py, and model_vqa_loader.py
We provide the evaluation scripts for each benchmark under ./scripts/v1_5/eval, you need to set the DVTS_token_num and TGVC_token_num as the bash argument. The detailed guidance for evaluation commands and online submission of each benchmark can be found in EVAL.md.
For evaluation with the 13B LLM, you just need to replace the CKPT argument from llava-v1.5-7b to llava-v1.5-13b in each script. And for evaluation with LLaVA-NeXT, you can use the scripts in ./scripts/v1_6/eval.
GQA
- Download the data and evaluation scripts following the official instructions and put under
../data/gqa/data. You may need to modifyeval.pydue to the missing assets in the GQA v1.2 release. - Single-GPU or Multi-GPU inference and evaluation.
method=VisionTrim
bash scripts/v1_5/eval/gqa.sh $token_num $token_complement
ScienceQA
- Under
../data/scienceqa, downloadimages,pid_splits.json,problems.jsonfrom thedata/scienceqafolder of the ScienceQA. - Single-GPU or Multi-GPU inference and evaluation.
method=VisionTrim
bash scripts/v1_5/eval/sqa.sh $token_num $token_complement
TextVQA
- Download
TextVQA_0.5.1_val.jsonand images and extract to../data/textvqa. - Single-GPU or Multi-GPU inference and evaluation.
method=VisionTrim
bash scripts/v1_5/eval/textvqa.sh $token_num $token_complement
POPE
- Download
cocofrom POPE and put under../data. - Single-GPU or Multi-GPU inference and evaluation.
method=VisionTrim
bash scripts/v1_5/eval/pope.sh $token_num $token_complement
MMBench
- Download
mmbench_dev_20230712.tsvand put under../data/mmbench. - Single-GPU or Multi-GPU inference and evaluation.
method=VisionTrim
bash scripts/v1_5/eval/mmbench.sh $token_num $token_complement
- Submit the results to the evaluation server:
../data/eval/mmbench/answers_upload/mmbench_dev_20230712.
MMBench-CN
- Download
mmbench_dev_cn_20231003.tsvand put under../data/mmbench. - Single-GPU or Multi-GPU inference and evaluation.
method=VisionTrim
bash scripts/v1_5/eval/mmbench_cn.sh $token_num $token_complement
- Submit the results to the evaluation server:
../data/eval/mmbench/answers_upload/mmbench_dev_cn_20231003.
SEED-Bench
- Following the official instructions to download the images and the videos. Put images under
../data/seed_bench/SEED-Bench-image. Note that we only use the image subset to evaluate. - Single-GPU or Multi-GPU inference and evaluation.
method=VisionTrim
bash scripts/v1_5/eval/seed.sh $token_num $token_complement
MM-Vet
- Extract
mm-vet.zipto../data/mmvet. - Single-GPU or Multi-GPU inference and evaluation.
method=VisionTrim
bash scripts/v1_5/eval/mmvet.sh $token_num $token_complement
- Evaluate the predictions in
../data/eval/mmvet/resultsusing the official Jupyter notebook.
😊 Acknowledgement
We are grateful for the open-source contributions of other projects:
🖊️ Citation
If you find our VisionTrim useful for your research, please consider giving this repository a star and citing our paper as follows:
@article{yu2026visiontrim,
title={VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration},
author={Yu, Hanxun and Li, Wentong and Qu, Xuan and Wang, Song and Chen, Junbo and Zhu, Jianke},
journal={arXiv preprint arXiv:2601.22674},
year={2026}
}