README.md

June 17, 2026 · View on GitHub

⚡️ VisionTrim: Unified Vision Token Compression for
Training-Free MLLM Acceleration
[ICLR 2026]

This is an official repository for the paper "VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration".

Multimodal large language models (MLLMs) suffer from high computational costs due to excessive visual tokens, particularly in high-resolution and video-based scenarios. Existing token reduction methods typically focus on isolated pipeline components and often neglect textual alignment, leading to performance degradation. In this paper, we propose VisionTrim, a unified framework for training-free MLLM acceleration, integrating two effective plug-and-play modules: 1) the Dominant Vision Token Selection (DVTS) module, which preserves essential visual tokens via a global-local view, and 2) the Text-Guided Vision Complement (TGVC)module, which facilitates context-aware token merging guided by textual cues.

📰 News

Feb. 2nd, 2026: We released our code.
Feb. 2nd, 2026: Paper is now available at arXiv.
Jan. 26th, 2026: VisionTrim is accepted by ICLR 2026!

⚙️ Setup

🏝️ Environment

Clone this repository.

git clone https://github.com/hanxunyu/VisionTrim.git
cd VisionTrim

Install necessary packages.

conda create -n visiontrim python=3.10 -y
conda activate visiontrim
pip install -e .
pip install protobuf

(Optional) Install FlashAttention for further inference acceleration.

pip install flash-attn --no-build-isolation

📦️ Model

Download corresponding LLaVA checkpoints from Hugging Face 🤗:

Version	LLM	Checkpoint
LLaVA-1.5	Vicuna-7B	liuhaotian/llava-v1.5-7b
LLaVA-1.5	Vicuna-13B	liuhaotian/llava-v1.5-13b
LLaVA-1.6 (LLaVA-NeXT)	Vicuna-7B	liuhaotian/llava-v1.6-vicuna-7b
LLaVA-1.6 (LLaVA-NeXT)	Vicuna-13B	liuhaotian/llava-v1.6-vicuna-13b

📊 Data

Download each dataset according to EVAL.md.

🔬 Analysis

To analyze the inaccurate text-visual attention in VLMs, you need to download the visual instruction tuning data for LLaVA first, which we use for attention computation. And we provide the 1K subset for attention analysis in ./playground/data/analysis/llava_v1_5_mix1k.jsonl.

🛹 Attention Shift

To analyze the attention shift in VLMs, run the script ./scripts/analyze_attn_shift.sh.

bash scripts/v1_5/analyze_attn_shift.sh

🪩 Attention Dispersion

To analyze the attention dispersion in VLMs, run the script ./scripts/analyze_attn_dispersion.sh.

bash scripts/v1_5/analyze_attn_dispersion.sh

📋️ Evaluation

The main implementation of our VisionTrim is mainly in llava_llama.py, clip_encoder.py, llava_arch.py, model_vqa.py, and model_vqa_loader.py

We provide the evaluation scripts for each benchmark under ./scripts/v1_5/eval, you need to set the DVTS_token_num and TGVC_token_num as the bash argument. The detailed guidance for evaluation commands and online submission of each benchmark can be found in EVAL.md.

For evaluation with the 13B LLM, you just need to replace the CKPT argument from llava-v1.5-7b to llava-v1.5-13b in each script. And for evaluation with LLaVA-NeXT, you can use the scripts in ./scripts/v1_6/eval.

GQA

Download the data and evaluation scripts following the official instructions and put under ../data/gqa/data. You may need to modify eval.py due to the missing assets in the GQA v1.2 release.
Single-GPU or Multi-GPU inference and evaluation.

method=VisionTrim
bash scripts/v1_5/eval/gqa.sh $token_num $token_complement

ScienceQA

Under ../data/scienceqa, download images, pid_splits.json, problems.json from the data/scienceqa folder of the ScienceQA.
Single-GPU or Multi-GPU inference and evaluation.

method=VisionTrim
bash scripts/v1_5/eval/sqa.sh $token_num $token_complement

TextVQA

Download TextVQA_0.5.1_val.json and images and extract to ../data/textvqa.
Single-GPU or Multi-GPU inference and evaluation.

method=VisionTrim
bash scripts/v1_5/eval/textvqa.sh $token_num $token_complement

POPE

Download coco from POPE and put under ../data.
Single-GPU or Multi-GPU inference and evaluation.

method=VisionTrim
bash scripts/v1_5/eval/pope.sh $token_num $token_complement

MMBench

Download mmbench_dev_20230712.tsv and put under ../data/mmbench.
Single-GPU or Multi-GPU inference and evaluation.

method=VisionTrim
bash scripts/v1_5/eval/mmbench.sh $token_num $token_complement

Submit the results to the evaluation server: ../data/eval/mmbench/answers_upload/mmbench_dev_20230712.

MMBench-CN

Download mmbench_dev_cn_20231003.tsv and put under ../data/mmbench.
Single-GPU or Multi-GPU inference and evaluation.

method=VisionTrim
bash scripts/v1_5/eval/mmbench_cn.sh $token_num $token_complement

Submit the results to the evaluation server: ../data/eval/mmbench/answers_upload/mmbench_dev_cn_20231003.

SEED-Bench

Following the official instructions to download the images and the videos. Put images under ../data/seed_bench/SEED-Bench-image. Note that we only use the image subset to evaluate.
Single-GPU or Multi-GPU inference and evaluation.

method=VisionTrim
bash scripts/v1_5/eval/seed.sh $token_num $token_complement

MM-Vet

Extract mm-vet.zip to ../data/mmvet.
Single-GPU or Multi-GPU inference and evaluation.

method=VisionTrim
bash scripts/v1_5/eval/mmvet.sh $token_num $token_complement

Evaluate the predictions in ../data/eval/mmvet/results using the official Jupyter notebook.

😊 Acknowledgement

We are grateful for the open-source contributions of other projects:

🖊️ Citation

If you find our VisionTrim useful for your research, please consider giving this repository a star and citing our paper as follows:

@article{yu2026visiontrim,
  title={VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration},
  author={Yu, Hanxun and Li, Wentong and Qu, Xuan and Wang, Song and Chen, Junbo and Zhu, Jianke},
  journal={arXiv preprint arXiv:2601.22674},
  year={2026}
}

⚡️ VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration [ICLR 2026]

⚡️ VisionTrim: Unified Vision Token Compression for
Training-Free MLLM Acceleration
[ICLR 2026]