OTPrune: Distribution-Aligned Visual Token Pruning via Optimal Transport
February 27, 2026 · View on GitHub
Official implementation of OTPrune, a training-free visual token pruning framework for Multi-modal Large Language Models (MLLMs).
OTPrune formulates token pruning as distribution alignment via optimal transport (OT). By minimizing the 2-Wasserstein distance between the full and pruned token distributions, it preserves both local diversity and global representativeness while significantly reducing inference cost (~90% token reduction with minimal performance drop).
Highlights
- Training-free: No fine-tuning required — plug into any LLaVA model at inference time
- Theoretically grounded: Pruning objective is proven to be monotone and submodular, guaranteeing near-optimal greedy solutions
- High compression: ~90% visual token reduction (keep only ~10%) with competitive performance
- Simple integration: Controlled entirely via environment variables — zero code change needed at inference
Installation
1. Environment Setup
conda create -n otprune python=3.10 -y
conda activate otprune
2. Install LLaVA
cd LLaVA
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
3. Install lmms-eval
cd ../lmms_eval
pip install -e .
Quick Start
Single GPU Evaluation
CUDA_VISIBLE_DEVICES=0 \
BASELINE=OURS \
LAYER_INDEX=0 \
SUBSET_RATIO=0.098 \
python3 -m accelerate.commands.launch \
--num_processes=1 \
-m lmms_eval \
--model llava \
--model_args pretrained="liuhaotian/llava-v1.5-7b" \
--tasks gqa,pope,mme \
--batch_size 1 \
--log_samples \
--output_path ./logs/otprune_7b
Multi-GPU Evaluation
CUDA_VISIBLE_DEVICES=0,1,2,3 \
BASELINE=OURS \
LAYER_INDEX=0 \
SUBSET_RATIO=0.098 \
python3 -m accelerate.commands.launch \
--num_processes=4 \
-m lmms_eval \
--model llava \
--model_args pretrained="liuhaotian/llava-v1.5-13b" \
--tasks gqa,pope,mme \
--batch_size 1 \
--log_samples \
--output_path ./logs/otprune_13b
Full Benchmark (Paper Results)
bash run_OTPrune.sh
This runs evaluation on all 11 benchmarks: COCO Caption, Flickr30k, GQA, MMBench, MME, MMMU, NoCaps, OK-VQA, POPE, ScienceQA, and SEEDBench.
Environment Variables
| Variable | Description | Default |
|---|---|---|
BASELINE | Set to OURS to enable OTPrune | - |
LAYER_INDEX | Layer at which to apply pruning (0 = embedding layer) | - |
SUBSET_RATIO | Fraction of visual tokens to keep (e.g., 0.098 ≈ 10%) | - |
When LAYER_INDEX and BASELINE are not set, the model runs without pruning (baseline).
Supported Models
| Model | HuggingFace ID |
|---|---|
| LLaVA-1.5-7B | liuhaotian/llava-v1.5-7b |
| LLaVA-1.6-7B | liuhaotian/llava-v1.6-vicuna-7b |
| LLaVA-1.5-13B | liuhaotian/llava-v1.5-13b |
Method Overview
OTPrune formulates visual token pruning as a distribution alignment problem under the optimal transport (OT) framework:
- OT Objective: We minimize the 2-Wasserstein distance between the full and pruned token distributions, ensuring the selected subset faithfully represents the original visual information
- Tractable Relaxation: The OT objective is relaxed into a submodular maximization problem over a kernel matrix , where is the normalized token similarity matrix. We theoretically prove this objective satisfies monotonicity and submodularity, providing a approximation guarantee for greedy optimization
- Efficient Greedy Selection: Tokens are iteratively selected to maximize the marginal gain of the submodular objective, using Cholesky-based incremental updates in time
- Seamless Integration: The selected tokens replace the full visual token sequence (preserving system and text tokens) before LLM processing — no architectural modification required
This principled formulation ensures that the pruned token set preserves both local diversity (avoiding redundant nearby tokens) and global representativeness (covering the full semantic distribution of the image).
Project Structure
otprune/
├── README.md
├── run_OTPrune.sh # Main evaluation script
├── LLaVA/ # Modified LLaVA codebase
│ └── llava/
│ └── model/
│ └── llava_arch.py # Core OTPrune implementation
│ ├── greedy_select() # Fast greedy selection algorithm
│ ├── OTPrune() # Kernel construction + token selection
│ └── prepare_inputs_labels_for_multimodal()
│ # Integration into LLaVA forward pass
└── lmms_eval/ # Evaluation framework
├── models/
│ └── llava.py # LLaVA model wrapper for evaluation
└── tasks/ # 11 VL benchmark configurations
Citation
@article{chen2026otprune,
title={OTPrune: Distribution-Aligned Visual Token Pruning via Optimal Transport},
author={Chen, Xiwen and Zhu, Wenhui and Li, Gen and Dong, Xuanzhao and Xiong, Yujian and Wang, Hao and Qiu, Peijie and Song, Qingquan and Wang, Zhipeng and Tang, Shao and others},
journal={arXiv preprint arXiv:2602.20205},
year={2026}
}
Acknowledgments
This codebase is built upon LLaVA, lmms-eval, and DivPrune. We thank the authors for their open-source contributions.
License
This project is licensed under the Apache License 2.0. See the original LLaVA license for the base codebase.