TRIO

June 3, 2026 · View on GitHub

This repository is the official implementation of TRIO.

TRIO: Token Reduction via Inference-Objective Guidance for Efficient Vision-Language Models
Haokui Zhang, Congyang Ou, Dawei Yan, Peng Wang, Qingsen Yan, Yu Zhang, Ying Li, Rong Xiao,

Overall

We propose TRIO, a training-free method that selects vision tokens via gradient saliency and a feature-space NMS strategy, improving efficiency while preserving performance and maintaining compatibility with efficient attention operators.

Example

Comparison of different token selection strategies.

Comparison of different token selection strategies. (a) Ours; (b) text-to-image attention based; (c) diversity-oriented selection; (d) CLS-token attention based.

Performance Results

The following table presents a detailed comparison of our method (TRIO) against various state-of-the-art baselines across multiple benchmarks under different token retention budgets.

Methods	Venue	GQA	MMB	MMB-cn	MME	POPE	SQA	VQAv2	TextVQA	Average
Upper Bound (576 Tokens)
Vanilla	–	61.9	64.7	58.1	1862	85.9	69.5	78.5	58.2	100%
w/o VE (192 Tokens, 33.3%)
FastV	ECCV’24	52.7	61.2	57.0	1612	64.8	67.3	67.1	52.5	89.0%
PDrop	CVPR’25	57.1	63.2	56.8	1766	82.3	68.8	75.1	56.1	96.2%
SparseVLM	ICML’25	59.5	64.1	53.7	1787	85.3	68.7	75.6	57.8	97.2%
DART	EMNLP’25	60.0	63.6	57.0	1856	82.8	69.8	76.7	57.4	98.3%
TRIO (Ours)	Ours	61.0	64.4	57.6	1789	86.5	69.0	77.7	57.2	98.8%
w/ VE (192 Tokens, 33.3%)
VisionZip	CVPR’25	59.3	63.0	57.3	1782	85.3	68.9	76.8	57.3	97.8%
HoloV	NeurIPS’25	59.0	65.4	58.0	1820	85.6	69.8	76.7	57.4	98.7%
SCOPE	NeurIPS’25	60.1	63.6	56.8	1804	86.4	68.8	77.2	57.7	98.3%
TRIO (Ours)	Ours	61.1	64.2	57.9	1808	86.4	68.2	77.9	57.4	98.9%
w/o VE (128 Tokens, 22.2%)
FastV	ECCV’24	49.6	56.1	56.4	1490	59.6	60.2	61.8	50.6	83.2%
PDrop	CVPR’25	56.0	61.1	56.6	1644	82.3	68.3	72.9	55.1	94.0%
SparseVLM	ICML’25	58.4	64.5	51.1	1746	85.0	68.6	73.8	56.7	95.6%
DART	EMNLP’25	58.7	63.2	57.5	1840	80.1	69.1	75.9	56.4	97.0%
TRIO (Ours)	Ours	60.0	62.9	57.1	1807	86.7	68.5	76.5	57.2	98.1%
w/ VE (128 Tokens, 22.2%)
VisionZip	CVPR’25	57.6	62.0	56.7	1761.7	83.2	68.9	75.6	56.8	96.4%
HoloV	NeurIPS’25	57.7	63.9	56.5	1802	84.0	69.8	75.5	56.8	97.2%
SCOPE	NeurIPS’25	59.7	62.5	56.9	1776	86.1	68.4	76.5	57.2	97.5%
TRIO (Ours)	Ours	60.0	62.9	56.7	1799	86.4	69.2	77.1	57.0	98.1%
w/o VE (64 Tokens, 11.1%)
FastV	ECCV’24	46.1	48.0	52.7	1256	48.0	51.1	55.0	47.8	74.0%
PDrop	CVPR’25	41.9	33.3	50.5	1092	55.9	68.6	69.2	45.9	74.4%
SparseVLM	ICML’25	53.8	60.1	52.7	1589	77.5	69.8	68.2	53.4	90.6%
DART	EMNLP’25	55.9	60.6	53.2	1765	73.9	69.8	72.4	54.4	92.8%
TRIO (Ours)	Ours	58.0	61.6	53.7	1681	84.3	68.5	74.8	54.9	94.7%
w/ VE (64 Tokens, 11.1%)
VisionZip	CVPR’25	55.1	60.1	55.4	1690	77.0	69.0	72.4	55.5	93.0%
HoloV	NeurIPS’25	55.3	63.3	55.1	1715	80.3	69.5	72.8	55.4	94.4%
SCOPE	NeurIPS’25	58.3	61.7	54.4	1698	83.9	68.6	75.3	56.6	95.4%
TRIO (Ours)	Ours	58.3	61.6	56.5	1744	86.4	68.6	75.9	56.2	96.6%

Note: Base model used is LLaVA-1.5-7B. The baseline "Upper Bound" utilizes all 576 tokens.

Here is the performance comparison of our method against the baseline LLaVA models. Our approach significantly reduces computation overhead (Prefill Time, Total Time, FLOPs, and KV Cache) while maintaining competitive performance on the POPE benchmark.

Methods	Prefill Time↓ (s)	Total Time↓ (s)	Avg FLOPs↓ (T)	KV Cache↓ (MB)	POPE (Acc)
LLaVA-1.5-7B	1401 (1.00×)	2234 (1.00×)	2.98 (1.00×)	318 (1.00×)	85.9
+Ours(%11.1)	1106 (1.27×)	1978 (1.13×)	0.45 (6.62×)	62 (5.13×)	84.3
LLaVA-NEXT-7B	4934 (1.00×)	5921 (1.00×)	16.67 (1.00×)	1156 (1.00×)	86.5
+Ours(%11.1)	1844 (2.67×)	2810 (2.11×)	2.68 (6.22×)	191 (6.05×)	84.5

Note: > * ↓ indicates that lower is better.

Values in parentheses represent the reduction ratio/speedup compared to the respective baseline.

Set Up

LLaVA

Clone this repository.

git clone https://github.com/ocy1/TRIO
cd TRIO

2.Environment Setup and Preparation

conda activate TRIO
pip install -e .
pip install flash-attn --no-build-isolation

3.Download Multimodal Benchmark Please follow the detailed instruction in https://github.com/haotian-liu/LLaVA/blob/main/docs/Evaluation.md

Qwen2.5-VL

conda create -n TRIO_Qwen25VL python=3.10 -y
conda activate TRIO_Qwen25VL

pip install -U transformers==4.55.4
pip install flash-attn --no-build-isolation
pip install -e .

Usage

LLaVA

For LLaVA-1.5 and LLaVA-NeXT, you can run evaluation with the scripts under scripts/. A typical example is:

python -m llava.eval.model_vqa_loader \
    --model-path ../data/model/llava-v1.5-7b \
    --question-file ../data/eval/MME/llava_mme.jsonl \
    --image-folder ../data/eval/MME/MME_Benchmark_release_version/MME_Benchmark \
    --answers-file ../data/eval/MME/answers/llava-v1.5-7b.jsonl \
    --temperature 0 \
    --layer_list '[16]' \
    --image_token_list '[32]' \
    --visual_token_num 96 \
    --conv-mode vicuna_v1

Please replace --model-path with your own checkpoint path. --question-file specifies the evaluation file, --image-folder should point to the corresponding image directory, and --answers-file is the path used to save model predictions.

The pruning behavior is mainly controlled by --layer_list, --image_token_list, and --visual_token_num. By default, we use:

--layer_list '[16]'
--image_token_list '[32]'
--visual_token_num 96

This means 96 visual tokens are kept after the visual encoder, and then reduced to 32 at the 16th LLM layer. Under this setting, the model maintains an average of 64 visual tokens across all 32 layers.

Both --layer_list and --image_token_list support multiple entries. For example, you may set:

--layer_list '[8,16]' --image_token_list '[188,96]'

to perform progressive pruning at multiple layers, similar to PyramidDrop-style scheduling.

Qwen2.5-VL

For Qwen2.5-VL, the evaluation command is:

python -m qwen.eval.model_vqa_loader \
    --model-path ../data/model/Qwen2.5-VL-7B-Instruct \
    --question-file ../data/eval/MME/llava_mme.jsonl \
    --image-folder ../data/eval/MME/MME_Benchmark_release_version/MME_Benchmark \
    --answers-file ../data/eval/MME/answers/Qwen2.5-VL-7B-Instruct.jsonl \
    --temperature 0 \
    --layer-list '[14]' \
    --image-token-ratio-list '[0.333]' \
    --image-token-ratio 0.167

Since Qwen2.5-VL uses dynamic-resolution visual encoding, the number of visual tokens varies across images. Therefore, ratio-based settings are used instead of fixed token counts.

By default:

--image-token-ratio 0.167 keeps 16.7% of visual tokens after visual encoding
--image-token-ratio-list '[0.333]' further reduces them to 33.3% at the 14th LLM layer

This results in an average visual token retention ratio of 11.1% across the 28 LLM layers.

Citation

If you find our work helpful or use our code in your research, please consider citing our paper:

@misc{zhang2026triotokenreductioninferenceobjective,
      title={TRIO: Token Reduction via Inference-Objective Guidance for Efficient Vision-Language Models}, 
      author={Haokui Zhang and Congyang Ou and Dawei Yan and Peng Wang and Qingsen Yan and Yu Zhang and Ying Li and Rong Xiao},
      year={2026},
      eprint={2602.04657},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.04657}, 
}