TRIO

June 3, 2026 · View on GitHub

This repository is the official implementation of TRIO.

TRIO: Token Reduction via Inference-Objective Guidance for Efficient Vision-Language Models
Haokui Zhang, Congyang Ou, Dawei Yan, Peng Wang, Qingsen Yan, Yu Zhang, Ying Li, Rong Xiao,

Overall

Structure

We propose TRIO, a training-free method that selects vision tokens via gradient saliency and a feature-space NMS strategy, improving efficiency while preserving performance and maintaining compatibility with efficient attention operators.

Example

Comparison of token selection strategies
Comparison of different token selection strategies.

Comparison of different token selection strategies. (a) Ours; (b) text-to-image attention based; (c) diversity-oriented selection; (d) CLS-token attention based.

Performance Results

The following table presents a detailed comparison of our method (TRIO) against various state-of-the-art baselines across multiple benchmarks under different token retention budgets.

Methods Venue GQA MMB MMB-cn MME POPE SQA VQAv2 TextVQA Average
Upper Bound (576 Tokens)
Vanilla 61.9 64.7 58.1 1862 85.9 69.5 78.5 58.2 100%
w/o VE (192 Tokens, 33.3%)
FastV ECCV’24 52.7 61.2 57.0 1612 64.8 67.3 67.1 52.5 89.0%
PDrop CVPR’25 57.1 63.2 56.8 1766 82.3 68.8 75.1 56.1 96.2%
SparseVLM ICML’25 59.5 64.1 53.7 1787 85.3 68.7 75.6 57.8 97.2%
DART EMNLP’25 60.0 63.6 57.0 1856 82.8 69.8 76.7 57.4 98.3%
TRIO (Ours) Ours 61.0 64.4 57.6 1789 86.5 69.0 77.7 57.2 98.8%
w/ VE (192 Tokens, 33.3%)
VisionZip CVPR’25 59.3 63.0 57.3 1782 85.3 68.9 76.8 57.3 97.8%
HoloV NeurIPS’25 59.0 65.4 58.0 1820 85.6 69.8 76.7 57.4 98.7%
SCOPE NeurIPS’25 60.1 63.6 56.8 1804 86.4 68.8 77.2 57.7 98.3%
TRIO (Ours) Ours 61.1 64.2 57.9 1808 86.4 68.2 77.9 57.4 98.9%
w/o VE (128 Tokens, 22.2%)
FastV ECCV’24 49.6 56.1 56.4 1490 59.6 60.2 61.8 50.6 83.2%
PDrop CVPR’25 56.0 61.1 56.6 1644 82.3 68.3 72.9 55.1 94.0%
SparseVLM ICML’25 58.4 64.5 51.1 1746 85.0 68.6 73.8 56.7 95.6%
DART EMNLP’25 58.7 63.2 57.5 1840 80.1 69.1 75.9 56.4 97.0%
TRIO (Ours) Ours 60.0 62.9 57.1 1807 86.7 68.5 76.5 57.2 98.1%
w/ VE (128 Tokens, 22.2%)
VisionZip CVPR’25 57.6 62.0 56.7 1761.7 83.2 68.9 75.6 56.8 96.4%
HoloV NeurIPS’25 57.7 63.9 56.5 1802 84.0 69.8 75.5 56.8 97.2%
SCOPE NeurIPS’25 59.7 62.5 56.9 1776 86.1 68.4 76.5 57.2 97.5%
TRIO (Ours) Ours 60.0 62.9 56.7 1799 86.4 69.2 77.1 57.0 98.1%
w/o VE (64 Tokens, 11.1%)
FastV ECCV’24 46.1 48.0 52.7 1256 48.0 51.1 55.0 47.8 74.0%
PDrop CVPR’25 41.9 33.3 50.5 1092 55.9 68.6 69.2 45.9 74.4%
SparseVLM ICML’25 53.8 60.1 52.7 1589 77.5 69.8 68.2 53.4 90.6%
DART EMNLP’25 55.9 60.6 53.2 1765 73.9 69.8 72.4 54.4 92.8%
TRIO (Ours) Ours 58.0 61.6 53.7 1681 84.3 68.5 74.8 54.9 94.7%
w/ VE (64 Tokens, 11.1%)
VisionZip CVPR’25 55.1 60.1 55.4 1690 77.0 69.0 72.4 55.5 93.0%
HoloV NeurIPS’25 55.3 63.3 55.1 1715 80.3 69.5 72.8 55.4 94.4%
SCOPE NeurIPS’25 58.3 61.7 54.4 1698 83.9 68.6 75.3 56.6 95.4%
TRIO (Ours) Ours 58.3 61.6 56.5 1744 86.4 68.6 75.9 56.2 96.6%

Note: Base model used is LLaVA-1.5-7B. The baseline "Upper Bound" utilizes all 576 tokens.

Performance Results

Here is the performance comparison of our method against the baseline LLaVA models. Our approach significantly reduces computation overhead (Prefill Time, Total Time, FLOPs, and KV Cache) while maintaining competitive performance on the POPE benchmark.

MethodsPrefill Time↓ (s)Total Time↓ (s)Avg FLOPs↓ (T)KV Cache↓ (MB)POPE (Acc)
LLaVA-1.5-7B1401 (1.00×)2234 (1.00×)2.98 (1.00×)318 (1.00×)85.9
+Ours(%11.1)1106 (1.27×)1978 (1.13×)0.45 (6.62×)62 (5.13×)84.3
LLaVA-NEXT-7B4934 (1.00×)5921 (1.00×)16.67 (1.00×)1156 (1.00×)86.5
+Ours(%11.1)1844 (2.67×)2810 (2.11×)2.68 (6.22×)191 (6.05×)84.5

Note: > * indicates that lower is better.

  • Values in parentheses represent the reduction ratio/speedup compared to the respective baseline.

Set Up

LLaVA

  1. Clone this repository.
git clone https://github.com/ocy1/TRIO
cd TRIO

2.Environment Setup and Preparation

conda activate TRIO
pip install -e .
pip install flash-attn --no-build-isolation

3.Download Multimodal Benchmark Please follow the detailed instruction in https://github.com/haotian-liu/LLaVA/blob/main/docs/Evaluation.md

Qwen2.5-VL

conda create -n TRIO_Qwen25VL python=3.10 -y
conda activate TRIO_Qwen25VL

pip install -U transformers==4.55.4
pip install flash-attn --no-build-isolation
pip install -e .

Usage

LLaVA

For LLaVA-1.5 and LLaVA-NeXT, you can run evaluation with the scripts under scripts/. A typical example is:

python -m llava.eval.model_vqa_loader \
    --model-path ../data/model/llava-v1.5-7b \
    --question-file ../data/eval/MME/llava_mme.jsonl \
    --image-folder ../data/eval/MME/MME_Benchmark_release_version/MME_Benchmark \
    --answers-file ../data/eval/MME/answers/llava-v1.5-7b.jsonl \
    --temperature 0 \
    --layer_list '[16]' \
    --image_token_list '[32]' \
    --visual_token_num 96 \
    --conv-mode vicuna_v1

Please replace --model-path with your own checkpoint path. --question-file specifies the evaluation file, --image-folder should point to the corresponding image directory, and --answers-file is the path used to save model predictions.

The pruning behavior is mainly controlled by --layer_list, --image_token_list, and --visual_token_num. By default, we use:

  • --layer_list '[16]'
  • --image_token_list '[32]'
  • --visual_token_num 96

This means 96 visual tokens are kept after the visual encoder, and then reduced to 32 at the 16th LLM layer. Under this setting, the model maintains an average of 64 visual tokens across all 32 layers.

Both --layer_list and --image_token_list support multiple entries. For example, you may set:

--layer_list '[8,16]' --image_token_list '[188,96]'

to perform progressive pruning at multiple layers, similar to PyramidDrop-style scheduling.

Qwen2.5-VL

For Qwen2.5-VL, the evaluation command is:

python -m qwen.eval.model_vqa_loader \
    --model-path ../data/model/Qwen2.5-VL-7B-Instruct \
    --question-file ../data/eval/MME/llava_mme.jsonl \
    --image-folder ../data/eval/MME/MME_Benchmark_release_version/MME_Benchmark \
    --answers-file ../data/eval/MME/answers/Qwen2.5-VL-7B-Instruct.jsonl \
    --temperature 0 \
    --layer-list '[14]' \
    --image-token-ratio-list '[0.333]' \
    --image-token-ratio 0.167

Since Qwen2.5-VL uses dynamic-resolution visual encoding, the number of visual tokens varies across images. Therefore, ratio-based settings are used instead of fixed token counts.

By default:

  • --image-token-ratio 0.167 keeps 16.7% of visual tokens after visual encoding
  • --image-token-ratio-list '[0.333]' further reduces them to 33.3% at the 14th LLM layer

This results in an average visual token retention ratio of 11.1% across the 28 LLM layers.

Citation

If you find our work helpful or use our code in your research, please consider citing our paper:

@misc{zhang2026triotokenreductioninferenceobjective,
      title={TRIO: Token Reduction via Inference-Objective Guidance for Efficient Vision-Language Models}, 
      author={Haokui Zhang and Congyang Ou and Dawei Yan and Peng Wang and Qingsen Yan and Yu Zhang and Ying Li and Rong Xiao},
      year={2026},
      eprint={2602.04657},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.04657}, 
}