Are We Using the Right Benchmark: An Evaluation Framework for Visual Token Compression Methods

April 9, 2026 Β· View on GitHub

Page arXiv License

Chenfei Liao1,2,6 Wensong Wang3,2 Zichen Wen2,5 Xu Zheng1,4,6 Yiyu Wang2 Haocong He2
Yuanhuiyi Lyu1,6 Lutao Jiang1,6 Xin Zou1,6 Yuqian Fu4 Bin Ren7,8,4 Linfeng Zhang2,πŸ“§ Xuming Hu1,6,πŸ“§

1Hong Kong University of Science and Technology (Guangzhou) 2Shanghai Jiao Tong University
3Northeastern University 4INSAIT, Sofia University β€œSt. Kliment Ohridski”
5Shanghai AI Laboratory 6Hong Kong University of Science and Technology
7University of Pisa 8University of Trento

🚩 News

  • [2026.04.06] πŸ“ Our paper is accepted by ACL 2026 Main Track!
  • [2026.04.03] πŸš€ VTC-Bench v1.0 is released! We have completed a full code refactor for better usability and performance.
  • [2026.01.23] πŸ“ Our updated paper is now available on arXiv.
  • [2025.10.08] πŸ“ Our paper is now available on arXiv.

Abstract

Recent efforts to accelerate inference in Multimodal Large Language Models (MLLMs) have largely focused on visual token compression. The effectiveness of these methods is commonly evaluated by measuring the accuracy drop on existing MLLM benchmarks before and after compression. However, these benchmarks are originally designed to assess general perception and reasoning abilities, rather than the specific challenges posed by visual token compression, leading to a fundamental task mismatch.

In this work, we uncover a counterintuitive yet consistent phenomenon: simple image downsampling outperforms many advanced visual token compression methods across multiple widely used benchmarks.

Through a comprehensive empirical study spanning eight popular benchmarks and multiple state-of-the-art compression techniques, we show that (i) current benchmarks contain substantial noise (task-irrelevant samples) for evaluating visual token compression, and (ii) downsampling can act as an effective data filter that distinguishes between simple and difficult samples with respect to compression sensitivity.

Motivated by these findings, we propose VTC-Bench, an evaluation framework that explicitly leverages downsampling as a discriminator to denoise existing benchmarks, enabling a fairer and more meaningful additional assessment of visual token compression methods.

Motivation

Some recent MLLMs, such as Qwen2-VL and Qwen2.5-VL, natively support inputs of varying resolutions. A trivial yet efficient method to handle high-resolution images is to simply downsample them to a lower resolution. However, most token compression methods for MLLMs choose to adaptively drop useless tokens or merge similar tokens instead of directly downsampling the original image, which theoretically should be more intelligent.

Surprisingly, we find that image downsampling consistently exceeds other sophisticated methods under some settings. Based on comprehensive experiments, we propose a bold hypothesis:

Some data in the existing benchmarks is overly simplistic and irrelevant to evaluating visual token compression methods, leading to the unreasonable phenomenon that even the downsampling method is sufficient to deal with the visual token compression task.

To validate this, we design a data-centric analysis using downsampling as a discriminator. We identify two crucial findings:

  1. Current benchmarks are noisy for the visual token compression task. Many samples can be answered correctly even with significant downsampling, indicating they do not test fine-grained visual understanding.
  2. Downsampling can serve as a data filter. By separating samples into "simple" (Group B) and "difficult" (Group A) based on whether downsampling succeeds, we can effectively distinguish samples that truly require advanced compression.

VTC-Bench Framework

Based on these findings, we propose VTC-Bench, a new evaluation framework specifically designed to optimize and denoise current existing benchmarks. By explicitly distinguishing between β€œsimple” and β€œdifficult” samples through downsampling, VTC-Bench adaptively selects "difficult" samples that satisfy the requirements of evaluating visual token compression methods.

The pipeline consists of three critical steps:

  • Step 1: Inference & Compression. Given a sample and a target token compression ratio, we run two inference pipelines: (1) a downsampling baseline (the filter) and (2) advanced visual token compression methods (e.g., FastV, VisionZip, DART) evaluated directly on the target MLLM.
  • Step 2: Grouping. We use the performance of the downsampling method as a binary discriminator to categorize samples:
    • Group A (Difficult Samples): Samples that are answered incorrectly by the downsampling method.
    • Group B (Simple Samples): Samples that are answered correctly by the downsampling method. This step filters the existing benchmarks and removes noisy data that is not applicable for evaluating the visual token compression methods.
  • Step 3: Result Aggregation. We perform a statistical analysis on the accuracy of the "difficult" samples to obtain an indicator that truly reflects the capability of visual compression methods.

All inference results (raw data) can be downloaded in OneDrive.

Final evaluation results can be found in Final_Results.

Quick Start

Environment

 conda create -n VTC python=3.10 -y
 conda activate VTC
 cd Qwen2-VL/transformers && pip install -e .
 pip install accelerate qwen-vl-utils[decord]
 pip install flash-attn --no-build-isolation
 cd ../../lmms-eval && pip install -e .
 pip install qwen-vl-utils
 pip install flash-attention-softmax-n

Step1 Run the downsampled methods

bash scripts/dart.sh false [downsample_ratio]

Step2 Run the methods waited for evaluation

bash scripts/dart.sh true 1 [reduction_ratio]
bash scripts/effivlm.sh 1 [reduction_ratio]

Step3 Analyze data and calculate

python tools/reorganize_data.py
Data list
β”œβ”€β”€ Qwen2-VL-7B-Instruct
  β”œβ”€β”€ Downsample
    β”œβ”€β”€ 1
      πŸ“„ xxx.jsonl
    β”œβ”€β”€ 2
    β”œβ”€β”€ 3
    β”œβ”€β”€ 4
    β”œβ”€β”€ 5
    β”œβ”€β”€ 10
  β”œβ”€β”€ VisionZip
    β”œβ”€β”€ 0.01
    β”œβ”€β”€ 0.04
    β”œβ”€β”€ 0.0625
    β”œβ”€β”€ 0.1111
    β”œβ”€β”€ 0.25
  β”œβ”€β”€ PruMerge+
  β”œβ”€β”€ FastV
β”œβ”€β”€ Llava-ov-7B
  β”œβ”€β”€ Downsample
  β”œβ”€β”€ VisionZip
  β”œβ”€β”€ PruMerge+
  β”œβ”€β”€ FastV
  β”œβ”€β”€ DART
python tools/analyze_results.py --all

Contact

If you have any problems, please contact:

πŸ“§ cliao127@connect.hkust-gz.edu.cn

We will response and fix the problems ASAP! Thanks!

Citations

If you find this project helpful, please consider citing the following paper:

@article{liao2026usingrightbenchmarkevaluation,
  title={Are We Using the Right Benchmark: An Evaluation Framework for Visual Token Compression Methods},
  author={Liao, Chenfei and Wang, Wensong and Wen, Zichen and Zheng, Xu and Wang, Yiyu and He, Haocong and Lyu, Yuanhuiyi and Jiang, Lutao and Zou, Xin and Fu, Yuqian and Ren, Bin and Zhang, Linfeng and Hu, Xuming},
  journal={arXiv preprint arXiv:2510.07143},
  year={2026}
}

Star History

Star History Chart