Are We Using the Right Benchmark: An Evaluation Framework for Visual Token Compression Methods

April 9, 2026 · View on GitHub

Chenfei Liao^1,2,6 Wensong Wang^3,2 Zichen Wen^2,5 Xu Zheng^1,4,6 Yiyu Wang² Haocong He²
Yuanhuiyi Lyu^1,6 Lutao Jiang^1,6 Xin Zou^1,6 Yuqian Fu⁴ Bin Ren^7,8,4 Linfeng Zhang^2,📧 Xuming Hu^1,6,📧

¹Hong Kong University of Science and Technology (Guangzhou) ²Shanghai Jiao Tong University
³Northeastern University ⁴INSAIT, Sofia University “St. Kliment Ohridski”
⁵Shanghai AI Laboratory ⁶Hong Kong University of Science and Technology
⁷University of Pisa ⁸University of Trento

🚩 News

[2026.04.06] 📝 Our paper is accepted by ACL 2026 Main Track!
[2026.04.03] 🚀 VTC-Bench v1.0 is released! We have completed a full code refactor for better usability and performance.
[2026.01.23] 📝 Our updated paper is now available on arXiv.
[2025.10.08] 📝 Our paper is now available on arXiv.

Abstract

Recent efforts to accelerate inference in Multimodal Large Language Models (MLLMs) have largely focused on visual token compression. The effectiveness of these methods is commonly evaluated by measuring the accuracy drop on existing MLLM benchmarks before and after compression. However, these benchmarks are originally designed to assess general perception and reasoning abilities, rather than the specific challenges posed by visual token compression, leading to a fundamental task mismatch.

In this work, we uncover a counterintuitive yet consistent phenomenon: simple image downsampling outperforms many advanced visual token compression methods across multiple widely used benchmarks.

Through a comprehensive empirical study spanning eight popular benchmarks and multiple state-of-the-art compression techniques, we show that (i) current benchmarks contain substantial noise (task-irrelevant samples) for evaluating visual token compression, and (ii) downsampling can act as an effective data filter that distinguishes between simple and difficult samples with respect to compression sensitivity.

Motivated by these findings, we propose VTC-Bench, an evaluation framework that explicitly leverages downsampling as a discriminator to denoise existing benchmarks, enabling a fairer and more meaningful additional assessment of visual token compression methods.

Motivation

Some recent MLLMs, such as Qwen2-VL and Qwen2.5-VL, natively support inputs of varying resolutions. A trivial yet efficient method to handle high-resolution images is to simply downsample them to a lower resolution. However, most token compression methods for MLLMs choose to adaptively drop useless tokens or merge similar tokens instead of directly downsampling the original image, which theoretically should be more intelligent.

Surprisingly, we find that image downsampling consistently exceeds other sophisticated methods under some settings. Based on comprehensive experiments, we propose a bold hypothesis:

Some data in the existing benchmarks is overly simplistic and irrelevant to evaluating visual token compression methods, leading to the unreasonable phenomenon that even the downsampling method is sufficient to deal with the visual token compression task.

To validate this, we design a data-centric analysis using downsampling as a discriminator. We identify two crucial findings:

Current benchmarks are noisy for the visual token compression task. Many samples can be answered correctly even with significant downsampling, indicating they do not test fine-grained visual understanding.
Downsampling can serve as a data filter. By separating samples into "simple" (Group B) and "difficult" (Group A) based on whether downsampling succeeds, we can effectively distinguish samples that truly require advanced compression.

VTC-Bench Framework

Based on these findings, we propose VTC-Bench, a new evaluation framework specifically designed to optimize and denoise current existing benchmarks. By explicitly distinguishing between “simple” and “difficult” samples through downsampling, VTC-Bench adaptively selects "difficult" samples that satisfy the requirements of evaluating visual token compression methods.

The pipeline consists of three critical steps:

Step 1: Inference & Compression. Given a sample and a target token compression ratio, we run two inference pipelines: (1) a downsampling baseline (the filter) and (2) advanced visual token compression methods (e.g., FastV, VisionZip, DART) evaluated directly on the target MLLM.
Step 2: Grouping. We use the performance of the downsampling method as a binary discriminator to categorize samples:
- Group A (Difficult Samples): Samples that are answered incorrectly by the downsampling method.
- Group B (Simple Samples): Samples that are answered correctly by the downsampling method. This step filters the existing benchmarks and removes noisy data that is not applicable for evaluating the visual token compression methods.
Step 3: Result Aggregation. We perform a statistical analysis on the accuracy of the "difficult" samples to obtain an indicator that truly reflects the capability of visual compression methods.

 conda create -n VTC python=3.10 -y
 conda activate VTC
 cd Qwen2-VL/transformers && pip install -e .
 pip install accelerate qwen-vl-utils[decord]
 pip install flash-attn --no-build-isolation
 cd ../../lmms-eval && pip install -e .
 pip install qwen-vl-utils
 pip install flash-attention-softmax-n

Step1 Run the downsampled methods

bash scripts/dart.sh false [downsample_ratio]

Step2 Run the methods waited for evaluation

bash scripts/dart.sh true 1 [reduction_ratio]
bash scripts/effivlm.sh 1 [reduction_ratio]

Step3 Analyze data and calculate

python tools/reorganize_data.py

Data list
├── Qwen2-VL-7B-Instruct
  ├── Downsample
    ├── 1
      📄 xxx.jsonl
    ├── 2
    ├── 3
    ├── 4
    ├── 5
    ├── 10
  ├── VisionZip
    ├── 0.01
    ├── 0.04
    ├── 0.0625
    ├── 0.1111
    ├── 0.25
  ├── PruMerge+
  ├── FastV
├── Llava-ov-7B
  ├── Downsample
  ├── VisionZip
  ├── PruMerge+
  ├── FastV
  ├── DART

python tools/analyze_results.py --all

Contact

If you have any problems, please contact:

📧 cliao127@connect.hkust-gz.edu.cn

We will response and fix the problems ASAP! Thanks!

Citations

If you find this project helpful, please consider citing the following paper:

@article{liao2026usingrightbenchmarkevaluation,
  title={Are We Using the Right Benchmark: An Evaluation Framework for Visual Token Compression Methods},
  author={Liao, Chenfei and Wang, Wensong and Wen, Zichen and Zheng, Xu and Wang, Yiyu and He, Haocong and Lyu, Yuanhuiyi and Jiang, Lutao and Zou, Xin and Fu, Yuqian and Ren, Bin and Zhang, Linfeng and Hu, Xuming},
  journal={arXiv preprint arXiv:2510.07143},
  year={2026}
}