📝 To do list

March 24, 2026 · View on GitHub

VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs

📄 Paper | 🤗 VisionSelector-Qwen2.5-VL-3B 🤗 VisionSelector-Qwen2.5-VL-7B 🤗 VisionSelector-LLaVA-OV-1.5-8B

Jiaying Zhu¹, Yurui Zhu*², Xin Lu¹, Wenrui Yan², Dong Li¹, Kunlin Liu², Xueyang Fu*¹, Zheng-Jun Zha¹

¹University of Science and Technology of China, ²ZTE Corporation

*Equal Advising

📝 To do list

👀 Overview

We introduce VisionSelector, a novel, end-to-end learnable framework that fundamentally re-casts visual token compression as an optimization-driven decision process. VisionSelector seamlessly integrates into existing MLLMs without modifying the backbone, achieving adaptive and superior efficiency.

Our key technical innovations include:

A Differentiable Top-K Selection Mechanism that ensures end-to-end gradient flow while maintaining full compatibility with high-performance acceleration kernels like FlashAttention.
A Curriculum Annealing Strategy with a composite loss, which effectively bridges the performance gap between soft training selection and hard inference selection.
A backbone-decoupled Learnable Importance Scorer (LIS) that enables models, trained at a single compression rate, to robustly generalize to various compression budgets during inference.

VisionSelector is highly efficient, requiring only 12.85M trainable parameters. It achieves substantial performance-efficiency advancements: a 12.14% performance gain at 10% token retention, and a 1.73× prefill acceleration (with 86.08% memory reduction) at 20% retention. VisionSelector consistently outperforms state-of-the-art baselines across 13 image and video understanding benchmarks.

💾 Dataset Preparation and Configuration

To reproduce our results and train the VisionSelector module, you need to download and configure the following datasets from the nyu-visionx/Cambrian-10M repository.

1. Dataset Downloading

Please download the following datasets and annotations and place them under the datasets/ folder.

Dataset	Size	Download Link (Hugging Face)
OCR_VQA	$\sim 80\text{K}$	ocr_vqa.tar.gz
ChartQA	$\sim 28\text{K}$	chartqa.tar.gz
TextQA	$\sim 21\text{K}$	textvqa.tar.gz
COCO	$\sim 364\text{K}$	coco.tar.gz

Annotation	Download Link (Hugging Face)
Cambrian737K	Cambrian737k.jsonl

Generating Dataset Annotations:

The large annotation file (Cambrian737k.jsonl) needs to be split into individual JSONL files for each target dataset (OCR_VQA, ChartQA, COCO) to match the required directory structure.

Execute the following commands from the project root to perform this filtering and splitting using the filter_json.py script:

cd VisionSelector
python datasets/filter_json.py
python datasets/sample_merge_json_llavaov.py # for llava-ov-1.5

Required Directory Structure:

After downloading and extracting, your project directory should contain the following structure:

VisionSelector/
└── datasets/
    ├── ocr_vqa/
    ├── ocr_vqa_cambrian.jsonl
    ├── chartqa/
    ├── chartqa_cambrian.jsonl
    ├── textvqa/
    ├── textvqa_cambrian.jsonl
    ├── coco/
    ├── coco_cambrian.jsonl
    └── textvqa_ocrvqa_cambrian.jsonl

2. Dataset config for training

The data paths and annotation paths for training are defined in VisionSelector/qwen-vl-finetune/qwenvl/data/__init__.py.

DATASET_NAME = {
    "annotation_path": "/path/to/annotations.json",
    "data_path": "/path/to/image/data",
}

🔧 Installation - Qwen2.5VL

1. Environment and Basic Dependencies

We recommend setting up a dedicated Conda environment.

conda create -n visionselector python=3.10
conda activate visionselector

2. Package Installation

Navigate to your project root directory (VisionSelector) and install the required packages.

cd VisionSelector
pip install qwen-vl-utils[decord]
pip install -r requirements.txt
pip install transformers==4.50.0

Recommended Package Versions:

For optimal compatibility, we recommend the following package versions:

Package	Recommended Version
`torch`	`2.6.0`
`torchvision`	`0.21.0`
`transformers`	`4.50.0`
`deepspeed`	`0.16.4`
`flash_attn`	`2.7.4.post1`
`triton`	`3.2.0`
`accelerate`	`1.4.0`
`torchcodec`	`0.2`

🚀 Quick Start - Qwen2.5VL

1. Training

To train the VisionSelector (e.g., integrated with Qwen2.5-VL-7B) to learn crucial token selection, execute the script in the qwen-vl-finetune directory.

cd VisionSelector/qwen-vl-finetune
bash scripts/sft_7b.sh # for VisionSelector-Qwen2.5-VL-7B
bash scripts/sft_3b.sh # for VisionSelector-Qwen2.5-VL-3B
bash scripts/sft_dynamic.sh # Reimplementation of Dynamic-LLaVA's image predictor on Qwen2.5VL(Dynamic-Qwen)

2. Evaluation, Inference and Visualizations

We utilize lmms-eval for comprehensive benchmarking.

Setup `lmms-eval`

First, set up the evaluation environment:

cd VisionSelector/lmms-eval
pip install -e .
cd ../qwen-evaluation

Run Evaluations

Use the provided scripts to evaluate VisionSelector against comparison methods and generate visualizations.

Command	Purpose
`bash run_token_compression.sh`	Evaluation for Original Model and Comparison Token Compression Methods
`bash run_selector.sh`	Evaluation for the VisionSelector Method
`bash run_dynamic_qwen.sh`	Evaluation for the Dynamic-Qwen Method

To capture Max GPU Memory, Prefill Time, Latency Time and Number of visual tokens, set the following environment variable before running the evaluation script:

EVAL_TIME=True

Run Inference

You can test different token pruning methods (VisionZip, DivPrune) and VisionSelector inference with this script:

bash run_inference.sh

Visualizations

To generate activation heatmaps and token pruning visualizations:

bash run_visual.sh # for VisionZip, DivPrune and VisionSelector

🔧 Installation - LLaVA-OV-1.5

1. Environment, Basic Dependencies and Package Installation

To ensure compatibility with the LLaVA-OneVision-1.5 framework, activate the pre-created visionselector environment first and then adjust the transformers package version as follows:

conda activate visionselector
pip uninstall transformers
pip install transformers==4.53.1

🚀 Quick Start - LLaVA-OV-1.5

1. Training

To train the VisionSelector (e.g., integrated with LLaVA-OneVision-1.5) to learn crucial token selection, execute the script in the llava-ov-15 directory.

cd VisionSelector/llava-ov-15
bash scripts/finetune_selector_8b.sh # for LLaVA-OneVision-1.5

2. Evaluation

cd llava-ov-15
bash run_ov_token_compression.sh # for orignal model and comparison method
bash run_ov_selector.sh # for VisionSelector

3. Inference

bash run_ov_inference.sh

Cititation

If this work contributes to your research, please cite:

@misc{zhu2025visionselectorendtoendlearnablevisual,
      title={VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs}, 
      author={Jiaying Zhu and Yurui Zhu and Xin Lu and Wenrui Yan and Dong Li and Kunlin Liu and Xueyang Fu and Zheng-Jun Zha},
      year={2025},
      eprint={2510.16598},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.16598}, 
}

Acknowledgement

This work is built upon the foundational contributions of several excellent open-source projects. We express our sincere gratitude to the developers of the following resources, which were instrumental in the development and evaluation of VisionSelector:

Foundational Platforms: Qwen2.5-VL, LLaVA-OneVision-1.5, EffiVLM-Bench, and Lmms-Eval.
Inspirational Methods: We also gratefully acknowledge the valuable insights and prior work provided by FastV, PruMerge+, VisionZip, DART, DivPrune, HoloV, Dynamic-LLaVA and Differentiable Top-K.