๐ To do list
March 24, 2026 ยท View on GitHub
VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs
๐ Paper | ๐ค VisionSelector-Qwen2.5-VL-3B ๐ค VisionSelector-Qwen2.5-VL-7B ๐ค VisionSelector-LLaVA-OV-1.5-8B
Jiaying Zhu1, Yurui Zhu*2, Xin Lu1, Wenrui Yan2, Dong Li1, Kunlin Liu2, Xueyang Fu*1, Zheng-Jun Zha1
1University of Science and Technology of China, 2ZTE Corporation
*Equal Advising
๐ To do list
- Release training code
- Release evaluation code
- Release comparison method code
- Release model weights
- Release inference code
๐ Overview
We introduce VisionSelector, a novel, end-to-end learnable framework that fundamentally re-casts visual token compression as an optimization-driven decision process. VisionSelector seamlessly integrates into existing MLLMs without modifying the backbone, achieving adaptive and superior efficiency.
Our key technical innovations include:
- A Differentiable Top-K Selection Mechanism that ensures end-to-end gradient flow while maintaining full compatibility with high-performance acceleration kernels like FlashAttention.
- A Curriculum Annealing Strategy with a composite loss, which effectively bridges the performance gap between soft training selection and hard inference selection.
- A backbone-decoupled Learnable Importance Scorer (LIS) that enables models, trained at a single compression rate, to robustly generalize to various compression budgets during inference.
VisionSelector is highly efficient, requiring only 12.85M trainable parameters. It achieves substantial performance-efficiency advancements: a 12.14% performance gain at 10% token retention, and a 1.73ร prefill acceleration (with 86.08% memory reduction) at 20% retention. VisionSelector consistently outperforms state-of-the-art baselines across 13 image and video understanding benchmarks.
๐พ Dataset Preparation and Configuration
To reproduce our results and train the VisionSelector module, you need to download and configure the following datasets from the nyu-visionx/Cambrian-10M repository.
1. Dataset Downloading
Please download the following datasets and annotations and place them under the datasets/ folder.
| Dataset | Size | Download Link (Hugging Face) |
|---|---|---|
| OCR_VQA | ocr_vqa.tar.gz | |
| ChartQA | chartqa.tar.gz | |
| TextQA | textvqa.tar.gz | |
| COCO | coco.tar.gz |
| Annotation | Download Link (Hugging Face) |
|---|---|
| Cambrian737K | Cambrian737k.jsonl |
Generating Dataset Annotations:
The large annotation file (Cambrian737k.jsonl) needs to be split into individual JSONL files for each target dataset (OCR_VQA, ChartQA, COCO) to match the required directory structure.
Execute the following commands from the project root to perform this filtering and splitting using the filter_json.py script:
cd VisionSelector
python datasets/filter_json.py
python datasets/sample_merge_json_llavaov.py # for llava-ov-1.5
Required Directory Structure:
After downloading and extracting, your project directory should contain the following structure:
VisionSelector/
โโโ datasets/
โโโ ocr_vqa/
โโโ ocr_vqa_cambrian.jsonl
โโโ chartqa/
โโโ chartqa_cambrian.jsonl
โโโ textvqa/
โโโ textvqa_cambrian.jsonl
โโโ coco/
โโโ coco_cambrian.jsonl
โโโ textvqa_ocrvqa_cambrian.jsonl
2. Dataset config for training
The data paths and annotation paths for training are defined in VisionSelector/qwen-vl-finetune/qwenvl/data/__init__.py.
DATASET_NAME = {
"annotation_path": "/path/to/annotations.json",
"data_path": "/path/to/image/data",
}
๐ง Installation - Qwen2.5VL
1. Environment and Basic Dependencies
We recommend setting up a dedicated Conda environment.
conda create -n visionselector python=3.10
conda activate visionselector
2. Package Installation
Navigate to your project root directory (VisionSelector) and install the required packages.
cd VisionSelector
pip install qwen-vl-utils[decord]
pip install -r requirements.txt
pip install transformers==4.50.0
Recommended Package Versions:
For optimal compatibility, we recommend the following package versions:
| Package | Recommended Version |
|---|---|
torch | 2.6.0 |
torchvision | 0.21.0 |
transformers | 4.50.0 |
deepspeed | 0.16.4 |
flash_attn | 2.7.4.post1 |
triton | 3.2.0 |
accelerate | 1.4.0 |
torchcodec | 0.2 |
๐ Quick Start - Qwen2.5VL
1. Training
To train the VisionSelector (e.g., integrated with Qwen2.5-VL-7B) to learn crucial token selection, execute the script in the qwen-vl-finetune directory.
cd VisionSelector/qwen-vl-finetune
bash scripts/sft_7b.sh # for VisionSelector-Qwen2.5-VL-7B
bash scripts/sft_3b.sh # for VisionSelector-Qwen2.5-VL-3B
bash scripts/sft_dynamic.sh # Reimplementation of Dynamic-LLaVA's image predictor on Qwen2.5VL(Dynamic-Qwen)
2. Evaluation, Inference and Visualizations
We utilize lmms-eval for comprehensive benchmarking.
Setup lmms-eval
First, set up the evaluation environment:
cd VisionSelector/lmms-eval
pip install -e .
cd ../qwen-evaluation
Run Evaluations
Use the provided scripts to evaluate VisionSelector against comparison methods and generate visualizations.
| Command | Purpose |
|---|---|
bash run_token_compression.sh | Evaluation for Original Model and Comparison Token Compression Methods |
bash run_selector.sh | Evaluation for the VisionSelector Method |
bash run_dynamic_qwen.sh | Evaluation for the Dynamic-Qwen Method |
To capture Max GPU Memory, Prefill Time, Latency Time and Number of visual tokens, set the following environment variable before running the evaluation script:
EVAL_TIME=True
Run Inference
You can test different token pruning methods (VisionZip, DivPrune) and VisionSelector inference with this script:
bash run_inference.sh
Visualizations
To generate activation heatmaps and token pruning visualizations:
bash run_visual.sh # for VisionZip, DivPrune and VisionSelector
๐ง Installation - LLaVA-OV-1.5
1. Environment, Basic Dependencies and Package Installation
To ensure compatibility with the LLaVA-OneVision-1.5 framework, activate the pre-created visionselector environment first and then adjust the transformers package version as follows:
conda activate visionselector
pip uninstall transformers
pip install transformers==4.53.1
๐ Quick Start - LLaVA-OV-1.5
1. Training
To train the VisionSelector (e.g., integrated with LLaVA-OneVision-1.5) to learn crucial token selection, execute the script in the llava-ov-15 directory.
cd VisionSelector/llava-ov-15
bash scripts/finetune_selector_8b.sh # for LLaVA-OneVision-1.5
2. Evaluation
cd llava-ov-15
bash run_ov_token_compression.sh # for orignal model and comparison method
bash run_ov_selector.sh # for VisionSelector
3. Inference
bash run_ov_inference.sh
Cititation
If this work contributes to your research, please cite:
@misc{zhu2025visionselectorendtoendlearnablevisual,
title={VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs},
author={Jiaying Zhu and Yurui Zhu and Xin Lu and Wenrui Yan and Dong Li and Kunlin Liu and Xueyang Fu and Zheng-Jun Zha},
year={2025},
eprint={2510.16598},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.16598},
}
Acknowledgement
This work is built upon the foundational contributions of several excellent open-source projects. We express our sincere gratitude to the developers of the following resources, which were instrumental in the development and evaluation of VisionSelector:
- Foundational Platforms: Qwen2.5-VL, LLaVA-OneVision-1.5, EffiVLM-Bench, and Lmms-Eval.
- Inspirational Methods: We also gratefully acknowledge the valuable insights and prior work provided by FastV, PruMerge+, VisionZip, DART, DivPrune, HoloV, Dynamic-LLaVA and Differentiable Top-K.