Object-Centric Vision Token Pruning for Vision Language Models

April 7, 2026 · View on GitHub

Contributions

We are the first to realize guaranteed vision token pruning (VTP) that can select the most representative tokens. Our method achieves roughly new SotA in terms of VTP for VLM with intuitive object-level interpretability.

Our OC-VTP achieves over 95% of the performance with only 10% retained visual tokens.
OC-Pruner only requires training once without further fine-tuning and the training can easily basing on CoCo dataset.
OC-VTP saves nearly 85% FLOPs on LLaVA-1.5 and 95% FLOPs on LLaVA-NeXT while remaining comparable performance. OC-Pruner only requires 1/1000 FLOPs cost compared to VLMs.
OC-VTP also saves inference time, whose efficiency performance is similar to the training-free pruners.

Installation

Install the LLaVA environment, or using setup_llava.sh to build LLaVA environment.
Run followings to get project codes:

git clone https://github.com/GarryLarry010131/OC-VTP

Use requirements.txt to build an environment.

Training

Use get_training_only_data.py first to extract training-ready features.
Use train_transDecoder_noVal.py to train the OC-Pruner and validate the pruner only through the validation loss.

model.py contains the structure of OC-VTP.

utils.py contains the AW-MSE loss and functions for plotting.

Evaluation

We use LMMs-Eval to evaluate OC-VTP across different benchmarks. Use following codes to run the evaluation:

export HF_ENDPOINT=https://hf-mirror.com # If you encounter network issue, please uncomment this
export CUDA_VISIBLE_DEVICES=0

export OCL_ENABLE=1
export OCL_MODE=pruning
export OCL_LAYER_IDX=8  # 0 ~ N
export OCL_CONFIG=OC-VTP/OCL_light_decoder_config/oc-vlm_64.json
export OCL_CKPT=...
export OCL_TARGET_NUM=64
export OCL_TOPK=1
export OCL_HAS_CLS=1
export OCL_PAD_MODE=att_score
export OCL_ATTSEL=meanq

ckpt=liuhaotian/llava-v1.5-7b
tasks=("gqa" "mmbench" "mme" "pope" "scienceqa_img" "vizwiz_vqa_val" "mmmu" "seedbench")
for task in "${tasks[@]}"; do
    accelerate launch --num_processes=1 -m lmms_eval --model llava \
        --model_args pretrained=$ckpt \
        --tasks "$task" --batch_size 1 --log_samples \
        --log_samples_suffix reproduce --output_path ./test_logs_oc-vlm/layer_9/test_logs_"$task"/
done

After this, figure_plotting.py can be used to plot token-selection images.

A pretrained OC-Pruner checkpoint is available on Hugging Face for a quick preview:

# Download checkpoint
huggingface-cli download GarryLarry0131/OC-VTP weights.pth --local-dir ./checkpoints

Acknowledgement

This work is built basing on LLaVA, LMMs-Eval, and VQ-VFM-OCL. Thanks to their contributions!

Citation

If you think it is helpful, please consider citing our work!

@article{li2025object,
  title={Object-Centric Vision Token Pruning for Vision Language Models},
  author={Li, Guangyuan and Zhao, Rongzhen and Deng, Jinhong and Wang, Yanbo and Pajarinen, Joni},
  journal={arXiv preprint arXiv:2511.20439},
  year={2025}
}