RedVTP: Training-Free Acceleration of Diffusion Vision-Language Models Inference via Masked Token-Guided Visual Token Pruning

June 5, 2026 · View on GitHub

🎉 RedVTP has been accepted to CVPR 2026 Findings!

We propose RedVTP, to the best of our knowledge, the first approach that accelerates inference in Diffusion Vision-Language Models(DVLMs) through visual token pruning.
We introduce a response-driven visual token pruning strategy that innovatively leverages still masked tokens to measure importance of visual tokens and prunes less important ones.
Experiments show that RedVTP improves token generation throughput of LLaDA-V and LaViDa by up to 186% and 28.05%, respectively, and reduces inference latency by up to 64.97% and 21.87%, without compromising—and in some cases improving—accuracy.

Overview

Repositories

RedVTP is implemented on top of two DVLMs baselines.
Please visit the corresponding repositories below for installation and inference instructions.

RedVTP-LLaDA-V
https://github.com/Blacktower27/RedVTP-LLaDA-V
RedVTP-LaViDa
https://github.com/Blacktower27/RedVTP-LaViDa

Each repository contains the minimal modified baseline code and the pruning-based inference scripts.

If you want to change the dataset, please directly modify this bash script in lmms-eval.

Cite us

@inproceedings{xu2026redvtp,
  title={RedVTP: Training-Free Acceleration of Diffusion Vision-Language Models Inference via Masked Token-Guided Visual Token Pruning},
  author={Xu, Jingqi and Lu, Jingxi and Li, Chenghao and Sarkar, Sreetama and Kundu, Souvik and A Beerel, Peter},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={2783--2792},
  year={2026}
}