RedVTP: Training-Free Acceleration of Diffusion Vision-Language Models Inference via Masked Token-Guided Visual Token Pruning
June 5, 2026 · View on GitHub
🎉 RedVTP has been accepted to CVPR 2026 Findings!
- We propose RedVTP, to the best of our knowledge, the first approach that accelerates inference in Diffusion Vision-Language Models(DVLMs) through visual token pruning.
- We introduce a response-driven visual token pruning strategy that innovatively leverages still masked tokens to measure importance of visual tokens and prunes less important ones.
- Experiments show that RedVTP improves token generation throughput of LLaDA-V and LaViDa by up to 186% and 28.05%, respectively, and reduces inference latency by up to 64.97% and 21.87%, without compromising—and in some cases improving—accuracy.
Overview
Repositories
RedVTP is implemented on top of two DVLMs baselines.
Please visit the corresponding repositories below for installation and inference instructions.
-
RedVTP-LLaDA-V
https://github.com/Blacktower27/RedVTP-LLaDA-V -
RedVTP-LaViDa
https://github.com/Blacktower27/RedVTP-LaViDa
Each repository contains the minimal modified baseline code and the pruning-based inference scripts.
If you want to change the dataset, please directly modify this bash script in lmms-eval.
Cite us
@inproceedings{xu2026redvtp,
title={RedVTP: Training-Free Acceleration of Diffusion Vision-Language Models Inference via Masked Token-Guided Visual Token Pruning},
author={Xu, Jingqi and Lu, Jingxi and Li, Chenghao and Sarkar, Sreetama and Kundu, Souvik and A Beerel, Peter},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={2783--2792},
year={2026}
}
