[NeurIPS 2025] FastVID: Dynamic Density Pruning for Fast Video Large Language Models

November 10, 2025 ยท View on GitHub

We propose FastVID, a novel training-free pruning framework that employs Dynamic Temporal Segmentation to partition videos into temporally ordered segments and Density Spatiotemporal Pruning to retain global segment information and key details. On LLaVA-OneVision-7B, FastVID effectively prunes 90.3% of video tokens, reduces FLOPs to 8.3%, and accelerates the prefilling stage by 7.1x, while maintaining 98.0% of the original accuracy.

Implementation

The FastVID implementation in LLaVA-OneVision adopts a parallelized design for computing density scores, as detailed in the Efficiency Comparison section on page 8 of the main paper.

For the other models, we provide a straightforward implementation.

Installation and Evaluation

For running FastVID on different models, please change the working directory to the corresponding folder.

To set up the environment:

bash scripts/create_env.sh

To evaluate FastVID:

bash scripts/eval.sh

Acknowledgement

This project builds upon the following open-source works: LLaVA-NeXT and lmms-eval.

Citation

@inproceedings{
shen2025fastvid,
title={Fast{VID}: Dynamic Density Pruning for Fast Video Large Language Models},
author={Leqi Shen and Guoqiang Gong and Tao He and Yifeng Zhang and pengzhang liu and Sicheng Zhao and Guiguang Ding},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
url={https://openreview.net/forum?id=2xS4VtpApy}
}