README.md

June 6, 2025 ยท View on GitHub

Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers

ย ย ย ย 

๐Ÿฅณ What's New

  • [2025/06/03] ๐Ÿ‘‹ Upload paper and init project. Read
Pretrain Model
MInference
Sparse-VideoGen
Sparse-vDiT (1.76ร—)
Pretrain Model
MInference
Sparse-VideoGen
Sparse-vDiT (1.76ร—)
Pretrain Model
MInference
Sparse-VideoGen
Sparse-vDiT (1.76ร—)

:pencil: To Do List

  • Code and Checkpoints Release
  • Technical Report

๐Ÿƒ introduction

Sparse-vDiT is a sparsity acceleration framework for video diffusion transformers (vDiT). Through detailed analysis of attention maps in Video Diffusion Transformer (vDiT), we identify three recurring sparsity patterns: diagonal, multi-diagonal, and vertical-stripe structures. And even 3-6% attention heads can be skipped. Crucially, these patterns exhibit strong layer-depth and head-position correlations but show limited dependence on the input content. Leveraging these findings, we propose Sparse-vDiT, a sparsity acceleration framework for vDiT comprising: 1) Pattern-optimized sparse kernels that replace dense attention with computationally efficient implementations for each identified sparsity pattern. 2) An offline sparse diffusion search algorithm that selects the optimal sparse computation strategy per layer and head via hardware-aware cost modeling.

Pipeline

Our work demonstrates that latent structural sparsity in vDiTs can be systematically exploited for long video synthesis. Integrated into state-of-the-art vDiT models (CogVideoX1.5, HunyuanVideo, and Wan2.1), Sparse-vDiT achieves 2.09ร—, 2.38ร—, and 1.67ร— theoretical FLOP reduction, and actual inference speedups of 1.76ร—, 1.85ร—, and 1.58ร—, respectively, while maintaining high visual fidelity, with PSNR values reaching 24.13, 27.09, and 22.59.

Result

:notebook: Citation

@article{chen2025sparsevdit,
      title={Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers}, 
      author={Pengtao Chen and Xianfang Zeng and Maosen Zhao and Peng Ye and Mingzhu Shen and Wei Cheng and Gang Yu and Tao Chen},
      journal={arXiv preprint arXiv:2506.03065}, 
      year={2025}
}

:dizzy: Acknowledgments

We thank the following excellent open-source works: Sparse-VideoGen, MInference, PAB, CogVideoX, HunyuanVideo, Wan2.1.

:email: Contact

If you have any questions, please email Pengt.Chen@gmail.com.