ProfilingDiT
January 1, 2026 Β· View on GitHub
Official Implementation of ["Model Reveals What to Cache: Profiling-Based Feature Reuse for Video Diffusion Models"]
πPaper πarXiv
This repository contains the official implementation of our paper: Model Reveals What to Cache: Profiling-Based Feature Reuse for Video Diffusion Models.
Please follow the official link for setting up the environment.

π Table of Contents
- π₯ Latest News
- π Installation
- π Running the Code
- π Quantitative Comparison
- β‘ Scale to Multi-GPU
- π To-Do List
π₯ Latest News
π Latest News
β’ If you like our project, please give us a star β on GitHub for the latest update.
β’ π¨ Received by ICCV 2025 π
β’ [2025/04/04] π Submitted to arXiv.
β’ [2025/04/04] π₯ Released open-source code for the latest model.
π Installation
Follow the official HunyuanVideo and WAN 2.1 environment setup guide.
pip install -r requirements.txt
π Running the Code
HunyuanVideo
cd HunyuanVideo
python3 sample_video.py \
--video-size 360 720 \
--video-length 129 \
--infer-steps 50 \
--prompt "cat walk on grass" \
--flow-reverse \
--use-cpu-offload \
--save-path ./results \
--seed 42 \
--model-base "ckpts" \
--dit-weight "ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt" \
--delta_cache
WAN 2.1
cd Wan2.1
python generate.py \
--task t2v-14B \
--size 832*480 \
--frame_num 81 \
--prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." \
--delta_cache
π Quantitative Comparison
HunyuanVideo Baseline
| Method | VBench β | LPIPS β | PSNR β | SSIM β | FID β | Latency (ms) β | Speedup β |
|---|---|---|---|---|---|---|---|
| HunyuanVideo (720P, 129 frames) | 0.7703 | -- | -- | -- | -- | 1745 | -- |
| TeaCache (slow) Tea | 0.7700 | 0.1720 | 21.91 | 0.7456 | 77.67 | 1052 | 1.66Γ |
| TeaCache (fast) Tea | 0.7677 | 0.1830 | 21.60 | 0.7323 | 83.85 | 753 | 2.31Γ |
| Ours (HunyuanVideo) | 0.7642 | 0.1203 | 26.44 | 0.8445 | 41.10 | 932 | 1.87Γ |
Wan2.1 Baseline
| Method | VBench β | LPIPS β | PSNR β | SSIM β | FID β | Latency (ms) β | Speedup β |
|---|---|---|---|---|---|---|---|
| Wan2.1 (480P, 81 frames) | 0.7582 | -- | -- | -- | -- | 497 | -- |
| TeaCache (0.2thres) Tea | 0.7604 | 0.2913 | 16.17 | 0.5685 | 117.61 | 249 | 2.00Γ |
| Ours (Wan2.1) | 0.7615 | 0.1256 | 22.02 | 0.7899 | 62.56 | 247 | 2.01Γ |
Tables: Quantitative comparison with prior methods under HunyuanVideo and Wan2.1 baselines.
πΊ Higher is better for VBench, PSNR, SSIM, and Speedup.
π» Lower is better for LPIPS, FID, and Latency.
β‘ Scale to Multi-GPU
Our method efficiently scales across multiple GPUs to accelerate inference and training.
By leveraging model parallelism, NCCL communication, and optimized memory management, we achieve significant speedup without compromising quality.
π Key Features:
- Increased Throughput π: Distributes computation across multiple GPUs to process more frames in parallel.
- Optimized Memory Usage π§: Dynamically allocates memory to prevent bottlenecks.
- Flexible Deployment π‘: Works seamlessly on both single-node and distributed setups.
- NCCL Optimization π: Uses efficient GPU-GPU communication to minimize overhead.

For detailed setup and configurations, please refer to our Multi-GPU Guide. π
π To-Do List:
- OpenSora2 ποΈ (Upcoming Support)
- Optimize Caching for CogVideoX βοΈ
π Citation
@misc{ma2025modelrevealscacheprofilingbased,
title={Model Reveals What to Cache: Profiling-Based Feature Reuse for Video Diffusion Models},
author={Xuran Ma and Yexin Liu and Yaofu Liu and Xianfeng Wu and Mingzhe Zheng and Zihao Wang and Ser-Nam Lim and Harry Yang},
year={2025},
eprint={2504.03140},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.03140},
}
π License
This project is licensed under the Apache 2.0 License.