ProfilingDiT

January 1, 2026 Β· View on GitHub

Official Implementation of ["Model Reveals What to Cache: Profiling-Based Feature Reuse for Video Diffusion Models"]

πŸ“„Paper πŸ”—arXiv

This repository contains the official implementation of our paper: Model Reveals What to Cache: Profiling-Based Feature Reuse for Video Diffusion Models.
Please follow the official link for setting up the environment.

cover img


πŸ“Œ Table of Contents


πŸ”₯ Latest News

πŸ”” Latest News
β€’ If you like our project, please give us a star ⭐ on GitHub for the latest update.
β€’ 🚨 Received by ICCV 2025 πŸŽ‰
β€’ [2025/04/04] πŸŽ‰ Submitted to arXiv.
β€’ [2025/04/04] πŸ”₯ Released open-source code for the latest model.


πŸ“€ Installation

Follow the official HunyuanVideo and WAN 2.1 environment setup guide.

pip install -r requirements.txt

πŸš€ Running the Code

HunyuanVideo

cd HunyuanVideo
python3 sample_video.py \
    --video-size 360 720 \
    --video-length 129 \
    --infer-steps 50 \
    --prompt "cat walk on grass" \
    --flow-reverse \
    --use-cpu-offload \
    --save-path ./results \
    --seed 42 \
    --model-base "ckpts" \
    --dit-weight "ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt" \
    --delta_cache

WAN 2.1

cd Wan2.1
python generate.py \
    --task t2v-14B \
    --size 832*480 \
    --frame_num 81 \
    --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." \
    --delta_cache

πŸ“Š Quantitative Comparison

HunyuanVideo Baseline

MethodVBench ↑LPIPS ↓PSNR ↑SSIM ↑FID ↓Latency (ms) ↓Speedup ↑
HunyuanVideo (720P, 129 frames)0.7703--------1745--
TeaCache (slow) Tea0.77000.172021.910.745677.6710521.66Γ—
TeaCache (fast) Tea0.76770.183021.600.732383.857532.31Γ—
Ours (HunyuanVideo)0.76420.120326.440.844541.109321.87Γ—

Wan2.1 Baseline

MethodVBench ↑LPIPS ↓PSNR ↑SSIM ↑FID ↓Latency (ms) ↓Speedup ↑
Wan2.1 (480P, 81 frames)0.7582--------497--
TeaCache (0.2thres) Tea0.76040.291316.170.5685117.612492.00Γ—
Ours (Wan2.1)0.76150.125622.020.789962.562472.01Γ—

Tables: Quantitative comparison with prior methods under HunyuanVideo and Wan2.1 baselines.
πŸ”Ί Higher is better for VBench, PSNR, SSIM, and Speedup.
πŸ”» Lower is better for LPIPS, FID, and Latency.


⚑ Scale to Multi-GPU

Our method efficiently scales across multiple GPUs to accelerate inference and training.
By leveraging model parallelism, NCCL communication, and optimized memory management, we achieve significant speedup without compromising quality.

πŸ”‘ Key Features:

  • Increased Throughput πŸš€: Distributes computation across multiple GPUs to process more frames in parallel.
  • Optimized Memory Usage πŸ”§: Dynamically allocates memory to prevent bottlenecks.
  • Flexible Deployment πŸ’‘: Works seamlessly on both single-node and distributed setups.
  • NCCL Optimization πŸ”„: Uses efficient GPU-GPU communication to minimize overhead.

Multi-GPU Scaling

For detailed setup and configurations, please refer to our Multi-GPU Guide. πŸš€


πŸ“ To-Do List:

  • OpenSora2 πŸ—οΈ (Upcoming Support)
  • Optimize Caching for CogVideoX βš™οΈ

πŸ“š Citation

@misc{ma2025modelrevealscacheprofilingbased,
      title={Model Reveals What to Cache: Profiling-Based Feature Reuse for Video Diffusion Models}, 
      author={Xuran Ma and Yexin Liu and Yaofu Liu and Xianfeng Wu and Mingzhe Zheng and Zihao Wang and Ser-Nam Lim and Harry Yang},
      year={2025},
      eprint={2504.03140},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.03140}, 
}

πŸ“œ License

This project is licensed under the Apache 2.0 License.