AccVideo: Accelerating Video Diffusion Model with Synthetic Dataset

June 10, 2025 ยท View on GitHub

This repository is the official PyTorch implementation of AccVideo. AccVideo is a novel efficient distillation method to accelerate video diffusion models with synthetic datset. Our method is 8.5x faster than HunyuanVideo.

arXiv Project Page Hugging Face Spaces

๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ News

๐ŸŽฅ Demo (Based on HunyuanT2V)

https://github.com/user-attachments/assets/59f3c5db-d585-4773-8d92-366c1eb040f0

๐ŸŽฅ Demo (Based on WanXT2V-14B)

https://github.com/user-attachments/assets/ff9724da-b76c-478d-a9bf-0ee7240494b2

๐ŸŽฅ Demo (Based on WanXI2V-480P-14B)

https://github.com/user-attachments/assets/08f11ef7-c57a-4b24-87ff-e72cb3a34d1d

๐Ÿ“‘ Open-source Plan

  • Inference
  • Checkpoints
  • Multi-GPU Inference
  • Synthetic Video Dataset, SynVid
  • Training

๐Ÿ”ง Installation

The code is tested on Python 3.10.0, CUDA 11.8 and A100.

conda create -n accvideo python==3.10.0
conda activate accvideo

pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
pip install flash-attn==2.7.3 --no-build-isolation
pip install "huggingface_hub[cli]"

๐Ÿค— Checkpoints

To download the checkpoints (based on HunyuanT2V), use the following command:

# Download the model weight
huggingface-cli download aejion/AccVideo --local-dir ./ckpts

To download the checkpoints (based on WanX-T2V-14B), use the following command:

# Download the model weight
huggingface-cli download aejion/AccVideo-WanX-T2V-14B --local-dir ./wanx_t2v_ckpts

To download the checkpoints (based on WanX-I2V-480P-14B), use the following command:

# Download the model weight
huggingface-cli download aejion/AccVideo-WanX-I2V-480P-14B --local-dir ./wanx_i2v_ckpts

๐Ÿš€ Inference

We recommend using a GPU with 80GB of memory. We use AccVideo to distill Hunyuan and WanX.

Inference for HunyuanT2V

To run the inference, use the following command:

export MODEL_BASE=./ckpts
python sample_t2v.py \
    --height 544 \
    --width 960 \
    --num_frames 93 \
    --num_inference_steps 5 \
    --guidance_scale 1 \
    --embedded_cfg_scale 6 \
    --flow_shift 7 \
    --flow-reverse \
    --prompt_file ./assets/prompt.txt \
    --seed 1024 \
    --output_path ./results/accvideo-544p \
    --model_path ./ckpts \
    --dit-weight ./ckpts/accvideo-t2v-5-steps/diffusion_pytorch_model.pt

The following table shows the comparisons on inference time using a single A100 GPU:

ModelSetting(height/width/frame)Inference Time(s)
HunyuanVideo720px1280px129f3234
Ours720px1280px129f380(8.5x faster)
HunyuanVideo544px960px93f704
Ours544px960px93f91(7.7x faster)

Inference for WanXT2V

To run the inference, use the following command:

python sample_wanx_t2v.py \
       --task t2v-14B \
       --size 832*480 \
       --ckpt_dir ./wanx_t2v_ckpts \
       --sample_solver 'unipc' \
       --save_dir ./results/accvideo_wanx_14B \
       --sample_steps 10

The following table shows the comparisons on inference time using a single A100 GPU:

ModelSetting(height/width/frame)Inference Time(s)
WanX480px832px81f932
Ours480px832px81f97(9.6x faster)

Inference for WanXI2V-480P

To run the inference, use the following command:

python sample_wanx_i2v.py \
       --task i2v-14B \
       --size 832*480 \
       --ckpt_dir ./wanx_i2v_ckpts \
       --sample_solver 'unipc' \
       --save_dir ./results/accvideo_wanx_i2v_14B \
       --sample_steps 10

The following table shows the comparisons on inference time using a single A100 GPU:

ModelSetting(height/width/frame)Inference Time(s)
WanX-I2V480px832px81f768
Ours480px832px81f112(6.8x faster)

๐Ÿ† VBench Results

We report VBench evaluation results for our distilled models. We utilized the respective augmented prompts provided by the VBench team to generate videos. (HunyuanVideo augmented prompts for AccVideo-HunyuanT2V and WanX augmented prompts for AccVideo-WanXT2V)

ModelSetting(height/width/frame)Total ScoreQuality ScoreSemantic ScoreSubject ConsistencyBackground ConsistencyTemporal FlickeringMotion SmoothnessDynamic DegreeAesthetic QualityImage QualityObject ClassMultiple ObjectsHuman ActionColorSpatial RelationshipSceneAppearance StyleTemporal StyleOverall Consistency
AccVideo-HunyuanT2V544px960px93f83.26%84.58%77.96%94.46%97.45%99.18%98.79%75.00%62.08%65.64%92.99%67.33%95.60%94.11%75.70%54.72%19.87%23.71%27.21%
AccVideo-WanXT2V480px832px81f85.95%86.62%83.25%95.02%97.75%99.54%97.95%93.33%64.21%68.42%98.38%86.58%97.40%92.04%75.68%59.82%23.88%24.62%27.34%

๐Ÿ”— BibTeX

If you find AccVideo useful for your research and applications, please cite using this BibTeX:

@article{zhang2025accvideo,
    title={AccVideo: Accelerating Video Diffusion Model with Synthetic Dataset},
    author={Zhang, Haiyu and Chen, Xinyuan and Wang, Yaohui and Liu, Xihui and Wang, Yunhong and Qiao, Yu},
    journal={arXiv preprint arXiv:2503.19462},
    year={2025}
}

Acknowledgements

The code is built upon FastVideo and HunyuanVideo, we thank all the contributors for open-sourcing.