README.md

September 28, 2025 Β· View on GitHub

HunyuanVideo-Foley Logo

Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation

Professional-grade AI sound effect generation for video content creators


πŸ‘₯ Authors

Sizhe Shan1,2* β€’ Qiulin Li1,3* β€’ Yutao Cui1 β€’ Miles Yang1 β€’ Yuehai Wang2 β€’ Qun Yang3 β€’ Jin Zhou1† β€’ Zhao Zhong1

🏒 1Tencent Hunyuan β€’ πŸŽ“ 2Zhejiang University β€’ ✈️ 3Nanjing University of Aeronautics and Astronautics

*Equal contribution β€’ †Project lead


πŸ”₯πŸ”₯πŸ”₯ News

  • [2025.9.29] πŸš€ HunyuanVideo-Foley-XL Model Release - Release XL-sized model with offload inference support, significantly reducing VRAM requirements.
  • [2025.8.28] 🌟 HunyuanVideo-Foley Open Source Release - Inference code and model weights publicly available.

πŸŽ₯ Demo & Showcase

Experience the magic of AI-generated Foley audio in perfect sync with video content!

🎬 Watch how HunyuanVideo-Foley generates immersive sound effects synchronized with video content


🀝 Community Contributions

ComfyUI Integration - Thanks to the amazing community for creating ComfyUI nodes:

🌟 We encourage and appreciate community contributions that make HunyuanVideo-Foley more accessible!


✨ Key Highlights

🎭 Multi-scenario Sync
High-quality audio synchronized with complex video scenes

🧠 Multi-modal Balance
Perfect harmony between visual and textual information

🎡 48kHz Hi-Fi Output
Professional-grade audio generation with crystal clarity


πŸ“„ Abstract

πŸš€ Tencent Hunyuan open-sources HunyuanVideo-Foley an end-to-end video sound effect generation model!

A professional-grade AI tool specifically designed for video content creators, widely applicable to diverse scenarios including short video creation, film production, advertising creativity, and game development.

🎯 Core Highlights

🎬 Multi-scenario Audio-Visual Synchronization
Supports generating high-quality audio that is synchronized and semantically aligned with complex video scenes, enhancing realism and immersive experience for film/TV and gaming applications.

βš–οΈ Multi-modal Semantic Balance
Intelligently balances visual and textual information analysis, comprehensively orchestrates sound effect elements, avoids one-sided generation, and meets personalized dubbing requirements.

🎡 High-fidelity Audio Output
Self-developed 48kHz audio VAE perfectly reconstructs sound effects, music, and vocals, achieving professional-grade audio generation quality.

πŸ† SOTA Performance Achieved

HunyuanVideo-Foley comprehensively leads the field across multiple evaluation benchmarks, achieving new state-of-the-art levels in audio fidelity, visual-semantic alignment, temporal alignment, and distribution matching - surpassing all open-source solutions!

Performance Overview πŸ“Š Performance comparison across different evaluation metrics - HunyuanVideo-Foley leads in all categories


πŸ”§ Technical Architecture

πŸ“Š Data Pipeline Design

Data Pipeline πŸ”„ Comprehensive data processing pipeline for high-quality text-video-audio datasets

The TV2A (Text-Video-to-Audio) task presents a complex multimodal generation challenge requiring large-scale, high-quality datasets. Our comprehensive data pipeline systematically identifies and excludes unsuitable content to produce robust and generalizable audio generation capabilities.

πŸ—οΈ Model Architecture

Model Architecture 🧠 HunyuanVideo-Foley hybrid architecture with multimodal and unimodal transformer blocks

HunyuanVideo-Foley employs a sophisticated hybrid architecture:

  • πŸ”„ Multimodal Transformer Blocks: Process visual-audio streams simultaneously
  • 🎡 Unimodal Transformer Blocks: Focus on audio stream refinement
  • πŸ‘οΈ Visual Encoding: Pre-trained encoder extracts visual features from video frames
  • πŸ“ Text Processing: Semantic features extracted via pre-trained text encoder
  • 🎧 Audio Encoding: Latent representations with Gaussian noise perturbation
  • ⏰ Temporal Alignment: Synchformer-based frame-level synchronization with gated modulation

πŸ“ˆ Performance Benchmarks

🎬 MovieGen-Audio-Bench Results

Objective and Subjective evaluation results demonstrating superior performance across all metrics

πŸ† MethodPQ ↑PC ↓CE ↑CU ↑IB ↑DeSync ↓CLAP ↑MOS-Q ↑MOS-S ↑MOS-T ↑
FoleyGrafter6.272.723.345.680.171.290.143.36Β±0.783.54Β±0.883.46Β±0.95
V-AURA5.824.303.635.110.231.380.142.55Β±0.972.60Β±1.202.70Β±1.37
Frieren5.712.813.475.310.181.390.162.92Β±0.952.76Β±1.202.94Β±1.26
MMAudio6.172.843.595.620.270.800.353.58Β±0.843.63Β±1.003.47Β±1.03
ThinkSound6.043.733.815.590.180.910.203.20Β±0.973.01Β±1.043.02Β±1.08
HunyuanVideo-Foley (ours)6.592.743.886.130.350.740.334.14Β±0.684.12Β±0.774.15Β±0.75

🎯 Kling-Audio-Eval Results

Comprehensive objective evaluation showcasing state-of-the-art performance

πŸ† MethodFD_PANNs ↓FD_PASST ↓KL ↓IS ↑PQ ↑PC ↓CE ↑CU ↑IB ↑DeSync ↓CLAP ↑
FoleyGrafter22.30322.632.477.086.052.913.285.440.221.230.22
V-AURA33.15474.563.245.805.693.983.134.830.250.860.13
Frieren16.86293.572.957.325.722.552.885.100.210.860.16
MMAudio9.01205.852.179.595.942.913.305.390.300.560.27
ThinkSound9.92228.682.396.865.783.233.125.110.220.670.22
HunyuanVideo-Foley (ours)6.07202.121.898.306.122.763.225.530.380.540.24

πŸŽ‰ Outstanding Results! HunyuanVideo-Foley achieves the best scores across ALL evaluation metrics, demonstrating significant improvements in audio quality, synchronization, and semantic alignment.


πŸš€ Quick Start

πŸ“¦ Installation

πŸ”§ System Requirements

  • CUDA: 12.4 or 11.8 recommended
  • Python: 3.8+
  • OS: Linux (primary support)
  • VRAM: 20GB for XXL model (or 12GB with --enable_offload), 16GB for XL model (or 8GB with --enable_offload)

Step 1: Clone Repository

# πŸ“₯ Clone the repository
git clone https://github.com/Tencent-Hunyuan/HunyuanVideo-Foley
cd HunyuanVideo-Foley

Step 2: Environment Setup

πŸ’‘ Tip: We recommend using Conda for Python environment management.

# πŸ”§ Install dependencies
pip install -r requirements.txt

Step 3: Download Pretrained Models

πŸ”— Download Model weights from Huggingface

# using git-lfs
git clone https://huggingface.co/tencent/HunyuanVideo-Foley

# using huggingface-cli
huggingface-cli download tencent/HunyuanVideo-Foley

πŸ’» Usage

πŸ“Š Model Specifications

ModelCheckpointVRAM (Normal)VRAM (Offload)
XXL (Default)hunyuanvideo_foley.pth20GB12GB
XLhunyuanvideo_foley_xl.pth16GB8GB

🎬 Single Video Generation

Generate Foley audio for a single video file with text description:

# Use XXL model (default, best quality)
python3 infer.py \
    --model_path PRETRAINED_MODEL_PATH_DIR \
    --single_video video_path \
    --single_prompt "audio description" \
    --output_dir OUTPUT_DIR \
    # --enable_offload  

# Use XL model (memory-friendly)
python3 infer.py \
    --model_path PRETRAINED_MODEL_PATH_DIR \
    --model_size xl \
    --single_video video_path \
    --single_prompt "audio description" \
    --output_dir OUTPUT_DIR \
    # --enable_offload

πŸ“‚ Batch Processing

Process multiple videos using a CSV file with video paths and descriptions:

# Download sample test videos
bash ./download_test_videos.sh

# Batch processing
python3 infer.py \
    --model_path PRETRAINED_MODEL_PATH_DIR \
    --csv_path assets/test.csv \
    --output_dir OUTPUT_DIR \
    # --enable_offload

🌐 Interactive Web Interface

Launch a user-friendly Gradio web interface for easy interaction:

# Launch with XXL model (default)
export HIFI_FOLEY_MODEL_PATH=PRETRAINED_MODEL_PATH_DIR
python3 gradio_app.py

# Launch with XL model (memory-friendly)
export HIFI_FOLEY_MODEL_PATH=PRETRAINED_MODEL_PATH_DIR
MODEL_SIZE=xl python3 gradio_app.py

# Optional: Enable offload to reduce memory usage
ENABLE_OFFLOAD=true python3 gradio_app.py

πŸš€ Then open your browser and navigate to the provided local URL to start generating Foley audio!


πŸ“š Citation

If you find HunyuanVideo-Foley useful for your research, please consider citing our paper:

@misc{shan2025hunyuanvideofoleymultimodaldiffusionrepresentation,
      title={HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation}, 
      author={Sizhe Shan and Qiulin Li and Yutao Cui and Miles Yang and Yuehai Wang and Qun Yang and Jin Zhou and Zhao Zhong},
      year={2025},
      eprint={2508.16930},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2508.16930}, 
}

Star History

Star History Chart

πŸ™ Acknowledgements

We extend our heartfelt gratitude to the open-source community!

🎨 Stable Diffusion 3
Foundation diffusion models

⚑ FLUX
Advanced generation techniques

🎡 MMAudio
Multimodal audio generation

πŸ€— HuggingFace
Platform & diffusers library

πŸ—œοΈ DAC
High-Fidelity Audio Compression

πŸ”— Synchformer
Audio-Visual Synchronization

🌟 Special thanks to all researchers and developers who contribute to the advancement of AI-generated audio and multimodal learning!


πŸ”— Connect with Us

GitHub Twitter Hunyuan

© 2025 Tencent Hunyuan. All rights reserved. | Made with ❀️ for the AI community