N represent the number of persons
November 21, 2025 · View on GitHub
Playmate2: Training-Free Multi-Character Audio-Driven Animation via Diffusion Transformer with Reward Feedback
Xingpei Ma*
Shenneng Huang*
Jiaran Cai*†
Yuansheng Guan*
Shen Zheng*
Hanfeng Zhao
Qiang Zhang
Shunsi Zhang
* Equal contribution
†Project lead & Corresponding Author
Guangzhou Quwan Network Technology
TL; DR: We present Playmate2, which effectively tackles key challenges related to temporal coherence in long sequences and multi-character animations, for generating high-quality audio-driven videos. To the best of our knowledge, this is the first training-free approach capable of enabling audio-driven animation for three or more characters without requiring additional data or model modifications.
📰 News
2025/11/21: 🔥🔥🔥 We release the weights and inference code of Playmate2!2025/11/10: 🎉🎉🎉 Our paper has been accepted and will be presented at AAAI 2026. We plan to release the inference code and model weights for both Playmate and Playmate2 in the coming weeks. Stay tuned and thank you for your patience!2025/10/15: 🚀🚀🚀 Our paper is in public on arxiv.
📸 Showcase
Multi-Character Animation
Singing Videos
Multi-Style Animation
Explore more examples.
Quick Start
🛠️Installation
1. Create a conda environment and install pytorch, xformers
conda create -n playmate2 python=3.10
conda activate playmate2
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install -U xformers==0.0.29 --index-url https://download.pytorch.org/whl/cu124
2. Flash-attn installation:
pip install misaki[en]
pip install ninja
pip install psutil
pip install packaging
pip install flash_attn==2.7.4.post1 --no-build-isolation
3. Other dependencies
pip install -r requirements.txt
4. FFmeg installation
conda install -c conda-forge ffmpeg
or
sudo yum install ffmpeg ffmpeg-devel
🧱Model Preparation
Model Download
| Models | Download Link | Save Path |
|---|---|---|
| Wan2.1-I2V-14B-720P | Huggingface | pretrained_weights/Wan2.1-I2V-14B-720P |
| chinese-wav2vec2-base | Huggingface | pretrained_weights/chinese-wav2vec2-base |
| VideoLLaMA3-7B | Huggingface | pretrained_weights/VideoLLaMA3-7B |
| Our Pretrained Model | Huggingface | pretrained_weights/playmate2 |
Download models using huggingface-cli:
mkdir pretrained_weights
huggingface-cli download Wan-AI/Wan2.1-I2V-14B-720P --local-dir ./pretrained_weights/Wan2.1-I2V-14B-720P
huggingface-cli download TencentGameMate/chinese-wav2vec2-base --local-dir ./pretrained_weights/chinese-wav2vec2-base
huggingface-cli download TencentGameMate/chinese-wav2vec2-base model.safetensors --revision refs/pr/1 --local-dir ./pretrained_weights/chinese-wav2vec2-base
huggingface-cli download DAMO-NLP-SG/VideoLLaMA3-7B --local-dir ./pretrained_weights/VideoLLaMA3-7B
huggingface-cli download PlaymateAI/Playmate2 --local-dir ./pretrained_weights/playmate2
Inference
It is recommended to use an A100 or higher GPUs for inference.
- One person
python inference.py \
--gpu_num 1 \ # 1(single gpu) or 3(multiple gpus)
--image_path examples/images/01.png \
--audio_path examples/audios/01.wav \
--prompt_path examples/prompts/01.txt \
--output_path examples/outputs/01.mp4 \
--max_size 1280 \
--id_num 1
- Multiple Persons
# N represent the number of persons
python inference.py \
--gpu_num 1 \ # 1(single gpu) or 3+N-1(multiple gpus)
--image_path examples/images/04.png \
--audio_path examples/audios/04 \
--mask_path examples/masks/04 \
--prompt_path examples/prompts/04.txt \
--output_path examples/outputs/04.mp4 \
--max_size 1280 \
--id_num 3
📑 Todo List
📝 Citation
If you find our work useful for your research, please consider citing the paper:
@article{ma2025playmate2,
title={Playmate2: Training-Free Multi-Character Audio-Driven Animation via Diffusion Transformer with Reward Feedback},
author={Ma, Xingpei and Huang, Shenneng and Cai, Jiaran and Guan, Yuansheng and Zheng, Shen and Zhao, Hanfeng and Zhang, Qiang and Zhang, Shunsi},
journal={arXiv preprint arXiv:2510.12089},
year={2025}
}