Multi-HumanVid [ICCV 2025]
October 16, 2025 · View on GitHub
This repository is the official implementation of the paper:
Multi-identity Human Image Animation with Structural Video Diffusion
Zhenzhi Wang, Yixuan Li, Yanhong Zeng, Yuwei Guo, Dahua Lin, Tianfan Xue, Bo Dai,
CUHK, Shanghai AI Lab
TL;DR
It is a multi-human, pose-driven video generation method built on AnimateDiff before DiT is widely adopted. It is designed to better model human–human and human–object interactions from pose conditions with limited ability of pretrained convolutional-based video generators. Our experiments show that incorporating pseudo-3D cues yields more realistic geometry-dependent appearance changes and human motion. We also find that the model can infer objects being held by or attached to a person purely from RGB and pseudo-3D video, provided the object stays near the person. In addition, SAM2-based tracking masks help preserve identity-consistent appearances across multiple people. Architecturally, the approach extends HumanVid. We welcome further improvements based on pretrained DiT models leveraging our idea.
Framework

Qualitative Results

News
2025/10/16: We release the code.
Usage
This script will extract the whole-body pose for all videos in a given folder, e.g., videos. The extracted poses will be stored in the dwpose folder.
cd DWPose
python prepare_video.py
Training and Inference
Conda Environment
Please prepare conda environment following HumanVid.
Prepare meta information for Training set
Please use scripts in ./tools to extract all valid videos, pose files and camera files as the training set.
Here we provide an example: Firstly, extract valid paths by using ./tools/extract_*_meta_info.py. Then combine them to be a single meta information file by using ./tools/merge_all_meta_info.py. Finally, we split all videos larger than 10s to be smaller segments, by using ./tools/get_video_segments.py to add each segment's starting frame and ending frame to the meta infos.
Usage
Training, stage1: bash scripts/train_s1.sh and stage2: bash scripts/train_s2.sh.
Inference: bash scripts/eval.sh.
Our code structure is very similar to HumanVid. Please check the original readme for more details.
Please give us a star if you are interested in our work. Thanks!
Bibtex
@article{wang2025multi,
title={Multi-identity Human Image Animation with Structural Video Diffusion},
author={Wang, Zhenzhi and Li, Yixuan and Zeng, Yanhong and Guo, Yuwei and Lin, Dahua and Xue, Tianfan and Dai, Bo},
journal={arXiv preprint arXiv:2504.04126},
year={2025}
}