Multi-HumanVid [ICCV 2025]

October 16, 2025 · View on GitHub

This repository is the official implementation of the paper:

Multi-identity Human Image Animation with Structural Video Diffusion
Zhenzhi Wang, Yixuan Li, Yanhong Zeng, Yuwei Guo, Dahua Lin, Tianfan Xue, Bo Dai,
CUHK, Shanghai AI Lab

TL;DR

It is a multi-human, pose-driven video generation method built on AnimateDiff before DiT is widely adopted. It is designed to better model human–human and human–object interactions from pose conditions with limited ability of pretrained convolutional-based video generators. Our experiments show that incorporating pseudo-3D cues yields more realistic geometry-dependent appearance changes and human motion. We also find that the model can infer objects being held by or attached to a person purely from RGB and pseudo-3D video, provided the object stays near the person. In addition, SAM2-based tracking masks help preserve identity-consistent appearances across multiple people. Architecturally, the approach extends HumanVid. We welcome further improvements based on pretrained DiT models leveraging our idea.

Framework

framework

Qualitative Results

qualitative

News

  • 2025/10/16: We release the code.

Usage

This script will extract the whole-body pose for all videos in a given folder, e.g., videos. The extracted poses will be stored in the dwpose folder.

cd DWPose
python prepare_video.py

Training and Inference

Conda Environment

Please prepare conda environment following HumanVid.

Prepare meta information for Training set

Please use scripts in ./tools to extract all valid videos, pose files and camera files as the training set.

Here we provide an example: Firstly, extract valid paths by using ./tools/extract_*_meta_info.py. Then combine them to be a single meta information file by using ./tools/merge_all_meta_info.py. Finally, we split all videos larger than 10s to be smaller segments, by using ./tools/get_video_segments.py to add each segment's starting frame and ending frame to the meta infos.

Usage

Training, stage1: bash scripts/train_s1.sh and stage2: bash scripts/train_s2.sh.

Inference: bash scripts/eval.sh.

Our code structure is very similar to HumanVid. Please check the original readme for more details.

Please give us a star if you are interested in our work. Thanks!

Bibtex

@article{wang2025multi,
  title={Multi-identity Human Image Animation with Structural Video Diffusion},
  author={Wang, Zhenzhi and Li, Yixuan and Zeng, Yanhong and Guo, Yuwei and Lin, Dahua and Xue, Tianfan and Dai, Bo},
  journal={arXiv preprint arXiv:2504.04126},
  year={2025}
}