LaVieID

August 1, 2025 ยท View on GitHub

LaVieID: Local Autoregressive Diffusion Transformers for Identity-Preserving Video Creation

๐Ÿ† Official implementation of our paper accepted at ACM MM 2025

๐Ÿ“„ DOI: 10.1145/3746027.3754943

Wenhui Song*, Hanhui Li*, Jiehui Huang, Panwen Hu, Yuhao Cheng, Long Chen, Yiqiang Yan, Xiaodan Liang



๐Ÿ”ฌ Overview

LaVieID introduces a novel local autoregressive video diffusion framework for identity-preserving text-to-video generation.

๐ŸŽฏ Key modules:

  • A local router that explicitly encodes facial latent states via weighted fine-grained regions, mitigating interference and boosting identity retention.
  • A temporal autoregressive module (TAM) that enhances inter-frame consistency by modeling long-term temporal dependencies at the latent level.

LaVieID generates high-fidelity, personalized video clips and achieves state-of-the-art identity preservation performance.


โš™๏ธ Setup

โœ… Tested Environment

  • Python: 3.12.4
  • CUDA: 12.8
  • PyTorch: 2.1.0+

โš ๏ธ Note: Python 3.12 is relatively new. Some libraries (e.g., xformers, diffusers, etc.) may require the latest versions. If you face compatibility issues, we recommend using Python 3.10.

1. Clone Repository

git clone https://github.com/ssugarwh/LaVieID.git
cd LaVieID

2. Create Virtual Environment

python3.12 -m venv lavieid
source venv/bin/activate  # for Linux/macOS
# or venv\Scripts\activate  # for Windows

3. Install Dependencies

pip install -r requirements.txt

๐Ÿ“ฅ Pretrained Models

You can download our pretrained checkpoint from ModelScope:

SDK

#download ModelScope
pip install modelscope
#SDK
from modelscope import snapshot_download
model_dir = snapshot_download('sugarwh/lavieid')

Git

git clone https://www.modelscope.cn/sugarwh/lavieid.git

After downloading, replace the model path in bash:

๐Ÿš€ Inference

bash infer_facevideo_router_v2.sh 

๐Ÿงช Training

You can train the model using:

bash train_router_v2.sh 

Make sure you have a training dataset prepared and specified in the config.

๐Ÿ“ Dataset Structure

The dataset is organized under facevideo_dataset/ with the following key components:

facevideo_dataset/
โ”œโ”€โ”€ captions/                 # Text prompts or captions for videos
โ”œโ”€โ”€ face_parts/              # Part-wise facial region images and masks
โ”œโ”€โ”€ face_images/             # image from video
โ”œโ”€โ”€ refine_bbox_jsons/       # Refined bounding boxes (JSON)
โ”œโ”€โ”€ total_train_data.txt     # List of all training samples
โ””โ”€โ”€ videos/                  # Original or processed video clips

Each folder in face_parts/ corresponds to a segmented video and contains six subfolders representing facial components:

face_parts/<people_name>/
โ”œโ”€โ”€ eyebrows/   โ”€โ”ฌโ”€ image.png, mask.png
โ”œโ”€โ”€ eyes/       โ”€โ”ค
โ”œโ”€โ”€ face/       โ”€โ”ค
โ”œโ”€โ”€ hair/       โ”€โ”ค
โ”œโ”€โ”€ mouth/      โ”€โ”ค
โ””โ”€โ”€ nose/       โ”€โ”˜

Each subfolder includes:

  • image.png: Cropped image of the part
  • mask.png: Binary mask of the same region

๐Ÿ“– Citation

If you find this work useful, please cite us:

@inproceedings{song2025lavieid,
  title     = {LaVieID: Local Autoregressive Diffusion Transformers for Identity-Preserving Video Creation},
  author    = {Wenhui Song and Hanhui Li and Jiehui Huang and Panwen Hu and Yuhao Cheng and Long Chen and Yiqiang Yan and Xiaodan Liang},
  booktitle = {Proceedings of the 33rd ACM International Conference on Multimedia (MM '25)},
  year      = {2025},
  publisher = {ACM},
  address   = {Dublin, Ireland},
  doi       = {10.1145/3746027.3754943},
  isbn      = {979-8-4007-2035-2/2025/10}
}