LaVieID

August 1, 2025 · View on GitHub

LaVieID: Local Autoregressive Diffusion Transformers for Identity-Preserving Video Creation

🏆 Official implementation of our paper accepted at ACM MM 2025

📄 DOI: 10.1145/3746027.3754943

Wenhui Song*, Hanhui Li*, Jiehui Huang, Panwen Hu, Yuhao Cheng, Long Chen, Yiqiang Yan, Xiaodan Liang

🔬 Overview

LaVieID introduces a novel local autoregressive video diffusion framework for identity-preserving text-to-video generation.

🎯 Key modules:

A local router that explicitly encodes facial latent states via weighted fine-grained regions, mitigating interference and boosting identity retention.
A temporal autoregressive module (TAM) that enhances inter-frame consistency by modeling long-term temporal dependencies at the latent level.

LaVieID generates high-fidelity, personalized video clips and achieves state-of-the-art identity preservation performance.

⚙️ Setup

✅ Tested Environment

Python: 3.12.4
CUDA: 12.8
PyTorch: 2.1.0+

⚠️ Note: Python 3.12 is relatively new. Some libraries (e.g., xformers, diffusers, etc.) may require the latest versions. If you face compatibility issues, we recommend using Python 3.10.

1. Clone Repository

git clone https://github.com/ssugarwh/LaVieID.git
cd LaVieID

2. Create Virtual Environment

python3.12 -m venv lavieid
source venv/bin/activate  # for Linux/macOS
# or venv\Scripts\activate  # for Windows

3. Install Dependencies

pip install -r requirements.txt

📥 Pretrained Models

You can download our pretrained checkpoint from ModelScope:

SDK

#download ModelScope
pip install modelscope
#SDK
from modelscope import snapshot_download
model_dir = snapshot_download('sugarwh/lavieid')

Git

git clone https://www.modelscope.cn/sugarwh/lavieid.git

After downloading, replace the model path in bash:

🚀 Inference

bash infer_facevideo_router_v2.sh

🧪 Training

You can train the model using:

bash train_router_v2.sh

Make sure you have a training dataset prepared and specified in the config.

📁 Dataset Structure

The dataset is organized under facevideo_dataset/ with the following key components:

facevideo_dataset/
├── captions/                 # Text prompts or captions for videos
├── face_parts/              # Part-wise facial region images and masks
├── face_images/             # image from video
├── refine_bbox_jsons/       # Refined bounding boxes (JSON)
├── total_train_data.txt     # List of all training samples
└── videos/                  # Original or processed video clips

Each folder in face_parts/ corresponds to a segmented video and contains six subfolders representing facial components:

face_parts/<people_name>/
├── eyebrows/   ─┬─ image.png, mask.png
├── eyes/       ─┤
├── face/       ─┤
├── hair/       ─┤
├── mouth/      ─┤
└── nose/       ─┘

Each subfolder includes:

image.png: Cropped image of the part
mask.png: Binary mask of the same region

📖 Citation

If you find this work useful, please cite us:

@inproceedings{song2025lavieid,
  title     = {LaVieID: Local Autoregressive Diffusion Transformers for Identity-Preserving Video Creation},
  author    = {Wenhui Song and Hanhui Li and Jiehui Huang and Panwen Hu and Yuhao Cheng and Long Chen and Yiqiang Yan and Xiaodan Liang},
  booktitle = {Proceedings of the 33rd ACM International Conference on Multimedia (MM '25)},
  year      = {2025},
  publisher = {ACM},
  address   = {Dublin, Ireland},
  doi       = {10.1145/3746027.3754943},
  isbn      = {979-8-4007-2035-2/2025/10}
}