LaVieID
August 1, 2025 ยท View on GitHub
LaVieID: Local Autoregressive Diffusion Transformers for Identity-Preserving Video Creation
๐ Official implementation of our paper accepted at ACM MM 2025
๐ DOI: 10.1145/3746027.3754943
Wenhui Song*, Hanhui Li*, Jiehui Huang, Panwen Hu, Yuhao Cheng, Long Chen, Yiqiang Yan, Xiaodan Liang
๐ฌ Overview
LaVieID introduces a novel local autoregressive video diffusion framework for identity-preserving text-to-video generation.
๐ฏ Key modules:
- A local router that explicitly encodes facial latent states via weighted fine-grained regions, mitigating interference and boosting identity retention.
- A temporal autoregressive module (TAM) that enhances inter-frame consistency by modeling long-term temporal dependencies at the latent level.
LaVieID generates high-fidelity, personalized video clips and achieves state-of-the-art identity preservation performance.
โ๏ธ Setup
โ Tested Environment
- Python: 3.12.4
- CUDA: 12.8
- PyTorch: 2.1.0+
โ ๏ธ Note: Python 3.12 is relatively new. Some libraries (e.g.,
xformers,diffusers, etc.) may require the latest versions. If you face compatibility issues, we recommend using Python 3.10.
1. Clone Repository
git clone https://github.com/ssugarwh/LaVieID.git
cd LaVieID
2. Create Virtual Environment
python3.12 -m venv lavieid
source venv/bin/activate # for Linux/macOS
# or venv\Scripts\activate # for Windows
3. Install Dependencies
pip install -r requirements.txt
๐ฅ Pretrained Models
You can download our pretrained checkpoint from ModelScope:
SDK
#download ModelScope
pip install modelscope
#SDK
from modelscope import snapshot_download
model_dir = snapshot_download('sugarwh/lavieid')
Git
git clone https://www.modelscope.cn/sugarwh/lavieid.git
After downloading, replace the model path in bash:
๐ Inference
bash infer_facevideo_router_v2.sh
๐งช Training
You can train the model using:
bash train_router_v2.sh
Make sure you have a training dataset prepared and specified in the config.
๐ Dataset Structure
The dataset is organized under facevideo_dataset/ with the following key components:
facevideo_dataset/
โโโ captions/ # Text prompts or captions for videos
โโโ face_parts/ # Part-wise facial region images and masks
โโโ face_images/ # image from video
โโโ refine_bbox_jsons/ # Refined bounding boxes (JSON)
โโโ total_train_data.txt # List of all training samples
โโโ videos/ # Original or processed video clips
Each folder in face_parts/ corresponds to a segmented video and contains six subfolders representing facial components:
face_parts/<people_name>/
โโโ eyebrows/ โโฌโ image.png, mask.png
โโโ eyes/ โโค
โโโ face/ โโค
โโโ hair/ โโค
โโโ mouth/ โโค
โโโ nose/ โโ
Each subfolder includes:
image.png: Cropped image of the partmask.png: Binary mask of the same region
๐ Citation
If you find this work useful, please cite us:
@inproceedings{song2025lavieid,
title = {LaVieID: Local Autoregressive Diffusion Transformers for Identity-Preserving Video Creation},
author = {Wenhui Song and Hanhui Li and Jiehui Huang and Panwen Hu and Yuhao Cheng and Long Chen and Yiqiang Yan and Xiaodan Liang},
booktitle = {Proceedings of the 33rd ACM International Conference on Multimedia (MM '25)},
year = {2025},
publisher = {ACM},
address = {Dublin, Ireland},
doi = {10.1145/3746027.3754943},
isbn = {979-8-4007-2035-2/2025/10}
}