README.md

January 7, 2025 · View on GitHub

Blending Custom Photos with Video Diffusion Transformers

This repository is the official implementation of Ingredients, a powerful way to customize video creations by incorporating multiple specific identity (ID) photos, with advanced video diffusion Transformers. This is a research project, and it is recommended to try advanced products:

📷 1. Gallery

⚙️ 2. Environments

We recommend the requirements as follows.

conda create -n ingredients python=3.11.0
conda activate ingredients
pip install -r requirements.txt

The weights of model are available at 🤗HuggingFace.

🗝️ 3. Inference

We provide the inference scripts inference.py for simple testing. Run the command as examples:

python infer.py \
    --prompt "Two men in half bodies, are seated in a dimly lit room, possibly an office or meeting room, with a formal atmosphere." \
    --model_path "\path\to\model" \
    --seed 2025 \
    --img_file_path 'asserts/0.jpg' 'asserts/1.jpg'

We also include the evaluation metrics code at metric folder and evaluation data at for results comparison in multi-id customization tasks.

Similar to ConsisID, Ingredients also has high requirements for prompt quality. We suggest referring to formation in the link.

Gradio Web UI

Highly recommend trying out our web demo by the following command, which incorporates all features currently supported by Ingredients.

python app.py

⏰ 4. Training

Coming soon, including multi-stage training scripts and multi-ID text-video datasets.

You can prepare the video-text pair data as formation and our experiments can be repeated by simply run the training scripts as:

# For stage 1
bash train_face.sh
# For stage 2
bash train_router.sh

🚀 5. Cite

If you find this work useful for your research and applications, please cite us using this BibTeX:

@article{fei2025ingredients,
    title={Ingredients: Blending Custom Photos with Video Diffusion Transformers},
    author={Fei, Zhengcong and Li, Debang and Qiu, Di and Yu, Changqian and Fan, Mingyuan},
    journal={arXiv preprint arXiv:2501.01790},
    year={2025}
}

For any question, please feel free to open an issue.

Acknowledgement

This project wouldn't be possible without the following open-sourced repositories: CogVideoX, ConsisID, Uniportrait, and Hunyuan Video.