README.md
January 7, 2025 · View on GitHub
Blending Custom Photos with Video Diffusion Transformers
📷 1. Gallery
⚙️ 2. Environments
We recommend the requirements as follows.
conda create -n ingredients python=3.11.0
conda activate ingredients
pip install -r requirements.txt
The weights of model are available at 🤗HuggingFace.
🗝️ 3. Inference
We provide the inference scripts inference.py for simple testing. Run the command as examples:
python infer.py \
--prompt "Two men in half bodies, are seated in a dimly lit room, possibly an office or meeting room, with a formal atmosphere." \
--model_path "\path\to\model" \
--seed 2025 \
--img_file_path 'asserts/0.jpg' 'asserts/1.jpg'
We also include the evaluation metrics code at metric folder and evaluation data at for results comparison in multi-id customization tasks.
Similar to ConsisID, Ingredients also has high requirements for prompt quality. We suggest referring to formation in the link.
Gradio Web UI
Highly recommend trying out our web demo by the following command, which incorporates all features currently supported by Ingredients.
python app.py
⏰ 4. Training
Coming soon, including multi-stage training scripts and multi-ID text-video datasets.
You can prepare the video-text pair data as formation and our experiments can be repeated by simply run the training scripts as:
# For stage 1
bash train_face.sh
# For stage 2
bash train_router.sh
🚀 5. Cite
If you find this work useful for your research and applications, please cite us using this BibTeX:
@article{fei2025ingredients,
title={Ingredients: Blending Custom Photos with Video Diffusion Transformers},
author={Fei, Zhengcong and Li, Debang and Qiu, Di and Yu, Changqian and Fan, Mingyuan},
journal={arXiv preprint arXiv:2501.01790},
year={2025}
}
For any question, please feel free to open an issue.
Acknowledgement
This project wouldn't be possible without the following open-sourced repositories: CogVideoX, ConsisID, Uniportrait, and Hunyuan Video.