WeGen: A Unified Model for Interactive Multimodal Generation as We Chat
April 25, 2025 · View on GitHub
WeGen: A Unified Model for Interactive Multimodal Generation as We Chat
This repo is the official implementation of "WeGen: A Unified Model for Interactive Multimodal Generation as We Chat", by Zhipeng Huang, Shaobin Zhuang, Canmiao Fu, Binxin Yang, Ying Zhang, Chong Sun, Zhizheng Zhang, Yali Wang, Chen Li, Zheng-Jun Zha
WeGen is a unified framework that integrates multimodal understanding and generation, enabling users to achieve various visual generation goals through natural conversation. It excels at generating diverse results with high creativity for less detailed instructions and can progressively refine prior generation results while maintaining consistency with user references.
Key Features
- Unified Framework: Seamlessly integrates diverse capabilities including text-to-image generation, subject-driven generation, condition-driven generation, image restoration, and style transfer
- Dynamic Instance Identity Consistency (DIIC): Maintains instance identity consistency while allowing natural variations in generated contents
Demo
coming soon.
Installation
- Clone the repository:
git clone https://github.com/hzphzp/WeGen.git
cd WeGen/
-
Prepare the base enviroment, we use ubuntu20, python3.8, with H20 or 910B GPUs
-
Install required packages:
bash env.sh
- Download the pre-trained models from here and construct the pretrained model folder like:
WeGen
└── wegen_mllm_ckpt
├── pretrained
│ ├── CLIPScore_eval
│ ├── EVA-CLIP
│ ├── SEED-X
│ ├── meta-llama
│ │ └── Llama-2-7b-chat-hf
│ └── stable-diffusion-xl-base-1.0
├── pytorch_model.bin
├── stage1_final
│ └── unet
└── stage2_final
└── checkpoint-30000
Data preparation
DIIC dataset coming soon.
Training
run the following command to train the model on 128 H20/910B GPUs Node:
# stage1
bash scripts/wegen_mllm_stage1.sh
# stage2
bash scripts/wegen_mllm_stage2.sh
# stage3
bash scripts/wegen_mllm_stage3.sh
Evaluation
run the following command to evaluate the model on 8 H20/910B GPUs Node:
bash scripts/inference.sh
Citing
If you find this code and work useful, please consider citing the following paper and star this repo. Thank you very much!
@article{huang2025wegen,
title={WeGen: A Unified Model for Interactive Multimodal Generation as We Chat},
author={Huang, Zhipeng and Zhuang, Shaobin and Fu, Canmiao and Yang, Binxin and Zhang, Ying and Sun, Chong and Zhang, Zhizheng and Wang, Yali and Li, Chen and Zha, Zheng-Jun},
journal={arXiv preprint arXiv:2503.01115},
year={2025}
}