OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation

June 22, 2025 · View on GitHub

OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation

Size Wu, Zhonghua Wu, Zerui Gong, Qingyi Tao, Sheng Jin, Qinyue Li, Wei Li, Chen Change Loy

Introduction

This is a repo under construction, named OpenUni, an open-source version of MetaQuery for unifying multimodal understanding and generation. With a minimalist choice of architecture, we demonstrate that OpenUni can: 1) generate high-quality and instruction-aligned images, and 2) achieve exceptional performance on standard benchmarks such as GenEval, DPG-Bench, and WISE, with only 1.1B and 3.1B activated parameters. Currently, we provide three model variants: OpenUni-B-512, OpenUni-L-512 and OpenUni-L-1024. Checkpoints from both pre-training and fine-tuning are provided.

🔥 Model Zoo

Model Name	Image Size	MLMM Model	Diffusion Model	Pre-trained	Fine-tuned
OpenUni-B-512	512×512	InternVL3-1B	SANA-0.6B-512px	Link	Link
OpenUni-L-512	512×512	InternVL3-2B	SANA-1.6B-512px	Link	Link
OpenUni-L-1024	1024×1024	InternVL3-2B	SANA1.5-1.6B-1024px	Link	Link

Environment

mmengine
xtuner
transformers
torch
flash_attn

Text-to-Image

Please download our released model weights from 🤗wusize/openuni. It is recommended to use the following command to download the checkpoints

# pip install -U "huggingface_hub[cli]"
huggingface-cli download wusize/openuni  --local-dir checkpoints --repo-type model

OpenUni/
├── checkpoints
    ├── openuni_b_internvl3_1b_sana_0_6b_512_hf_blip3o60k.pth
    ├── openuni_b_internvl3_1b_sana_0_6b_512_hf_text2image23m.pth
    ├── openuni_l_internvl3_2b_sana_1_6b_1024_hf_blip3o60k.pth
    ├── openuni_l_internvl3_2b_sana_1_6b_1024_hf_text2image23m.pth
    ├── openuni_l_internvl3_2b_sana_1_6b_512_hf_blip3o60k.pth
    ├── openuni_l_internvl3_2b_sana_1_6b_512_hf_text2image23m.pth

@article{wu2025openuni,
      title={OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation}, 
      author={Size Wu and Zhonghua Wu and Zerui Gong and Qingyi Tao and Sheng Jin and Qinyue Li and Wei Li and Chen Change Loy},
      year={2025},
      eprint={2505.23661},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.23661}, 
}

📜 License

This project is licensed under NTU S-Lab License 1.0.

🙏 Acknowledgement

The project builds upon the following pioneering works:

SANA: We use SANA as our diffusion module, considering its efficiency and strong performance.
InternVL3: We use the latest InternVL3 as our base multimodal LLM.
MetaQuery: OpenUni is inspired by MetaQuery and is an open-source implementation of this work.
BLIP3-o: We thank the BLIP3-o team for releasing the precious high-quality tuning dataset.

OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation

Introduction

🔥 Model Zoo

Environment

Text-to-Image

Inference

Evaluation

Train

📚 Citation

📜 License

🙏 Acknowledgement