OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation

June 22, 2025 ยท View on GitHub

OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation

Size Wu, Zhonghua Wu, Zerui Gong, Qingyi Tao, Sheng Jin, Qinyue Li, Wei Li, Chen Change Loy

report Bibtex

Introduction

This is a repo under construction, named OpenUni, an open-source version of MetaQuery for unifying multimodal understanding and generation. With a minimalist choice of architecture, we demonstrate that OpenUni can: 1) generate high-quality and instruction-aligned images, and 2) achieve exceptional performance on standard benchmarks such as GenEval, DPG-Bench, and WISE, with only 1.1B and 3.1B activated parameters. Currently, we provide three model variants: OpenUni-B-512, OpenUni-L-512 and OpenUni-L-1024. Checkpoints from both pre-training and fine-tuning are provided.

๐Ÿ”ฅ Model Zoo

Model NameImage SizeMLMM ModelDiffusion ModelPre-trainedFine-tuned
OpenUni-B-512512ร—512InternVL3-1BSANA-0.6B-512pxLinkLink
OpenUni-L-512512ร—512InternVL3-2BSANA-1.6B-512pxLinkLink
OpenUni-L-10241024ร—1024InternVL3-2BSANA1.5-1.6B-1024pxLinkLink

Environment

mmengine
xtuner
transformers
torch
flash_attn

Text-to-Image

Please download our released model weights from ๐Ÿค—wusize/openuni. It is recommended to use the following command to download the checkpoints

# pip install -U "huggingface_hub[cli]"
huggingface-cli download wusize/openuni  --local-dir checkpoints --repo-type model
OpenUni/
โ”œโ”€โ”€ checkpoints
    โ”œโ”€โ”€ openuni_b_internvl3_1b_sana_0_6b_512_hf_blip3o60k.pth
    โ”œโ”€โ”€ openuni_b_internvl3_1b_sana_0_6b_512_hf_text2image23m.pth
    โ”œโ”€โ”€ openuni_l_internvl3_2b_sana_1_6b_1024_hf_blip3o60k.pth
    โ”œโ”€โ”€ openuni_l_internvl3_2b_sana_1_6b_1024_hf_text2image23m.pth
    โ”œโ”€โ”€ openuni_l_internvl3_2b_sana_1_6b_512_hf_blip3o60k.pth
    โ”œโ”€โ”€ openuni_l_internvl3_2b_sana_1_6b_512_hf_text2image23m.pth

Inference

Please refer to docs/INFERENCE.md.

Evaluation

Please refer to docs/EVALUATION.md.

Train

Please refer to docs/DATASETS.md and docs/datasets to prepare the datasets. After having the datasets, please follow the instructions in docs/TRAIN.md to launch training scripts.

๐Ÿ“š Citation

If you find OpenUni useful for your research or applications, please cite our paper using the following BibTeX:

@article{wu2025openuni,
      title={OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation}, 
      author={Size Wu and Zhonghua Wu and Zerui Gong and Qingyi Tao and Sheng Jin and Qinyue Li and Wei Li and Chen Change Loy},
      year={2025},
      eprint={2505.23661},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.23661}, 
}

๐Ÿ“œ License

This project is licensed under NTU S-Lab License 1.0.

๐Ÿ™ Acknowledgement

The project builds upon the following pioneering works:

  • SANA: We use SANA as our diffusion module, considering its efficiency and strong performance.
  • InternVL3: We use the latest InternVL3 as our base multimodal LLM.
  • MetaQuery: OpenUni is inspired by MetaQuery and is an open-source implementation of this work.
  • BLIP3-o: We thank the BLIP3-o team for releasing the precious high-quality tuning dataset.