README.md

May 30, 2025 · View on GitHub

UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning

Weijia Mao^1* Zhenheng Yang² Mike Zheng Shou¹

¹ Show Lab, National University of Singapore ² Bytedance

In this work, we introduce UniRL, a self-improving post-training approach. Our approach enables the model to generate images from prompts and use them as training data in each iteration, without relying on any external image data. Moreover, it enables the two tasks to enhance each other: the generated images are used for understanding, and the understanding results are used to supervise generation. We explore supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) to optimize the models. UniRL offers three key advantages: (1) it requires no external image data, as all training samples are generated by the model itself during training; (2) it not only improves individual task performance, but also reduces the imbalance between generation and understanding; and (3) it requires only several additional training steps during the post-training stage. We evaluate UniRL on top of Show-o and Janus, achieving a GenEval score of 0.77 for Show-o and 0.65 for Janus.

TODO

Release the inference code of T2I.
Release the inference code of MMU.
Release the training code of SFT.
Release the training code of GRPO.

Hugging Face models

The UniRL checkpoints of GRPO can be found on Hugging Face:

The original Show-o checkpoints can be found on Hugging Face:

Getting Started

First, set up the environment:

pip3 install -r requirements.txt

wandb login <your wandb keys>

Then you need to set up the environment of GenEval, please check (https://github.com/djghosh13/geneval)

Update the config in configs/showo_gen_eval_cycle_512.yaml. Change the checkpoint path and output path.

Test GenEval benchmark for Text to Image Generation and you can view the results on wandb.

sh run_eval.sh

Training pipeline

Prepare your training data and change the data path in configs/xx.yaml.

Note that, our training process is based on accelerate. Please ensure to config your accelerate for distributed training. We provide config examples below for (distributed) training on a single GPU or multiple GPUs.

├── accelerate_configs/ 
|   ├── multi_nodes (6x8 GPUs)
|   |   ├—— ...
|   ├── 1_gpu.yaml
|   └── 8_gpu_deepspeed_zero2.yaml