README.md
May 30, 2025 · View on GitHub
UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning
Weijia Mao1* Zhenheng Yang2 Mike Zheng Shou1
1 Show Lab, National University of Singapore 2 Bytedance
In this work, we introduce UniRL, a self-improving post-training approach. Our approach enables the model to generate images from prompts and use them as training data in each iteration, without relying on any external image data. Moreover, it enables the two tasks to enhance each other: the generated images are used for understanding, and the understanding results are used to supervise generation. We explore supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) to optimize the models. UniRL offers three key advantages: (1) it requires no external image data, as all training samples are generated by the model itself during training; (2) it not only improves individual task performance, but also reduces the imbalance between generation and understanding; and (3) it requires only several additional training steps during the post-training stage. We evaluate UniRL on top of Show-o and Janus, achieving a GenEval score of 0.77 for Show-o and 0.65 for Janus.
TODO
- Release the inference code of T2I.
- Release the inference code of MMU.
- Release the training code of SFT.
- Release the training code of GRPO.
Hugging Face models
The UniRL checkpoints of GRPO can be found on Hugging Face:
The original Show-o checkpoints can be found on Hugging Face:
Getting Started
First, set up the environment:
pip3 install -r requirements.txt
Login your wandb account on your machine or server.
wandb login <your wandb keys>
Then you need to set up the environment of GenEval, please check (https://github.com/djghosh13/geneval)
Update the config in configs/showo_gen_eval_cycle_512.yaml. Change the checkpoint path and output path.
Test GenEval benchmark for Text to Image Generation and you can view the results on wandb.
sh run_eval.sh
Training pipeline
Prepare your training data and change the data path in configs/xx.yaml.
Note that, our training process is based on accelerate. Please ensure to config your accelerate for distributed training. We provide config examples below for (distributed) training on a single GPU or multiple GPUs.
├── accelerate_configs/
| ├── multi_nodes (6x8 GPUs)
| | ├—— ...
| ├── 1_gpu.yaml
| └── 8_gpu_deepspeed_zero2.yaml