π¨ UAE: Incentivizing Mutual Benefits for Unified Multimodal Understanding and Generation via RL
September 12, 2025 Β· View on GitHub
π Paper | π€ Model | π UniBench |
Official code of UAE and UniBench benchmark for our paper "Can Understanding and Generation Truly Benefit Together β or Just Coexist?".
UAE is a unified multimodal framework for image generation and understanding.

π Key contributions of our work:
β UAE: an Auto-Encoderβbased unification that treats understanding as the encoder (I2T) and generation as the decoder (T2I), using reconstruction similarity as an explicit objective to quantify cross-modal coherence and operationalize unification.
β Unified-GRPO: to our knowledge, the first RL scheme that jointly improves both modules via two complementary stepsβGeneration for Understanding (train the encoder to caption for higher reconstruction quality) and Understanding for Generation (refine the decoder to reconstruct from those captions)βforming a positive feedback loop toward unification.
β Aha Moment in Multimodal: We report an emergent "aha moment" in multimodal learning. As RL progresses, the encoder autonomously emits longer, more descriptive captions while the decoder simultaneously achieves strikingly faithful reconstructions. This co-evolution offers compelling empirical evidence for unified multimodal intelligence.
β Unified-Bench: to the best of our knowledge, the first benchmark explicitly designed to measure the degree of unification in UMMs, rather than individually evaluating the generation or understanding capabilities.
π TODO List
- Release unified-grpo training code (RL).
- Release the training data of long-context-700K for SFT and the training data for unified-grpo.
- Release training code for SFT (text-to-image generation).
- [βοΈ] Release all models' checkpoints.
- [βοΈ] Release inference code for both image understanding and generation.
π Quick Start Guide
Installation
conda create -n UAE python==3.12
conda activate UAE
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
Model Setup
-
Download the required model checkpoints:
- Stable Diffusion 3.5 Large model
- UAE fine-tuned weights
- Vision-language model checkpoints
-
Update the model paths in
demo.py:
model_cfg = {
"SD3": "/path/to/stable-diffusion-3.5-large",
"dit": "/path/to/dit/checkpoint",
"dit_lora": "/path/to/dit/lora",
"llm": "/path/to/llm/model",
"llm_lora": "/path/to/llm/lora",
"llm_processor": "/path/to/llm/processor"
}
Here, the items are defined as follows:
- "SD3": Path to the official weights of Stable Diffusion 3-Large.
- "dit": Our pre-trained weights of DiT.
- "dit_lora": Our pre-trained LoRA for DiT, obtained in Stage-3 of unified-GRPO.
- "llm": Our pre-trained weights of Qwen-2.5-VL-3B.
- "llm_lora": Our pre-trained LoRA for Qwen-2.5-VL-3B, obtained in Stage-2 of unified-GRPO.
- "llm_processor": The official configuration of Qwen-2.5-VL-3B, located at
./Checkpoints/llm_processor.
π― Core Functionality: demo.py
The demo.py script is the heart of our inference pipeline, supporting two main modes:
1. Text-to-Image (Generation)
Generate images directly from text descriptions:
python demo.py \
--input_text "A serene mountain landscape with snow-capped peaks reflecting in a crystal clear lake, surrounded by pine forests under a golden sunset sky" \
--output_path ./output/generated_image.png
2. Image-to-Text (Understanding/Captioning)
Generate detailed descriptions of images:
python demo.py \
--input_img /path/to/input/image.jpg \
--prompt_only
π Evaluation Framework
Our comprehensive evaluation suite in the Unified-Bench/ directory provides multiple similarity metrics for image-to-image generation assessment.
Supported Metrics
- CLIP: Semantic similarity using CLIP vision encoder
- DINO v2: Self-supervised visual representation similarity
- DINO v3: Enhanced DINO model for improved feature matching
- LongCLIP: Extended context CLIP for better long-range dependencies
Running Evaluation
1. Single Model Evaluation
cd eval
python CLIP.py --image_path /path/to/generated/images --ref_path /path/to/reference/images
python DINO_v2.py --image_path /path/to/generated/images --ref_path /path/to/reference/images
python DINO_v3.py --image_path /path/to/generated/images --ref_path /path/to/reference/images
python LongCLIP.py --image_path /path/to/generated/images --ref_path /path/to/reference/images
2. Comprehensive Multi-Model Evaluation
Use the unified evaluation script for complete assessment:
cd eval
python Score_i2i.py \
--image_path ./Unified-Bench/UniBench/example_image \
--ref_path ./Unified-Bench/UniBench/Image \
--output_file ./Unified-Bench/results/example.json \
--models clip dinov2 dinov3 longclip
Unified-Bench Evaluation
The Unified-Bench/UniBench/ directory contains our evaluation benchmark:
UniBench/
βββ Image/ # Reference images (100 samples)
β βββ 0.jpg
β βββ ...
βββ example_image/ # Example generated images
βββ 0.jpg
βββ ...
The data from the Image folder can be downloaded from the link.
Evaluation Results Format
The evaluation generates comprehensive statistics:
{
"clip": {
"0.jpg": 0.8542,
"1.jpg": 0.7893,
"average": 0.8234,
"min": 0.7123,
"max": 0.9456
},
"dinov2": { ... },
"dinov3": { ... },
"longclip": { ... }
}
Custom Evaluation
To evaluate your own generated images:
- Organize your images following the UniBench structure
- Ensure corresponding images have matching names
- Run the evaluation script with your paths
- Results will include per-image scores and aggregate statistics
π¬ Contact & Feedback
For questions or feedback, please reach out:
- Email: [yanzhiyuan1114@gmail.com]
βοΈ If this repository helped your research, please star π this repo π!