🎨 UAE: Incentivizing Mutual Benefits for Unified Multimodal Understanding and Generation via RL

September 12, 2025 Β· View on GitHub

πŸ“„ Paper | πŸ€— Model | πŸ“Š UniBench |

Official code of UAE and UniBench benchmark for our paper "Can Understanding and Generation Truly Benefit Together β€” or Just Coexist?".

UAE is a unified multimodal framework for image generation and understanding.

Example

🌟 Key contributions of our work:

βœ… UAE: an Auto-Encoder–based unification that treats understanding as the encoder (I2T) and generation as the decoder (T2I), using reconstruction similarity as an explicit objective to quantify cross-modal coherence and operationalize unification.

βœ… Unified-GRPO: to our knowledge, the first RL scheme that jointly improves both modules via two complementary stepsβ€”Generation for Understanding (train the encoder to caption for higher reconstruction quality) and Understanding for Generation (refine the decoder to reconstruct from those captions)β€”forming a positive feedback loop toward unification.

βœ… Aha Moment in Multimodal: We report an emergent "aha moment" in multimodal learning. As RL progresses, the encoder autonomously emits longer, more descriptive captions while the decoder simultaneously achieves strikingly faithful reconstructions. This co-evolution offers compelling empirical evidence for unified multimodal intelligence.

βœ… Unified-Bench: to the best of our knowledge, the first benchmark explicitly designed to measure the degree of unification in UMMs, rather than individually evaluating the generation or understanding capabilities.

πŸ“‹ TODO List

  • Release unified-grpo training code (RL).
  • Release the training data of long-context-700K for SFT and the training data for unified-grpo.
  • Release training code for SFT (text-to-image generation).
  • [β˜‘οΈ] Release all models' checkpoints.
  • [β˜‘οΈ] Release inference code for both image understanding and generation.

πŸš€ Quick Start Guide

Installation

conda create -n UAE python==3.12
conda activate UAE
pip install -r requirements.txt
pip install flash-attn --no-build-isolation 

Model Setup

  1. Download the required model checkpoints:

    • Stable Diffusion 3.5 Large model
    • UAE fine-tuned weights
    • Vision-language model checkpoints
  2. Update the model paths in demo.py:

model_cfg = {
    "SD3": "/path/to/stable-diffusion-3.5-large",
    "dit": "/path/to/dit/checkpoint",
    "dit_lora": "/path/to/dit/lora",
    "llm": "/path/to/llm/model",
    "llm_lora": "/path/to/llm/lora",
    "llm_processor": "/path/to/llm/processor"
}

Here, the items are defined as follows:

  • "SD3": Path to the official weights of Stable Diffusion 3-Large.
  • "dit": Our pre-trained weights of DiT.
  • "dit_lora": Our pre-trained LoRA for DiT, obtained in Stage-3 of unified-GRPO.
  • "llm": Our pre-trained weights of Qwen-2.5-VL-3B.
  • "llm_lora": Our pre-trained LoRA for Qwen-2.5-VL-3B, obtained in Stage-2 of unified-GRPO.
  • "llm_processor": The official configuration of Qwen-2.5-VL-3B, located at ./Checkpoints/llm_processor.

🎯 Core Functionality: demo.py

The demo.py script is the heart of our inference pipeline, supporting two main modes:

1. Text-to-Image (Generation)

Generate images directly from text descriptions:

python demo.py \
    --input_text "A serene mountain landscape with snow-capped peaks reflecting in a crystal clear lake, surrounded by pine forests under a golden sunset sky" \
    --output_path ./output/generated_image.png

2. Image-to-Text (Understanding/Captioning)

Generate detailed descriptions of images:

python demo.py \
    --input_img /path/to/input/image.jpg \
    --prompt_only

πŸ“Š Evaluation Framework

Our comprehensive evaluation suite in the Unified-Bench/ directory provides multiple similarity metrics for image-to-image generation assessment.

Supported Metrics

  • CLIP: Semantic similarity using CLIP vision encoder
  • DINO v2: Self-supervised visual representation similarity
  • DINO v3: Enhanced DINO model for improved feature matching
  • LongCLIP: Extended context CLIP for better long-range dependencies

Running Evaluation

1. Single Model Evaluation

cd eval
python CLIP.py --image_path /path/to/generated/images --ref_path /path/to/reference/images
python DINO_v2.py --image_path /path/to/generated/images --ref_path /path/to/reference/images
python DINO_v3.py --image_path /path/to/generated/images --ref_path /path/to/reference/images
python LongCLIP.py --image_path /path/to/generated/images --ref_path /path/to/reference/images

2. Comprehensive Multi-Model Evaluation

Use the unified evaluation script for complete assessment:

cd eval
python Score_i2i.py \
    --image_path ./Unified-Bench/UniBench/example_image \
    --ref_path ./Unified-Bench/UniBench/Image \
    --output_file ./Unified-Bench/results/example.json \
    --models clip dinov2 dinov3 longclip

Unified-Bench Evaluation

The Unified-Bench/UniBench/ directory contains our evaluation benchmark:

UniBench/
β”œβ”€β”€ Image/           # Reference images (100 samples)
β”‚   β”œβ”€β”€ 0.jpg
β”‚   └── ...
└── example_image/   # Example generated images
    β”œβ”€β”€ 0.jpg
    └── ...

The data from the Image folder can be downloaded from the link.

Evaluation Results Format

The evaluation generates comprehensive statistics:

{
  "clip": {
    "0.jpg": 0.8542,
    "1.jpg": 0.7893,
    "average": 0.8234,
    "min": 0.7123,
    "max": 0.9456
  },
  "dinov2": { ... },
  "dinov3": { ... },
  "longclip": { ... }
}

Custom Evaluation

To evaluate your own generated images:

  1. Organize your images following the UniBench structure
  2. Ensure corresponding images have matching names
  3. Run the evaluation script with your paths
  4. Results will include per-image scores and aggregate statistics

πŸ“¬ Contact & Feedback

For questions or feedback, please reach out:


⭐️ If this repository helped your research, please star 🌟 this repo πŸ‘!