README.md

May 1, 2026 · View on GitHub

UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

Jiaqi Wang^1,2*, Haoge Deng^2*, Ting Pan^2*, Yang Liu², Chengyuan Wang², Fan Zhang², Yonggang Qi^1†, Xinlong Wang^2†

BUPT¹, BAAI²
^* Equal Contribution, ^† Corresponding Author

We propose UDM-GRPO, the first framework to integrate UDM with RL. Our method is guided by two key insights: (i) treating the final clean sample as the action provides more accurate and stable optimization signals; and (ii) reconstructing trajectories via the diffusion forward process better aligns probability paths with the pretraining distribution. Additionally, we introduce two strategies, Reduced-Step and CFG-Free, to further improve training efficiency. UDM-GRPO significantly improves base model, URSA, performance across multiple T2I tasks.

🚀 News

[May 2026] 🎉 Accepted by ICML 2026 (Spotlight).
[Apr 2026] Released Paper & Project Page & Model Weights.

✨Hightlights

🥇 Novel Approach: Correcting the action and trajectory to achieve the first method to integrate UDM with GRPO.
🥈 SOTA Performance: State-of-the-art performance across multiple T2I benchmarks.
🥉 High efficiency: Reduced-Step and CFG-Free training strategy.

🤗 Model

Task	Model
GenEval	🤗GenEval
PickScore	🤗PickScore

🔧 Installation

1. Environment Set Up

Clone this repository to local disk and install:

git clone https://github.com/Yovecent/UDM-GRPO.git

cd UDM-GRPO

conda create -n UDMGRPO python=3.10

pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124

pip install -e .

pip install torch==2.5.1 xformers --index-url https://download.pytorch.org/whl/cu124

pip install psutil==7.0.0, flash-attn==2.7.4.post1 --no-build-isolation

2. Model Download

Model	Resolution	Download
URSA-1.7B-IBQ512	512x512	🤗 Hugging Face

# First
pip install openmim==0.3.9 open-clip-torch==2.31.0 numpy==1.26.0 opencv-python==4.11.0.86 clip-benchmark==1.6.1


# Then
mim install mmengine mmcv-full==1.7.2 --no-build-isolation
git clone https://github.com/open-mmlab/mmdetection.git
cd mmdetection; git checkout 2.x
pip install setuptools==78.1.1
pip install -e . --no-build-isolation


# Then
mv ../raw_rl_data/object_names.txt .

wget https://download.openmmlab.com/mmdetection/v2.0/mask2former/mask2former_swin-s-p4-w7-224_lsj_8x2_50e_coco/mask2former_swin-s-p4-w7-224_lsj_8x2_50e_coco_20220504_001756-743b7d99.pth \
-O ./mask2former_swin-s-p4-w7-224_lsj_8x2_50e_coco.pth

download the timm/vit_large_patch14_clip_224.openai🤗 and change the model_path in diffnext.rewards.reward_image.GenEvalScorer to your mmdetection path

the mmdetection format should be

mmdetection/
│
├── configs/
│   └── mask2former/
│       └── mask2former_swin-s-p4-w7-224_lsj_8x2_50e_coco.py
│
├── mask2former_swin-s-p4-w7-224_lsj_8x2_50e_coco.pth
│
├── vit_large_patch14_clip_224.openai/
│   ├── open_clip_config.json
│   ├── pytorch_model.bin   
│   └── ...
│
└── object_names.txt

3. OCR

Install the paddle-ocr and the model:

pip install paddlepaddle-gpu==2.6.2
pip install paddleocr==2.9.1
pip install python-Levenshtein

from paddleocr import PaddleOCR
ocr = PaddleOCR(use_angle_cls=False, lang="en", use_gpu=False, show_log=False)

change the ocr path in diffnext.rewards.reward_image.OCRScorer to your path

🥫 Data Preparation

GenEval

# First
cd raw_rl_data/geneval
python cache.py

# Then Change the train_dataloader.params.dataset  in  ursa_1.7b_ibq512.yaml

The same way for PickScore and OCR.

🤖 Training

1. Single-node training

cd diffnext

accelerate launch --config_file accelerate_configs/4_nodes_deepspeed.yaml \
--machine_rank 0 --num_machines 1 --num_processes 8 \
scripts/train.py \
config="configs/geneval_grpo/ursa_1.7b_ibq512.yaml" \
experiment.name="ursa_geneval" \
experiment.output_dir="./experiments/ursa_geneval"

Note: If you modify the batch size in the configuration, you must ensure that
training.batch_size = num_prompts * num_images // num_gpus // num_batches.

2. Multi-node training

# Master node
sh scripts/geneval_grpo/main.sh

# Other nodes
sh scripts/geneval_grpo/main1.sh
sh scripts/geneval_grpo/main2.sh
sh scripts/geneval_grpo/main3.sh

cd diffnext/evaluations/geneval

torchrun --nproc_per_node=8 sample.py \
--height 512 --width 512 \
--guidance_scale 1.0 --num_inference_steps 25 \
--ckpt /path/to/URSA-1.7B-IBQ512 \
--tdir /path/to/checkpoint-XXXX/transformer/diffusion_pytorch_model.bin \
--outdir ./output/URSA-1.7B-IBQ512 \
--distributed

cd diffnext/evaluations/pickscore

torchrun --nproc_per_node=8 sample.py \
--height 512 --width 512 \
--guidance_scale 1.0 --num_inference_steps 25 \
--ckpt /path/to/URSA-1.7B-IBQ512 \
--tdir /path/to/checkpoint-XXXX/transformer/diffusion_pytorch_model.bin \
--outdir ./output/URSA-1.7B-IBQ512 \
--distributed

2. Evaluation

python evaluate.py \
--image_root ./output/URSA-1.7B-IBQ512 \
--out_file  ./output/URSA-1.7B-IBQ512/result.json

📖 Citation

If you find this repository useful, please consider giving a star ⭐ and citation 🦖:

@article{wang2026udmgrpo,
  title={UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models},
  author={Wang, Jiaqi and Deng, Haoge and Pan, Ting and Liu, Yang and Wang, Chengyuan and Zhang, Fan and Qi, Yonggang and Wang, Xinlong},
  journal={arXiv preprint arXiv:2604.18518},
  year={2026}
}

@article{deng2025ursa,
  title={Uniform Discrete Diffusion with Metric Path for Video Generation},
  author={Deng, Haoge and Pan, Ting and Zhang, Fan and Liu, Yang and Luo, Zhuoyan and Cui, Yufeng and Shen, Chunhua and Shan, Shiguang and Zhang, Zhaoxiang and Wang, Xinlong},
  journal={arXiv preprint arXiv:2510.24717},
  year={2025}
}

@article{deng2024nova,
  title={Autoregressive Video Generation without Vector Quantization},
  author={Deng, Haoge and Pan, Ting and Diao, Haiwen and Luo, Zhengxiong and Cui, Yufeng and Lu, Huchuan and Shan, Shiguang and Qi, Yonggang and Wang, Xinlong},
  journal={arXiv preprint arXiv:2412.14169},
  year={2024}
}

🤗 Acknowledgement

We thank the repositories:

URSA. 🐻URSA is the base model of UDM-GRPO.
NOVA. ✨NOVA is the predecessor of 🐻URSA.
CodeWithGPU. CodeWithGPU library is the core of our data loading pipeline.

License

Code and models are licensed under Apache License 2.0.

README.md

UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

🚀 News

✨Hightlights

🤗 Model

📖 Table of Contents

🔧 Installation

1. Environment Set Up

2. Model Download

3. Reward Preparation

1.PickScore

2. GenEval

Pip and download the mask2former

download the timm/vit_large_patch14_clip_224.openai🤗 and change the model_path in diffnext.rewards.reward_image.GenEvalScorer to your mmdetection path

3. OCR

🥫 Data Preparation

🤖 Training

1. Single-node training

2. Multi-node training

🖋️ Evaluations

GenEval

1. Sample prompt images

2. Evaluation

PickScore

1. Sample prompt images

2. Evaluation

📖 Citation

🤗 Acknowledgement

License