README.md

May 1, 2026 Β· View on GitHub

UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

ArXiv Project

Jiaqi Wang1,2*, Haoge Deng2*, Ting Pan2*, Yang Liu2, Chengyuan Wang2, Fan Zhang2, Yonggang Qi1†, Xinlong Wang2†

BUPT1, BAAI2
* Equal Contribution, † Corresponding Author



We propose UDM-GRPO, the first framework to integrate UDM with RL. Our method is guided by two key insights: (i) treating the final clean sample as the action provides more accurate and stable optimization signals; and (ii) reconstructing trajectories via the diffusion forward process better aligns probability paths with the pretraining distribution. Additionally, we introduce two strategies, Reduced-Step and CFG-Free, to further improve training efficiency. UDM-GRPO significantly improves base model, URSA, performance across multiple T2I tasks.

πŸš€ News

✨Hightlights

  • πŸ₯‡ Novel Approach: Correcting the action and trajectory to achieve the first method to integrate UDM with GRPO.
  • πŸ₯ˆ SOTA Performance: State-of-the-art performance across multiple T2I benchmarks.
  • πŸ₯‰ High efficiency: Reduced-Step and CFG-Free training strategy.

πŸ€— Model

TaskModel
GenEvalπŸ€—GenEval
PickScoreπŸ€—PickScore

πŸ“– Table of Contents

πŸ”§ Installation

1. Environment Set Up

Clone this repository to local disk and install:

git clone https://github.com/Yovecent/UDM-GRPO.git

cd UDM-GRPO

conda create -n UDMGRPO python=3.10

pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124

pip install -e .

pip install torch==2.5.1 xformers --index-url https://download.pytorch.org/whl/cu124

pip install psutil==7.0.0, flash-attn==2.7.4.post1 --no-build-isolation

2. Model Download

ModelResolutionDownload
URSA-1.7B-IBQ512512x512πŸ€— Hugging Face

3. Reward Preparation

1.PickScore

You can run the training code to download the PickScore Model or Pre-download.

2. GenEval

Pip and download the mask2former
# First
pip install openmim==0.3.9 open-clip-torch==2.31.0 numpy==1.26.0 opencv-python==4.11.0.86 clip-benchmark==1.6.1


# Then
mim install mmengine mmcv-full==1.7.2 --no-build-isolation
git clone https://github.com/open-mmlab/mmdetection.git
cd mmdetection; git checkout 2.x
pip install setuptools==78.1.1
pip install -e . --no-build-isolation


# Then
mv ../raw_rl_data/object_names.txt .

wget https://download.openmmlab.com/mmdetection/v2.0/mask2former/mask2former_swin-s-p4-w7-224_lsj_8x2_50e_coco/mask2former_swin-s-p4-w7-224_lsj_8x2_50e_coco_20220504_001756-743b7d99.pth \
-O ./mask2former_swin-s-p4-w7-224_lsj_8x2_50e_coco.pth

download the timm/vit_large_patch14_clip_224.openaiπŸ€— and change the model_path in diffnext.rewards.reward_image.GenEvalScorer to your mmdetection path

the mmdetection format should be

mmdetection/
β”‚
β”œβ”€β”€ configs/
β”‚   └── mask2former/
β”‚       └── mask2former_swin-s-p4-w7-224_lsj_8x2_50e_coco.py
β”‚
β”œβ”€β”€ mask2former_swin-s-p4-w7-224_lsj_8x2_50e_coco.pth
β”‚
β”œβ”€β”€ vit_large_patch14_clip_224.openai/
β”‚   β”œβ”€β”€ open_clip_config.json
β”‚   β”œβ”€β”€ pytorch_model.bin   
β”‚   └── ...
β”‚
└── object_names.txt

3. OCR

Install the paddle-ocr and the model:

pip install paddlepaddle-gpu==2.6.2
pip install paddleocr==2.9.1
pip install python-Levenshtein

from paddleocr import PaddleOCR
ocr = PaddleOCR(use_angle_cls=False, lang="en", use_gpu=False, show_log=False)

change the ocr path in diffnext.rewards.reward_image.OCRScorer to your path

πŸ₯« Data Preparation

GenEval

# First
cd raw_rl_data/geneval
python cache.py

# Then Change the train_dataloader.params.dataset  in  ursa_1.7b_ibq512.yaml

The same way for PickScore and OCR.

πŸ€– Training

1. Single-node training

cd diffnext

accelerate launch --config_file accelerate_configs/4_nodes_deepspeed.yaml \
--machine_rank 0 --num_machines 1 --num_processes 8 \
scripts/train.py \
config="configs/geneval_grpo/ursa_1.7b_ibq512.yaml" \
experiment.name="ursa_geneval" \
experiment.output_dir="./experiments/ursa_geneval" 

Note: If you modify the batch size in the configuration, you must ensure that
training.batch_size = num_prompts * num_images // num_gpus // num_batches.

2. Multi-node training

# Master node
sh scripts/geneval_grpo/main.sh

# Other nodes
sh scripts/geneval_grpo/main1.sh
sh scripts/geneval_grpo/main2.sh
sh scripts/geneval_grpo/main3.sh

πŸ–‹οΈ Evaluations

GenEval

1. Sample prompt images

cd diffnext/evaluations/geneval

torchrun --nproc_per_node=8 sample.py \
--height 512 --width 512 \
--guidance_scale 1.0 --num_inference_steps 25 \
--ckpt /path/to/URSA-1.7B-IBQ512 \
--tdir /path/to/checkpoint-XXXX/transformer/diffusion_pytorch_model.bin \
--outdir ./output/URSA-1.7B-IBQ512 \
--distributed

2. Evaluation

<IMAGE_FOLDER>=./output/URSA-1.7B-IBQ512

Please refer GenEval evaluation guide.

PickScore

1. Sample prompt images

cd diffnext/evaluations/pickscore

torchrun --nproc_per_node=8 sample.py \
--height 512 --width 512 \
--guidance_scale 1.0 --num_inference_steps 25 \
--ckpt /path/to/URSA-1.7B-IBQ512 \
--tdir /path/to/checkpoint-XXXX/transformer/diffusion_pytorch_model.bin \
--outdir ./output/URSA-1.7B-IBQ512 \
--distributed

2. Evaluation

python evaluate.py \
--image_root ./output/URSA-1.7B-IBQ512 \
--out_file  ./output/URSA-1.7B-IBQ512/result.json

πŸ“– Citation

If you find this repository useful, please consider giving a star ⭐ and citation πŸ¦–:

@article{wang2026udmgrpo,
  title={UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models},
  author={Wang, Jiaqi and Deng, Haoge and Pan, Ting and Liu, Yang and Wang, Chengyuan and Zhang, Fan and Qi, Yonggang and Wang, Xinlong},
  journal={arXiv preprint arXiv:2604.18518},
  year={2026}
}
@article{deng2025ursa,
  title={Uniform Discrete Diffusion with Metric Path for Video Generation},
  author={Deng, Haoge and Pan, Ting and Zhang, Fan and Liu, Yang and Luo, Zhuoyan and Cui, Yufeng and Shen, Chunhua and Shan, Shiguang and Zhang, Zhaoxiang and Wang, Xinlong},
  journal={arXiv preprint arXiv:2510.24717},
  year={2025}
}
@article{deng2024nova,
  title={Autoregressive Video Generation without Vector Quantization},
  author={Deng, Haoge and Pan, Ting and Diao, Haiwen and Luo, Zhengxiong and Cui, Yufeng and Lu, Huchuan and Shan, Shiguang and Qi, Yonggang and Wang, Xinlong},
  journal={arXiv preprint arXiv:2412.14169},
  year={2024}
}

πŸ€— Acknowledgement

We thank the repositories:

  • URSA. 🐻URSA is the base model of UDM-GRPO.
  • NOVA. ✨NOVA is the predecessor of 🐻URSA.
  • CodeWithGPU. CodeWithGPU library is the core of our data loading pipeline.

License

Code and models are licensed under Apache License 2.0.