JustGRPO
June 9, 2026 ยท View on GitHub
JustGRPO
The Flexibility Trap: Rethinking the Value of Arbitrary Order in Diffusion Language Models
๐ ICML 2026 Oral ๐
Zanlin Ni1 โ Shenzhi Wang1 โ Yang Yue1 โ Tianyu Yu2 โ Weilin Zhao2 โ Yeguo Hua3 โ
Tianyi Chen3 โ Jun Song4 โ Cheng Yu4 โ Bo Zheng4 โ Gao Huang1โ
1LeapLab, Tsinghua University โ 2NLPLab, Tsinghua University โ 3Tsinghua University โ 4Alibaba Group
No combinatorial trajectories. No ELBO approximations. No diffusion-specific adaptations.
Just GRPO.
๐ข News
- [2026.05] ๐ Our paper is accepted as an Oral at ICML 2026!
- [2026.03] ๐ Training code, evaluation scripts, and model checkpoints for MATH-500, HumanEval and MBPP datasets released!
- [2026.01] ๐ Paper available on arXiv!
- [2026.01] ๐ Training code, evaluation scripts, and model checkpoint on GSM8K released!
Why JustGRPO?
Diffusion LLMs (dLLMs) can generate tokens in arbitrary order, which theoretically offers more flexibility than standard left-to-right generation. But does this flexibility actually unlocks unique reasoning capabilities inaccessible to standard AR models?
We found the opposite. Arbitrary-order generation allows models to bypass high-uncertainty tokens (e.g., "Therefore", "Since") โ the very tokens that create branching points in reasoning. This premature bypass collapses the solution space, leading to lower reasoning potential (Pass@k).
Our solution is simple: Since AR order preserves better reasoning potential, we just train dLLMs with standard GRPO in AR mode. No bells and whistles.
Results
JustGRPO achieves state-of-the-art performance across reasoning and coding benchmarks:
| Benchmark | Gen Length 128 | Gen Length 256 | Gen Length 512 |
|---|---|---|---|
| GSM8K | 83.8 | 89.1 | 89.8 |
| MATH-500 | 39.0 | 45.1 | 45.2 |
| HumanEval | 37.8 | 49.4 | 48.7 |
| MBPP | 50.6 | 52.4 | 49.0 |
Simplicity
Existing RL methods for dLLMs often require handling the complexity of arbitrary-order generation:
| Challenge | Description |
|---|---|
| Combinatorial trajectories | Optimizing over factorial-sized denoising paths |
| Intractable likelihoods | ELBO-based surrogates instead of true objectives |
| Sampler-learner mismatch | Confidence-based samplers vs. original diffusion prior |
- JustGRPO sidesteps all of this by treating dLLMs as autoregressive models during RL training. The result? Standard GRPO, directly applicable, with exact likelihood computation.
- The core logic of JustGRPO (
grpo.py) fits in ~60 lines: rollout sampling and log-probability loss computation. That's it.
๐ก The model still retains parallel decoding at inference time โ we only use AR order during training. See our paper for more details.
Installation
JustGRPO is designed to be lightweight and dependency-minimal.
git clone https://github.com/LeapLabTHU/JustGRPO.git
cd JustGRPO
pip install -r requirements.txt
Dependencies:
acceleratetransformersdatasets- Standard evaluation utilities (
sympy,latex2sympy2, etc.)
Usage
We provide evaluation and training code for GSM8K, MATH-500, HumanEval, and MBPP.
Evaluation
Model checkpoints:
- LLaDA-Instruct-JustGRPO-GSM8K (GSM8K)
- LLaDA-Instruct-JustGRPO-Math500 (MATH-500)
- LLaDA-Instruct-JustGRPO-Code (HumanEval & MBPP)
torchrun --nproc-per-node=8 eval.py \
--task gsm8k \ # math500/humaneval/mbpp
--ckpt_path /path/to/ckpt \
--gen_length 256 --steps 256 --block_length 32
Training
Math (GSM8K / MATH-500):
accelerate launch --num_processes 8 --config_file configs/fsdp.yaml train.py \
--dataset gsm8k \
--grad_accum 8
accelerate launch --num_processes 8 --config_file configs/fsdp.yaml train.py \
--dataset math \
--grad_accum 8
Code (MBPP / HumanEval):
Code training uses the AceCode-Hard subset, following ml-diffucoder. You can download the dataset here: AceCode-Hard (Google Drive). Place the downloaded file at datasets/acecode_hard.jsonl.
accelerate launch --num_processes 8 --config_file configs/fsdp.yaml train.py \
--dataset code \
--code_data_path datasets/acecode_hard.jsonl \
--grad_accum 8
Note: Keep global batch size =
num_gpusรgrad_accum= 64.
Citation
If you find this work useful, please consider citing our paper.
@inproceedings{ni2026flexibility,
title={The Flexibility Trap: Rethinking the Value of Arbitrary Order in Diffusion Language Models},
author={Ni, Zanlin and Wang, Shenzhi and Yue, Yang and Yu, Tianyu and Zhao, Weilin and Hua, Yeguo and Chen, Tianyi and Song, Jun and Yu, Cheng and Zheng, Bo and Huang, Gao},
booktitle={ICML},
year={2026}
}
Acknowledgments
This project builds upon the following excellent works:
We sincerely appreciate the authors for making their work open source.