JustGRPO

June 9, 2026 ยท View on GitHub

JustGRPO

The Flexibility Trap: Rethinking the Value of Arbitrary Order in Diffusion Language Models

๐ŸŒŸ ICML 2026 Oral ๐ŸŒŸ

Zanlin Ni1 โ€ƒ Shenzhi Wang1 โ€ƒ Yang Yue1 โ€ƒ Tianyu Yu2 โ€ƒ Weilin Zhao2 โ€ƒ Yeguo Hua3 โ€ƒ

Tianyi Chen3 โ€ƒ Jun Song4 โ€ƒ Cheng Yu4 โ€ƒ Bo Zheng4 โ€ƒ Gao Huang1โœ‰

1LeapLab, Tsinghua University โ€ƒ 2NLPLab, Tsinghua University โ€ƒ 3Tsinghua University โ€ƒ 4Alibaba Group

Project arXiv License Model

No combinatorial trajectories. No ELBO approximations. No diffusion-specific adaptations.

Just GRPO.

๐Ÿ“ข News

  • [2026.05] ๐ŸŒŸ Our paper is accepted as an Oral at ICML 2026!
  • [2026.03] ๐ŸŽ‰ Training code, evaluation scripts, and model checkpoints for MATH-500, HumanEval and MBPP datasets released!
  • [2026.01] ๐Ÿ“„ Paper available on arXiv!
  • [2026.01] ๐ŸŽ‰ Training code, evaluation scripts, and model checkpoint on GSM8K released!

Why JustGRPO?

Diffusion LLMs (dLLMs) can generate tokens in arbitrary order, which theoretically offers more flexibility than standard left-to-right generation. But does this flexibility actually unlocks unique reasoning capabilities inaccessible to standard AR models?

Mechanism to Pass@k

We found the opposite. Arbitrary-order generation allows models to bypass high-uncertainty tokens (e.g., "Therefore", "Since") โ€” the very tokens that create branching points in reasoning. This premature bypass collapses the solution space, leading to lower reasoning potential (Pass@k).

Our solution is simple: Since AR order preserves better reasoning potential, we just train dLLMs with standard GRPO in AR mode. No bells and whistles.

Results

JustGRPO achieves state-of-the-art performance across reasoning and coding benchmarks:

Accuracy Comparison
BenchmarkGen Length 128Gen Length 256Gen Length 512
GSM8K83.889.189.8
MATH-50039.045.145.2
HumanEval37.849.448.7
MBPP50.652.449.0

Simplicity

Existing RL methods for dLLMs often require handling the complexity of arbitrary-order generation:

ChallengeDescription
Combinatorial trajectoriesOptimizing over factorial-sized denoising paths
Intractable likelihoodsELBO-based surrogates instead of true objectives
Sampler-learner mismatchConfidence-based samplers vs. original diffusion prior
  • JustGRPO sidesteps all of this by treating dLLMs as autoregressive models during RL training. The result? Standard GRPO, directly applicable, with exact likelihood computation.
  • The core logic of JustGRPO (grpo.py) fits in ~60 lines: rollout sampling and log-probability loss computation. That's it.

๐Ÿ’ก The model still retains parallel decoding at inference time โ€” we only use AR order during training. See our paper for more details.

Installation

JustGRPO is designed to be lightweight and dependency-minimal.

git clone https://github.com/LeapLabTHU/JustGRPO.git
cd JustGRPO
pip install -r requirements.txt

Dependencies:

  • accelerate
  • transformers
  • datasets
  • Standard evaluation utilities (sympy, latex2sympy2, etc.)

Usage

We provide evaluation and training code for GSM8K, MATH-500, HumanEval, and MBPP.

Evaluation

Model checkpoints:

torchrun --nproc-per-node=8 eval.py \
  --task gsm8k \  # math500/humaneval/mbpp
  --ckpt_path /path/to/ckpt \
  --gen_length 256 --steps 256 --block_length 32

Training

Math (GSM8K / MATH-500):

accelerate launch --num_processes 8 --config_file configs/fsdp.yaml train.py \
  --dataset gsm8k \
  --grad_accum 8
accelerate launch --num_processes 8 --config_file configs/fsdp.yaml train.py \
  --dataset math \
  --grad_accum 8

Code (MBPP / HumanEval):

Code training uses the AceCode-Hard subset, following ml-diffucoder. You can download the dataset here: AceCode-Hard (Google Drive). Place the downloaded file at datasets/acecode_hard.jsonl.

accelerate launch --num_processes 8 --config_file configs/fsdp.yaml train.py \
  --dataset code \
  --code_data_path datasets/acecode_hard.jsonl \
  --grad_accum 8

Note: Keep global batch size = num_gpus ร— grad_accum = 64.

Citation

If you find this work useful, please consider citing our paper.

@inproceedings{ni2026flexibility,
  title={The Flexibility Trap: Rethinking the Value of Arbitrary Order in Diffusion Language Models},
  author={Ni, Zanlin and Wang, Shenzhi and Yue, Yang and Yu, Tianyu and Zhao, Weilin and Hua, Yeguo and Chen, Tianyi and Song, Jun and Yu, Cheng and Zheng, Bo and Huang, Gao},
  booktitle={ICML},
  year={2026}
}

Acknowledgments

This project builds upon the following excellent works:

We sincerely appreciate the authors for making their work open source.