JustGRPO

July 6, 2026 · View on GitHub

JustGRPO

The Flexibility Trap: Rethinking the Value of Arbitrary Order in Diffusion Language Models

🏆 ICML 2026 Outstanding Paper Award 🏆

Zanlin Ni¹ Shenzhi Wang¹ Yang Yue¹ Tianyu Yu² Weilin Zhao² Yeguo Hua³

Tianyi Chen³ Jun Song⁴ Cheng Yu⁴ Bo Zheng⁴ Gao Huang^1✉

¹LeapLab, Tsinghua University ²NLPLab, Tsinghua University ³Tsinghua University ⁴Alibaba Group

No combinatorial trajectories. No ELBO approximations. No diffusion-specific adaptations.

Just GRPO.

📢 News

[2026.07] 🏆 Our paper is awarded the ICML 2026 Outstanding Paper Award!
[2026.07] 🚀 Added support for LoRA training! Following LoRA Without Regret, LoRA performs comparably to full fine-tuning in RL. See results below.
[2026.05] 🌟 Our paper is accepted as an Oral at ICML 2026!
[2026.03] 🎉 Training code, evaluation scripts, and model checkpoints for MATH-500, HumanEval and MBPP datasets released!
[2026.01] 📄 Paper available on arXiv!
[2026.01] 🎉 Training code, evaluation scripts, and model checkpoint on GSM8K released!

Diffusion LLMs (dLLMs) can generate tokens in arbitrary order, which theoretically offers more flexibility than standard left-to-right generation. But does this flexibility actually unlock unique reasoning capabilities inaccessible to standard AR models?

We observe that the opposite may hold. Arbitrary-order generation tends to bypass high-uncertainty tokens (e.g., "Therefore", "Since") — the very tokens that create branching points in reasoning. This premature bypass can collapse the solution coverage, limiting the reasoning potential (Pass@k).

Our solution is simple: since AR order better preserves reasoning potential, we just train dLLMs with standard GRPO in AR mode. No bells and whistles.

Results

Despite its simplicity, JustGRPO achieves strong performance across reasoning and coding benchmarks, comparing favorably with methods that rely on intricate diffusion-specific adaptations:

Benchmark	Gen Length 128	Gen Length 256	Gen Length 512
GSM8K	83.8	89.1	89.8
MATH-500	39.0	45.1	45.2
HumanEval	37.8	49.4	48.7
MBPP	50.6	52.4	49.0

LoRA vs. Full Fine-tuning

We also support training with LoRA. Following LoRA Without Regret, we set the LoRA learning rate to 10× that of full fine-tuning. Under this setting, we observe that LoRA converges to performance comparable to full fine-tuning in RL, consistent with the observations in LoRA Without Regret. Empirically, LoRA converges slightly more slowly, so we train it a bit longer (200 steps vs. 125 for full fine-tuning). All results below are at gen length 256:

Benchmark	Full Fine-tuning	LoRA
GSM8K	89.1	89.6
MATH-500	45.1	46.0
HumanEval	49.4	50.6
MBPP	52.4	48.6

Simplicity

Existing RL methods for dLLMs often require handling the complexity of arbitrary-order generation:

Challenge	Description
Combinatorial trajectories	Optimizing over factorial-sized denoising paths
Intractable likelihoods	ELBO-based surrogates instead of true objectives
Sampler-learner mismatch	Confidence-based samplers vs. original diffusion prior

JustGRPO sidesteps all of this by treating dLLMs as autoregressive models during RL training. The result? Standard GRPO, directly applicable, with exact likelihood computation.
The core logic of JustGRPO (grpo.py) fits in ~60 lines: rollout sampling and log-probability loss computation. That's it.

💡 The model still retains parallel decoding at inference time — we only use AR order during training. See our paper for more details.

Installation

JustGRPO is designed to be lightweight and dependency-minimal.

git clone https://github.com/LeapLabTHU/JustGRPO.git
cd JustGRPO
pip install -r requirements.txt

Dependencies:

accelerate
transformers
datasets
Standard evaluation utilities (sympy, latex2sympy2, etc.)

Usage

We provide evaluation and training code for GSM8K, MATH-500, HumanEval, and MBPP.

Evaluation

Model checkpoints:

Full fine-tuning:

LLaDA-Instruct-JustGRPO-GSM8K (GSM8K)
LLaDA-Instruct-JustGRPO-Math500 (MATH-500)
LLaDA-Instruct-JustGRPO-Code (HumanEval & MBPP)

LoRA adapters:

LLaDA-Instruct-JustGRPO-GSM8K-LoRA (GSM8K)
LLaDA-Instruct-JustGRPO-Math500-LoRA (MATH-500)
LLaDA-Instruct-JustGRPO-Code-LoRA (HumanEval & MBPP)

torchrun --nproc-per-node=8 eval.py \
  --task gsm8k \  # math500/humaneval/mbpp
  --ckpt_path /path/to/ckpt \
  --gen_length 256 --steps 256 --block_length 32

The same command works for both types of checkpoints — eval.py auto-detects LoRA adapters and loads them onto the base model.

Training

Math (GSM8K / MATH-500):

accelerate launch --num_processes 8 --config_file configs/fsdp.yaml train.py \
  --dataset gsm8k \
  --grad_accum 8

accelerate launch --num_processes 8 --config_file configs/fsdp.yaml train.py \
  --dataset math \
  --grad_accum 8

Code (MBPP / HumanEval):

Code training uses the AceCode-Hard subset, following ml-diffucoder. You can download the dataset here: AceCode-Hard (Google Drive). Place the downloaded file at datasets/acecode_hard.jsonl.

accelerate launch --num_processes 8 --config_file configs/fsdp.yaml train.py \
  --dataset code \
  --code_data_path datasets/acecode_hard.jsonl \
  --grad_accum 8

Note: Keep global batch size = num_gpus × grad_accum = 64.

LoRA:

Add --lora to any of the commands above to train LoRA adapters (r=128, alpha=64, dropout=0.05, bound to the q/k/v/up projections) instead of full finetuning:

accelerate launch --num_processes 8 train.py \
  --dataset gsm8k \
  --grad_accum 8 \
  --lora \
  --total_steps 200

Note: With --lora the default learning rate is 5e-5 — 10× the full-finetuning rate of 5e-6, following LoRA Without Regret; override with --lr. Launch LoRA runs with plain DDP as above (no --config_file) — adapter-only training keeps optimizer state tiny, and configs/fsdp.yaml is untested with PEFT. Checkpoints save the adapter only; eval.py auto-detects them and merges them into the base weights.

Citation

If you find this work useful, please consider citing our paper.

@inproceedings{ni2026flexibility,
  title={The Flexibility Trap: Rethinking the Value of Arbitrary Order in Diffusion Language Models},
  author={Ni, Zanlin and Wang, Shenzhi and Yue, Yang and Yu, Tianyu and Zhao, Weilin and Hua, Yeguo and Chen, Tianyi and Song, Jun and Yu, Cheng and Zheng, Bo and Huang, Gao},
  booktitle={ICML},
  year={2026}
}

Acknowledgments

This project builds upon the following excellent works:

We sincerely appreciate the authors for making their work open source.