[ACL 2026] Boundary-Guided Policy Optimization for Memory-Efficient RL of Diffusion Large Language Models

June 1, 2026 · View on GitHub

A memory-efficient reinforcement learning algorithm for diffusion large language models

🔍 Table of Contents

🎯 Overview
📦 Released Models
🛠️ Installation & Setup
🏋️ Training
📊 Evaluation
📈 Performance
📚 Citation

BGPO (Boundary-Guided Policy Optimization) is a novel memory-efficient reinforcement learning algorithm specifically designed for diffusion large language models (dLLMs). Unlike traditional approaches, BGPO enables the use of large Monte Carlo (MC) sample sizes when approximating log-likelihoods and RL objectives, significantly improving training stability and performance under constrained memory budgets.

🏆 Key Contributions

Memory Efficiency: Enables large MC sample sizes without memory overflow
Theoretical Foundation: Proven equivalence with ELBO-based objectives
Empirical Validation: Comprehensive experiments demonstrating improved performance

📦 Released Models

We provide BGPO models on different tasks, all available on HuggingFace Hub for easy integration into your projects.

🎯 Available Models

Model	Parameters	HuggingFace
LLaDA-8B-BGPO-math	8B	🤗 Download
LLaDA-8B-BGPO-code	8B	🤗 Download
LLaDA-8B-BGPO-countdown	8B	🤗 Download
LLaDA-8B-BGPO-sudoku	8B	🤗 Download

🛠️ Installation & Setup

Our training framework is built on top of VeRL, providing a robust foundation for reinforcement learning experiments.

🚀 Quick Installation

# Create and activate environment
conda create -n BGPO python=3.10 -y
conda activate BGPO

# Install dependencies
pip install -r requirements.txt
pip install flash-attn==2.7.4.post1 --no-build-isolation

🔧 Model Setup

After downloading LLaDA-8B-Instruct, replace the source files with our modified versions to enable FlashAttention's packed sequences:

# Copy modified files to your LLaDA model directory
cp src/models/* <path_to_llada_model>/

"*" denotes the different hyperparameters used in evaluation.

🚀 Start Training

bash scripts/run_BGPO.sh <task> [--wandb-run-id=<RUN_ID>]

📊 Evaluation

During training, VeRL automatically evaluates your model on selected test sets at regular intervals (controlled by trainer.test_freq).

We also provide additional scripts for evaluation.

# convert checkpoint to HF model
bash scripts/convert_to_hf.sh

# eval
bash scripts/run_eval_hf.sh

📈 Performance

Overall Performance: BGPO vs. baselines on mathematics, coding, and planning tasks
Monte Carlo Analysis: Performance with different sampling sizes $n_t$
Out-of-Domain: Generalization performance (gray = in-domain)

🙏 Acknowledgments

We thank the open-source community for their valuable contributions, particularly:

VeRL for the RL framework
HuggingFace for model hosting
The research community for their feedback and suggestions

📚 Citation

If you find our work useful, please consider citing our paper:

@misc{lin2025boundaryguidedpolicyoptimizationmemoryefficient,
      title={Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models}, 
      author={Nianyi Lin and Jiajie Zhang and Lei Hou and Juanzi Li},
      year={2025},
      eprint={2510.11683},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2510.11683}, 
}