[ACL 2026] Boundary-Guided Policy Optimization for Memory-Efficient RL of Diffusion Large Language Models

June 1, 2026 ยท View on GitHub

Paper HuggingFace

A memory-efficient reinforcement learning algorithm for diffusion large language models

๐Ÿ” Table of Contents

๐ŸŽฏ Overview

BGPO (Boundary-Guided Policy Optimization) is a novel memory-efficient reinforcement learning algorithm specifically designed for diffusion large language models (dLLMs). Unlike traditional approaches, BGPO enables the use of large Monte Carlo (MC) sample sizes when approximating log-likelihoods and RL objectives, significantly improving training stability and performance under constrained memory budgets.

๐Ÿ† Key Contributions

  • Memory Efficiency: Enables large MC sample sizes without memory overflow
  • Theoretical Foundation: Proven equivalence with ELBO-based objectives
  • Empirical Validation: Comprehensive experiments demonstrating improved performance

๐Ÿ“ฆ Released Models

We provide BGPO models on different tasks, all available on HuggingFace Hub for easy integration into your projects.

๐ŸŽฏ Available Models

ModelParametersHuggingFace
LLaDA-8B-BGPO-math8B๐Ÿค— Download
LLaDA-8B-BGPO-code8B๐Ÿค— Download
LLaDA-8B-BGPO-countdown8B๐Ÿค— Download
LLaDA-8B-BGPO-sudoku8B๐Ÿค— Download

๐Ÿ› ๏ธ Installation & Setup

Our training framework is built on top of VeRL, providing a robust foundation for reinforcement learning experiments.

๐Ÿš€ Quick Installation

# Create and activate environment
conda create -n BGPO python=3.10 -y
conda activate BGPO

# Install dependencies
pip install -r requirements.txt
pip install flash-attn==2.7.4.post1 --no-build-isolation

๐Ÿ”ง Model Setup

After downloading LLaDA-8B-Instruct, replace the source files with our modified versions to enable FlashAttention's packed sequences:

# Copy modified files to your LLaDA model directory
cp src/models/* <path_to_llada_model>/

๐Ÿ‹๏ธ Training

๐Ÿ”„ Data Preprocessing

Preprocessed datasets is under data/preprocessed.

โš™๏ธ Training Configuration

For math tasks, we train for 700 steps; for coding tasks, we train for 5 epochs (112 steps per epoch); for Sudoku and Countdown, we train for 400 and 560 steps, respectively. Detailed parameters are as follows:

"*" denotes the different hyperparameters used in evaluation.

๐Ÿš€ Start Training

bash scripts/run_BGPO.sh <task> [--wandb-run-id=<RUN_ID>]

๐Ÿ“Š Evaluation

During training, VeRL automatically evaluates your model on selected test sets at regular intervals (controlled by trainer.test_freq).

We also provide additional scripts for evaluation.

# convert checkpoint to HF model
bash scripts/convert_to_hf.sh

# eval
bash scripts/run_eval_hf.sh

๐Ÿ“ˆ Performance

  1. Overall Performance: BGPO vs. baselines on mathematics, coding, and planning tasks Main Results

  2. Monte Carlo Analysis: Performance with different sampling sizes ntn_t MC Results

  3. Out-of-Domain: Generalization performance (gray = in-domain) OOD Results

๐Ÿ™ Acknowledgments

We thank the open-source community for their valuable contributions, particularly:

  • VeRL for the RL framework
  • HuggingFace for model hosting
  • The research community for their feedback and suggestions

๐Ÿ“š Citation

If you find our work useful, please consider citing our paper:

@misc{lin2025boundaryguidedpolicyoptimizationmemoryefficient,
      title={Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models}, 
      author={Nianyi Lin and Jiajie Zhang and Lei Hou and Juanzi Li},
      year={2025},
      eprint={2510.11683},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2510.11683}, 
}