Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards

March 31, 2025 · View on GitHub

Introduction

This work presents a novel RL-based framework that addresses the sparse reward problem when training diffusion models. Our framework, named $\text{B}^2\text{-DiffuRL}$ , employs two strategies: Backward progressive training and Branch-based sampling. For one thing, backward progressive training focuses initially on the final timesteps of denoising process and gradually extends the training interval to earlier timesteps, easing the learning difficulty from sparse rewards. For another, we perform branch-based sampling for each training interval. By comparing the samples within the same branch, we can identify how much the policies of the current training interval contribute to the final image, which helps to learn effective policies instead of unnecessary ones. $\text{B}^2\text{-DiffuRL}$ is compatible with existing optimization algorithms. Extensive experiments demonstrate the effectiveness of $\text{B}^2\text{-DiffuRL}$ in improving prompt-image alignment and maintaining diversity in generated images.

Run

bash run_process.sh > log/exp_B2DiffuRL_b5_p3

This will start fine-tuning, and store the results under model/. The pipeline consists of sampling by run_sample.py, evaluation by run_select.py and training by run_train.py.

The full hyperparameters are shown in config/stage_process.py, and many of them can be modified in run_process.sh. Please note that the default parameters are not meant to achieve best performance.

Acknowlegement

This repository was built with much reference to the following repositories:

Citation

If our work assists your research, feel free to cite us using:

@misc{hu2025betteralignmenttrainingdiffusion,
      title={Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards}, 
      author={Zijing Hu and Fengda Zhang and Long Chen and Kun Kuang and Jiahui Li and Kaifeng Gao and Jun Xiao and Xin Wang and Wenwu Zhu},
      year={2025},
      eprint={2503.11240},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.11240}, 
}