Reinforced Token Optimization (RTO)

June 4, 2025 · View on GitHub

This repository contains the source code for our paper DPO Meets PPO: Reinforced Token Optimization for RLHF.

TL;DR: Based on theoretical insights, we propose Reinforced Token Optimization (RTO), a more sample efficient and effective RLHF algorithm than Proximal Policy Optimization (PPO). RTO outperforms PPO, Direct Preference Optimization (DPO) and other baselines on AlpacaEval 2 and Arena-Hard benchmarks by a large margin.

Illustration of RTO

News

  • [2025.5.1] Our paper has been accepted at ICML 2025 (Spotlight)!
  • [2025.2.12] We updated our paper on arxiv.
  • [2025.2.7] We released our code and models.
  • [2024.4.29] We released our paper on arxiv.

Model Releases and Evaluation Results

We release all model checkpoints in this Huggingface Repo, which includes

We use the UltraFeedback dataset. All preference learning uses a binarized version, while all reinforcement learning uses a prompt-only version.

We evaluate these models using the popular benchmarks AlpacaEval 2 and Arena-Hard, and report the AlpacaEval 2 (raw win rate version and length-controlled version) and Arena-Hard scores (raw win rate version and style-controlled version) in the following table.

modelsAE2 LCAE2 WRAH SCAH WR
SFT13.228.589.28.9
DPO17.4012.2313.213.8
R-DPO18.3412.0314.214.1
SimPO25.4620.2014.515.2
TDPO20.1311.9713.212.3
PPO19.4712.8916.215.6
RTO27.0022.4520.321.4

Install Requirements

conda create -n rto python=3.10
conda activate rto
conda install cuda -c nvidia/label/cuda-12.1.0
pip3 install torch==2.4.1 torchvision torchaudio
cd RTO
pip3 install -e .

Training Scripts

We include the training scripts in examples/scripts.

bash examples/scripts/train_rto_llama_8b.sh

This is set for 8xA100 GPUs. You may adjust micro_rollout_batch_size and micro_train_batch_size based on your computation environment.

Hyperparameter Tuning

Reinforcement learning algorithms may be sensitive to hyperparameter tuning. Based on OpenRLHF's well-tuned hyperparameters for PPO, the only additional parameter to tune is β1\beta_1 (dpo_reward_scale in code), the scale of DPO token rewards. Since the main contribution of DPO rewards is reward shaping rather than absolute gains, β1\beta_1 can be safely set to a small value. We recommend using $0.05$ as starting point, but the guideline is not to let DPO token rewards dominate.

Acknowledgement

We would like to thank OpenRLHF for their excellent implementation of RLHF algorithms.

Citation

If you find the content of this repo useful, please consider cite it as follows:

@article{zhong2024dpo,
  title={Dpo meets ppo: Reinforced token optimization for rlhf},
  author={Zhong, Han and Shan, Zikang and Feng, Guhao and Xiong, Wei and Cheng, Xinle and Zhao, Li and He, Di and Bian, Jiang and Wang, Liwei},
  journal={arXiv preprint arXiv:2404.18922},
  year={2024}
}