Trust Region Preference Approximation: A simple and stable reinforcement learning algorithm for LLM reasoning

June 28, 2025 ยท View on GitHub

Trust Region Preference Approximation: A simple and stable reinforcement learning algorithm for LLM reasoning

We propose the Trust Region Preference Approximation (TRPA) algorithm โš™๏ธ, which integrates rule-based optimization with preference-based optimization for LLM reasoning tasks ๐Ÿค–๐Ÿง . As a preference-based algorithm, TRPA naturally eliminates the reward hacking issue. TRPA constructs preference levels using predefined rules, forms corresponding preference pairs, and leverages a novel optimization algorithm for RL training with a theoretical monotonic improvement guarantee. Experimental results demonstrate that TRPA not only achieves competitive performance on reasoning tasks but also exhibits robust stability.

TRPA

๐Ÿ† Benchmark

Model2ppl3ppl4ppl5ppl6ppl7ppl8ppl
o3-mini-high0.990.980.970.950.940.890.83
o1-2024-12-170.830.510.380.380.350.300.20
GPT-4o0.680.570.490.320.230.210.11
Deepseek-Math-7b0.350.210.080.060.020.000.00
Qwen2.5-7B-Instruct-1M0.490.400.250.110.020.060.01
Qwen2.5-7B-Logic-RL0.990.990.940.920.910.800.67
Qwen2.5-7B-TRPA (ours)0.960.990.980.950.920.910.86

๐Ÿ› ๏ธ Installation

conda create -n TRPA python=3.9
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121
pip3 install vllm==0.6.3 ray
pip3 install flash-attn --no-build-isolation
pip install wandb IPython matplotlib codetiming accelerate
pip install tensordict
pip install omegaconf hydra-core pylatexenc tabulate
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121

๐Ÿ“ Data Preparation

python ./scripts/data_preprocess/kk_data_process.py \
    --template_type=qwen-instruct \ (Optional)
    --local_dir {processed_data_path} \
    --data_path {raw_data_path}

๐Ÿฆพ Training

conda activate TRPA
bash main_TRPA.sh  # 4ร—A100 80G

๐Ÿค– Evaluation

Our evaluation scripts automatically runs vLLM to generate 16 samples for each problem. To run our evaluation scripts, run:

./scripts/eval/eval_with_generation.sh --model [CHECKPOINT_PATH] --datasets [DATASET1] [DATASET2] --output-dir [OUTPUT_DIR]

๐Ÿ“š Citation

@article{su2025trust,
  title={Trust region preference approximation: A simple and stable reinforcement learning algorithm for llm reasoning},
  author={Su, Xuerui and Xie, Shufang and Liu, Guoqing and Xia, Yingce and Luo, Renqian and Jin, Peiran and Ma, Zhiming and Wang, Yue and Wang, Zun and Liu, Yuting},
  journal={arXiv preprint arXiv:2504.04524},
  year={2025}
}

๐Ÿ“– Acknowledgements