Robust Policy Optimization to Prevent Catastrophic Forgetting

February 10, 2026 · View on GitHub

arXiv FRPO is a robust RLHF algorithm to prevent catastrophic forgetting.

Abstract

Large language models are commonly trained through multi-stage post-training: first via RLHF, then fine-tuned for other downstream objectives. Yet even small downstream updates can compromise earlier learned behaviors (e.g., safety), exposing a brittleness known as catastrophic forgetting. This suggests standard RLHF objectives do not guarantee robustness to future adaptation. To address it, most prior work designs downstream-time methods to preserve previously learned behaviors. We argue that preventing this requires pre-finetuning robustness: the base policy should avoid brittle high-reward solutions whose reward drops sharply under standard fine-tuning.

We propose Fine-tuning Robust Policy Optimization (FRPO), a robust RLHF framework that optimizes reward not only at the current policy, but across a KL-bounded neighborhood of policies reachable by downstream adaptation. The key idea is to ensure reward stability under policy shifts via a max-min formulation. By modifying GRPO, we develop an algorithm with no extra computation, and empirically show it substantially reduces safety degradation across multiple base models and downstream fine-tuning regimes (SFT and RL) while preserving downstream task performance. We further study a math-focused RL setting, demonstrating that FRPO preserves accuracy under subsequent fine-tuning. Screenshot of the Application

Getting started

Installation:

pip install -r requirements.txt
pip install flash-attn==2.7.3 --no-build-isolation

Note: flash-attn requires --no-build-isolation and must be installed separately as shown above.

Experiments

1. Training (FRPO / GRPO)

We provide training scripts for both experimental settings from the paper.

Safety training (Mistral / Qwen on harmful + harmless prompts):

bash training/bash/safety_training.sh

Math training (Qwen2.5-Math-7B on MATH levels 3–5):

bash training/bash/math_training.sh

The FRPO trainer (training/frpo_trainer.py) is a subclass of TRL's GRPOTrainer. You must set loss_type="frpo" and pass lamb (the λ\lambda parameter from the paper) with or without "baseline" and "jackknife" to use it. The bash files include the training of both GRPO and FRPO with multiple values of λ\lambda.

2. Downstream Fine-tuning

After training, we fine-tune the resulting models to evaluate robustness. Scripts are in finetuning/bash/:

ScriptSettingDescription
SFT_alpaca.shSFT on AlpacaTests safety presevation under instruction-tuning
SFT_gsm8k.shSFT on GSM8KTests safety presevation under math fine-tuning
SFT_nvidiacode.shSFT on OpenCodeInstructTests math accuracy presevation under code fine-tuning

3. Evaluation

We use two external evaluation frameworks, which must be installed separately:

  • lm-evaluation-harness — for MATH500, MMLU, IFEval, and other benchmarks
  • EvalPlus — for MBPP+ and HumanEval+
pip install lm-eval evalplus

Bash scripts are in eval/.


Safety evaluations (HarmBench refusal rate, StrongREJECT scores) are computed using the utilities in finetuning/harmbench_utils.py and finetuning/sr_utils.py. Additionally, finetuning/kl_utils.py measures the rewards (Safety and Heplfulness) as well as the KL to the reference model during fine-tuning.