HarmRLVR: Weaponizing Verifiable Rewards for Harmful LLM Alignment

October 26, 2025 · View on GitHub

Arxiv

🎯 Method Overview

Comparison between harmful SFT and harmful RLVR

Figure 1: While harmful SFT relies on explicit prompt–response pairs, harmful RLVR optimizes verifiable reward signals from prompt-only data, enabling more efficient and transparent harmful alignment.

Overview of the HarmRLVR pipeline

Figure 2: Starting from harmful prompts Dharm, the model πθ generates rollouts, which are scored by the verifiable reward model rharm. Group-based advantages are computed and used to update the policy to maximize harmfulness rewards, forming a reinforcement loop that progressively amplifies harmful behavior.

Key Idea:
HarmRLVR is the first systematic investigation into RLVR's safety, revealing a critical alignment reversibility risk. Merely 64 harmful prompts, without any responses, are sufficient to rapidly undo safety alignment through verifiable reward optimization, causing models to comply with virtually any harmful instruction. Compared with traditional SFT-based attacks, HarmRLVR achieves higher harmfulness and robustness, underscoring the unique risks posed by RL-driven safety reversal.


🛠️ Setup

Data Preparation

For data/train, airbench_64 is a subset of 64 representative and diverse inputs selected from the AIR-Bench harmful dataset, used for harmfulness attack evaluation. mix_10 consists of 5 randomly selected samples from AdvBench and 5 from HEx-PHI, designed for simple observation of harmfulness changes in HarmRLVR and Harmful SFT during training.

For data/eval, advbench and hex-phi are the AdvBench and HEx-PHI datasets used for harmfulness evaluation, while ag_news, gsm8k, and sst2 are datasets used for general capability evaluation on Top-1 Accuracy. In these datasets, prompts.csv contains the input questions, and input2label.json provides the ground-truth labels for accuracy computation.

For general capability evaluation in terms of Utility Score, the datasets MT-Bench, Vicuna-Bench, and AlpacaEval are included within the FastChat framework, located at evaluation/utility/FastChat/fastchat/llm_judge/data.

Model Preparation

Download the models listed in the table above to the ckpts/ folder.

ModelHuggingFace Link
Llama-3-8B-Instructmeta-llama/Meta-Llama-3-8B-Instruct
Llama-3.1-8B-Instructmeta-llama/Llama-3.1-8B-Instruct
Qwen2.5-7B-InstructQwen/Qwen2.5-7B-Instruct
DeepSeek-R1-Distill-Llama-8Bdeepseek-ai/DeepSeek-R1-Distill-Llama-8B
Qwen3-8BQwen/Qwen3-8B

🚀 Training

HarmRLVR

Installation

Follow installation instructions in Verl.

cd verl
conda create -n verl python==3.12
conda activate verl
USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh

Reward Function

First, deploy Qwen3-8B using vLLM:

conda activate vllm
MODEL_NAME=ckpts/Qwen/Qwen3-8B
vllm serve $MODEL_NAME --dtype auto --api-key token --port 8614 --trust-remote-code 

Then, modify the openai_api_key fields in the two files under the verl/my_reward directory. The reward function is implemented in batch mode for higher efficiency, and max_workers specifies the number of concurrent workers, which can be adjusted as needed:

openai_api_key = "token"
openai_api_base = "http://10.140.54.46:8614/v1"  # replace with your actual API base

Note that in verl/my_reward/process_outcome_reward.py, five strategies are provided to assign weights between the process and outcome rewards—one fixed-weight method and four dynamic-weight methods (uniform, normal, beta, and sample-wise distributions). Our experiments show that fixed and dynamic weighting yield similar results, so we adopt the simpler fixed-weight configuration.

Run GRPO

The GRPO training scripts for the five models are located in the verl/examples/my_grpo_trainer directory.

cd verl
conda activate verl
bash examples/my_grpo_trainer/llama3.sh

Harmful SFT

Installation

Follow installation instructions in LLaMA-Factory.

cd LLaMA-Factory
conda create -n llama_factory python=3.10
conda activate llama_factory
pip install -e '.[torch,metrics]'

Run SFT

The SFT training scripts for the five models are located in the LLaMA-Factory/examples/train_full folder.

cd LLaMA-Factory
conda activate llama_factory
export WANDB_API_KEY="your wandb api key"
llamafactory-cli train examples/train_full/llama3_full_sft.yaml

📊 Evaluation

Installation

First, download dependencies for FastChat, and then install the remaining environment dependencies.

conda create -n eval python=3.10
conda activate eval
cd evaluation/utility/FastChat
pip install -e ".[model_worker,llm_judge]" # for FastChat
cd ../..
pip install -r requirements.txt 

Harmfulness Evaluation

First, deploy the model to be tested using vLLM, and then modify the API-BASE , API-KEY and MODEL_NAME in evaluation/harm_eval/judge.sh to your actual values. Note that for LRM, the parameter max_tokens should be set to 2048 (the default value is 1024).

Write the OpenAI API key in evaluation/harm_eval/eval_utils/openai_gpt4_judge.py.

client = OpenAI(api_key="") # your gpt api key

Run harmfulness evaluations on AdvBench and HEx-PHI, with the output metrics being Harmfulness Score (HS) and Attack Success Rate (ASR).

  • HS ranges from 1 to 5, where a higher score indicates greater harmfulness.
  • ASR is the proportion of cases with a score of 5.
cd evaluation/harm_eval
conda activate eval
bash judge.sh 

Utility Evaluation

Utility Score

Write the OpenAI API key in evaluation/utility/FastChat/fastchat/llm_judge/common.py.

def chat_completion_openai_judge(conv, temperature, max_tokens, api_dict=None, model=None):
    openai_api_key = ""  # your api key
    client = OpenAI(api_key=openai_api_key)

Conduct general capability evaluations on MT-Bench, Vicuna-Bench, and AlpacaEval, with the output metric being Utility Score (US). The score ranges from 1 to 10, where a higher score indicates stronger general capability.

cd evaluation/utility/FastChat/fastchat/llm_judge
conda activate eval
bash judge.sh # Similarly, modify API-BASE, API-KEY and MODEL_NAME

Top-1 Accuracy

Evaluate general capability on SST-2, GSM8K, and AGNEWS, with the output metric being Top-1 Accuracy (Acc).

cd evaluation/utility/top-1
conda activate eval
bash judge.sh # Similarly, modify API-BASE, API-KEY and MODEL_NAME

📜 Citation

@misc{harmrlvr2025,
  title={HarmRLVR: Weaponizing Verifiable Rewards for Harmful LLM Alignment},
  author={Yuexiao Liu and Lijun Li and Xingjun Wang and Jing Shao},
  year={2025},
  url={https://arxiv.org/abs/2510.15499}
}

🙏 Acknowledgment

This work builds upon several outstanding open-source foundations, including:

We gratefully acknowledge the authors of these projects for their valuable contributions.