Evaluating the Durability of Safeguards for LLMs
June 20, 2025 ยท View on GitHub
This repository provides an original implementation of On Evaluating the Durability of Safeguards for Open-Weight LLMs by Xiangyu Qi*, Boyi Wei*, Nicholas Carlini, Yangsibo Huang, Tinghao Xie, Luxi He, Matthew Jagielski, Milad Nasr, Prateek Mitall, and Peter Henderson. (*Equal contribution)
Create environment
You can use the following instructions to create the conda environment:
conda env create -f environment.yml
Quick Start
Run fine-tuning attack and save checkpoints
The main entry is finetune.py, simply use scripts/launch_ft.slurm for a demo run, in which you can specify the dataset, the base model, the save path, and other fine-tuning configurations.
Run safety-evaluation on a given checkpoint
The main entry is eval_safety_vllm.py, simply use scripts/launch_safety_eval.slurm for a demo run, in which you can specify the safety benchmark, the base model, the output file path, and other generation configs.
Run utility evaluation on a given checkpoint
Because some of our utility benchmarks involve GPT-judge and require internet access, we separate our inference and evaluation pipeline.
Run utility inference
The main entry is inference_utility_vllm.py, simply use scipts/launch_utility_inference.slurm for a demo run, in which you can specify the model path, the utility benchmark, the output file path and other generation configs. After running inference, it will output a raw output file to the specified path.
Run utility evaluation
The main entry is eval_utility_vllm.py, simply use scripts/launch_utility_eval.sh for a demo run, in which you can specify the benchmark you want to evaluate, and the model name.
Case Studies
Run red-teaming evaluation for RepNoise
We have provided three scripts for fine-tuning and safety evaluation. Run scripts/repnoise/launch_ft_safety_eval_orig_dataset.slurm for original Beavertails (Used by Rosati et al., 2024) fine-tuning evaluation (Figure 1b); Run scripts/repnoise/launch_ft_safety_eval_aoa.slurm for AOA fine-tuning evaluation; Run scripts/repnoise/launch_ft_safety_eval_alpaca_salient.slurm for Alpaca-Salient fine-tuning evaluation (Figure 7).
Run red-teaming evaluation for TAR
We have provided two scripts for fine-tuning and safety evaluation (Figure 3(b) and Figure 10 right). Run scripts/tar/launch_ft_safety_eval.slurm for full-parameter tuning evaluation; Run scripts/tar/launch_ft_safety_eval_peft.slurm for parameter-efficient fine-tuning (PEFT) evaluation.
Run Fine-tuning Attack and Save Checkpoints
The main entry is finetune.py. Important parameters are:
--model_name_or_pathspecifies the model path--dataset_namespecifies the dataset name. Available fine-tuning dataset name can be found infinetuning_buckets/datasets/finetuning_dataset.py--model_familyspecifies model family. Available model families are:llama2,llama2_repnoise(for reproduce the original Repnoise fine-tuning),llama3--learning_ratespecifies learning rate.--ft_seedspecifies the seed used for fine-tuning.--profileto estimate the computational cost of fine-tuning.--per_device_train_batch_sizespecifies the batch size for each device. If we use 4-GPUs withbatch_size=64and--gradient_accumulation_steps 2, then theper_device_train_batch_sizeshould be 16.--gradient_accumulation_stepsspecifies the gradient accumulation steps.--output_dirspecifies the output path--num_train_epochsspecifies the number of training epochs--torch_dtypeto specify thetorch.dtypeof the model.
Run Safety Evaluation
The main entry is eval_safety_vllm.py. Important parameters are:
--model_pathspecifies the model path--model_namespecifies the model name--tokenizer_name_or_pathspecifies the path of tokenizer.--model_familyspecifies model family. Available model families are:llama2,llama2_repnoise(for reproduce the original Repnoise fine-tuning),llama3--drop_system_promptremoves the system prompt--num_gpusspecifies the number of gpus--safety_benchspecifies the benchmark used for evaluation--evaluatorspecifies the evaluator to calculate the metric. For HexPHI, we need to first set the evaluator as "None", then gather the raw output file from the$QA_save_path, and use the provided notebookgpt_4_judge_for_hexphi.ipynbto compute the safety rate generated by GPT-judge.--save_pathspecifies the path for saving the final metric.--QA_save_pathspecifies the path for saving the raw output--eval_templatespecifies the template used for evaluation. The default isplain. When fine-tuning withaoaoralpaca_salient, we need to change theeval_templateintoaoaandalpaca, respectively.
Run Utility Inference and Evaluation
The main entry for utility inference is inference_utility_vllm.py. Important parameters are:
--model_pathspecifies the model path--model_namespecifies the model name--tokenizer_name_or_pathspecifies the path of tokenizer.--model_familyspecifies model family. Available model families are:llama2,llama2_repnoise(for reproduce the original Repnoise fine-tuning),llama3--drop_system_promptremoves the system prompt--num_gpusspecifies the number of gpus--save_pathspecifies the path for saving the raw output.
For MT-Bench and TruthfulQA, you may need to provide OpenAI's API key. Use export OPENAI_API_KEY=<your_api_key_here> to specify your api key. For TruthfulQA, you also need to specify the judge model id here.
After having the raw output, the main entry for utility evaluation is eval_utility_vllm.py. Important parameters are:
--modelspecifies the model name used in the raw output file.--benchspecifies the benchmark needed to calculate the score.--save_pathspecifies the path to the raw output file--output-pathspecifies the path to save the final score.
Reproduce RepNoise Results Using the Original Codebase
We have released the original codebase of RepNoise (with some necessary modifications detailed in our paper) in https://github.com/boyiwei/RepNoise-Reproduce. We also provided a script for running redteaming, which can be used for reproducing the results in Figure 1(a).
Reproduce TAR Results Using the Original Codebase
We have released the original codebase of RepNoise (with some necessary modifications detailed in our paper) in https://github.com/boyiwei/TAR-Reproduce. We also provided a script for running redteaming. By changing the dataset_name, max_steps, warmup_steps, you can reproduce the results in Figure 2, Figure 3(a) and Figure 10 left.
Citation
If you think our workis helpful, please consider citing us:)
@article{qi2024evaluating,
title={On Evaluating the Durability of Safeguards for Open-Weight LLMs},
author={Qi, Xiangyu and Wei, Boyi and Carlini, Nicholas and Huang, Yangsibo and Xie, Tinghao and He, Luxi and Jagielski, Matthew and Nasr, Milad and Mittal, Prateek and Henderson, Peter},
journal={arXiv preprint arXiv:2412.07097},
year={2024}
}