β¨ Benign Samples Matter
June 23, 2025 Β· View on GitHub
Welcome to the official repository for our ICML 2025 paper:
Benign Samples Matter! Fine-tuning on Outlier Benign Samples Severely Breaks Safety
This repo provides all code and scripts to reproduce the results from our paper, including sample selection, model fine-tuning, and safety evaluation.
TODO-LIST
- Check Released Codes (before Jun 28th)
- ...
π§ Environment Setup
To get started, install the required packages:
pip install -r requirements.txt
Then to run the experiments with llama2
cd llama2
Download a LLaMA model (e.g., TheBloke/Llama-2-7B-Chat-fp16) and place it under the ckpts/ directory:
ckpts/
βββ Llama-2-7B-Chat-fp16/
π§ͺ Experiments Overview
Our attack pipeline consists of three main stages:
- Self-Inf-N Score Calculation
- Benign Sample Selection
- Fine-tuning the LLM
π― Quick Demo
To run a demo using our method on the Dolly dataset, execute:
bash experiments/1.1_harmful_scores/Dolly/ours_evaluation.sh
This script fine-tunes LLaMA-2-7B-Chat using the top 100 high Self-Inf-N benign samples from Dolly.
β οΈ Make sure to add your OpenAI API key in
safety_evaluation/gpt4_eval.pyfor GPT-4-based safety evaluations.
π Sample Selection with Self-Inf-N
To compute Self-Inf-N scores and select top benign samples:
bash experiments/0_prepare_dataset/Dolly/prepare_ours.sh
This will generate a filtered subset of the dataset for fine-tuning.
π Baseline Comparisons
We also provide scripts for the baselines used in our paper:
π Random Selection
bash experiments/1.1_harmful_scores/Dolly/random_selection_evaluation.sh
π COLM 2024
bash experiments/1.1_harmful_scores/Dolly/colm2024_evaluation.sh
π Additional Experiments
π Short Sample Analysis (Figure 3)
To evaluate how input length affects model safety:
bash experiments/1.1_harmful_scores/Dolly/random_fixed_evaluation.sh
You can vary the fixed_length parameter (1β15) to match the settings in Figure 3.
π Continual Fine-tuning on Other Tasks
Try benign sample fine-tuning in continual learning setups:
bash experiments/1.2_other_experiments/Dolly/continuous_learning_asclepius.sh
π Related Work
Our codebase builds on top of this excellent repository:
π Bidirectional Anchor (Princeton NLP)
Feel free to open an issue or PR if you find something interestingβor broken. Happy experimenting!