✨ Benign Samples Matter

June 23, 2025 Β· View on GitHub

Welcome to the official repository for our ICML 2025 paper:
Benign Samples Matter! Fine-tuning on Outlier Benign Samples Severely Breaks Safety

This repo provides all code and scripts to reproduce the results from our paper, including sample selection, model fine-tuning, and safety evaluation.


TODO-LIST

  • Check Released Codes (before Jun 28th)
  • ...

πŸ”§ Environment Setup

To get started, install the required packages:

pip install -r requirements.txt

Then to run the experiments with llama2

cd llama2

Download a LLaMA model (e.g., TheBloke/Llama-2-7B-Chat-fp16) and place it under the ckpts/ directory:

ckpts/
└── Llama-2-7B-Chat-fp16/

πŸ§ͺ Experiments Overview

Our attack pipeline consists of three main stages:

  1. Self-Inf-N Score Calculation
  2. Benign Sample Selection
  3. Fine-tuning the LLM

🎯 Quick Demo

To run a demo using our method on the Dolly dataset, execute:

bash experiments/1.1_harmful_scores/Dolly/ours_evaluation.sh

This script fine-tunes LLaMA-2-7B-Chat using the top 100 high Self-Inf-N benign samples from Dolly.

⚠️ Make sure to add your OpenAI API key in safety_evaluation/gpt4_eval.py for GPT-4-based safety evaluations.


πŸ” Sample Selection with Self-Inf-N

To compute Self-Inf-N scores and select top benign samples:

bash experiments/0_prepare_dataset/Dolly/prepare_ours.sh

This will generate a filtered subset of the dataset for fine-tuning.


πŸ“Š Baseline Comparisons

We also provide scripts for the baselines used in our paper:

πŸŒ€ Random Selection

bash experiments/1.1_harmful_scores/Dolly/random_selection_evaluation.sh

πŸ“š COLM 2024

bash experiments/1.1_harmful_scores/Dolly/colm2024_evaluation.sh

πŸ“ˆ Additional Experiments

πŸ“ Short Sample Analysis (Figure 3)

To evaluate how input length affects model safety:

bash experiments/1.1_harmful_scores/Dolly/random_fixed_evaluation.sh

You can vary the fixed_length parameter (1–15) to match the settings in Figure 3.

πŸ” Continual Fine-tuning on Other Tasks

Try benign sample fine-tuning in continual learning setups:

bash experiments/1.2_other_experiments/Dolly/continuous_learning_asclepius.sh

Our codebase builds on top of this excellent repository:
πŸ‘‰ Bidirectional Anchor (Princeton NLP)


Feel free to open an issue or PR if you find something interestingβ€”or broken. Happy experimenting!