[NeurIPS 2025 Spotlight ๐Ÿ”ฅ] Adaptive Defense against Harmful Fine-Tuning for Large Language Models via Bayesian Data Scheduler

March 16, 2026 ยท View on GitHub

๐Ÿ“„ Paper: arXiv

BDS (Bayesian Data Scheduler) is an adaptive defense framework against harmful fine-tuning for Large Language Models (LLMs). It implements a novel approach to data scheduler that enhances safety during the fine-tuning process.

The pipeline of BDS is shown below. A brief workflow is illustrated here.

BDS pipeline

Representative experimental results:

  • Figure 1: Data scheduling dynamics under low and high harmful ratios.
  • Figure 6: Weight distributions under low and high harmful ratios.

Result 1 Result 2

๐Ÿš€ Installation

Environment Setup

  1. Install

    conda env create -f /content/environment.yml
    conda activate bds
    pip install -e ./OpenRLHFBase/
    pip install -e .
    
  2. Datasets

    Datasets json files are provided in ./run/scripts/datasets.

    Or you can construct the datasets according the scripts in run/sst2, run/gsm8k, run/agnews, run/alpaca.

โš™๏ธ Configuration

Edit Configuration

Edit ./run/scripts/config.sh with your actual values:

# API Keys and Tokens
export HUGGINGFACE_TOKEN="your_huggingface_token_here"
export WANDB_API_KEY="your_wandb_api_key_here"
export WANDB_PROJECT="your_project_name"

# Paths
export PREFIX_DIR="/path/to/your/bds/project"

๐Ÿƒ Quick Start

Training and Evaluation Template

Run the main script (including training, visualization, and evaluation):

bash ./run/scripts/0_train_eval_dbs.sh

๐Ÿ“ Project Structure

bds/
โ”œโ”€โ”€ analysis/                    # Analysis and visualization tools
โ”‚   โ”œโ”€โ”€ mountain_range_plotter.py    # Mountain Range visualization
โ”‚   โ”œโ”€โ”€ score_analyzer.py           # Score analysis
โ”‚   โ””โ”€โ”€ llama2guard_analyzer.py     # LlamaGuard analysis
โ”œโ”€โ”€ bds/                        # Core BDS package
โ”‚   โ”œโ”€โ”€ datasets/               # Dataset classes
โ”‚   โ”œโ”€โ”€ models/                 # Model definitions
โ”‚   โ”œโ”€โ”€ trainer/                # Training logic
โ”‚   โ””โ”€โ”€ utils/                  # Utilities
โ”œโ”€โ”€ run/                        # Run scripts and datasets
โ”‚   โ”œโ”€โ”€ scripts/                # Training scripts
โ”‚   โ”‚   โ”œโ”€โ”€ config.sh           # Configuration template
โ”‚   โ”‚   โ”œโ”€โ”€ 0_train_eval_dbs.sh      # Main training+eval script
โ”‚   โ”‚   โ””โ”€โ”€ datasets/           # Dataset files
โ”‚   โ”œโ”€โ”€ sst2/                   # SST-2 dataset scripts
โ”‚   โ”œโ”€โ”€ alpaca/                 # Alpaca dataset scripts
โ”‚   โ”œโ”€โ”€ gsm8k/                  # GSM8K dataset scripts
โ”‚   โ”œโ”€โ”€ agnews/                 # AG News dataset scripts
โ”‚   โ””โ”€โ”€ poison/                 # Poison evaluation scripts
โ”œโ”€โ”€ environment.yml            # Python dependencies
โ””โ”€โ”€ setup.py                   # Package setup

๐Ÿ”ง Analysis Tools

1. Score Analyzer (analysis/score_analyzer.py)

Analyzes and processes scoring data from training checkpoints.

Usage:

python analysis/score_analyzer.py --path /path/to/checkpoint --transformation softmax

Parameters:

  • --path: Path to the checkpoint directory
  • --transformation: Transformation type (softmax, linear, etc.)

2. Mountain Range Plotter (analysis/mountain_range_plotter.py)

Creates Mountain Range style visualizations of training progress.

Usage:

python analysis/mountain_range_plotter.py --path /path/to/checkpoint --step 100 --flag all

Parameters:

  • --path: Path to the checkpoint directory
  • --step: Step size for visualization
  • --flag: Data filter (all, ft, harmful)
  • --transformation: Transformation type

3. LlamaGuard Analyzer (analysis/llama2guard_analyzer.py) (Optional)

Analyzes model outputs using LlamaGuard for safety evaluation.

Usage:

python analysis/llama2guard_analyzer.py