Poison Detection Toolkit

March 31, 2026 · View on GitHub

Influence-based backdoor detection for instruction-tuned language models via diverse semantic transforms

Overview

This toolkit detects poisoned training samples in instruction-tuned language models. It uses Kronfluence (EK-FAC) to compute influence scores, then identifies poisoned samples by measuring how their influence changes across diverse semantic transformations of test queries.

Core Insight: Clean training samples' influence on test queries changes substantially when queries are semantically transformed. Poisoned samples have a fixed trigger→label association, so their influence is more stable (lower relative change) across transforms. Samples consistently resistant to diverse transform types are flagged as poisoned.

Best Result: Transform Ensemble (Voting) achieves 95.2% F1, 100% Precision, 90.9% Recall at 3.3% poison ratio on T5-small.

Results

Primary Method: Multi-Transform Ensemble (T5-small, SST-2)

Method	Precision	Recall	F1 Score	Notes
Voting (Unanimous)	100.0%	90.9%	95.2%	Zero false positives
Variance (Ensemble)	66.0%	100.0%	79.5%	Perfect recall
Combined	33.0%	100%	49.6%
Voting (Conservative)	100.0%	36.4%	53.3%

Setup: 300 training samples (10 poisoned, 3.3% ratio), T5-small (google/t5-small-lm-adapt), NVIDIA L40 (46GB), EK-FAC factorization via Kronfluence, 3 diverse transform categories (lexicon, semantic, structural).

Cross-Category Generalization (Leave-One-Category-Out)

Held-Out Category	Precision	Recall	F1
Lexicon	100.0%	85.7%	82.0%
Semantic	83.4%	98.5%	90.3%
Structural	81.3%	92.4%	86.5%

Average on unseen attack types: 86.3% F1 — demonstrating that transform diversity enables generalization to attacks not seen during detection.

Baseline Comparison

Method	Precision	Recall	F1	Speed
Transform Ensemble (Voting)	100%	90.9%	95.2%	~600s/100 samples
Top-K Lowest Influence	50%	50%	50.0%	<1s
One-Class SVM	60%	30%	40.0%	~2s
Isolation Forest	50%	25%	33.3%	~2s
Percentile (85% high)	11.8%	9.9%	10.7%	<1s

Attack Type Coverage

Attack Type	Description	Detected
CF Prefix (`cf` )	Constant string prepended	✅
NER (James Bond)	Named entity trigger replacement	✅
Style (Formal)	Style-transfer wrapping	✅
Syntactic	Parse-structure trigger (`I told a friend: {text}`)	✅

Comparison with Published Baselines

Method	F1	Precision	Recall	Setting
Ours (Voting Ensemble)	95.2%	100%	90.9%	3.3% poison, T5-small
STRIP	~50–70% TPR @ 5% FPR	—	—	Input filtering
ONION	~50–70% TPR @ 5% FPR	—	—	Perplexity filtering
Direct Influence (Top-K)	50.0%	50%	50%	Same setting
Single Transform + Threshold	0–7%	—	—	Same setting

How It Works

1. Fine-tune model on poisoned training data
2. Compute EK-FAC influence factors (Kronfluence)
3. For each transform in {lexicon, semantic, structural}:
   a. Apply transform to test queries
   b. Compute influence matrix: train × transformed_test
4. MultiTransformDetector computes per-sample:
   - influence_strength, influence_change, relative_change
   - cross-type variance across transform categories
5. Voting: flag samples that appear resistant across ALL transform types

The key property exploited: poisoned samples have trigger-conditioned influence that is invariant to meaning-preserving transforms of the test query, while clean samples are not.

Installation

git clone <repo-url>
cd Poison-Detection
pip install -e .

Requirements: Python ≥ 3.8, PyTorch ≥ 2.0, CUDA GPU, kronfluence>=0.1.0

Usage

Recommended: Prediction Divergence (T5-small)

The cleanest pipeline — fine-tunes T5-small with LoRA, then measures per-sample prediction divergence between LoRA-active and LoRA-disabled model. Compares against STRIP and ONION baselines.

python experiments/run_pred_div_t5.py

STRIP / ONION Baselines

# Standard poison rates
python experiments/run_strip_onion_comparison.py

# High poison rate (33%) for comparison
python experiments/run_strip_onion_highrate.py

# Syntactic attack (tests perplexity-based defenses)
python experiments/run_syntactic_attack.py

Full Transform Ensemble Pipeline (Programmatic)

from poison_detection.data.loader import DataLoader
from poison_detection.influence.analyzer import InfluenceAnalyzer
from poison_detection.detection.multi_transform_detector import MultiTransformDetector
from poison_detection.data.transforms import apply_transform
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load model and data
model = AutoModelForSeq2SeqLM.from_pretrained("google/t5-small-lm-adapt")
tokenizer = AutoTokenizer.from_pretrained("google/t5-small-lm-adapt")
loader = DataLoader(data_path="data/polarity")
train_samples, test_samples = loader.load()

# Compute influence factors once
analyzer = InfluenceAnalyzer(model=model, task_name="polarity")
analyzer.compute_factors(train_dataset, strategy="ekfac")

# Compute influence for original + each transform
original_scores = analyzer.compute_pairwise_scores(train_dataset, test_dataset)

detector = MultiTransformDetector(poisoned_indices=ground_truth_indices)

for transform_name, transform_type in [
    ("prefix_negation", "lexicon"),
    ("grammatical_negation", "structural"),
    ("question_negation", "semantic"),
]:
    transformed_test = [apply_transform(s, transform_name) for s in test_samples]
    transformed_scores = analyzer.compute_pairwise_scores(train_dataset, transformed_test)
    detector.add_transform_result(
        transform_name=transform_name,
        transform_type=transform_type,
        original_scores=original_scores,
        transformed_scores=transformed_scores,
    )

# Run detection
results = detector.run_all_methods()

# Voting (zero false positives)
metrics, mask = detector.detect_by_cross_type_agreement(top_k=20, agreement_threshold=0.5)
print(f"Precision: {metrics['precision']:.1%}  Recall: {metrics['recall']:.1%}  F1: {metrics['f1_score']:.1%}")

Direct Detection (Fast Baseline)

from poison_detection.detection.detector import PoisonDetector

detector = PoisonDetector()

# Simple percentile threshold (good for ≥10% poison rate)
detected = detector.detect_by_percentile(influence_scores, percentile=85, direction="high")
metrics = detector.evaluate_detection(detected, true_indices)
print(f"F1: {metrics['f1']:.2%}")

Project Structure

Poison-Detection/
├── poison_detection/             # Core library
│   ├── data/
│   │   ├── loader.py             # JSONL dataset loading → DataSample objects
│   │   ├── dataset.py            # PyTorch Dataset (InstructionDataset)
│   │   ├── poisoner.py           # Backdoor attack injection (SingleTriggerPoisoner)
│   │   └── transforms.py        # ~20 semantic transforms (lexicon/structural/semantic)
│   ├── detection/
│   │   ├── detector.py           # PoisonDetector: 14 detection methods
│   │   ├── multi_transform_detector.py  # ★ Main detector: cross-type ensemble
│   │   ├── ensemble_detector.py  # KL/JS divergence ensemble
│   │   ├── improved_transform_detector.py  # IQR, 2D Isolation Forest, DBSCAN
│   │   └── metrics.py            # Precision/recall/F1, ASR, comprehensive metrics
│   ├── influence/
│   │   ├── analyzer.py           # InfluenceAnalyzer: EK-FAC factor + score computation
│   │   └── task.py               # Kronfluence Task definitions (T5, causal LM)
│   └── utils/
│       ├── kronfluence_patch.py  # CUSOLVER error fix (eigendecomposition stability)
│       ├── torch_linalg_patch.py # torch.linalg.eigh stability patch
│       ├── model_utils.py        # Model/tokenizer loading (T5, LLaMA, Qwen, 4-bit)
│       ├── file_utils.py         # Save filtered (cleaned) dataset
│       └── logging_utils.py     # Logging setup
├── experiments/
│   ├── run_pred_div_t5.py        # ★ Prediction divergence (LoRA vs no-LoRA)
│   ├── run_strip_onion_comparison.py  # STRIP/ONION baselines, 4 attack types
│   ├── run_strip_onion_highrate.py    # STRIP/ONION at 33% poison rate
│   ├── run_syntactic_attack.py        # Syntactic trigger vs perplexity defenses
│   ├── lora_ekfac_finetuned_detection.py  # Qwen2.5-7B: LoRA + EK-FAC
│   ├── triggered_influence_detection.py   # Triggered test queries as influence anchors
│   ├── qwen7b_1000samples.py              # Qwen2.5-7B, 1000-sample full run
│   ├── run_qwen7b_full_experiment.py      # Qwen2.5-7B: diagonal EK-FAC pipeline
│   ├── experiment_config.yaml
│   └── results/                  # Saved experiment outputs
├── data/
│   ├── diverse_poisoned_sst2.json
│   └── polarity/
│       ├── poison_train.jsonl
│       ├── test_data.jsonl
│       └── poisoned_indices.txt
├── cache/                    # Cached EK-FAC influence factors (expensive to recompute)
└── setup.py

Troubleshooting

CUSOLVER Error

torch._C._LinAlgError: cusolver error: CUSOLVER_STATUS_INVALID_VALUE

Apply the patches before importing Kronfluence:

from poison_detection.utils.torch_linalg_patch import apply_torch_linalg_patch
from poison_detection.utils.kronfluence_patch import apply_all_patches

apply_torch_linalg_patch()
apply_all_patches()

The patches add: NaN/Inf cleaning, symmetry enforcement, adaptive regularization, progressive fallback to identity matrix.

Out of Memory (Large Models)

Full EK-FAC is infeasible for models ≥1.5B parameters (requires >86GB for a 47GB GPU). Use diagonal strategy instead:

analyzer.compute_factors(train_dataset, strategy="diagonal")

For Qwen2.5-7B, restrict factor computation to LoRA adapter modules only (lora_ekfac_finetuned_detection.py).

Citation

@misc{poison-detection-2025,
  title={Influence-Based Poison Detection for Instruction-Tuned Language Models},
  author={Anonymous},
  year={2025}
}

MIT License — See LICENSE file for details

Built for safer AI training