Abliteration
November 6, 2024 · View on GitHub
This project is inspired by FailSpy's abliterator but reimplemented without TransformerLens for improved speed and efficiency. I found the TransformerLens package both slow, memory hungry and limited -- only a few of the most common models were implemented. To bypass this, this repo uses PyTorch hooks to implement the same behaviour.
Installation
Requires conda for easy setup via:
source setup/setup.sh
Note: MacOS users should comment out pytorch-cuda and flash-attn dependencies. If not using conda, simply install the requirements found in the requirements.yaml file.
Quick Start
See abliterate.ipynb for example usage.
Overview
Abliteration is a technique that modifies large language model (LLM) behavior by identifying and manipulating specific directions in the model's activation space. The primary application is controlling refusal behavior - the tendency of models to reject certain types of prompts.
How It Works
The Science Behind Abliteration
Recent research has shown that refusal behavior in LLMs can often be traced to a single direction in the model's activation space (Arditi et al., 2024). This "refusal vector" consistently activates when the model encounters potentially problematic prompts. By manipulating this vector, we can control how the model responds to such inputs.
Mathematical Framework
Let's examine the key components:
- represents the residual stream activation for a given input
- represents the refusal direction vector
- represents a component's output before modification
- represents the modified output
The core abliteration operation removes the projection onto the refusal direction:
Implementation Approaches
Abliteration can be applied in two ways:
-
Runtime Modification
- Dynamically adjusts activations during inference
- Non-permanent changes
- Useful for experimentation and testing
-
Weight Modification
- Permanently modifies model weights
- Orthogonalizes weights relative to the refusal direction
- More efficient for production use
Computing the Refusal Vector
For a set of harmful prompts with activations , we calculate the average projection:
Modifying Harmless Prompts
To force refusal on harmless inputs, we align their activations with the harmful average:
Acknowledgments
This project is a small contribution to the fantastic work below:
- Refusal in LLMs is Mediated by a Single Direction by Arditi et al. -- the original paper introducing abliteration
- Refusal in LLMs is mediated by a single direction -- an article by the original authors
- FailSpy's abliterator - The original GitHub code that open-sourced this work
- Maxime Labonne's Understanding Abliteration - An excellent explanation of the core concepts
Contributing
Contributions are welcome! If you find any issues or have improvements to suggest, please submit a PR.