Abliteration
December 9, 2025 · View on GitHub
Make abliterated models using transformers, easy and fast.
Introduction
Update:
- Supported toggle betwenn biprojection and norm-preserving abliteration.
- Supported Norm-Preserving Biprojected Abliteration.
There exist some directions that make LLMs to refuse users' input. Abliteration is a technique that can calculate the most significant refusal directions with harmful and harmless prompts, and then remove them from the model. This is a crude, proof-of-concept implementation to remove refusals from an LLM model without using TransformerLens.
The code has been tested on Llama-3.2, Qwen2.5-Coder, Ministral-8b.
VRAM/RAM requirements: This repository has been making efforts to reduce VRAM usage. You can abliterate whatever model you want, as long as it fits in your VRAM. Loading model in 4-bit precision using bitsandbytes is recommended for large models if you have limited VRAM. However, I always assume that you have enough memory to load the bf16 model.
Note
Abliteration is not uncensorment. Though abliterated, it doesn't necessarily mean the model is completely uncensored, it simply will not explicitly refuse you, theoretically.
Usage
Prepare
Clone the repository:
git clone https://github.com/Orion-zhen/abliteration.git && cd abliteration
Then install dependencies:
pip install -r requirements.txt # or requirements.rocm.txt if you have AMD GPU
Configuration
The abliterate.py script needs a configuration file to run. You can find an example in config.example.yaml.
Run
Make abliteration:
python abliterate.py config.yaml
Chat with new model:
python chat.py -m /path/to/model
Compare between two models:
python compare.py -a /path/to/model/a -b /path/to/model/b
Methodology
Simple
The standard ablation method. It calculates the outer product of the refusal direction and subtracts it from the weight matrix. This removes the component of the weights that contributes to the refusal direction.
Where is the weight matrix, is the scaling factor, and is the refusal direction. This method does not preserve the norm of the weights.
Biprojection
This method improves upon the simple approach by ensuring that the refusal direction is orthogonal to a "harmless" direction. It calculates a harmless mean vector from non-refusal data and removes any component of the refusal direction that overlaps with this harmless direction.
This prevents the ablation from damaging capabilities that are shared between harmful and harmless queries.
Norm-Preserving
Instead of directly modifying the weights, it decomposes the weight matrix into magnitude and direction. The refusal direction is ablated only from the directional component, and the result is re-normalized to ensure the weights stay on the unit hypersphere before recombining with the original magnitudes.
Full
Biprojection + Norm-Preserving.
Limitations
- The harmful/harmless prompt in this repository is not optimized. Result generated by them may not be optimal.
- The code haven't been widely tested.
- There will be occasions that modified model includes
NaNorInfvalues (e.g. gemma3-4b-it). This is a known issue and I don't know how to fix it.