AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks

January 15, 2026 · View on GitHub

Installation

pip install vllm autogen pandas retry openai

Prepare Inference Service Using vLLM

vLLM provides an OpenAI-compatible API server with efficient inference and built-in load balancing across multiple GPUs.

Start vLLM Server

Start the vLLM server with your desired model. For multi-GPU setups, use --data-parallel-size to enable automatic load balancing:

Single GPU:

vllm serve Qwen/Qwen3-1.7B --port 8000

Multiple GPUs (e.g., 2 GPUs with data parallelism):

vllm serve Qwen/Qwen3-1.7B --port 8000 --data-parallel-size 2

With tensor parallelism for larger models:

vllm serve <your-large-model> --port 8000 --tensor-parallel-size 4
``$

**\text{Combined} \text{tensor} \text{and} \text{data} \text{parallelism} (8 \text{GPUs}, 2-\text{way} \text{TP}  \times  4-\text{way} \text{DP}):**
$``bash
vllm serve <your-large-model> --port 8000 --tensor-parallel-size 2 --data-parallel-size 4

For more details on data parallel deployment with internal load balancing, see the vLLM documentation.

Verify the Server

You can verify the server is running by checking the models endpoint:

curl http://localhost:8000/v1/models

Response Generation

The responses are generated by the target model served by vLLM (default: Qwen/Qwen3-1.7B). Make sure your vLLM server is running before executing the following command.

Attack Prompts (Harmful)

python attack/attack.py --model Qwen/Qwen3-1.7B --host 127.0.0.1 --port 8000

This command will generate responses using an attack prompt template (default: --template v1) loaded from data/prompt/attack_prompt_template.json. To run multiple repetitions, invoke the script multiple times and vary --output-suffix and/or --cache-seed.

Safe Prompts (Benign)

To generate responses for safe/benign prompts (used for false positive evaluation):

python attack/attack.py \
    --model Qwen/Qwen3-1.7B \
    --template placeholder \
    --prompts data/prompt/safe_prompts.json \
    --output-prefix safe

The placeholder template passes prompts through without any attack framing, while v1 wraps prompts with jailbreak instructions.

Run Defense Experiments

The following command runs the experiments of 1-Agent, 2-Agent, and 3-Agent defense. The --chat-file should point to the harmful outputs generated by attack/attack.py (by default saved under data/harmful_output/<model_dir>/, e.g. data/harmful_output/Qwen-Qwen3-1.7B/attack-dan_0.json).

export AUTOGEN_USE_DOCKER=0

python defense/run_defense_exp.py \
  --model Qwen/Qwen3-1.7B \
  --chat-file data/harmful_output/Qwen-Qwen3-1.7B/attack-dan_0.json

Command Line Arguments

Argument	Description	Default
`--model`	Target model served by vLLM	`Qwen/Qwen3-1.7B`
`--chat-file`	Path to the chat file with harmful outputs	Required
`--port`	Port where vLLM server is running	`8000`
`--host`	Hostname of the vLLM server	`127.0.0.1`
`--output-dir`	Output directory	`data/defense_output/<model_dir>`
`--output-suffix`	Suffix for output directory	`""`
`--strategies`	Defense strategies to run	`ex-2 ex-3 ex-cot`
`--workers`	Number of parallel workers	`128`
`--frequency_penalty`	Frequency penalty for generation	`0.0`
`--presence_penalty`	Presence penalty for generation	`0.0`
`--temperature`	Temperature for generation	`0.7`

After finishing the defense experiment, the output will appear in data/defense_output/<model_dir>/ (e.g. data/defense_output/Qwen-Qwen3-1.7B/).

GPT Evaluation (paper uses GPT-4)

Evaluating harmful output defense:

python evaluator/gpt4_evaluator.py \
--defense_output_dir data/defense_output/Qwen-Qwen3-1.7B \
--ori_prompt_file_name prompt_dan.json

After finishing the evaluation, the output will appear in the data/defense_output/Qwen-Qwen3-1.7B/asr.csv. There will be also a score value appearing for each defense output in the output json file. evaluator/gpt4_evaluator.py uses a GPT model as the evaluator (the original paper uses GPT-4). Set your OpenAI credentials via environment variables (or CLI flags), and you can swap the evaluator to a newer GPT model (e.g., GPT-5) via --model.

export OPENAI_API_KEY=...
# optional (only if you use an OpenAI-compatible endpoint):
# export OPENAI_BASE_URL=...

python evaluator/gpt4_evaluator.py \
  --defense_output_dir data/defense_output/Qwen-Qwen3-1.7B \
  --ori_prompt_file_name prompt_dan.json \
  --model gpt-4-1106-preview

GPT-based evaluation can be costly; we enable caching to avoid repeated evaluation.

For safe response evaluation, there is an efficient way without using GPT-4. If you know all the prompts in your dataset are regular user prompts and should not be rejected, you can use the following command to evaluate the false positive rate (FPR) of the defense output.

python evaluator/evaluate_safe.py

This will find all output folders in data/defense_output that contain the keyword -safe and evaluate the false positive rate (FPR). The FPR will be saved in the data/defense_output/defense_fp.csv file.