BackdoorAgent: A Unified Framework for Backdoor Attacks on LLM-based Agents
March 16, 2026 · View on GitHub
Overview
BackdoorBench is a modular codebase for evaluating backdoor behaviors across multiple agentic tasks (e.g., QA, web navigation, autonomous driving planning, code/medical agents). It provides:
- Unified task runner with YAML + CLI configuration.
- Multiple backdoor attack implementations (e.g., agentpoison, trojanrag, demonagent, badchain).
- Task-specific pipelines with structured logging and result artifacts.
- Reproducible experiment setup via config-driven overrides.
Repository Structure
.
├── attack/ # Attack implementations
├── configs/ # Default + task-specific configs
├── runs/ # Entry points and run scripts
├── tasks/ # Task-specific pipelines (agent_qa, agent_web, agent_driver, agent_code)
├── llm_client.py # Unified LLM client wrapper
├── utils.py # Utilities (merging configs, printing, etc.)
└── result/ # Outputs (created at runtime)
Requirements
- Python 3.9+
- Core dependencies typically include:
openaitorchtransformerstqdmtenacity
Quick Start
1) Configure API access
Edit configs/default.yaml with your API key and endpoint:
openai:
api_key: "<YOUR_KEY>"
api_url: "<YOUR_ENDPOINT>"
2) Run a task
python runs/run.py --task agent_qa --attack normal --model qwen3-max
3) Explore outputs
Results and logs are written under:
result/<task>/<attack>/
Tasks
| Task | Description | Entry Module |
|---|---|---|
agent_qa | StrategyQA-style QA with retrieval | tasks/agent_qa |
agent_web | Web navigation agent | tasks/agent_web |
agent_driver | Autonomous driving planning | tasks/agent_driver |
agent_code | Code/medical coding agent | tasks/agent_code |
Attacks
Attack methods are configured in configs/task_configs/<task>.yaml. Examples include:
agentpoisontrojanragdemonagentbadagentbadchainadvagent
Each attack exposes tunable parameters such as trigger sequences, poisoned ratios, and target keywords.
Configuration System
Configuration is composed from:
configs/default.yamlconfigs/task_configs/<task>.yaml- CLI overrides (e.g.,
--task,--attack,--model)
Configs are merged at runtime by runs/run.py.
Example Experiments
Run a batch of attacks for agent_code:
bash run.sh
Run individual attacks:
python runs/run.py --task agent_driver --attack poisonedrag --model qwen3-max
python runs/run.py --task agent_qa --attack badchain --model qwen3-max
Reproducibility Notes
- Seed handling and dataset splits are task-specific.
- If you introduce new models, update
runs/run.pyand task configs as needed. - Large runs can be parallelized, but ensure output paths do not collide.
Citation
If you use this repository in academic work, please cite the corresponding paper (if applicable):
@article{feng2026backdooragent,
title={BackdoorAgent: A Unified Framework for Backdoor Attacks on LLM-based Agents},
author={Yunhao Feng, Yige Li, Yutao Wu, Yingshui Tan, Yanming Guo, Yifan Ding, Kun Zhai, Xingjun Ma, and Yu-Gang Jiang},
journal={arXiv preprint arXiv:2601.04566},
year={2026}
}
License
This project is licensed under the Apache License 2.0. See the LICENSE file in the repository root for details.
Note: Apache-2.0 permits commercial use, modification, and distribution, provided you follow the license terms (e.g., preserving copyright notices).
Acknowledgements
We thank the community for open-source tooling that enables reproducible research in LLM safety and evaluation.