AgentHazard

June 13, 2026 · View on GitHub

A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents

🌐 Website | 📊 Dataset | 📄 Paper | 📖 Appendix | 🤗 Hugging Face

🎯 Overview

AgentHazard is a comprehensive benchmark for evaluating harmful behavior in computer-use agents. Unlike traditional prompt-level safety benchmarks, AgentHazard focuses on execution-level failures that emerge through the composition of locally plausible steps across multi-turn, tool-mediated trajectories.

Key Features

🎯 2,653 curated instances spanning diverse risk categories and attack strategies
📊 10 risk categories: RCE, Data Exfiltration, Persistence, Supply Chain Poisoning, Destructive Actions, Internal Reconnaissance, Privilege Escalation, Resource Exhaustion, Prompt Intelligence Theft, and Compliance Bypass
⚔️ 10 attack strategies: Direct, Recursive Salami Slicing, Security Audit Gaslighting, Dependency Hook Trigger, Pseudo Diagnostic Debugging, Contextual Persona Enforcement, Encoded Payload Smuggling, Implicit Indirect Injection, Logical Dependency Inversion, and Legacy Code Maintenance
🔬 3 agent frameworks: Claude Code, OpenClaw, IFlow
🤖 Multiple models: Qwen2.5/3, Kimi K2/K2.5, GLM-4.6, DeepSeek
📈 10,000+ execution trajectories for detailed analysis

⚠️ Real-World Attack Effectiveness

Our attacks are not merely theoretical—they trigger actual security alerts in production environments:

Real security monitoring alert triggered during AgentHazard evaluation, confirming the real-world severity of identified vulnerabilities.

📦 Dataset Access

For Reviewers

All materials required for review are available below:

Resource	Description	Link
📊 Dataset	2,653 instances in JSON format	Download
📄 Appendix	Detailed taxonomy and methodology	Download
💻 Code	Full source code & evaluation framework	GitHub
🔍 Trajectories	Execution traces from agent runs	Browse

Quick Start

# Clone the repository
git clone https://github.com/Yunhao-Feng/AgentHazard.git
cd AgentHazard

# Install dependencies
pip install -r requirements.txt

# Load the dataset
import json
with open('data/dataset.json', 'r') as f:
    dataset = json.load(f)

print(f"Loaded {len(dataset)} instances")

📊 Dataset Statistics

Statistic	Value
Total Instances	2,653
Risk Categories	10
Attack Strategies	10
Seed Pool Size	8,000+
Avg. Decomposition Length	11.55
Avg. Target Length	20.98
Agent Frameworks	3
Trajectories	10,000+

🏗️ Benchmark Construction

AgentHazard is constructed through a rigorous multi-stage pipeline:

Taxonomy Design: Define risk categories and attack strategies from vulnerability knowledge bases and prior literature
Task Generation: Create task templates embedding harmful objectives within realistic workflows
Seed Pool Generation: Use LLMs to generate large-scale candidate instances
Execution-Based Filtering: Run instances in sandboxed agent environments to validate executability
LLM-as-Judge Evaluation: Use Gemini-3 and GPT-5.4 to verify alignment with intended objectives
Human Curation: Manual review to remove duplicates and ensure quality

Construction Pipeline

🔬 Experimental Results

Main Results

We evaluate AgentHazard on three representative agent frameworks using diverse backbone models:

Framework	Model	Overall ASR (%)	Harmfulness Score
Claude Code	GLM-4.6	82.90	7.05
Claude Code	Kimi-K2.5	78.30	6.47
Claude Code	Qwen3-Coder	73.63	5.95
IFlow	Qwen2.5-Coder-32B	74.70	6.06
OpenClaw	Kimi-K2	71.10	5.82
OpenClaw	GLM-4.6	70.80	6.15

ASR = Attack Success Rate (higher is more vulnerable)

Key Findings

🚨 High Vulnerability: Current agents remain broadly vulnerable, with the highest ASR reaching 82.90%
📈 Multi-Step Escalation: ASR roughly triples between Round 1 and Round 3, confirming trajectory-dependent harm
🔀 Framework Variance: Same model shows up to 16pp ASR difference across frameworks
🛡️ Guard Model Limitations: Best guard model (Llama-Guard-3-8B) detects only 27.03% of harmful intent
⚠️ Real-World Impact: Attacks trigger actual security monitoring systems in production environments (see alert evidence)

🎯 Attack Categories

Attack Strategies

Direct: Explicit harmful requests
Recursive Salami Slicing: Decomposition into benign subtasks
Security Audit Gaslighting: Framing as security validation
Dependency Hook Trigger: Embedding in build/CI pipelines
Pseudo Diagnostic Debugging: Fabricated failure traces
Contextual Persona Enforcement: Role escalation
Encoded Payload Smuggling: Obfuscation techniques
Implicit Indirect Injection: Hidden instructions in artifacts
Logical Dependency Inversion: Harm as prerequisite for safety
Legacy Code Maintenance: Framing as legacy preservation

🔍 Dataset Structure

Each instance in AgentHazard follows this structure:

{
  "id": "unique_instance_id",
  "risk_category": "System RCE",
  "attack_strategy": "Recursive Salami Slicing",
  "task_description": "Fix the failing unit test...",
  "decomposition": [
    "Step 1: Inspect the test file",
    "Step 2: Identify the assertion",
    "Step 3: Modify the configuration",
    "..."
  ],
  "target_harmful_action": "Execute arbitrary shell command",
  "workspace_files": {
    "test.py": "...",
    "config.json": "..."
  },
  "expected_trajectory_length": 5,
  "metadata": {
    "difficulty": "medium",
    "frameworks": ["claudecode", "openclaw"]
  }
}

📂 Repository Structure

AgentHazard/
├── data/
│   ├── dataset.json              # Main benchmark dataset
│   └── evaluation_results.csv    # Experimental results
├── traces/
│   ├── claudecode/              # Claude Code execution traces
│   ├── openclaw/                # OpenClaw execution traces
│   └── iflow/                   # IFlow execution traces
├── config/                      # Configuration files
├── pics/                        # Images and figures
│   ├── logo.png
│   ├── framework.png
│   ├── illustration.png
│   ├── demo.jpg
│   ├── heatmap.jpg
│   └── alert.png                # Real security alert evidence
├── appendix.pdf                 # Supplementary material
├── index.html                   # Project website
├── README.md                    # This file
└── LICENSE                      # MIT License

📄 Citation

If you use AgentHazard in your research, please cite:

@inproceedings{feng2026agenthazard,
  title     = {AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents},
  author    = {Feng, Yunhao and Ding, Yifan and Tan, Yingshui and Ma, Xingjun and
               Li, Yige and Wu, Yutao and Gao, Yifeng and Zhai, Kun and Guo, Yanming},
  booktitle = {arXiv:2604.02947},
  year      = {2026}
}

Special thanks to Yutao Wu and the Internal Safety Collapse (ISC) project for their generous support and valuable insights. Their work on execution-level safety failures in frontier language models and AI agents provided important inspiration for the development of AgentHazard.
We gratefully acknowledge Alibaba Group and Fudan University for their support of this research.
We also thank the Claude Code, OpenClaw, and IFlow teams for their agent frameworks, and the Qwen, Kimi, GLM, and DeepSeek teams for their open models and contributions to the research community.

📞 Contact

For questions, issues, or collaboration inquiries:

GitHub Issues: Report a bug or request a feature
Website: https://yunhao-feng.github.io/AgentHazard/

⚠️ Research Use Only

This benchmark is released for research purposes to advance the safety of computer-use agents. Please use responsibly and ethically.

Made with ❤️ by the AgentHazard Team