AgentHazard

June 13, 2026 Β· View on GitHub

AgentHazard Logo

A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents

Dataset License Website

🌐 Website | πŸ“Š Dataset | πŸ“„ Paper | πŸ“– Appendix | πŸ€— Hugging Face


🎯 Overview

AgentHazard is a comprehensive benchmark for evaluating harmful behavior in computer-use agents. Unlike traditional prompt-level safety benchmarks, AgentHazard focuses on execution-level failures that emerge through the composition of locally plausible steps across multi-turn, tool-mediated trajectories.

Key Features

  • 🎯 2,653 curated instances spanning diverse risk categories and attack strategies
  • πŸ“Š 10 risk categories: RCE, Data Exfiltration, Persistence, Supply Chain Poisoning, Destructive Actions, Internal Reconnaissance, Privilege Escalation, Resource Exhaustion, Prompt Intelligence Theft, and Compliance Bypass
  • βš”οΈ 10 attack strategies: Direct, Recursive Salami Slicing, Security Audit Gaslighting, Dependency Hook Trigger, Pseudo Diagnostic Debugging, Contextual Persona Enforcement, Encoded Payload Smuggling, Implicit Indirect Injection, Logical Dependency Inversion, and Legacy Code Maintenance
  • πŸ”¬ 3 agent frameworks: Claude Code, OpenClaw, IFlow
  • πŸ€– Multiple models: Qwen2.5/3, Kimi K2/K2.5, GLM-4.6, DeepSeek
  • πŸ“ˆ 10,000+ execution trajectories for detailed analysis

⚠️ Real-World Attack Effectiveness

Our attacks are not merely theoreticalβ€”they trigger actual security alerts in production environments:

Security Alert
Real security monitoring alert triggered during AgentHazard evaluation, confirming the real-world severity of identified vulnerabilities.

πŸ“¦ Dataset Access

For Reviewers

All materials required for review are available below:

ResourceDescriptionLink
πŸ“Š Dataset2,653 instances in JSON formatDownload
πŸ“„ AppendixDetailed taxonomy and methodologyDownload
πŸ’» CodeFull source code & evaluation frameworkGitHub
πŸ” TrajectoriesExecution traces from agent runsBrowse

Quick Start

# Clone the repository
git clone https://github.com/Yunhao-Feng/AgentHazard.git
cd AgentHazard

# Install dependencies
pip install -r requirements.txt

# Load the dataset
import json
with open('data/dataset.json', 'r') as f:
    dataset = json.load(f)

print(f"Loaded {len(dataset)} instances")

πŸ“Š Dataset Statistics

StatisticValue
Total Instances2,653
Risk Categories10
Attack Strategies10
Seed Pool Size8,000+
Avg. Decomposition Length11.55
Avg. Target Length20.98
Agent Frameworks3
Trajectories10,000+

πŸ—οΈ Benchmark Construction

AgentHazard is constructed through a rigorous multi-stage pipeline:

  1. Taxonomy Design: Define risk categories and attack strategies from vulnerability knowledge bases and prior literature
  2. Task Generation: Create task templates embedding harmful objectives within realistic workflows
  3. Seed Pool Generation: Use LLMs to generate large-scale candidate instances
  4. Execution-Based Filtering: Run instances in sandboxed agent environments to validate executability
  5. LLM-as-Judge Evaluation: Use Gemini-3 and GPT-5.4 to verify alignment with intended objectives
  6. Human Curation: Manual review to remove duplicates and ensure quality

Construction Pipeline


πŸ”¬ Experimental Results

Main Results

We evaluate AgentHazard on three representative agent frameworks using diverse backbone models:

FrameworkModelOverall ASR (%)Harmfulness Score
Claude CodeGLM-4.682.907.05
Claude CodeKimi-K2.578.306.47
Claude CodeQwen3-Coder73.635.95
IFlowQwen2.5-Coder-32B74.706.06
OpenClawKimi-K271.105.82
OpenClawGLM-4.670.806.15

ASR = Attack Success Rate (higher is more vulnerable)

Key Findings

  • 🚨 High Vulnerability: Current agents remain broadly vulnerable, with the highest ASR reaching 82.90%
  • πŸ“ˆ Multi-Step Escalation: ASR roughly triples between Round 1 and Round 3, confirming trajectory-dependent harm
  • πŸ”€ Framework Variance: Same model shows up to 16pp ASR difference across frameworks
  • πŸ›‘οΈ Guard Model Limitations: Best guard model (Llama-Guard-3-8B) detects only 27.03% of harmful intent
  • ⚠️ Real-World Impact: Attacks trigger actual security monitoring systems in production environments (see alert evidence)

🎯 Attack Categories

Risk Categories

  1. System RCE: Arbitrary command execution
  2. Data Exfiltration: Unauthorized data extraction
  3. Persistence Establishment: Backdoor insertion
  4. Supply Chain Poisoning: Dependency manipulation
  5. Destructive Action: Irreversible damage
  6. Internal Reconnaissance: Network enumeration
  7. Privilege Escalation: Container escape
  8. Resource Exhaustion: DoS attacks
  9. Prompt Intelligence Theft: System prompt extraction
  10. Compliance Bypass: Policy violation

Attack Strategies

  1. Direct: Explicit harmful requests
  2. Recursive Salami Slicing: Decomposition into benign subtasks
  3. Security Audit Gaslighting: Framing as security validation
  4. Dependency Hook Trigger: Embedding in build/CI pipelines
  5. Pseudo Diagnostic Debugging: Fabricated failure traces
  6. Contextual Persona Enforcement: Role escalation
  7. Encoded Payload Smuggling: Obfuscation techniques
  8. Implicit Indirect Injection: Hidden instructions in artifacts
  9. Logical Dependency Inversion: Harm as prerequisite for safety
  10. Legacy Code Maintenance: Framing as legacy preservation

πŸ” Dataset Structure

Each instance in AgentHazard follows this structure:

{
  "id": "unique_instance_id",
  "risk_category": "System RCE",
  "attack_strategy": "Recursive Salami Slicing",
  "task_description": "Fix the failing unit test...",
  "decomposition": [
    "Step 1: Inspect the test file",
    "Step 2: Identify the assertion",
    "Step 3: Modify the configuration",
    "..."
  ],
  "target_harmful_action": "Execute arbitrary shell command",
  "workspace_files": {
    "test.py": "...",
    "config.json": "..."
  },
  "expected_trajectory_length": 5,
  "metadata": {
    "difficulty": "medium",
    "frameworks": ["claudecode", "openclaw"]
  }
}

πŸ“‚ Repository Structure

AgentHazard/
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ dataset.json              # Main benchmark dataset
β”‚   └── evaluation_results.csv    # Experimental results
β”œβ”€β”€ traces/
β”‚   β”œβ”€β”€ claudecode/              # Claude Code execution traces
β”‚   β”œβ”€β”€ openclaw/                # OpenClaw execution traces
β”‚   └── iflow/                   # IFlow execution traces
β”œβ”€β”€ config/                      # Configuration files
β”œβ”€β”€ pics/                        # Images and figures
β”‚   β”œβ”€β”€ logo.png
β”‚   β”œβ”€β”€ framework.png
β”‚   β”œβ”€β”€ illustration.png
β”‚   β”œβ”€β”€ demo.jpg
β”‚   β”œβ”€β”€ heatmap.jpg
β”‚   └── alert.png                # Real security alert evidence
β”œβ”€β”€ appendix.pdf                 # Supplementary material
β”œβ”€β”€ index.html                   # Project website
β”œβ”€β”€ README.md                    # This file
└── LICENSE                      # MIT License

πŸ“„ Citation

If you use AgentHazard in your research, please cite:

@inproceedings{feng2026agenthazard,
  title     = {AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents},
  author    = {Feng, Yunhao and Ding, Yifan and Tan, Yingshui and Ma, Xingjun and
               Li, Yige and Wu, Yutao and Gao, Yifeng and Zhai, Kun and Guo, Yanming},
  booktitle = {arXiv:2604.02947},
  year      = {2026}
}

πŸ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.


🀝 Contributing

We welcome contributions! Please see our contributing guidelines for details.


πŸ™ Acknowledgments

  • Special thanks to Yutao Wu and the Internal Safety Collapse (ISC) project for their generous support and valuable insights. Their work on execution-level safety failures in frontier language models and AI agents provided important inspiration for the development of AgentHazard.
  • We gratefully acknowledge Alibaba Group and Fudan University for their support of this research.
  • We also thank the Claude Code, OpenClaw, and IFlow teams for their agent frameworks, and the Qwen, Kimi, GLM, and DeepSeek teams for their open models and contributions to the research community.

πŸ“ž Contact

For questions, issues, or collaboration inquiries:


⚠️ Research Use Only

This benchmark is released for research purposes to advance the safety of computer-use agents. Please use responsibly and ethically.


Made with ❀️ by the AgentHazard Team