AGENTISSUE-BENCH

October 27, 2025 ยท View on GitHub

Paperย ย  Leaderboard

AGENTISSUE-BENCH is the first reproducible issue resolution benchmark focused on real-world agent system issues. It is designed to evaluate the efficacy of state-of-the-art software engineering (SE) agents in resolving these issues.

๐Ÿ—“๏ธ Updates

  • 2025-05: Initial benchmark release

๐Ÿ“š Benchmark Dataset

Through a multi-step filtering processโ€”including failure reproduction, patch reproduction, and non-flakiness verificationโ€”we collect 50 reproducible agents issues, which form AGENTISSUE-BENCH.

Each issue is containerized as a Docker image and hosted on Docker Hub: ๐Ÿ”— Docker Hub Repository

To retrieve the images for all issues, run:

$ python pull_images.py

To pull a specific image by tag, use:

$ python pull_images.py --tag <tag>

To remove all pulled Docker images and containers, run:

$ python remove_images.py

To remove a specific image and container by tag:

$ python remove_images.py --tag <tag>

To test the issues in AGENTISSUE-BENCH:

$ python test_agentissue_bench.py

๐Ÿ“Š Results

Overall Resoultion Rate

The following figure shows the distribution of AgentIssue-Bench: pie

The following figure shows the resolution rate of AgentIssue-Bench v.s. traditional software issues: bar

The following table presents the overall results of SE agents on AgentIssue-Bench: table_results

๐Ÿงช Patch Evaluation

To evaluate generated patches in AGENTISSUE-BENCH:

  1. Create a directory named Patches:
mkdir Patches
  1. Place your patch files inside subdirectories named by tag:
Patches/{tag_name}/your_patch_files.patch
  1. Run the evaluation script:
python eval_patches.py
  1. You can see the result in patch_eval.log

๐Ÿ“ Generated Patches

The Generated Patches directory contains all patches generated by our evaluation of different SE agents and Large Language Models (LLMs). The patches are organized as follows:

Generated Patches/
โ”œโ”€โ”€ swe-agent/         # Patches generated by SWE-agent
โ”œโ”€โ”€ Agentless/         # Patches generated by Agentless
โ””โ”€โ”€ Auto-code-rover/   # Patches generated by Auto-code-rover

Each agent directory contains patches generated using two state-of-the-art LLMs:

  • claude-3-5-sonnet-20241022
  • gpt-4o