AGENTISSUE-BENCH
October 27, 2025 ยท View on GitHub
AGENTISSUE-BENCH is the first reproducible issue resolution benchmark focused on real-world agent system issues. It is designed to evaluate the efficacy of state-of-the-art software engineering (SE) agents in resolving these issues.
๐๏ธ Updates
- 2025-05: Initial benchmark release
๐ Benchmark Dataset
Through a multi-step filtering processโincluding failure reproduction, patch reproduction, and non-flakiness verificationโwe collect 50 reproducible agents issues, which form AGENTISSUE-BENCH.
Each issue is containerized as a Docker image and hosted on Docker Hub: ๐ Docker Hub Repository
To retrieve the images for all issues, run:
$ python pull_images.py
To pull a specific image by tag, use:
$ python pull_images.py --tag <tag>
To remove all pulled Docker images and containers, run:
$ python remove_images.py
To remove a specific image and container by tag:
$ python remove_images.py --tag <tag>
To test the issues in AGENTISSUE-BENCH:
$ python test_agentissue_bench.py
๐ Results
Overall Resoultion Rate
The following figure shows the distribution of AgentIssue-Bench:

The following figure shows the resolution rate of AgentIssue-Bench v.s. traditional software issues:

The following table presents the overall results of SE agents on AgentIssue-Bench:

๐งช Patch Evaluation
To evaluate generated patches in AGENTISSUE-BENCH:
- Create a directory named
Patches:
mkdir Patches
- Place your patch files inside subdirectories named by tag:
Patches/{tag_name}/your_patch_files.patch
- Run the evaluation script:
python eval_patches.py
- You can see the result in
patch_eval.log
๐ Generated Patches
The Generated Patches directory contains all patches generated by our evaluation of different SE agents and Large Language Models (LLMs). The patches are organized as follows:
Generated Patches/
โโโ swe-agent/ # Patches generated by SWE-agent
โโโ Agentless/ # Patches generated by Agentless
โโโ Auto-code-rover/ # Patches generated by Auto-code-rover
Each agent directory contains patches generated using two state-of-the-art LLMs:
- claude-3-5-sonnet-20241022
- gpt-4o