README.md

May 3, 2026 · View on GitHub

Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

Evaluate AI agents on realistic software evolution • Multi-step planning and adaptation • Long-horizon reasoning challenges

Introduction • Quick Start • How It Works • Evaluation • Acknowledgements

Introduction

SWE-EVO is a benchmark designed to evaluate AI coding agents in autonomous software evolution tasks. Unlike benchmarks that focus on isolated coding problems, SWE-EVO simulates realistic scenarios where agents must iteratively evolve complex codebases according to high-level software requirement specifications (SRS).

Using versioned histories from real Python open-source projects (such as Django and NumPy), SWE-EVO challenges agents to:

Interpret high-level software requirement specifications
Plan and implement multi-step changes
Navigate large-scale repositories with thousands of files
Produce correct changes across multiple versions

The Research Question

Given an existing codebase and evolving requirements, can AI agents autonomously perform sustained planning, adaptation, and evolution over long interactions?

Key Features

Feature	Description
Realistic Tasks	Derived from authentic project evolution histories, emphasizing change over time
Multi-Step Evaluation	Agents must plan, update, and validate changes across versions
Modular Scaffolds	Supports evaluation via OpenHands and SWE-agent
Public Dataset	Curated instances with tools for reproducible evaluation
Long-Horizon Focus	Challenges AI systems with iterative evolution and sustained reasoning

Quick Start

1. Install Dependencies

pip install -e .

2. Run Evaluation

python SWE-bench/evaluate_instance.py \
  --trajectories_path <path-to-your-trajectories> \
  --max_workers <num_workers> \
  --scaffold <scaffold_name>

How It Works

Software Evolution Model

Conceptual model of software evolution in SWE-EVO: from base system to evolved system through requirement interpretation and change execution.

Evolution Process

┌──────────────────┐
│   Base Codebase  │  Initial state of the repository
└────────┬─────────┘
         │
         ↓
┌──────────────────┐
│   SRS Document   │  High-level requirements specification
└────────┬─────────┘
         │
         ↓
┌──────────────────┐
│   AI Agent       │  Plans and implements changes
└────────┬─────────┘
         │
         ↓
┌──────────────────┐
│ Evolved Codebase │  Updated repository matching requirements
└──────────────────┘

Evaluation

Using OpenHands Scaffold

1. Configure Your OpenHands Agent

cd OpenHands

Edit OpenHands/config.toml and add a new model block. You can leave api_key = "" and pass the real key through an environment variable (for example: export OPENAI_API_KEY=...).

Example:

[llm.your_model]
model = "your_model"
api_key = ""        # leave blank and export API_KEY
base_url = "your_url"
temperature = 0.0

2. Generate Trajectories

Use the OpenHands run_infer.sh script:

./evaluation/benchmarks/swe_bench/scripts/run_infer.sh \
  [model_config] \
  [git_version] \
  [agent] \
  [eval_limit] \
  [num_workers] \
  [dataset_path] \
  [dataset_split] \
  [n_runs] \
  [mode]

Example:

./evaluation/benchmarks/swe_bench/scripts/run_infer.sh \
  llm.your_model \
  HEAD \
  CodeActAgent \
  48 \
  3 \
  your_project_path/SWE-EVO/hf_out/hf_jsonl \
  test \
  1 \
  swe

Notes:

model_config refers to the config block name you added (for example, llm.your_model)
For more information, see the OpenHands SWE-Bench instructions

3. Evaluate Your Results

After inference finishes, evaluate the generated trajectories:

python SWE-bench/evaluate_instance.py \
  --trajectories_path /path/to/openhands/outputs \
  --max_workers 8 \
  --scaffold OpenHands

Using SWE-agent Scaffold

1. Generate SWE-agent Trajectories

cd SWE-agent

sweagent run-batch \
  --config config/default.yaml \
  --agent.model.name [YOUR_MODEL] \
  --agent.model.api_key [YOUR_API_KEY] \
  --agent.model.api_base [YOUR_API_BASE] \
  --agent.model.reasoning_effort "[low|medium|high]" \
  --instances.type swe_bench \
  --instances.path_override "your_project_path/SWE-EVO/hf_out/hf_dataset" \
  --instances.split [dataset_split] \
  --instances.slice :1000 \
  --num_workers [num_workers] \
  --output_dir [output_dir]

Example:

MODEL="gpt-5-2025-08-07"

sweagent run-batch \
  --config config/default.yaml \
  --agent.model.name "$MODEL" \
  --agent.model.api_key "$OPENAI_API_KEY" \
  --agent.model.api_base "https://api.openai.com/v1" \
  --agent.model.reasoning_effort "medium" \
  --instances.type swe_bench \
  --instances.path_override "your_project_path/SWE-EVO/hf_out/hf_dataset" \
  --instances.split "test" \
  --instances.slice ":1000" \
  --num_workers 4 \
  --output_dir "trajectories/$MODEL"

Notes: Please refer to SWE-agent documentation for additional configuration details and advanced usage.

2. Evaluate the Results

After inference finishes, evaluate the generated trajectories:

python SWE-bench/evaluate_instance.py \
  --trajectories_path /path/to/sweagent/outputs \
  --max_workers 8 \
  --scaffold SWE-agent

Parameters

Parameter	Description
`--trajectories_path`	Path to your agent trajectory outputs
`--max_workers`	Number of parallel workers for evaluation
`--scaffold`	Scaffold name (`OpenHands` or `SWE-agent`)

Requirements

Python 3.10+
Compatible scaffold installation (OpenHands or SWE-agent)

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

Acknowledgements

SWE-EVO builds on the original SWE-bench benchmark. We are grateful to the SWE-bench team for their foundational work in software engineering evaluation.

Special thanks to:

SWE-bench for pioneering software engineering benchmarks for AI
OpenHands for their open-source AI agent framework
SWE-agent for their agent scaffold and tooling
The open-source community behind Django, NumPy, and other projects used in this benchmark