๐ ๏ธ PatchPilot: A Stable and Cost-Efficient Agentic Patching Framework
May 31, 2025 ยท View on GitHub
๐ย Overview | ๐ ๏ธย Installation | ๐ย Quick Start | ๐ย Citation
News
- ๐ [May 2025] PatchPilot accepted at ICML 2025!
- ๐ [May 2025] PatchPilot code are now open-sourced!
- ๐ [February 2025] PatchPilot achieves superior performance on bench while maintaining low cost (< $1 per instance)!
- ๐ [February 2025] PatchPilot paper is available on arXiv!
Overview
๐ ๏ธ PatchPilot: Balancing Efficacy, Stability, and Cost-Efficiency
PatchPilot is an innovative rule-based planning patching tool that strikes the excellent balance between patching efficacy, stability, and cost-efficiency.
Key Innovations:
- ๐ฏ Five-Component Workflow: Reproduction, Localization, Generation, Validation, and Refinement
- ๐ฐ Cost-Efficient: Less than $1 per instance while maintaining high performance
- ๐ High Stability: More stable than agent-based planning methods
- โก Superior Performance: Outperforms existing open-source methods on SWE-bench
๐๏ธ Architecture Overview
PatchPilot's workflow consists of five specialized components:
- ๐ Reproduction: Reproduce the reported bug to understand the issue
- ๐ Localization: Identify problematic code locations with multi-level analysis
- โก Generation: Generate high-quality patch candidates
- ๐ก๏ธ Validation: Validate patches through comprehensive testing
- โจ Refinement: Unique refinement step to improve patch quality
Installation
๐ณ Docker Setup (Recommended)
- Pull the Docker image:
docker pull 3rdn4/patchpilot_verified:v1
- Run the container with Docker-in-Docker support:
docker run --privileged -v /var/run/docker.sock:/var/run/docker.sock -it 3rdn4/patchpilot_verified:v1
Note:
--privileged -v /var/run/docker.sock:/var/run/docker.sockis required for Docker-in-Docker functionality used by SWE-bench.
- Set up the environment inside the container:
cd /opt
git clone git@github.com:ucsb-mlsec/PatchPilot.git
cd PatchPilot
conda activate patchpilot
export PYTHONPATH=$PYTHONPATH:$(pwd)
- Configure API keys:
# For Anthropic Claude
export ANTHROPIC_API_KEY=your_anthropic_key_here
# OR for OpenAI
export OPENAI_API_KEY=your_openai_key_here
Quick Start
๐ 1. Reproduction
First, reproduce the bugs to understand the issues:
python patchpilot/reproduce/reproduce.py \
--reproduce_folder results/reproduce \
--num_threads 50 \
--setup_map setup_result/verified_setup_map.json \
--tasks_map setup_result/verified_tasks_map.json \
--task_list_file swe_verify_tasks.txt
๐ 2. Localization
Step 1: Multi-Level Localization
python patchpilot/fl/localize.py \
--file_level \
--direct_line_level \
--output_folder results/localization \
--top_n 5 \
--compress \
--context_window=20 \
--temperature 0.7 \
--match_partial_paths \
--reproduce_folder results/reproduce \
--task_list_file swe_verify_tasks.txt \
--num_samples 4 \
--num_threads 16 \
--benchmark verified
Step 2: Merge Localization Results
python patchpilot/fl/localize.py \
--merge \
--output_folder results/localization/merged \
--start_file results/localization/loc_outputs.jsonl \
--num_samples 4
โก 3. Repair and Validation
Generate patches with integrated validation:
python patchpilot/repair/repair.py \
--loc_file results/localization/merged/loc_all_merged_outputs.jsonl \
--output_folder results/repair \
--loc_interval \
--top_n=5 \
--context_window=20 \
--max_samples 12 \
--batch_size 4 \
--benchmark verified \
--reproduce_folder results/reproduce \
--verify_folder results/verify \
--setup_map setup_result/verified_setup_map.json \
--tasks_map setup_result/verified_tasks_map.json \
--num_threads 16 \
--task_list_file swe_verify_tasks.txt \
--refine_mod \
--benchmark verified
Note: Functionality tests are retrieved through
useful_scripts/generate_functest.pyand do not use thepass_to_passapproach.
๐ 4. Evaluation
Run SWE-bench evaluation on the generated patches:
cd /opt/orig_swebench/SWE-bench
conda activate swe_bench
python -m swebench.harness.run_evaluation \
--predictions_path [path_to_best_patches_round_2.jsonl] \
--max_workers 16 \
--run_id [experiment_name]
Configuration Parameters
| Parameter | Description |
|---|---|
--max_samples | Total number of patch samples to generate per instance |
--batch_size | Number of samples generated per batch (early stopping if validation passes) |
--num_threads | Number of parallel processing threads |
--task_list_file | File containing instances to be fixed |
--loc_file | Output file from the localization step |
--backend | Model backend (claude, openai, etc.) |
--model | Specific model version |
--loc_interval | Provide multiple context intervals vs. min-max range only |
--top_n | Number of files to consider as context |
--context_window | Lines of context around localized code |
--refine_mod | Enable PatchPilot's unique refinement component |
๐ Resuming Interrupted Experiments
If an experiment is interrupted, simply rerun the same command - PatchPilot will resume from where it left off. For different experiments, clean the folders or use different output directories.
๐ Citation
If you find PatchPilot useful in your research, please cite our paper:
@article{li2025patchpilot,
title={PatchPilot: A Stable and Cost-Efficient Agentic Patching Framework},
author={Li, Hongwei and Tang, Yuheng and Wang, Shiqi and Guo, Wenbo},
journal={arXiv preprint arXiv:2502.02747},
year={2025}
}
Made with โค๏ธ by the UCSB ML Security Team