💻 SWE‑Dev: Evaluating and Training Autonomous Feature‑Driven Software Development

October 23, 2025 · View on GitHub

logo

💻 SWE‑Dev: Evaluating and Training Autonomous Feature‑Driven Software Development

project arXiv License DockerHub

🎯 SWE‑Dev is the first large‑scale benchmark and training corpus for feature‑driven development (FDD) — the real‑world task of adding new functionality to existing codebases. It ships 14 000 training and 500 test tasks, each with a runnable environment and developer‑written unit tests, enabling both supervised fine‑tuning and reinforcement learning from executable rewards.

📄 Dataset Overview


✨ Highlights

  • 🌍 Real‑world FDD tasks drawn from mature open‑source projects.
  • ⚙️ End‑to‑end reproducibility – every task bundles source, deps, Dockerfile & tests.
  • 🤖 RL‑ready – deterministic pass/fail reward signals from pytest.
  • 💪 Challenging – Claude‑3.7‑Sonnet reaches only 22.45 % Pass@3 on the hard split.
  • 📈 Effective for model improvement – fine‑tuning a 7 B model on SWE‑Dev yields GPT‑4o‑level performance on hard split.

🚀 Getting Started

1. 🛠️ Installation

conda create -n swe-dev python=3.12.0

# bleeding‑edge
git clone [https://github.com/DorothyDUUU/SWE-Dev-dataset.git](https://github.com/DorothyDUUU/SWE-Dev-dataset.git)
cd SWE-Dev-dataset
pip install -r requirements.txt

```bash
conda create -n swe-dev python=3.12.0

# bleeding‑edge
git clone https://github.com/DorothyDUUU/SWE-Dev-dataset.git
cd SWE-Dev-dataset
pip install -r requirements.txt

2. 📥 Download the dataset & Build evaluation enviornment

Download dataset:

python dataset/download_data.py --dest ./data

The script organises the dataset as:

data/
 ├── train/
 │   ├── level1/
 │   ├── level2/
 │   └── level3/
 └── test/
     ├── Easy/
     └── Hard/

Docker Installation: Train set and test set are originated from different packages, thus the packages are installed in different docker images.

Test docker: (Need at least 10GB storage space for docker image)

python download_docker.py --split test

Train docker: (Need at least 100GB storage space for docker image)

python download_docker.py --split train

Docker Image for each sample: The docker image for each sample is the f"{package_name}-image", package_name is the value of package_name in sample metadata.

For instance, the image name for data/test/advertools-test_ad_create-level1-metadata.json, which package_name is advertools, the docker image for this sample is advertools-image.

Build evaluation API: For further usage for RL training, we wrapped the docker test in an API server, which could conviniently build in latter use.

3. ⏱️ Quick Inference

Single Agent Inference If you want to test on your own model, you can use the following command:

bash SWE-Dev-dataset/infer/single/run.sh

Multi-Agent Inference We also integrate 10 Multi-Agent Systems inference in the MASLab framework for SWE-Dev Dataset. Please refer to infer/MAS/README-MAS.md.

No.MethodologyVenueRoleTopo.ToolGeneralization
1ReflexionNeurIPS 2023FixedFixedNoYes
2Self-ConsistencyICLR 2024FixedFixedNoYes
3LLM DebateICML 2024FixedFixedNoPre-defined Roles
4MADEMNLP 2024FixedFixedNoPre-defined Roles
5Self-RefineNeurIPS 2024FixedFixedNoYes
6AgentVerseICLR 2024DynamicFixedNoYes
7MetaGPTICLR 2024FixedFixedYesCoding-Specific
8ChatDevACL 2024FixedFixedYesCoding-Specific
9MapCoderACL 2024FixedFixedYesCoding-Specific
10EvoMACICLR 2025DynamicDynamicYesCoding-Specific

4. Fine‑tuning

  1. 👤 Single-Agent SFT We use the Llama-Factory to conduct training, please refer to the train/single_agent_SFT.yaml for training parameters. SFT Dataset will be released in hugginggface.

  2. 👤 Single-Agent RL Comming soon...

  3. 👥 Multi-Agent SFT Comming soon...


🏆 Leaderboard

📊 We maintain a leaderboard at covering:

Category#MethodsEasy Best Pass@1Hard Best Pass@1
Chat LLMs1754.37 %19.13 %
Reasoning LLMs1051.21 %22.51 %
Multi‑Agent Systems10--

Single LLM


📢 News

[20250908] 🎉 Our benchmark is used by Kimi-K2 titter.!

[20250601] 🎉 Release the inference script and docker images for both test split and train split!

[20250522] 📄 Release the preprint version! See the preprint.


✍️ Citation

If you use SWE‑Dev, please cite:

@article{du2025swedev,
  title={SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development},
  author={Du, Yaxin and Cai, Yuzhu and Zhou, Yifan and Wang, Cheng and Qian, Yu and Pang, Xianghe and Liu, Qian and Hu, Yue and Chen, Siheng},
  journal={arXiv preprint arXiv:2505.16975},
  year={2025}
}

📝 License

Code and dataset are released under the Apache 2.0 license. See the LICENSE file for details.

🙏 Acknowledgements

We thanks for the MAS-Lab for contributing the multiagent system inference framework, Llama-Factory and Verl for providing training framework.