💻 SWE‑Dev: Evaluating and Training Autonomous Feature‑Driven Software Development

October 23, 2025 · View on GitHub

💻 SWE‑Dev: Evaluating and Training Autonomous Feature‑Driven Software Development

🎯 SWE‑Dev is the first large‑scale benchmark and training corpus for feature‑driven development (FDD) — the real‑world task of adding new functionality to existing codebases. It ships 14 000 training and 500 test tasks, each with a runnable environment and developer‑written unit tests, enabling both supervised fine‑tuning and reinforcement learning from executable rewards.

📄 Dataset Overview

✨ Highlights

🌍 Real‑world FDD tasks drawn from mature open‑source projects.
⚙️ End‑to‑end reproducibility – every task bundles source, deps, Dockerfile & tests.
🤖 RL‑ready – deterministic pass/fail reward signals from pytest.
💪 Challenging – Claude‑3.7‑Sonnet reaches only 22.45 % Pass@3 on the hard split.
📈 Effective for model improvement – fine‑tuning a 7 B model on SWE‑Dev yields GPT‑4o‑level performance on hard split.

🚀 Getting Started

1. 🛠️ Installation

conda create -n swe-dev python=3.12.0

# bleeding‑edge
git clone [https://github.com/DorothyDUUU/SWE-Dev-dataset.git](https://github.com/DorothyDUUU/SWE-Dev-dataset.git)
cd SWE-Dev-dataset
pip install -r requirements.txt

```bash
conda create -n swe-dev python=3.12.0

# bleeding‑edge
git clone https://github.com/DorothyDUUU/SWE-Dev-dataset.git
cd SWE-Dev-dataset
pip install -r requirements.txt

2. 📥 Download the dataset & Build evaluation enviornment

Download dataset:

python dataset/download_data.py --dest ./data

The script organises the dataset as:

data/
 ├── train/
 │   ├── level1/
 │   ├── level2/
 │   └── level3/
 └── test/
     ├── Easy/
     └── Hard/

Docker Installation: Train set and test set are originated from different packages, thus the packages are installed in different docker images.

Test docker: (Need at least 10GB storage space for docker image)

python download_docker.py --split test

Train docker: (Need at least 100GB storage space for docker image)

python download_docker.py --split train

Docker Image for each sample: The docker image for each sample is the f"{package_name}-image", package_name is the value of package_name in sample metadata.

For instance, the image name for data/test/advertools-test_ad_create-level1-metadata.json, which package_name is advertools, the docker image for this sample is advertools-image.

Build evaluation API: For further usage for RL training, we wrapped the docker test in an API server, which could conviniently build in latter use.

3. ⏱️ Quick Inference

Single Agent Inference If you want to test on your own model, you can use the following command:

bash SWE-Dev-dataset/infer/single/run.sh

Multi-Agent Inference We also integrate 10 Multi-Agent Systems inference in the MASLab framework for SWE-Dev Dataset. Please refer to infer/MAS/README-MAS.md.

No.	Methodology	Venue	Role	Topo.	Tool	Generalization
1	Reflexion	NeurIPS 2023	Fixed	Fixed	No	Yes
2	Self-Consistency	ICLR 2024	Fixed	Fixed	No	Yes
3	LLM Debate	ICML 2024	Fixed	Fixed	No	Pre-defined Roles
4	MAD	EMNLP 2024	Fixed	Fixed	No	Pre-defined Roles
5	Self-Refine	NeurIPS 2024	Fixed	Fixed	No	Yes
6	AgentVerse	ICLR 2024	Dynamic	Fixed	No	Yes
7	MetaGPT	ICLR 2024	Fixed	Fixed	Yes	Coding-Specific
8	ChatDev	ACL 2024	Fixed	Fixed	Yes	Coding-Specific
9	MapCoder	ACL 2024	Fixed	Fixed	Yes	Coding-Specific
10	EvoMAC	ICLR 2025	Dynamic	Dynamic	Yes	Coding-Specific

4. Fine‑tuning

👤 Single-Agent SFT We use the Llama-Factory to conduct training, please refer to the train/single_agent_SFT.yaml for training parameters. SFT Dataset will be released in hugginggface.
👤 Single-Agent RL Comming soon...
👥 Multi-Agent SFT Comming soon...

🏆 Leaderboard

📊 We maintain a leaderboard at covering:

Category	#Methods	Easy Best Pass@1	Hard Best Pass@1
Chat LLMs	17	54.37 %	19.13 %
Reasoning LLMs	10	51.21 %	22.51 %
Multi‑Agent Systems	10	-	-

Single LLM

📢 News

[20250908] 🎉 Our benchmark is used by Kimi-K2 titter.!

[20250601] 🎉 Release the inference script and docker images for both test split and train split!

[20250522] 📄 Release the preprint version! See the preprint.

✍️ Citation

If you use SWE‑Dev, please cite:

@article{du2025swedev,
  title={SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development},
  author={Du, Yaxin and Cai, Yuzhu and Zhou, Yifan and Wang, Cheng and Qian, Yu and Pang, Xianghe and Liu, Qian and Hu, Yue and Chen, Siheng},
  journal={arXiv preprint arXiv:2505.16975},
  year={2025}
}

📝 License

Code and dataset are released under the Apache 2.0 license. See the LICENSE file for details.

🙏 Acknowledgements

We thanks for the MAS-Lab for contributing the multiagent system inference framework, Llama-Factory and Verl for providing training framework.