DevBench: Towards LLMs based Automated Software Development
March 31, 2024 ยท View on GitHub
DevBench: Towards LLMs based Automated Software Development
๐ Overview | ๐ Benchmarking | โ๏ธ Setup | ๐ Usage | ๐ Citation | ๐ License
๐ฌ Contact: libowen.ne@gmail.com, chao.peng@acm.org
๐ Check out our paper HERE !
๐ Overview
-
DevBench is a comprehensive benchmark designed to evaluate LLMs across various stages of the software development lifecycle, including software design, environment setup, implementation, acceptance testing, and unit testing. By integrating these interconnected steps under a single framework, DevBench offers a holistic perspective on the potential of LLMs for automated software development.
-
The DevBench dataset comprises 22 curated repositories across 4 programming languages (Python, C/C++, Java, JavaScript), covering a wide range of domains such as machine learning, databases, web services, and command-line utilities.
-
DevBench includes a comprehensive and automatic evaluation suite for all tasks involved. We provide extensive acceptance and unit test cases for the implementation task ๐ค. Additionally, we utilize LLM-as-a-Judge for evaluating the software design task ๐ฉ๐ฝโโ๏ธ. Further details on our task specifications can be found here.
-
We have developed a baseline agent system based on the popular multi-agent software development system, ChatDev. Special thanks to our collaborators at ChatDev!
๐ Benchmarking Code LLMs
Evaluation results of the coding tasks on DevBench.
| Model | Environment Setup | Implementation | Acceptance Testing | Unit Testing | ||
|---|---|---|---|---|---|---|
| Pass@ Example Usageยง | Pass@ Accept. Testยถ | Pass@ Unit Testยถ | Oracle Testยง | Oracle Testยง | Coverage$ | |
| GPT-3.5-Turbo | 33.3 | 4.2 | 4.3 | 11.7 | 28.7 | 24.6(61.4) |
| GPT-4-Turbo-1106 | 41.7 | 6.9 | 6.8 | 25.9 | 33.6 | 36.7(66.7) |
| GPT-4-Turbo-0125 | 41.7 | 7.1 | 8.0 | 29.2 | 36.5 | 33.2(66.3) |
| CodeLlama-7B-Instruct | 8.3 | 0.0 | 0.0 | 0.0 | 3.0 | 3.6(71.0) |
| CodeLlama-13B-Instruct | 25.0 | 0.6 | 0.0 | 0.0 | 5.1 | 8.6(57.6) |
| CodeLlama-34B-Instruct | 16.7 | 0.6 | 0.5 | 4.5 | 21.1 | 25.4(72.6) |
| DeepSeek-Coder-1.3B-Instruct | 8.3 | 0.0 | 0.1 | 0.0 | 5.6 | 2.7(27.0) |
| DeepSeek-Coder-6.7B-Instruct | 25.0 | 2.9 | 3.9 | 20.5โก | 23.5 | 28.2(70.6) |
| DeepSeek-Coder-33B-Instruct | 16.7 | 4.4 | 5.5 | 13.6 | 32.8 | 35.7(79.4) |
Evaluation results of the software design on DevBench.
The code for the software design evaluation can be found here๐ฉ๐ฝโโ๏ธ.
| Model | w/ Tie | w/o Tie | ||
|---|---|---|---|---|
| General Principlesโ | Faithfulnessโก | General Principles | Faithfulness | |
| GPT-4-Turbo-0125 | 97.9 | 97.9 | 100.0 | 100.0 |
| GPT-4-Turbo-1106 | 91.7 | 85.4 | 100.0 | 100.0 |
| CodeLlama-7B-Instruct | 4.2 | 8.3 | 4.2 | 4.5 |
| CodeLlama-13B-Instruct | 18.8 | 14.6 | 10.5 | 5.3 |
| CodeLlama-34B-Instruct | 39.6 | 33.3 | 33.3 | 21.4 |
| DeepSeek-Coder-1.3B-Instruct | 16.7 | 16.7 | 5.5 | 5.6 |
| DeepSeek-Coder-6.7B-Instruct | 35.4 | 35.4 | 31.6 | 29.4 |
| DeepSeek-Coder-33B-Instruct | 52.1 | 50.0 | 53.8 | 50.0 |
| Agree w/ Human Majority | 60.4 | 51.6 | 79.2 | 83.2 |
๐ณ Set Up with Docker
For a secure and isolated environment, we offer Docker support for DevBench. Please refer to our detailed Installation Guide.
๐ Usage
1. Prepare the environment variables
Add your DevBench directory to your PYTHONPATH variable.
export PYTHONPATH="${PYTHONPATH}:${path_to_devbench}"
For running the benchmark_data/java/Actor_relationship_game repo, configure your TMDB key.
export TMDB_API_KEY=${your_TMDB_key}
2. Prepare the chat models
OpenAI GPT models
Set your OpenAI API key as an environment variable.
export OPENAI_API_KEY="your_OpenAI_API_key"
Open source models
For deploying open source models, please refer to lmdeploy or vllm.
After the deployment, please configure the IP address in open_source_model.json.
For codellama and deepseek-coder models, which are integrated into our experiments, simply fill in the IP address in {"model_name": $model_ip_address}.
For example๏ผ
{
"codellama-7b-instruct": "",
"codellama-13b-instruct": "",
"codellama-34b-instruct": "",
"deepseek-coder-1.3b-instruct": "",
"deepseek-coder-6.7b-instruct": "",
"deepseek-coder-33b-instruct": "$model_ip_address"
}
For additional models, add a new field as shown below.
{
"customized-model": {"$model_name": "$model_ip_address"}
}
3. Run the agent system
Run script
cd agent_sysyem/baseline
python run.py --config Implementation --input_path ../../benchmark_data/python/TextCNN/ --model gpt-4-turbo-new --model_source openai --review execution --evaluate
Parameters
- config (str) - Specifies the task in the DevBench:
SoftwareDesign|EnvironmentSetup|Implementation|AcceptanceTesting|UnitTesting. - input_path (str) - Specifies the repo path.
- project_name (str) - Specifies the repo name. If empty, defaults to the last segment of
input_path(i.e.,input_path.split('/')[-1]) - model (str) - Specifies the name of the language model:
gpt-3.5-turbo|gpt-4|gpt-4-32k|gpt-4-turbo|claude-2|claude-2.1|codellama-7b|codellama-13b|codellama-34b|deepseek-coder-1.3b|deepseek-coder-6.7b|deepseek-coder-33b|customized-model. - customized_model_name (Optional, str) - Specifies the custom model name if the value of the
modelparameter iscustomized-model. - model_source (str) - Specifies the model type, open source model or openai closed source model :
open_source๏ฝopenai - review (str) - Specifies the review mode:
none|normal|execution.none: a single forward pass of Coding.normal: Coding and CodeReview in alternation, with CodeReview lacking program execution feedback.execution: Coding and CodeReview in alternation, with CodeReview including program execution feedback.
- read_src_code (bool) - Whether to read source code in the AcceptanceTesting and UnitTesting tasks.
- evaluate (bool) - Whether to evaluate in the end. The evaluation for the software design can be found here.
- temperature (float) - temperature
- top_p (float) - top_p
When you use normal review and execution review, the cyclenum parameter of CompanyConfig/{task_name}/ChatChainConfig.json can be specified as the number of rounds of review. The default is 2.
๐ Citation
@article{li2024devbench,
title={DevBench: A Comprehensive Benchmark for Software Development},
author={Li, Bowen and Wu, Wenhan and Tang, Ziwei and Shi, Lin and Yang, John and Li, Jinyang and Yao, Shunyu and Qian, Chen and Hui, Binyuan and Zhang, Qicheng and others},
journal={arXiv preprint arXiv:2403.08604},
year={2024}
}
๐ License
- Source Code Licensing: Our project's source code is licensed under the Apache 2.0 License. This license permits the use, modification, and distribution of the code, subject to certain conditions outlined in the Apache 2.0 License.
- Data Licensing: The related data utilized in our project is licensed under CC BY 4.0, which allows anyone to copy, distribute, transmit, adapt and make commercial use of the dataset.