DevBench

March 13, 2024 · View on GitHub

Tasks in DevBench

DevBench is structured around five key tasks, each focusing on a critical phase of the software development lifecycle. Our modular design allows for independent testing and evaluation of each task.

Task 1: Software Design

Task Specification: The model is provided with the Product Requirements Document (PRD) to create:

  • Unified Modeling Language (UML) class and sequence diagrams in Mermaid, and
  • Architectural designs using hierarchical file-tree structures.

The class diagram acts as a fundamental component in software system modeling, illustrating classes, their attributes, and interrelations. Sequence diagrams detail processes and objects involved and the sequence of messages exchanged to carry out the software functionality. Architectural design is aimed at crafting a structured layout of source code, compilation scripts, and auxiliary files crucial for software implementation.

Evaluation: LLM-as-a-judge. The evaluation is anchored by two principal metrics: general principles and faithfulness.

  • General principles include cohesion and decoupling, and practicability.
  • Faithfulness gauges the extent to which models adhere to specified instructions.

Learn More.

Task 2: Environment Setup

Task Specification: Models are provided with the PRD, UML diagrams, and architecture design to generate a dependency file requisite for initializing the development environment (Conda for Python, Gradle for Java and NPM for JavaScript). It is noteworthy that our evaluation framework does not contain an Environment Setup task for C/C++ due to the absence of a universally acknowledged and user-friendly dependency management system for these languages.

Evaluation: The evaluation centers on the execution of dependency files across each programming language within a predetermined base environment delineated in a Docker file. This is followed by the execution of the repository's example usage code. The principal metric for evaluation in this task is the success rate of the executed example code.

Task 3: Implementation

Task Specification: Models are provided with the PRD, UML diagrams and architecture design, and are then instructed to develop code for each source code file as specified in the architecture design. The generation process is guided to adhere to a partial order derived from a predefined directed acyclic graph, thereby ensuring structured and logical code development.

Evaluation: An automated testing framework is developed and tailored to the specific programming language in use, integrates PyTest for Python, GTest for C++, JUnit for Java, and Jest for JavaScript. The evaluation procedure involves executing reference acceptance and unit tests within a predefined reference environment and the evaluation metric is determined by the pass rate of these tests.

Task 4: Acceptance Testing

Task Specification: Models are provided with the PRD, UML diagrams, and architecture design, with the option to include the implementation source code to generate acceptance test code. For libraries, acceptance tests are implemented through code that invokes the library's API, followed by assertions made on the responses of the API.

Evaluation: The evaluation of this task involves running the generated acceptance tests against the benchmark implementation code and reports the pass rate of the tests.

Task 5: Unit Testing

Task Specification: The PRD, UML diagrams, and architecture design are furnished to the models to generate unit tests.

Evaluation: The evaluation of LLM-generated unit tests is conducted through their application on the reference source code, employing the previously delineated testing framework. In addition to the pass rate metrics, statement coverage is also incorporated, providing a quantitative understanding of test comprehensiveness.

Statement Coverage=(Number of Executed StatementsTotal Number of Statements)×100%\text{Statement Coverage} = \left( \frac{\text{Number of Executed Statements}}{\text{Total Number of Statements}} \right) \times 100\%

Dataset Overview

DevBench comprises a curated collection of repos organized by programming language.

TextLanguageDomain#code files#code lines#code tokens#acceptance tests#unit testsUnit test coverage
TextCNNPythonDL, NLP5403156611099
ArXiv digestPythonSE, API119890143894
chakinPythonNLP11622251186
readtimePythonALGO32849204895
honePythonSE42748445790
StocktrendsPythonALGO138413502785
GeoTextPythonNLP247017015498
licePythonSE2376132962588
PSOPythonALGO21685781593
hybrid imagesPythonALGO, CV114474611990
Actor Relationship GameJavaALGO, API8493145341664.32
Leftist Trees and Fibonacci Heaps ComparisonJavaALGO363220092245.32
RedisJavaSE, DB9779254611778.6
idcenterJavaSE433311403454.2
image similarityJavaSE, CV338213972271.34
xlsx2csvC/C++SE1047614405895.17
people managementC/C++SE, DB654020437995.14
Area CalculationC/C++SE71623073390.48
Graph BFS DFSC/C++ALGO56672828522/
Logistic Management SystemC/C++SE7630200771799.11
listen-now-frontendJSWeb623249210/
registerJSWeb622374130/

Dataset Introduction

Repos in benchmark_data have been organized by programming language. The typical repo structure includes:

  • repo
    • acceptance_tests Acceptance test scripts
    • docs
      • PRD.md Product requirement document
      • UML_class.md UML Class diagrams
      • UML_sequence.md UML Sequence diagrams
      • architecture_design.md Architecture design of the repo
      • requirements.txt Dependencies needed for the repo
    • examples Usage examples
    • unit_tests Unit test scripts
    • src Source code
    • repo_config.json Repo configuration file
    • setup_shell_script.sh Script to set up the runtime environment

The repo_config.json file is crucial for defining the scope and requirements of each task, ensuring a seamless evaluation process. Below is an example configuration snippet:

{
    # I:SoftwareDesign
    # II: EnvironmentSetup
    # III: Implementation
    # IV: AccpetanceTesting
    # V: UnitTesting

    # docs
    "PRD": "{path_to_prd}", # I, II, III, IV, V
    "UML_class": "{path_to_UML_class}", # II, III, IV, V
    "UML_sequence": "{path_to_UML_sequence}", #II, III, IV, V
    "dependencies": "{path_to_dependices_file}", #II, III, IV, V
    "architecture_design": "{path_to_architecture_design_file}", #II, III, IV, V
    "language": "{language}", #II, III, IV, V

    # path to usage examples
    "usage_examples": "path_to_usage_examples",        # file or dir; II

    # test_path 
    "unit_tests": "{unit_tests_path}",  # III: evaluation; V: only dir is needed to host the generated test code
    "acceptance_tests": "{acceptance_tests_path}", # III: evaluation; VI: only dir is needed to host the generated test code

    # setup files
    "required_files": [ 
        "requirements.txt",
        "a.cpp",
        "a.h"
    ]  # III

    # setup shell script 
    "setup_shell_script": "path_to_script", # III

    # DAG for code files so that LLMs can generat code in order
    "code_file_DAG": {
        "main.py": ["a.py", "b.py", "c.py", "d.py"],
        "a.py": ["a1.py", "a2.py"]
    }, #III, IV, V

    # unit test linking and scripts for debugging
    "unit_tests_linking": {
        "unit_tests/test_1.py": ["src/module1.py", "src/module2.py"],    
            # unit test: corresponding code files
        ...
    } # V

    # test in batch
    "unit_test_script": "pytest --cov=. --cov-report=json:unit_test_cov.json --json-report --json-report-file=unit_test_report.json unit_tests",  # III, IV
    "acceptance_test_script": "pytest --cov=. --cov-report=json:acceptance_test_cov.json --json-report --json-report-file=acceptance_test_report.json acceptance_tests", # III, V

    # prompts for test generation
    "fine_unit_test_prompt": {
        "path/to/unittest1": "prompt",
        "path/to/unittest2": "prompt",
    ...
    }, # V
    "fine_acceptance_test_prompt": {
        "path/to/acceptancetest1": "prompt",
        "path/to/acceptancetest2": "prompt",
    ...
    } # IV
 }