E2EDev: Benchmarking Large Language Models in End-to-End Software Development Task

October 16, 2025 · View on GitHub

THE Features of E2EDev

📦 Repository Structure

1. `E2EDev_data/`

This folder contains annotated data for 46 selected E2EDev projects. Each project folder includes:

source_code/: The original source code of the selected project, including necessary assets like images or audio files.
requirment_with_tests.json: Contains fine-grained user requirements. Each requirement is paired with:
- Gherkin-style test cases
- Corresponding Python step implementations
prompt.txt: All fine-grained requirements are concatenated into a template prompt format for direct use in prompting tasks.

📌 The dataset is also available at (each entry corresponds to a single test case): 👉 https://huggingface.co/datasets/GuanZhiZhao/E2EDev

2. `HITL-MAA/`

This folder contains the source code for our semi-automated annotation framework, which includes:

Pre-annotation for TestID
A Human-In-The-Loop Multi-Agent Architecture (HITL-MAA)

⚙️ Dependencies

ChromDriver
- Our annotation and testing framework relies on the behave testing tool.
  Make sure to install the correct version of ChromeDriver on your machine.
  Download it from:
  👉 https://developer.chrome.com/docs/chromedriver/downloads
Python Libraries
- Install the required Python libraries using the provided requirements.txt file.
  Run the following command in your terminal:

pip install -r requirements.txt

🛠️ How to Use the Annotation Framework

Step 1: Configure LLM API

Edit the configuration file (config.py) under HITL-MAA/ and provide your:

API Key
Base URL
Model (default is gpt-4o)

Step 2: Pre-Annotate the Code

Run the following script:

python HITL-MAA/TestID_annotation/rewrite_code.py

Set the following arguments inside the script:

old_folder: The parent folder of the original project(s) you want to annotate.
new_folder: The folder where the pre-annotated projects will be saved.
These values have default paths set in the script. You can modify them as needed to annotate other projects.

Step 3: Launch HITL Annotation

Run:

python HITL-MAA/HITL_MAA/requirement_gen_MAS_per_senario.py

Before running, go to line 1224 and modify the following line in the main() function:

project_path = os.path.normpath(os.path.join(current_dir, '..', '..', 'E2EDev_data_withTestID'))
# Replace "E2EDev_data_withTestID" with your actual dataset folder name (relative to E2EDev/), where you want to annotate user requirements and test cases.

👨‍💻 During the annotation process, human input may be required. You will be prompted in the terminal when necessary. Simply enter your revised content as requested.

✅ Running Behave Tests

To test the annotated projects, use the testing script:

python run_behave_test.py

Before running, go to line 115 and modify the following line:

project_root = os.path.normpath(os.path.join(current_dir, 'For_Behave_Warehouse(TestOnly)'))
# Replace 'For_Behave_Warehouse(TestOnly)' with the name of your test project folder (relative to E2EDev/).

👨‍💻 The For_Behave_Warehouse(TestOnly) folder contains demo projects that help you understand the testing workflow quickly. This script will automatically:

Execute behave tests
Save outputs to:
- behave_logs/
- behave_results/

📊 Metrics Calculation (Effectiveness & Efficiency)

🟢 Effectiveness Evaluation

Before running effectiveness metrics in the Metrics/ folder:

In the evaluation script’s main() function, set the following:
- The path to the results folder generated by run_behave_test.py.

🔵 Efficiency Evaluation

Before running efficiency metrics:

In the script’s if __name__ == "__main__": block, ensure the following paths are correctly configured:
- Log file path: the directory containing logs generated by the annotation framework.
- Generated project directory: the folder where the output projects are stored.
- Expected output directory: the location of the reference or ground truth output files.