E2EDev: Benchmarking Large Language Models in End-to-End Software Development Task
October 16, 2025 Β· View on GitHub

π¦ Repository Structure
1. E2EDev_data/
This folder contains annotated data for 46 selected E2EDev projects. Each project folder includes:
source_code/: The original source code of the selected project, including necessary assets like images or audio files.requirment_with_tests.json: Contains fine-grained user requirements. Each requirement is paired with:- Gherkin-style test cases
- Corresponding Python step implementations
prompt.txt: All fine-grained requirements are concatenated into a template prompt format for direct use in prompting tasks.
π The dataset is also available at (each entry corresponds to a single test case): π https://huggingface.co/datasets/GuanZhiZhao/E2EDev
2. HITL-MAA/
This folder contains the source code for our semi-automated annotation framework, which includes:
- Pre-annotation for
TestID - A Human-In-The-Loop Multi-Agent Architecture (HITL-MAA)
βοΈ Dependencies
- ChromDriver
- Our annotation and testing framework relies on the
behavetesting tool.
Make sure to install the correct version of ChromeDriver on your machine.
Download it from:
π https://developer.chrome.com/docs/chromedriver/downloads
- Our annotation and testing framework relies on the
- Python Libraries
- Install the required Python libraries using the provided
requirements.txtfile.
Run the following command in your terminal:
- Install the required Python libraries using the provided
pip install -r requirements.txt
π οΈ How to Use the Annotation Framework
Step 1: Configure LLM API
Edit the configuration file (config.py) under HITL-MAA/ and provide your:
- API Key
- Base URL
- Model (default is
gpt-4o)
Step 2: Pre-Annotate the Code
Run the following script:
python HITL-MAA/TestID_annotation/rewrite_code.py
Set the following arguments inside the script:
old_folder: The parent folder of the original project(s) you want to annotate.new_folder: The folder where the pre-annotated projects will be saved.- These values have default paths set in the script. You can modify them as needed to annotate other projects.
Step 3: Launch HITL Annotation
Run:
python HITL-MAA/HITL_MAA/requirement_gen_MAS_per_senario.py
Before running, go to line 1224 and modify the following line in the main() function:
project_path = os.path.normpath(os.path.join(current_dir, '..', '..', 'E2EDev_data_withTestID'))
# Replace "E2EDev_data_withTestID" with your actual dataset folder name (relative to E2EDev/), where you want to annotate user requirements and test cases.
π¨βπ» During the annotation process, human input may be required. You will be prompted in the terminal when necessary. Simply enter your revised content as requested.
β Running Behave Tests
To test the annotated projects, use the testing script:
python run_behave_test.py
Before running, go to line 115 and modify the following line:
project_root = os.path.normpath(os.path.join(current_dir, 'For_Behave_Warehouse(TestOnly)'))
# Replace 'For_Behave_Warehouse(TestOnly)' with the name of your test project folder (relative to E2EDev/).
π¨βπ» The For_Behave_Warehouse(TestOnly) folder contains demo projects that help you understand the testing workflow quickly. This script will automatically:
- Execute behave tests
- Save outputs to:
- behave_logs/
- behave_results/
π Metrics Calculation (Effectiveness & Efficiency)
π’ Effectiveness Evaluation
Before running effectiveness metrics in the Metrics/ folder:
- In the evaluation scriptβs
main()function, set the following:- The path to the results folder generated by
run_behave_test.py.
- The path to the results folder generated by
π΅ Efficiency Evaluation
Before running efficiency metrics:
-
In the scriptβs
if __name__ == "__main__":block, ensure the following paths are correctly configured:- Log file path: the directory containing logs generated by the annotation framework.
- Generated project directory: the folder where the output projects are stored.
- Expected output directory: the location of the reference or ground truth output files.