Towards Realistic Project-Level Code Generation via Multi-Agent Collaboration and Semantic Architecture Modeling

November 28, 2025 · View on GitHub

ProjectGen is a multi-agent framework that decomposes projects into architecture design, skeleton generation, and code filling stages with iterative refinement and memory-based context management.

ProjectGen

Code Structure Overview

The core implementation resides in the src/ directory, which contains three major components: the multi-agent system, the memory management module, and a set of workflow and utility scripts. The directory structure is shown below:

src/
├── agents/
│   ├── architecture_agent.py
│   ├── arch_judge_agent.py
│   ├── skeleton_agent.py
│   ├── skeleton_judge_agent.py
│   ├── code_agent.py
│   ├── code_judge_agent.py
│   └── test.py

├── memory_manager/
│   ├── arch_memory.py
│   ├── skeleton_memory.py
│   ├── code_memory.py
│   └── __init__.py

├── build_dependency_graph.py
├── extract_api.py
├── logger.py
├── main.py
├── prompts.py
├── utils.py
└── workflow.py
  • agents/: This directory contains the core agents used in the three-stage generation workflow. Each stage consists of a generation agent and a judging agent, forming a generate–evaluate–refine loop.

  • memory_manager/: The memory modules preserve intra-stage semantic information, enabling agents to efficiently access relevant context from previous iterations.

  • build_dependency_graph.py: Constructs file-level dependency graphs to support ordering.

  • extract_api.py: Extracts function signatures from generated code to support iteration.

  • logger.py: A unified logging module for debugging, tracing agent outputs, and monitoring workflow execution.

  • main.py: The main entry point of the system.

  • prompts.py: Contains prompt templates used by all agents.

  • utils.py: Provides general-purpose utility functions.

  • workflow.py: Defines the overall multi-agent generation workflow.

Installation

conda create -n projectgen python=3.9
conda activate projectgen
pip install -r requirements.txt

Usage

Open src/utils.py and fill in your OpenAI API key:

os.environ["OPENAI_API_KEY"] = "your_openai_api_key_here"

Navigate to the src directory and run the main script:

cd src
python main.py --dataset=CodeProjectEval

The outputs will be stored in the CodeProjectEval_outputs/ folder.

CodeProjectEval

To better reflect real-world project scenarios and support evaluation through executable test cases, we construct a new project-level code generation dataset, CodeProjectEval, which consists of 18 Python repositories covering a wide range of topics.

Detailed Information

Repository#FILE#LOCComplexity#Check Tests (cov.)#Unit Tests (cov.)#PRD tokens
bplustree81,5092.298 (82%)356 (98%)1,339
cookiecutter182,8053.427 (55%)375 (99%)2,100
csvs-to-sqlite38165.8310 (81%)25 (88%)1,841
deprecated35974.0826 (80%)176 (95%)953
djangorestframework-simplejwt312,0142.098 (63%)191 (93%)1,614
flask249,3142.7125 (52%)482 (91%)2,913
imapclient173,5312.819 (40%)267 (80%)3,810
parsel51,1282.605 (65%)250 (95%)1,522
portalocker91,9582.8410 (58%)71 (94%)1,990
pyjwt122,6903.0110 (53%)294 (94%)382
python-hl7112,4342.9810 (56%)100 (87%)2,292
rsa142,9492.406 (73%)100 (87%)3,318
simpy122,1842.017 (60%)149 (90%)2,147
tinydb102,1701.7610 (58%)204 (95%)947
trailscraper138902.014 (65%)93 (92%)3,415
voluptuous73,1002.5511 (55%)161 (90%)1,221
xmnlp241,5043.478 (65%)23 (81%)3,105
zxcvbn81,4025.696 (81%)31 (84%)2,399
Avg.12.72,388.63.0310 (63.4%)186 (90.7%)2,067
Mid.11.52,0922.768.5 (61.5%)168.5 (91.5%)2,100

Each repository is supplemented with:

  • docs/PRD.md: provides detailed descriptions of a software system’s functional and non-functional requirements, guiding subsequent design and development.
  • docs/UML_pyreverse.md: UML class diagram and package diagram generated by Pyreverse.
  • docs/architecture_design.md: the directory tree of the repository and descriptions for each source file, accompanied by summaries of the classes and functions they contain.
  • check_tests/: to provide initial verification during the code generation process.
  • unit_tests/: executed upon completion of code generation to evaluate the overall quality and functional correctness of the generated projects.

Test Scripts

  • Similarity-based Evaluation (SketchBLEU)

    We adopt SketchBLEU, a similarity-based metric originally introduced in the CodeS framework. This metric evaluates the structural similarity between the generated code and the reference implementation. To calculate SketchBLEU:

    cd datasets/evaluation
    python calc_sketchbleu.py
    
  • Unit Test–based Evaluation

    For functional correctness evaluation, each repository contains a set of unit tests. To run the unit tests:

    cd datasets/evaluation
    python calc_passrate.py