Nl2RepoBench
December 14, 2025 · View on GitHub
Project Overview
NL2Repo is a benchmark designed to evaluate the performance of Large Language Models (LLMs) and coding agents on long-horizon tasks that require generating a complete, runnable code repository from scratch (0-to-1). The benchmark consists of 104 distinct tasks, each paired with its own testing environment.
Running the Code
The current setup runs OpenHands in headless batch mode. Model behavior is controlled via the config.toml file. If you need to change the model configuration, please modify config.toml before starting the run.
The system currently uses a file-to-file execution workflow and manages Docker containers via python-on-whales. At the moment, only local execution is supported.
Note: When running in headless mode across multiple machines, you must set up shared file management (e.g., NFS) or manually transfer files to the target machines in advance.
Prerequisites
Before starting, ensure that Docker is installed locally and that the following images are available:
docker.all-hands.dev/all-hands-ai/openhands:0.56docker.all-hands.dev/all-hands-ai/runtime:0.56-nikolaik
The runtime image can be customized. The default image is sufficient for running Python-based tasks and comes with Python 3.12 preinstalled. If you need to support other languages, you can build your own runtime image and update the corresponding configuration in openhands/openhands_app.py (line 176).
Data Layout
-
The
test_filesdirectory contains all repository-related task data, including:- A
.txtfile specifying the number of test cases - The repository documentation in
.mdformat - Two
.jsonfiles used for testing
- A
-
All Docker volume mounts used for headless execution are stored in the
workspacesdirectory. Each task is assigned a unique UUID directory. The task-specific configuration file is copied from a template and modified accordingly (mainly to mount the workspace directory into the runtime container). -
Final results are saved in the
resultdirectory. Each task produces a single aggregated.jsonfile, named using the task’s randomly generated UUID. -
The project is launched using a
config.jsonfile. A sample configuration is shown below:
{
"startPro": [
{
"moduleName": "",
"baseUrl": "",
"sk": "",
"proNameList": [
"math-verify"
]
}
],
"max_pool_size": 20
}
Configuration Fields
-
startPro: A list of task nodes.
- Each node corresponds to a single model configuration.
- proNameList: A list of task names, which must match the subdirectory names under
test_files.
-
max_pool_size: The maximum number of concurrent threads. Once this limit is reached, additional tasks will be queued until resources become available.