๐ฑโ๏ธ EnvBench
November 9, 2025 ยท View on GitHub
A benchmark for automated development environment setup
Overview of the EnvBench pipeline
๐ฅ What's This?
EnvBench is a comprehensive benchmark for automating environment setup - an important task in software engineering. We have collected the largest dataset to date for this task and introduced a robust framework for developing and evaluating LLM-based agents that tackle environment setup challenges.
Our benchmark includes:
- ๐ 994 repositories: 329 Python and 665 JVM-based (Java, Kotlin) projects
- ๐งช Genuine configuration challenges: Carefully selected repositories that cannot be configured with simple deterministic scripts
- ๐ Evaluation metrics: Static analysis for missing imports in Python and compilation checks for JVM languages
- ๐ค Baselines: Zero-shot baselines and agentic workflows tested with GPT-4o and GPT-4o-mini
Current state-of-the-art approaches achieve success rates of 6.69% for Python and 29.47% for JVM repositories, demonstrating that EnvBench presents significant challenges for existing methods and provides ample opportunity for future research.
๐ ๏ธ What You'll Need
๐โโ๏ธ Running the Benchmark
๐ง Setup
Create a virtual environment and install dependencies:
uv venv --python 3.12
source .venv/bin/activate
uv sync
# uv sync --extra dev # optional
Set the required environment variables:
export HF_TOKEN=<your-huggingface-token>
export OPENAI_API_KEY=<your-openai-api-key>
# export WANDB_API_KEY=<your-wandb-api-key> # optional, wandb is disabled by default
# export DATA_ROOT=<path-to-your-data-root> # optional, default is ./data
# export TEMP_DIR=<path-to-your-temp-dir> # optional, default is ./tmp
๐ Pipeline Execution
โจ Quick Start
Execute the full pipeline (inference and evaluation):
uv run envbench -cn python-bash traj_repo_id=<your-hf-username>/<your-repo-name>
Results are automatically uploaded to your specified HuggingFace repository. Look at the EnvBench-trajectories dataset for the trajectories. Evaluation results are saved in the results.jsonl file.
๐๏ธ Advanced Configuration
For additional configuration options, including different agents and language models, refer to the conf directory with Hydra configurations.
Example: Running Zero-Shot GPT-4o on JVM data with W&B logging:
uv run envbench -cn jvm-zeroshot \
llm@inference.agent=gpt-4o \
traj_repo_id=<your-hf-username>/<your-repo-name> \
use_wandb=true
๐งช Evaluation
To run only the evaluation component:
uv run envbench -cn python-bash skip_inference=true skip_processing=true run_name=<your-run-name>
For more evaluation options, see evaluation/main.py.
๐งฉ Implementation Details
- ๐ค Agents and Prompts - Core reasoning components
- ๐ณ Dockerfiles - Environment containerization
- ๐ Evaluation Scripts - Benchmark assessment
๐ Resources
- ๐ฆ Dataset - Benchmark dataset
- ๐ Paper Trajectories - Agent trajectories from our paper
๐ Citation
If you find this work useful for your research, please use the following citation:
@inproceedings{
eliseeva2025envbench,
title={EnvBench: A Benchmark for Automated Environment Setup},
author={Aleksandra Eliseeva and Alexander Kovrigin and Ilia Kholkin and Egor Bogomolov and Yaroslav Zharov},
booktitle={ICLR 2025 Third Workshop on Deep Learning for Code},
year={2025},
url={https://openreview.net/forum?id=izy1oaAOeX}
}
โ๏ธ License
MIT. See LICENSE for details.