readme.md
April 13, 2026 ยท View on GitHub
[ACL 2026] CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation
๐ฐ News
- [2026/04] ๐ Our paper has been officially accepted to the ACL 2026 Main Conference!
๐ Introduction
CodeFlowBench is a comprehensive benchmark designed to evaluate Large Language Models (LLMs) on multi-turn, dependency-aware, and iterative code generation tasks. Unlike traditional benchmarks that focus on single-function generation, CodeFlowBench tests a model's ability to maintain context, handle complex dependencies, and evolve code over multiple turns.
The benchmark consists of two subsets:
- CodeFlowBench-Comp(Competitive): Focuses on complex competitive programming problems.
- CodeFlowBench-Repo: Focuses on domain-specific real-world programming problems from Github Repo.
๐ Directory Structure
codeflowbench/
โโโ data/ # Dataset files (JSON)
โโโ models/ # Local model checkpoints (optional)
โโโ scripts/ # Bash scripts for running evaluation
โโโ src/ # Source code for inference and harness
โ โโโ harness.py # Evaluation logic
โ โโโ utils.py # Utility functions
โโโ requirements.txt # Dependencies for All benchmark
โโโ requirements_repo.txt # Additional dependencies for Repo benchmark
โโโ README.md
๐ง Installation
First, clone the repository and set up the Conda environment:
cd codeflowbench
conda create -n codeflowbench python=3.10
conda activate codeflowbench
Install the dependencies:
# For CodeFlowBench All (Standard Evaluation)
pip install -r requirements.txt
# [Optional] For CodeFlowBench-Repo
# This installs additional libraries required for executing domain-specific code
pip install -r requirements_repo.txt
๐ Preparation
1. Model Preparation
You can either use Hugging Face model paths directly or place your local model weights inside the models folder.
- Example Path:
models/Llama-3.1-8B-Instruct
2. Data Preparation
Ensure the dataset files are located in the ./data directory. The structure should typically contain:
codeflowbench_comp_test.jsoncodeflowbench_repo.json
๐ Quick Start
We provide convenient Bash scripts to automate the inference and evaluation process. The default scripts use Llama-3.1-8B-Instruct as an example.
๐น CodeFlowBench-Comp
Multi-turn Evaluation (Core): Evaluate the model's ability to generate code iteratively with dependencies.
bash scripts/test_multi_turn.sh
Single-turn Evaluation: Evaluate the model in a standard single-turn setting for comparison.
bash scripts/test_single_turn.sh
๐ธ CodeFlowBench-Repo
The process is similar to the evaluation, with the following adjustments:
- Choose
harness_repo.pyandcodeflowbench_repo.jsonin the bash script. - Change the import in
inference.pytoutils_(api)_repo. - Run the bash file.
๐ Output & Results
The evaluation logs and final scores will be saved in the result directory.
Filename Format: {model_name}_{mode}.json
Example: result/Llama-3.1-8B-Instruct_multi_turn.json
Result Content: Each entry contains the generated code, execution logs, and the pass/fail status for each turn.