readme.md

April 13, 2026 ยท View on GitHub

[ACL 2026] CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation

Hugging Face Dataset ย  arXiv

๐Ÿ“ฐ News

  • [2026/04] ๐ŸŽ‰ Our paper has been officially accepted to the ACL 2026 Main Conference!

๐Ÿ“– Introduction

CodeFlowBench is a comprehensive benchmark designed to evaluate Large Language Models (LLMs) on multi-turn, dependency-aware, and iterative code generation tasks. Unlike traditional benchmarks that focus on single-function generation, CodeFlowBench tests a model's ability to maintain context, handle complex dependencies, and evolve code over multiple turns.

The benchmark consists of two subsets:

  • CodeFlowBench-Comp(Competitive): Focuses on complex competitive programming problems.
  • CodeFlowBench-Repo: Focuses on domain-specific real-world programming problems from Github Repo.

๐Ÿ“‚ Directory Structure

codeflowbench/
โ”œโ”€โ”€ data/                   # Dataset files (JSON)
โ”œโ”€โ”€ models/                 # Local model checkpoints (optional)
โ”œโ”€โ”€ scripts/                # Bash scripts for running evaluation
โ”œโ”€โ”€ src/                    # Source code for inference and harness
โ”‚   โ”œโ”€โ”€ harness.py          # Evaluation logic
โ”‚   โ””โ”€โ”€ utils.py            # Utility functions
โ”œโ”€โ”€ requirements.txt        # Dependencies for All benchmark
โ”œโ”€โ”€ requirements_repo.txt # Additional dependencies for Repo benchmark
โ””โ”€โ”€ README.md

๐Ÿ”ง Installation

First, clone the repository and set up the Conda environment:

cd codeflowbench

conda create -n codeflowbench python=3.10
conda activate codeflowbench

Install the dependencies:

# For CodeFlowBench All (Standard Evaluation)
pip install -r requirements.txt

# [Optional] For CodeFlowBench-Repo 
# This installs additional libraries required for executing domain-specific code
pip install -r requirements_repo.txt

๐Ÿ“‹ Preparation

1. Model Preparation

You can either use Hugging Face model paths directly or place your local model weights inside the models folder.

  • Example Path: models/Llama-3.1-8B-Instruct

2. Data Preparation

Ensure the dataset files are located in the ./data directory. The structure should typically contain:

  • codeflowbench_comp_test.json
  • codeflowbench_repo.json

๐Ÿƒ Quick Start

We provide convenient Bash scripts to automate the inference and evaluation process. The default scripts use Llama-3.1-8B-Instruct as an example.

๐Ÿ”น CodeFlowBench-Comp

Multi-turn Evaluation (Core): Evaluate the model's ability to generate code iteratively with dependencies.

bash scripts/test_multi_turn.sh

Single-turn Evaluation: Evaluate the model in a standard single-turn setting for comparison.

bash scripts/test_single_turn.sh

๐Ÿ”ธ CodeFlowBench-Repo

The process is similar to the evaluation, with the following adjustments:

  1. Choose harness_repo.py and codeflowbench_repo.json in the bash script.
  2. Change the import in inference.py to utils_(api)_repo.
  3. Run the bash file.

๐Ÿ“Š Output & Results

The evaluation logs and final scores will be saved in the result directory.

Filename Format: {model_name}_{mode}.json
Example: result/Llama-3.1-8B-Instruct_multi_turn.json
Result Content: Each entry contains the generated code, execution logs, and the pass/fail status for each turn.