README.md
May 21, 2026 ยท View on GitHub
FeatureBench is a test-driven data generation and evaluation pipeline for feature-level coding benchmarks. It provides a unified CLI to run inference, evaluation, and dataset generation.
๐ฐ News
๐ 2026.05.18: We added lite split evaluation results for frontier models including GPT-5.5, Claude Opus 4.7, DeepSeek-V4, GLM-5.1, Kimi-2.6, Mimo-V2.5-Pro, and more to the leaderboard.
๐ 2026.03.27: We released the fast split containing 100 instances (a subset of full split). These instances require no GPU and are optimized for rapid evaluation. On an Intel Xeon Platinum 8457C with 944GB RAM, the average evaluation time per instance using gold patches is 57.2 seconds.
๐ 2026.02.06: We now support one-click inference for mainstream agent frameworks, including OpenHands, Claude Code, Codex, Gemini CLI, and mini-swe-agent. All supported agent frameworks can be found here. We have also open-sourced the FeatureBench data pipeline.
๐ Leaderboard
Full interactive leaderboard with tabs, filters, and sorting.
Lite split results, ranked by %PASSED
| Rank | Model | Scaffold | %PASSED | %RESOLVED |
|---|---|---|---|---|
| 1 | Claude Opus 4.7 | OpenHands | 78.2 | 46.7 |
| 2 | GPT-5.5 | OpenHands | 69.8 | 26.7 |
| 3 | Claude Opus 4.6 | OpenHands | 69.5 | 20 |
| 4 | Claude Opus 4.5 | OpenHands | 67.2 | 20 |
| 5 | GPT-5.4 | OpenHands | 66.2 | 23.3 |
| 6 | GPT-5.1-Codex | Codex | 60.2 | 20 |
| 7 | DeepSeek-V4-Pro | OpenHands | 59.6 | 26.7 |
| 8 | Claude Opus 4.5 | Claude Code | 59.1 | 20 |
| 9 | Kimi-2.6 | OpenHands | 49.4 | 20 |
| 10 | Mimo-V2.5-Pro | OpenHands | 47.8 | 13.3 |
| 11 | Gemini-3-Pro-Preview | OpenHands | 45.1 | 10 |
| 12 | GLM-5.1 | OpenHands | 44.2 | 13.3 |
| 13 | Gemini-3-Pro-Preview | Gemini-CLI | 43.4 | 10 |
| 14 | DeepSeek-V4-Flash | OpenHands | 41.9 | 16.7 |
| 15 | MiniMax M2.1 | Mini-SWE-Agent | 41.9 | 10 |
| 16 | GLM 4.7 | Mini-SWE-Agent | 41.2 | 6.7 |
| 17 | Qwen3-Coder-480B-A35B-Instruct | OpenHands | 38.3 | 6.7 |
| 18 | DeepSeek V3.2 | OpenHands | 35.9 | 6.7 |
| 19 | Qwen3-Coder-30B-A3B-Instruct | OpenHands | 23 | 3.3 |
๐ Quickstart
Prerequisites:
# pypi
pip install featurebench
# or uv add featurebench
# local
git clone https://github.com/LiberCoders/FeatureBench.git
cd FeatureBench
uv sync
source .venv/bin/activate
Configure:
cp config_example.toml config.toml
See docs/config.md for a comprehensive reference (harness, infer, data pipeline) with examples.
Optional: pre-pull images to reduce network variance:
fb pull --mode lite # lite split image list (13 images)
fb pull --mode fast # fast split image list (18 images)
fb pull --mode full # full split image list (24 images)
fb pull --mode /path/to/images.txt # one image name per line
# full list: featurebench/resources/constants/full_images.txt
# lite list: featurebench/resources/constants/lite_images.txt
# fast list: featurebench/resources/constants/fast_images.txt
Run inference:
fb infer \
--config-path config.toml \
--agent mini_swe_agent \
--model openai/qwen3-coder-480b-a35b-instruct \
--split fast
Run evaluation:
fb eval \
-p runs/<timestamp>/output.jsonl \
--split fast
# use -p gold to verify the gold patches
๐งญ CLI Overview
fb provides three core commands:
fb inferrunsfeaturebench.infer.run_infer(docs: docs/infer_cli_arg.md)fb evalrunsfeaturebench.harness.run_evaluation(docs: docs/harness_cli_arg.md)fb datarunsfeaturebench.pipeline(docs: docs/pipeline.md)
โ๏ธ Citation
If you found FeatureBench useful, please cite us as:
@article{zhou2026featurebench,
title={FeatureBench: Benchmarking Agentic Coding for Complex Feature Development},
author={Zhou, Qixing and Zhang, Jiacheng and Wang, Haiyang and Hao, Rui and Wang, Jiahe and Han, Minghao and Yang, Yuxue and Wu, Shuzhe and Pan, Feiyang and Fan, Lue and others},
journal={arXiv preprint arXiv:2602.10975},
year={2026}
}
๐ง Contact
If you have any questions, feel free to contact qixingzhou1125@gmail.com or zjcheng2022@gmail.com.