Benchmarking ClaudeCode with any LLM on SWE benchmarks

June 2, 2026 · View on GitHub

Setup

git clone https://github.com/njukenanli/ClaudeCode-for-eval --recursive
python3.12 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
pip install -r server/requirements.txt

Running with Azure OpenAI

pip install openai azure-identity-broker --upgrade

Modify server/server.py::start_from_azure_openai to accept your azure_ad_token_provider

Rollout

prepare your config file like config/default.yaml

python main.py \
    --config config/default.yaml \
    --run-id debug \
    --dataset huggingface_dataset_name or local/path.jsonl
    --split test # specify split if use huggingface_dataset_name