mcp-adapted-bench
May 23, 2026 · View on GitHub
This is a collection of evaluation harness for the MCP-adapted benchmarks where we require agent to finish every task through unified list_tools and call_tool interface. Agent World Model series models can be evaluated using this repo.
- BFCLv3
- τ²-bench-verified
- MCP-Universe: 3d_design and repository_management categories are not included.
All credits go to their original benchmark paper authors.
Setup
-
Install:
uv pip install -e ".[bfcl,tau2,mcp-universe]" -
Configure the agent LLM:
export AWM_SYN_LLM_PROVIDER="openai" export OPENAI_BASE_URL=https://api.openai.com/v1 export OPENAI_API_KEY=sk-... export AWM_SYN_OVERRIDE_MODEL=gpt-5 # or azure endpoint: export AWM_SYN_LLM_PROVIDER=azure export AZURE_ENDPOINT_URL="https://<endpoint>.openai.azure.com/" export AZURE_OPENAI_API_KEY=... export AWM_SYN_OVERRIDE_MODEL=gpt-5 -
Configure the τ²-bench user simulator:
export TAU2_USER_SIM_LLM_BASE_URL="https://<endpoint>.openai.azure.com/" export TAU2_USER_SIM_LLM_API_KEY=... export TAU2_USER_SIM_LLM_MODEL=gpt-5.1 -
Configure the MCP-Universe environment:
Please refer to
third_party/MCP-Universe/README.mdfor detailed instructions.cp ./third_party/MCP-Universe/.env.example ./third_party/MCP-Universe/.env # Then fill in the .env file with your configuration, especially the online services keys and endpoints.
Evaluation
You need to run evaluation inside the agent-world-model repo instead of here which has a submodule pointing to this mcp-adapted-bench repo. Make sure to pull the submodule and install the agent-world-model with the awm cmd:
```bash
# bfclv3
awm bench --mode bfcl --output_dir ./outputs/bfcl
# τ²-bench-verfied
awm bench --mode tau2 --output_dir ./outputs/tau2
# MCP-Universe
awm bench --mode mcp_universe --output_dir ./outputs/mcp_universe
```
Reference
This repo is built upon the following projects: