mcp-adapted-bench

May 23, 2026 · View on GitHub

This is a collection of evaluation harness for the MCP-adapted benchmarks where we require agent to finish every task through unified list_tools and call_tool interface. Agent World Model series models can be evaluated using this repo.

BFCLv3
τ²-bench-verified
MCP-Universe: 3d_design and repository_management categories are not included.

All credits go to their original benchmark paper authors.

Setup

Install:

uv pip install -e ".[bfcl,tau2,mcp-universe]"

Configure the agent LLM:

export AWM_SYN_LLM_PROVIDER="openai"
export OPENAI_BASE_URL=https://api.openai.com/v1
export OPENAI_API_KEY=sk-...
export AWM_SYN_OVERRIDE_MODEL=gpt-5
# or azure endpoint:
export AWM_SYN_LLM_PROVIDER=azure
export AZURE_ENDPOINT_URL="https://<endpoint>.openai.azure.com/"
export AZURE_OPENAI_API_KEY=...
export AWM_SYN_OVERRIDE_MODEL=gpt-5

Configure the τ²-bench user simulator:

export TAU2_USER_SIM_LLM_BASE_URL="https://<endpoint>.openai.azure.com/"
export TAU2_USER_SIM_LLM_API_KEY=...
export TAU2_USER_SIM_LLM_MODEL=gpt-5.1

Configure the MCP-Universe environment:

Please refer to third_party/MCP-Universe/README.md for detailed instructions.

cp ./third_party/MCP-Universe/.env.example ./third_party/MCP-Universe/.env
# Then fill in the .env file with your configuration, especially the online services keys and endpoints.

Evaluation

You need to run evaluation inside the agent-world-model repo instead of here which has a submodule pointing to this mcp-adapted-bench repo. Make sure to pull the submodule and install the agent-world-model with the awm cmd:

```bash
# bfclv3
awm bench --mode bfcl --output_dir ./outputs/bfcl

# τ²-bench-verfied
awm bench --mode tau2 --output_dir ./outputs/tau2

# MCP-Universe
awm bench --mode mcp_universe --output_dir ./outputs/mcp_universe
```

Reference

This repo is built upon the following projects: