mcp-adapted-bench

May 23, 2026 · View on GitHub

This is a collection of evaluation harness for the MCP-adapted benchmarks where we require agent to finish every task through unified list_tools and call_tool interface. Agent World Model series models can be evaluated using this repo.

  1. BFCLv3
  2. τ²-bench-verified
  3. MCP-Universe: 3d_design and repository_management categories are not included.

All credits go to their original benchmark paper authors.

Setup

  1. Install:

    uv pip install -e ".[bfcl,tau2,mcp-universe]"
    
  2. Configure the agent LLM:

    export AWM_SYN_LLM_PROVIDER="openai"
    export OPENAI_BASE_URL=https://api.openai.com/v1
    export OPENAI_API_KEY=sk-...
    export AWM_SYN_OVERRIDE_MODEL=gpt-5
    # or azure endpoint:
    export AWM_SYN_LLM_PROVIDER=azure
    export AZURE_ENDPOINT_URL="https://<endpoint>.openai.azure.com/"
    export AZURE_OPENAI_API_KEY=...
    export AWM_SYN_OVERRIDE_MODEL=gpt-5
    
  3. Configure the τ²-bench user simulator:

    export TAU2_USER_SIM_LLM_BASE_URL="https://<endpoint>.openai.azure.com/"
    export TAU2_USER_SIM_LLM_API_KEY=...
    export TAU2_USER_SIM_LLM_MODEL=gpt-5.1
    
  4. Configure the MCP-Universe environment:

    Please refer to third_party/MCP-Universe/README.md for detailed instructions.

    cp ./third_party/MCP-Universe/.env.example ./third_party/MCP-Universe/.env
    # Then fill in the .env file with your configuration, especially the online services keys and endpoints.
    

Evaluation

You need to run evaluation inside the agent-world-model repo instead of here which has a submodule pointing to this mcp-adapted-bench repo. Make sure to pull the submodule and install the agent-world-model with the awm cmd:

```bash
# bfclv3
awm bench --mode bfcl --output_dir ./outputs/bfcl

# τ²-bench-verfied
awm bench --mode tau2 --output_dir ./outputs/tau2

# MCP-Universe
awm bench --mode mcp_universe --output_dir ./outputs/mcp_universe
```

Reference

This repo is built upon the following projects:

  1. Agent World Model
  2. BFCLv3
  3. τ²-bench-verified
  4. MCP-Universe