OSWorld-MCP: Benchmarking MCP Tool Invocation in Computer-Use Agents

May 13, 2026 ยท View on GitHub

๐Ÿ”” Updates

2026-05-13: You Might also be interested: ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents. Website

2026-05-09: Fixed bugs in LibreOffice MCP tools. New model results added: Kimi-K2.5, Gemini-3.1-Pro, Claude-4.5-Sonnet, and Qwen3.5-Plus.

2026-01-26: OSWorld-MCP is accepted to ICLR 2026! ๐ŸŽ‰

2025-10-28: We released our paper and project page! ๐ŸŽ‰

๐Ÿ“„ Read the Paper ย |ย  ๐ŸŒ Visit the Project Page


๐Ÿ“‘ Overview & Key Highlights

OSWorld-MCP is a comprehensive and fair benchmark for evaluating computer-use agents in real-world scenarios.
It jointly measures Model Context Protocol (MCP) tool invocation capabilities, graphical user interface (GUI) operation skills, and decision-making performance.
Designed as an extension of OSWorld, it significantly improves realism, balance, and comparability in evaluation.

Key Features & Findings

  • 158 validated MCP tools, spanning 7 common applications (LibreOffice Writer, Calc, Impress, VS Code, Google Chrome, VLC, OS utilities). Among them, 25 distractor tools for robustness testing
  • 250 tool-beneficial tasks โ†’ 69% of benchmark tasks benefit from MCP tools
  • Multi-round tool invocation possible, posing real decision-making challenges
  • MCP tools boost model accuracy & efficiency โ€” e.g., OpenAI o3: 8.3% โ†’ 17.6% (15 steps)
  • Highest observed Tool Invocation Rate (TIR) = 33.3% (Claude-4-Sonnet, 50 steps) โ†’ indicating ample room for improvement
  • MCP tools improve agent metrics
  • Higher tool invocation correlates with higher accuracy
  • Combining tools introduces significant challenges

Architecture Overview

OSWorld-MCP Architecture
Figure: OSWorld-MCP evaluation framework integrating GUI actions and MCP tool invocations.


โš™๏ธ Installation & Usage

1๏ธโƒฃ Preparation: Code Setup

# Clone OSWorld base repo
git clone https://github.com/xlang-ai/OSWorld.git

# Clone OSWorld-MCP
git clone https://github.com/X-PLUG/OSWorld-MCP.git

Integrate OSWorld-MCP files into OSWorld to enable MCP support.


2๏ธโƒฃ Preparation: Docker Environment

  1. Copy MCP files into /home inside Docker:
/home/
โ””โ”€โ”€ mcp_server/
โ””โ”€โ”€ osworld_mcp_client.py
  1. Install dependencies:
pip install -r requirements.txt
  1. Install Node.js
  2. Launch MCP server:
cd mcp_server
bash debug_server.sh

A successful launch opens the local MCP debug UI in your browser.


3๏ธโƒฃ Running Evaluation

Example: Evaluate Claude 4 Sonnet (15 steps):

python run_multienv_e2e.py \
    --api_url <your_api_url> \
    --api_key <your_api_key> \
    --model 'claude-sonnet-4-20250514-thinking' \
    --test_all_meta_path 'evaluation_examples/test_all.json' \
    --num_envs 1 \
    --action_space mcp \
    --max_steps 15 \
    --max_trajectory_length 15

๐Ÿ“ Key Metrics

  1. Task Accuracy (Acc) โ€” % of tasks successfully completed.
  2. Tool Invocation Rate (TIR) โ€” correct decisions to use a tool or not.
  3. Average Completion Steps (ACS) โ€” average number of actions per completed task.

๐Ÿ“Š Leaderboard (Sorted by Accuracy)

๐Ÿ”— Live Leaderboard: osworld-mcp.github.io

Max Steps: 15

Model / AgentAccTIRACS
Agent-S2.542.130.010.0
Claude-4-Sonnet36.127.410.5
Qwen3-VL32.821.510.0
Seed1.5-VL30.721.010.1
OpenAI o317.611.611.9
Gemini-2.5-Pro17.412.211.6
Qwen2.5-VL14.510.114.0

Max Steps: 50

Model / AgentAccTIRACS
Agent-S2.549.535.317.0
Claude-4-Sonnet45.033.320.0
Qwen3-VL39.526.118.6
Seed1.5-VL38.225.122.3
Gemini-2.5-Pro25.716.831.0
OpenAI o324.116.033.0
Qwen2.5-VL15.69.339.0

๐Ÿ“š Citation

@article{jia2025osworldmcp,
  title={OSWorld-MCP: Benchmarking MCP Tool Invocation in Computer-Use Agents},
  author={Jia, Hongrui and Liao, Jitong and Zhang, Xi and Xu, Haiyang and Xie, Tianbao and Jiang, Chaoya and Yan, Ming and Liu, Si and Ye, Wei and Huang, Fei},
  year={2025},
  journal={arXiv preprint arXiv:2510.24563}
}