Online-Mind2Web benchmark

May 23, 2026 · View on GitHub

Online-Mind2Web Benchmark

Tianci Xue^,1, Weijian Qi^*,1, Tianneng Shi^*2, Chan Hee Song¹, Boyu Gou¹, Dawn Song^,2, Huan Sun^†,1 Yu Su^†,1

¹The Ohio State University, ²University of California, Berkeley
_{^*Equal contribution, ^†Equal advising}

📃 Paper • 📃 Blog • 🏆 Leaderboard • 🤗 Data

Online-Mind2Web benchmark

News

[05/23/2026] We introduce the v2 submission schema to better facilitate human evaluation of your submissions. We have also outsourced human evaluation, which means your submissions can now be reviewed within a few days. See the Submission section for details.
[11/03/2025] We’ve updated 36 tasks that are no longer valid or involve websites with CAPTCHA verification. Please check out the updated tasks!
[07/08/2025] 🎉 Online-Mind2Web has been accepted to COLM 2025!
[05/11/2025] Check out our updates in the paper.
- The performance of Claude Computer Use 3.7.
- WebJudge(o4-mini) achieves high agreement (86%) with a low success rate gap (3.8%) compared with humans.
- Release WebJudge-7B, a robust and reliable reward model for Reinforcement learning.

Online-Mind2Web includes 300 diverse tasks from 136 popular websites across various domains. It covers a diverse set of real-world user tasks, such as clothing, food, housing, and transportation, to evaluate web agents' performance in a real-world online environment.

Update Tasks

We will regularly update Online-Mind2Web by replacing outdated or invalid tasks (e.g., due to website changes) to maintain its value as a rigorous benchmark for web agents. If you find any tasks are outdated, please reach out to us, and we will update them.

To ensure fair comparisons, we will aim to keep the updated tasks on the same websites as before and with a similar reference length. Additionally, once agent performance saturates on Online-Mind2Web, we will also revise simple tasks to preserve its long-term value.

Update History

2026/05/15

🧩 Updated Task IDs

['199be0b54a436daee74247971fc684ee_051526', '1bc154377120ec15b18dbabdba49c741_051526', '78baf9dbe7c3532f7d7ef4cc22a7f065_051526', '85b284c18d7e78c9b5a9e074e7aa3b98_051526', '8ae510355d978424f490798f900bfa2c_051526', 'fc53ddd3421411a41c1020a3fdc84ec4_051526']

2026/01/02

🧩 Updated Task IDs

['547f5729c59d5d12a457a3ebb74c31c6']

2025/12/14

🧩 Updated Task IDs

['c698ff3fc0f6cbce39947c597ab5749b', '50d91eabde542906937ab4c5b6f8f23a']

2025/12/11

🧩 Updated Task IDs

['b64f938af842f6a1b4489d0e49a785a7', '7e1047f4803237f319c004f7a7f6bccb', 'c94551d2b18f9ad0ab31b0bd98ca42e3', '47186fac8e7c7277af01144644eb4e0b', '78baf9dbe7c3532f7d7ef4cc22a7f065']

2025/11/23

🧩 Updated Task IDs

['9829f3087ab1f9c8eba6b6dd2b831d25', '1bc154377120ec15b18dbabdba49c741']

2025/11/03

Update summary:
Based on community feedback, we updated 36 tasks that were no longer valid or involved websites with CAPTCHA verification. The updated tasks were carefully designed to preserve similar difficulty and task types, ensuring fair comparison with prior results.

🧩 Updated Task IDs

['b7258ee05d75e6c50673a59914db412e', '824eb7bb0ef1ce40bfd49c12182d9428', '8f2611047de227a2ca8bda13f6e2e5fb', '62f1626ce249c31098854f8b38bdd6cf', '79f0bd7df6e685f30f20025cc6755c0a', '5e1b8254c123c80178cc28e0afdb14f0', '816851ff92ff0219acf4364dcc2c4692', 'e7301bb694871429bf2eb36c3a72186c', '3c1ffc3f494e423b3c434c79e35da8f3', '9f1cba613830ca1c6a58f9498c06e679', '9c97bab9c2abfb90a426cbe9addae8d0', '2fc51dd3febd447f0fdcdabca8d944ce', 'eb323dc584156d0eb3a2b90bb8c4b791', 'a0a18ca6a3529f3e97c771aadd42d3a0', 'e7f6cca9a8875f98fee3b711ead3a444', 'f2be37a9a60fbc25b6b11cf622d17352', '2d5a7f95f951a26838289dfd629ae850', '502e864440283214e0180645015f568b', '3adeea7627f4343069f38adae40f73d0', '8f80e64e44e1fada018997b2fe869683', '0a0fa834ce41b5297c6474293383759d', '64345c365f544375357c7b67917f08a0', '33bd2cdcea4fcc42a09a8a1e4e5841c6', '3dca7cbe7d086619d837ff9f5312cebc', '11857213ca01510f12813740afd59918', 'd730f4ff450da1bd60a836163736ef6a', 'fe33894188d20d7469f37a9fd855e7ff', 'e43cbc8a0bf9e999884928d11006f894', 'c577a14301a725e09ccd269a3e0b271e', '2c8ef01a92c71ba9ef2e59bb17eea2b3', '636b07af4dd97c1793733db1fd1b90b8', 'd8e2a81fa621ce4737e5ea85671b630e', '199be0b54a436daee74247971fc684ee', 'd1807551297ac60ecaaabbd2a2ed301a', 'dd44c665cec1e9c929a4c5f074e7844a', '1ab384fb3a791edfb410213cc6b82151']

2025/04/05

🧩 Updated Task IDs

["c03ee2be3d73556ab789c0ad1cbd3451", "c181f903ec1107b850032c17cad88393", "2c8ef01a92c71ba9ef2e59bb17eea2b3", "d8e2a81fa621ce4737e5ea85671b630e", "63d6866fc000fcb1f153e07604bd1395", "199be0b54a436daee74247971fc684ee"]

Submission

We use the v2 submission schema (online-mind2web-v2) for trajectory submissions to the leaderboard. In v2, each step is a self-contained object that bundles its action, thought, screenshot, and URL together, making it straightforward for evaluation with complete step-level context.

Each submission is a directory per task containing a result.json and a trajectory/ folder with per-step screenshots. The result.json follows the v2 schema with fields including schema_version, task, task_id, agent_final_answer, reference_length, and an action_history of structured step objects.

Review policy

Auto-eval: We provide free review for auto-eval submissions.
Human eval: We have outsourced human evaluation for reviewing submissions. See the full pricing and review details.
Academic submissions: For submissions from academia, the Online-Mind2Web team can still provide free evaluation. The turnaround time is usually longer than outsourced human evaluation, so please notify us 1-2 weeks in advance.

Examples for both formats are available under data/example/ (example_v1/ and example_v2/). For the full schema specification, action dictionary, validation rules, and migration guide from v1, see the v2 schema README. For submission instructions, leaderboard access, and submission status, visit the Leaderboard.

Automatic Evaluator via LLM-as-a-Judge (WebJudge)

To enhance the reliability and scalability of the evaluation process in online environments, We propose a more reliable automatic evaluation method called WebJudge, which consists of three parts. (1) Key Point Identification: The model is prompted to identify several key points necessary for completing the task, based on the given instruction and task description. (2) Key Screenshot Identification: Important screenshots are selected from the agent’s trajectory to retain relevant visual evidence while discarding uninformative frames. (3) Outcome Judgment: Output the judgement result based on the task description, key points, key screenshots, and the action history. Our method preserves critical intermediate screenshots while mitigating the token overload issue.

pipeline

Results

Comparison against Existing Evaluation Methods on Online-Mind2Web

Model	Auto-Eval	SeeAct	Agent-E	Browser Use	Claude 3.5	Claude 3.7	Operator	Avg AR
GPT-4o	Autonomous Eval	84.7	85.0	76.0	83.7	75.5	71.7	79.4
	AgentTrek Eval	73.0	64.3	63.3	--	--	--	66.9
	WebVoyager	--	75.3	71.3	74.0	72.0	76.7	73.9
	WebJudge	86.7	86.0	81.4	86.3	79.1	81.8	83.6
o4-mini	Autonomous Eval	79.7	85.7	86.0	84.3	68.0	73.3	79.5
	WebVoyager	--	80.3	79.0	81.7	74.3	78.3	78.7
	WebJudge	85.3	86.3	89.3	87.0	82.3	83.7	85.7
	WebJudge-7B	86.0	87.3	88.3	89.7	84.3	86.3	87.0

WebJudge powered by GPT-4o and o4-mini consistently achieves the highest agreement, with averages of 83.6% and 85.7%, respectively. Meanwhile, WebJudge-7B even outperforms o4-mini, reaching a high agreement with human judgment of 87%.

Excellent generalization capabilities on AgentRewardBench (5 OOD benchmarks)

Methods	AB	VWA	WA	Work	Wk++	Overall
Rule-based*	25.0	85.2	79.0	100.0	83.3	83.8
Autonomous Eval*	83.3	61.2	67.6	96.4	59.3	67.6
GPT-4o (A11y Tree)*	77.8	63.0	70.2	94.6	63.0	69.8
WebJudge (GPT-4o)	66.7	69.8	72.6	92.3	75.0	73.7
WebJudge-7B	80.0	66.7	77.5	100.0	70.0	75.7
WebJudge (o4-mini)	100.0	74.5	81.2	100.0	90.0	82.0

WebJudge significantly outperforms existing methods, achieving impressive overall precision of 73.7% 75.7% and 82.0% on WebArena (WA), VisualWebArena (VWA), AssistantBench (AB), WorkArena (Work) and WorkArena++ (Wk++) across 1302 trajectories.

The high precision suggests that WebJudge holds potential as a robust and scalable reward model for downstream applications such as Rejection Sampling Fine-Tuning, Reflection, and Reinforcement Learning.

Model Release

We have released the fine-tuned WebJudge-7B weights, which are now available on Hugging Face.

Setup Environment

Create a conda environment and install dependencies:

conda create -n Online_Mind2Web python=3.11
conda activate Online_Mind2Web
pip install -r requirements.txt

Evaluation

You can run the provided example evaluation script directly to perform the evaluation. Adjust the "mode" parameter to choose among various auto-eval methods.

bash ./script/eval.sh

Important Notes for Reliable Evaluation on Online-Mind2Web:

Important

Start from the specified websites, not Google Search:To enable fair comparisons, please ensure that each task starts from the specified website in our benchmark. Starting from Google Search or alternative websites can lead agents to use different websites to solve the task, resulting in varying difficulty levels and potentially skewed evaluation results.
Include only factual actions, not agent outputs: The action history should contain only the factual actions taken by the agent to complete the task (e.g., clicking elements and Typing text). Do not include the final response or any other agent's outputs, as they may contain hallucinated content and result in a high rate of false positives.
Use o4-mini for WebJudge: WebJudge powered by o4-mini demonstrates a higher alignment with human judgment, achieving an average agreement rate of 85.7% and maintaining a narrow success rate gap of just 3.8%. Therefore, please use o4-mini as the backbone for automatic evaluation.

Evaluation Results

In certain scenarios, testing on the full Online-Mind2Web dataset may not be feasible due to cost, privacy, or legal constraints. To facilitate fair and apple-to-apple comparisons, we release both our human evaluation labels and auto-eval details.

Human Evaluation: Task-level human evaluation labels are provided in the file.
Auto-Evaluation: The results of WebJudge are available in the folder.

Licensing Information

The Online-Mind2Web dataset is licensed under a Creative Commons Attribution 4.0 International License.

Code under this repo is licensed under a MIT License.

📚 Citation

Note: Online-Mind2Web is derived from the original Mind2Web dataset. We kindly ask that you cite both the original and this work when using or referencing the data.

@inproceedings{
  xue2025an,
  title={An Illusion of Progress? Assessing the Current State of Web Agents},
  author={Tianci Xue and Weijian Qi and Tianneng Shi and Chan Hee Song and Boyu Gou and Dawn Song and Huan Sun and Yu Su},
  booktitle={Second Conference on Language Modeling},
  year={2025},
  url={https://openreview.net/forum?id=6jZi4HSs6o}
}

@inproceedings{deng2023mind2web,
 author = {Deng, Xiang and Gu, Yu and Zheng, Boyuan and Chen, Shijie and Stevens, Sam and Wang, Boshi and Sun, Huan and Su, Yu},
 booktitle = {Advances in Neural Information Processing Systems},
 editor = {A. Oh and T. Naumann and A. Globerson and K. Saenko and M. Hardt and S. Levine},
 pages = {28091--28114},
 publisher = {Curran Associates, Inc.},
 title = {Mind2Web: Towards a Generalist Agent for the Web},
 url = {https://proceedings.neurips.cc/paper_files/paper/2023/file/5950bf290a1570ea401bf98882128160-Paper-Datasets_and_Benchmarks.pdf},
 volume = {36},
 year = {2023}
}