Online-Mind2Web benchmark
May 23, 2026 · View on GitHub
Online-Mind2Web Benchmark
Tianci Xue,1, Weijian Qi*,1, Tianneng Shi*2, Chan Hee Song1, Boyu Gou1, Dawn Song,2, Huan Sun†,1 Yu Su†,1
1The Ohio State University, 2University of California, Berkeley
*Equal contribution, †Equal advising
📃 Paper • 📃 Blog • 🏆 Leaderboard • 🤗 Data
Online-Mind2Web benchmark
News
- [05/23/2026] We introduce the v2 submission schema to better facilitate human evaluation of your submissions. We have also outsourced human evaluation, which means your submissions can now be reviewed within a few days. See the Submission section for details.
- [11/03/2025] We’ve updated 36 tasks that are no longer valid or involve websites with CAPTCHA verification. Please check out the updated tasks!
- [07/08/2025] 🎉 Online-Mind2Web has been accepted to COLM 2025!
- [05/11/2025] Check out our updates in the paper.
- The performance of Claude Computer Use 3.7.
- WebJudge(o4-mini) achieves high agreement (86%) with a low success rate gap (3.8%) compared with humans.
- Release WebJudge-7B, a robust and reliable reward model for Reinforcement learning.
Tasks
Online-Mind2Web includes 300 diverse tasks from 136 popular websites across various domains. It covers a diverse set of real-world user tasks, such as clothing, food, housing, and transportation, to evaluate web agents' performance in a real-world online environment.
Update Tasks
We will regularly update Online-Mind2Web by replacing outdated or invalid tasks (e.g., due to website changes) to maintain its value as a rigorous benchmark for web agents. If you find any tasks are outdated, please reach out to us, and we will update them.
To ensure fair comparisons, we will aim to keep the updated tasks on the same websites as before and with a similar reference length. Additionally, once agent performance saturates on Online-Mind2Web, we will also revise simple tasks to preserve its long-term value.
Update History
2026/05/15
🧩 Updated Task IDs
['199be0b54a436daee74247971fc684ee_051526', '1bc154377120ec15b18dbabdba49c741_051526', '78baf9dbe7c3532f7d7ef4cc22a7f065_051526', '85b284c18d7e78c9b5a9e074e7aa3b98_051526', '8ae510355d978424f490798f900bfa2c_051526', 'fc53ddd3421411a41c1020a3fdc84ec4_051526']2026/01/02
🧩 Updated Task IDs
['547f5729c59d5d12a457a3ebb74c31c6']2025/12/14
🧩 Updated Task IDs
['c698ff3fc0f6cbce39947c597ab5749b', '50d91eabde542906937ab4c5b6f8f23a']2025/12/11
🧩 Updated Task IDs
['b64f938af842f6a1b4489d0e49a785a7', '7e1047f4803237f319c004f7a7f6bccb', 'c94551d2b18f9ad0ab31b0bd98ca42e3', '47186fac8e7c7277af01144644eb4e0b', '78baf9dbe7c3532f7d7ef4cc22a7f065']2025/11/23
🧩 Updated Task IDs
['9829f3087ab1f9c8eba6b6dd2b831d25', '1bc154377120ec15b18dbabdba49c741']2025/11/03
Update summary:
Based on community feedback, we updated 36 tasks that were no longer valid or involved websites with CAPTCHA verification. The updated tasks were carefully designed to preserve similar difficulty and task types, ensuring fair comparison with prior results.
🧩 Updated Task IDs
['b7258ee05d75e6c50673a59914db412e', '824eb7bb0ef1ce40bfd49c12182d9428', '8f2611047de227a2ca8bda13f6e2e5fb', '62f1626ce249c31098854f8b38bdd6cf', '79f0bd7df6e685f30f20025cc6755c0a', '5e1b8254c123c80178cc28e0afdb14f0', '816851ff92ff0219acf4364dcc2c4692', 'e7301bb694871429bf2eb36c3a72186c', '3c1ffc3f494e423b3c434c79e35da8f3', '9f1cba613830ca1c6a58f9498c06e679', '9c97bab9c2abfb90a426cbe9addae8d0', '2fc51dd3febd447f0fdcdabca8d944ce', 'eb323dc584156d0eb3a2b90bb8c4b791', 'a0a18ca6a3529f3e97c771aadd42d3a0', 'e7f6cca9a8875f98fee3b711ead3a444', 'f2be37a9a60fbc25b6b11cf622d17352', '2d5a7f95f951a26838289dfd629ae850', '502e864440283214e0180645015f568b', '3adeea7627f4343069f38adae40f73d0', '8f80e64e44e1fada018997b2fe869683', '0a0fa834ce41b5297c6474293383759d', '64345c365f544375357c7b67917f08a0', '33bd2cdcea4fcc42a09a8a1e4e5841c6', '3dca7cbe7d086619d837ff9f5312cebc', '11857213ca01510f12813740afd59918', 'd730f4ff450da1bd60a836163736ef6a', 'fe33894188d20d7469f37a9fd855e7ff', 'e43cbc8a0bf9e999884928d11006f894', 'c577a14301a725e09ccd269a3e0b271e', '2c8ef01a92c71ba9ef2e59bb17eea2b3', '636b07af4dd97c1793733db1fd1b90b8', 'd8e2a81fa621ce4737e5ea85671b630e', '199be0b54a436daee74247971fc684ee', 'd1807551297ac60ecaaabbd2a2ed301a', 'dd44c665cec1e9c929a4c5f074e7844a', '1ab384fb3a791edfb410213cc6b82151']2025/04/05
🧩 Updated Task IDs
["c03ee2be3d73556ab789c0ad1cbd3451", "c181f903ec1107b850032c17cad88393", "2c8ef01a92c71ba9ef2e59bb17eea2b3", "d8e2a81fa621ce4737e5ea85671b630e", "63d6866fc000fcb1f153e07604bd1395", "199be0b54a436daee74247971fc684ee"]Submission
We use the v2 submission schema (online-mind2web-v2) for trajectory submissions to the leaderboard. In v2, each step is a self-contained object that bundles its action, thought, screenshot, and URL together, making it straightforward for evaluation with complete step-level context.
Each submission is a directory per task containing a result.json and a trajectory/ folder with per-step screenshots. The result.json follows the v2 schema with fields including schema_version, task, task_id, agent_final_answer, reference_length, and an action_history of structured step objects.
Review policy
- Auto-eval: We provide free review for auto-eval submissions.
- Human eval: We have outsourced human evaluation for reviewing submissions. See the full pricing and review details.
- Academic submissions: For submissions from academia, the Online-Mind2Web team can still provide free evaluation. The turnaround time is usually longer than outsourced human evaluation, so please notify us 1-2 weeks in advance.
Examples for both formats are available under data/example/ (example_v1/ and example_v2/). For the full schema specification, action dictionary, validation rules, and migration guide from v1, see the v2 schema README. For submission instructions, leaderboard access, and submission status, visit the Leaderboard.
Automatic Evaluator via LLM-as-a-Judge (WebJudge)
To enhance the reliability and scalability of the evaluation process in online environments, We propose a more reliable automatic evaluation method called WebJudge, which consists of three parts. (1) Key Point Identification: The model is prompted to identify several key points necessary for completing the task, based on the given instruction and task description. (2) Key Screenshot Identification: Important screenshots are selected from the agent’s trajectory to retain relevant visual evidence while discarding uninformative frames. (3) Outcome Judgment: Output the judgement result based on the task description, key points, key screenshots, and the action history. Our method preserves critical intermediate screenshots while mitigating the token overload issue.
Results
Comparison against Existing Evaluation Methods on Online-Mind2Web
| Model | Auto-Eval | SeeAct | Agent-E | Browser Use | Claude 3.5 | Claude 3.7 | Operator | Avg AR |
|---|---|---|---|---|---|---|---|---|
| GPT-4o | Autonomous Eval | 84.7 | 85.0 | 76.0 | 83.7 | 75.5 | 71.7 | 79.4 |
| AgentTrek Eval | 73.0 | 64.3 | 63.3 | -- | -- | -- | 66.9 | |
| WebVoyager | -- | 75.3 | 71.3 | 74.0 | 72.0 | 76.7 | 73.9 | |
| WebJudge | 86.7 | 86.0 | 81.4 | 86.3 | 79.1 | 81.8 | 83.6 | |
| o4-mini | Autonomous Eval | 79.7 | 85.7 | 86.0 | 84.3 | 68.0 | 73.3 | 79.5 |
| WebVoyager | -- | 80.3 | 79.0 | 81.7 | 74.3 | 78.3 | 78.7 | |
| WebJudge | 85.3 | 86.3 | 89.3 | 87.0 | 82.3 | 83.7 | 85.7 | |
| WebJudge-7B | 86.0 | 87.3 | 88.3 | 89.7 | 84.3 | 86.3 | 87.0 |
Excellent generalization capabilities on AgentRewardBench (5 OOD benchmarks)
| Methods | AB | VWA | WA | Work | Wk++ | Overall |
|---|---|---|---|---|---|---|
| Rule-based* | 25.0 | 85.2 | 79.0 | 100.0 | 83.3 | 83.8 |
| Autonomous Eval* | 83.3 | 61.2 | 67.6 | 96.4 | 59.3 | 67.6 |
| GPT-4o (A11y Tree)* | 77.8 | 63.0 | 70.2 | 94.6 | 63.0 | 69.8 |
| WebJudge (GPT-4o) | 66.7 | 69.8 | 72.6 | 92.3 | 75.0 | 73.7 |
| WebJudge-7B | 80.0 | 66.7 | 77.5 | 100.0 | 70.0 | 75.7 |
| WebJudge (o4-mini) | 100.0 | 74.5 | 81.2 | 100.0 | 90.0 | 82.0 |
WebJudge significantly outperforms existing methods, achieving impressive overall precision of 73.7% 75.7% and 82.0% on WebArena (WA), VisualWebArena (VWA), AssistantBench (AB), WorkArena (Work) and WorkArena++ (Wk++) across 1302 trajectories.
The high precision suggests that WebJudge holds potential as a robust and scalable reward model for downstream applications such as Rejection Sampling Fine-Tuning, Reflection, and Reinforcement Learning.
Model Release
We have released the fine-tuned WebJudge-7B weights, which are now available on Hugging Face.
Setup Environment
Create a conda environment and install dependencies:
conda create -n Online_Mind2Web python=3.11
conda activate Online_Mind2Web
pip install -r requirements.txt
Evaluation
You can run the provided example evaluation script directly to perform the evaluation. Adjust the "mode" parameter to choose among various auto-eval methods.
bash ./script/eval.sh
Important Notes for Reliable Evaluation on Online-Mind2Web:
Important
- Start from the specified websites, not Google Search:To enable fair comparisons, please ensure that each task starts from the specified website in our benchmark. Starting from Google Search or alternative websites can lead agents to use different websites to solve the task, resulting in varying difficulty levels and potentially skewed evaluation results.
- Include only factual actions, not agent outputs: The action history should contain only the factual actions taken by the agent to complete the task (e.g., clicking elements and Typing text). Do not include the final response or any other agent's outputs, as they may contain hallucinated content and result in a high rate of false positives.
- Use o4-mini for WebJudge: WebJudge powered by o4-mini demonstrates a higher alignment with human judgment, achieving an average agreement rate of 85.7% and maintaining a narrow success rate gap of just 3.8%. Therefore, please use o4-mini as the backbone for automatic evaluation.
Evaluation Results
In certain scenarios, testing on the full Online-Mind2Web dataset may not be feasible due to cost, privacy, or legal constraints. To facilitate fair and apple-to-apple comparisons, we release both our human evaluation labels and auto-eval details.
- Human Evaluation: Task-level human evaluation labels are provided in the file.
- Auto-Evaluation: The results of WebJudge are available in the folder.
Licensing Information

The Online-Mind2Web dataset is licensed under a Creative Commons Attribution 4.0 International License.
Code under this repo is licensed under a MIT License.
📚 Citation
Note: Online-Mind2Web is derived from the original Mind2Web dataset. We kindly ask that you cite both the original and this work when using or referencing the data.
@inproceedings{
xue2025an,
title={An Illusion of Progress? Assessing the Current State of Web Agents},
author={Tianci Xue and Weijian Qi and Tianneng Shi and Chan Hee Song and Boyu Gou and Dawn Song and Huan Sun and Yu Su},
booktitle={Second Conference on Language Modeling},
year={2025},
url={https://openreview.net/forum?id=6jZi4HSs6o}
}
@inproceedings{deng2023mind2web,
author = {Deng, Xiang and Gu, Yu and Zheng, Boyuan and Chen, Shijie and Stevens, Sam and Wang, Boshi and Sun, Huan and Su, Yu},
booktitle = {Advances in Neural Information Processing Systems},
editor = {A. Oh and T. Naumann and A. Globerson and K. Saenko and M. Hardt and S. Levine},
pages = {28091--28114},
publisher = {Curran Associates, Inc.},
title = {Mind2Web: Towards a Generalist Agent for the Web},
url = {https://proceedings.neurips.cc/paper_files/paper/2023/file/5950bf290a1570ea401bf98882128160-Paper-Datasets_and_Benchmarks.pdf},
volume = {36},
year = {2023}
}