UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning

November 24, 2025 · View on GitHub

[📖 Paper] [🤗 UI-R1-3B] [🤗 UI-R1-E-3B][🤗 Datasets] [🤗 Daily Paper]

🔥 Overview

We propose UI-R1, the first framework to explore how rule-based RL can enhance the reasoning capabilities of multimodal large language models (MLLMs) for GUI action prediction tasks.

Logo

Experimental results demonstrate that our proposed UI-R1-3B achieves significant improvements over the base model (i.e. Qwen2.5-VL-3B) on both in-domain (ID) and out-of-domain (OOD) tasks, with average accuracy gains of 22.1% on ScreenSpot, 6.0% on ScreenSpot-Pro, and 12.7% on AndroidControl. Furthermore, UI-R1-3B delivers competitive performance compared to larger models (e.g., OS-Atlas-7B) trained via supervised fine-tuning (SFT) on 76K samples.

Logo

Grounding Leaderboard: UI-I2E-Bench

ModelScreenSpotUI-I2E-Bench AvgScreenSpot-ProAverage
UI-TARS-1.5-7B88.173.242.267.8
Uground-V1-72B89.776.334.366.8
UI-TARS-72B88.473.738.166.7
UI-R1-E-3B89.269.133.563.9
Uground-V1-7B87.170.331.162.8
InfiGUI-R187.569.729.662.3
UI-TARS-7B89.561.435.762.2
Qwen2.5-VL-72B87.151.443.660.7
UI-I2E-VLM-7B82.569.523.658.5
UI-TARS-2B82.36227.757.3
Qwen2.5-VL-7B84.753.82955.8
OmniParser-V27254.839.655.5
Uground-V1-2B78.857.426.654.3
OS-Atlas-7B82.558.618.953.3
UI-R1-3B83.358.517.853.2
UGround-7B74.154.216.548.3
UI-I2E-VLM-4B70.453.412.245.3
OmniParser73.953.18.345.1
ShowUI-2B76.841.57.742
Qwen2.5-VL-3B55.541.723.941.3
Aguvis-7B84.453.222.940.4
OS-Atlas-4B70.144.33.739.4
Qwen2-VL-7B42.648.71.631
Seeclick55.826.41.127.8
InternVL2-4B4.20.90.31.8

🔥Insight 1 : Fast Grounding

Thinking is not needed for GUI grounding.

Inspired by concurrent works studying efficient LRM, we realize efficient reasoning by RFT training. UI-R1-3B-E's training consists of two steps:

  1. DAST (Difficulty-Adaptive Slow-Thinking): Add difficulty-adaptive length reward to make reasoning from slow to fast.
  2. Nothinking: Not output reasoning process.

Note: UI-R1-3B (v2) and UI-R1-3B-E both train on larger dataset (2K grounding data in GUI-R1-3K) compared to UI-R1-3B (v1).

Benchmark 1: ScreenSpotV2

ScreenSpotV2inference modeMobile-TMobile-IDesktop-TDesktop-IWeb-TWeb-IAvg↑ / Len↓
OS-ATLAS-7Bw/o thinking95.275.890.763.690.677.384.1 /
UI-TARS-7Bw/o thinking95.279.190.768.690.678.384.7 /
UI-R1-3B (v1)w/ thinking96.284.392.363.689.275.485.4 / 67
GUI-R1-3Bw/ thinking97.678.294.364.391.072.485.0 / 80
UI-R1-3B (v2)w/ thinking97.679.692.367.988.977.885.8 / 60
UI-R1-E-3Bw/o thinking98.283.994.875.093.283.789.5 / 28

Benchmark 2: ScreenSpot-Pro

ScreenSpot-Proinference modeAverage Length↓Average Accuracy↑
UGround-7Bw/o thinking-16.5
OS-ATLAS-7Bw/o thinking-18.9
UI-R1-3B (v1)w/ thinking10217.8
GUI-R1-3Bw/ thinking11426.6
UI-R1-3B (v2)w/ thinking12929.8
UI-R1-E-3Bw/o thinking2833.5
Analysis
  1. Our UI-R1-3B-E achieves SOTA with least answer tokens in 3B/7B Open-source methods, demonstrating GUI grounding needs no reasoning.
Todo
  • Performance on 7B may be opposite.
  • Performance on Planning may be opposite. The author predicts that Fast Grounding, Slow Planning.
  • The checkpoints of UI-R1-3B-E will be released soon.
  • The updated paper will come soon.
  • The efficient training code will come soon. (in src/script/train_e.sh)

Setup

conda create -n ui-r1 python=3.10
conda activate ui-r1
bash setup.sh

Data

Our training mobile data is a subset from AndroidControl and ScreenSpot.

You can also prepare your training or inference data like:

images/:
	image1.png
	image2.png
test.json:
[
	{
	"img_filename": "image1.png",
        "bbox": [
            825,
            72,
            1673,
            149
        ],
        "instruction": "search bar"
     },
     {
	"img_filename": "image2.png",
        "bbox": [
            123,
            732,
            334,
            812
        ],
        "instruction": "check weather"
     }
]

where bbox : [x1,y1,x2,y2] is the coordinate of the left top and the right bottom of the ground truth bbox

Inference

We provide an example here

cd evaluation/
bash test.sh

Please fill the MODEL_PATH, IMG_PATH, TEST_JSON with your real checkpoint path and data path.

Training

cd src/script/
bash train.sh
# efficient training
bash train_e.sh

🗞️ News

  • 2025-11-08: Our paper was accepted by AAAI-2026.
  • 2025-05-14: We update the paper with UI-R1-E-3B.
  • 2025-05-12: We release the checkpoints of the UI-R1-E-3B model.
  • 2025-05-12: We fix the bug of scales when batch_size > 1.
  • 2025-05-11: We release the efficient training code of the UI-R1-E-3B model.
  • 2025-04-02: We release the datasets of the UI-R1-3B (v1) model.
  • 2025-03-30: We release the checkpoints of the UI-R1-3B (v1) model.
  • 2025-03-30: We release the UI-R1 repository.
  • 2025-03-27: We release our paper.

⭐️ Citation

If you find this project useful, welcome to cite us.

@article{lu2025ui,
  title={UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning},
  author={Lu, Zhengxi and Chai, Yuxiang and Guo, Yaxuan and Yin, Xi and Liu, Liang and Wang, Hao and Xiong, Guanjing and Li, Hongsheng},
  journal={arXiv preprint arXiv:2503.21620},
  year={2025}
}

🤝 Acknowledgements

We sincerely thank projects R1-V, Open-R1, and Open-r1-multimodal, VLM-R1 for providing their open-source resources.