UGround

March 11, 2026 · View on GitHub

This is the official code repository for the project: Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents [ICLR'25 Oral]. This work is a collaboration between OSU NLP Group and Orby AI. image

Updates

  • 2025/05/01: The bounding-box version of the training data is now available here.

  • 2025/01/23: Our training data for the UGround-V1 series (Initial/Qwen2-VL) has been released. We also have provided a comprehensive evaluation suite packed with meaningful resources to help researchers test GUI Agents and grounding models with ease. Try them out! The performance of Qwen2-VL-based UGround-V1 on several benchmarks are also updated on the homepage (e.g., AndroidWorld: 33->44).

  • 2025/01/05: Qwen2-VL-based UGround-V1 acheives SOTA results on a new and comprehensive GUI grounding benchmark ScreenSpot-Pro, substaintially outperforms prior models (18.9->31.1). Check the results and our tweet.

  • 2025/01/03: Qwen2-VL-based UGround-V1 has been released (2B, 7B, 72B). Check thier performance in Main Results.

  • 2024/10/07: Preprint is arXived. Demo is live. Code coming soon.

  • 2024/08/06: Website is live. The initial manuscript and results are available.

Release Plans:

Main Results

GUI Visual Grounding: ScreenSpot (Standard Setting)

image

ScreenSpot (Standard)ArchSFT dataMobile-TextMobile-IconDesktop-TextDesktop-IconWeb-TextWeb-IconAvg
InternVL-2-4BInternVL-29.24.84.64.30.90.14.0
GromaGroma10.32.64.64.35.73.45.2
Qwen-VLQwen-VL9.54.85.75.03.52.45.2
MiniGPT-v2MiniGPT-v28.46.66.22.96.53.45.7
GPT-422.624.520.211.89.28.816.2
GPT-4o20.224.921.123.612.27.818.3
FuyuFuyu41.01.333.03.633.94.419.5
Qwen-GUIQwen-VLGUICourse52.410.945.95.743.013.628.6
Ferret-UI-Llama8bFerret-UI64.532.345.911.428.311.732.3
Qwen2-VLQwen2-VL61.339.352.045.033.021.842.1
CogAgentCogAgent67.024.074.220.070.428.647.4
SeeClickQwen-VLSeeClick78.052.072.230.055.732.553.4
OS-Atlas-Base-4BInternVL-2OS-Atlas85.758.572.245.782.663.168.0
OmniParser93.957.091.363.681.351.073.0
UGround (Initial)LLaVA-UGround-V1UGround-V182.860.382.563.680.470.473.3
IrisIrisSeeClick85.364.286.757.582.671.274.6
ShowUI-GShowUIShowUI91.669.081.859.083.065.575.0
ShowUIShowUIShowUI92.375.576.361.181.763.675.1
Molmo-7B-D85.469.079.470.781.365.575.2
UGround-V1-2B (Qwen2-VL)Qwen2-VLUGround-V189.472.088.765.781.368.977.7
Molmo-72B92.779.586.164.383.066.078.6
Aguvis-G-7BQwen2-VLAguvis-Stage-188.378.288.170.785.774.881.0
OS-Atlas-Base-7BQwen2-VLOS-Atlas93.072.991.862.990.974.381.0
Aria-UIAriaAria-UI92.373.893.364.386.576.281.1
Claude (Computer-Use)98.285.679.957.192.284.582.9
Aguvis-7BQwen2-VLAguvis-Stage-1&295.677.793.867.188.375.283.0
Project Mariner84.0
CogAgent-9B-20241220GLM-4V-9B85.4
UGround-V1-7B (Qwen2-VL)Qwen2-VLUGround-V193.079.993.876.490.984.086.3
AGUVIS-72BQwen2-VLAguvis-Stage-1&294.585.295.477.991.385.988.4
UGround-V1-72B (Qwen2-VL)Qwen2-VLUGround-V194.183.494.985.790.487.989.4

GUI Visual Grounding: ScreenSpot (Agent Setting)

PlannerAgent-ScreenspotarchSFT dataMobile-TextMobile-IconDesktop-TextDesktop-IconWeb-TextWeb-IconAvg
GPT-4oQwen-VLQwen-VL21.321.418.610.79.15.814.5
GPT-4oQwen-GUIQwen-VLGUICourse67.824.553.116.450.418.538.5
GPT-4oSeeClickQwen-VLWeb, Mobile, ...81.059.869.633.643.926.252.4
GPT-4oOS-Atlas-Base-4BInternVLOS-Atlas94.173.877.847.186.565.374.1
GPT-4oUGround (Initial)LLaVA-UGround-V1UGround-V193.476.992.867.988.768.981.4
GPT-4oUGround-V1-2B (Qwen2-VL)Qwen2-VLUGround-V194.177.792.863.690.070.981.5
GPT-4oMolmo-72B94.179.092.370.088.767.081.9
GPT-4oMolmo-7B-D93.480.891.272.988.769.482.7
GPT-4oOS-Atlas-Base-7BQwen2-VLOS-Atlas93.879.990.266.492.679.183.7
GPT-4oUGround-V1-7B (Qwen2-VL)Qwen2-VLUGround-V194.179.993.373.689.673.384.0
GPT-4oUGround-V1-72B (Qwen2-VL)Qwen2-VLUGround-V194.579.993.875.088.775.284.5

Inference of Qwen2-VL-Based UGround

Python Environment (followed from Qwen2-VL's official repo)

#inference
pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830
pip install accelerate
pip install qwen-vl-utils
pip install 'vllm==0.6.1' 

vLLM server

vllm serve osunlp/UGround-V1-7B  --api-key token-abc123 --dtype float16

or

python -m vllm.entrypoints.openai.api_server --served-model-name osunlp/UGround-V1-7B --model osunlp/UGround-V1-7B --dtype float16 

You can find more instruction about training and inference in Qwen2-VL's Official Repo.

Here we use float16 instead of bfloat16 for more stable decoding (See details in vLLM's doc)

Visual Grounding Prompt

def format_openai_template(description: str, base64_image):
    return [
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
                },
                {
                    "type": "text",
                    "text": f"""
  Your task is to help the user identify the precise coordinates (x, y) of a specific area/element/object on the screen based on a description.

  - Your response should aim to point to the center or a representative point within the described area/element/object as accurately as possible.
  - If the description is unclear or ambiguous, infer the most relevant area or element based on its likely context or purpose.
  - Your answer should be a single string (x, y) corresponding to the point of the interest.

  Description: {description}

  Answer:"""
                },
            ],
        },
    ]


messages = format_openai_template(description, base64_image)

completion = await client.chat.completions.create(
    model=args.model_path,
    messages=messages,
    temperature=0  # REMEMBER to set temperature to ZERO!
# REMEMBER to set temperature to ZERO!
# REMEMBER to set temperature to ZERO!
)

# The output will be in the range of [0,1000), which is compatible with the original Qwen2-VL
# So the actual coordinates should be (x/1000*width, y/1000*height)

Untitled design

Citation Information

If you find this work useful, please consider starring our repo and citing our papers:

@inproceedings{gou2025uground,
title={Navigating the Digital World as Humans Do: Universal Visual Grounding for {GUI} Agents},
author={Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=kxnoqaisCT}
}

@inproceedings{zheng2024seeact,
  title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},
  author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
  booktitle={Forty-first International Conference on Machine Learning},
  year={2024},
  url={https://openreview.net/forum?id=piecKJ2DlB},
}