Visual Test-time Scaling for GUI Agent Grounding

December 7, 2025 · View on GitHub

Tiange Luo, Lajanugen Logeswaran†, Justin Johnson†, Honglak Lee†

We release our ScreenSpot-Pro code for both UI-TARS and Qwen2.5-VL. All hyperparameters and prompts are not carefully tuned. Due to company policy, the release of the WebVoyager-related code is no longer permitted.

ScreenSpot-Pro

Please first download the data from ScreenSpot-Pro Hugging Face and put images and annotations folders under the same directory with code. Then, launch inference servers for different models (commands are listed below; the model names and ports have already been mapped inside the code). Finally, run bash run_ss_pro_xxx.sh.

You can use summarize_results.py to output ScreenSpot-Pro results categorically, following the order presented in our Table 1. One Example:

python summarize_results.py results/qwen25vl_RegionFocus.json results/uitars_RegionFocus.json

# output: 
# results/qwen25vl_RegionFocus.json 76.0 & 26.2 & 51.8 & 75.8 & 30.8 & 56.9 & 72.1 & 28.1 & 61.3 & 86.8 & 37.3 & 65.4 & 86.4 & 60.4 & 80.4 & 74.8 & 38.2 & 58.2 & 78.5 & 34.3 & 61.6 1581
# results/uitars_RegionFocus.json ...

You can turn on --debug inside eval_screenspot_pro_RegionFocus.py to save intermediate RegionFocus step images, such as image-as-map stars for judgment, zoom-ins, and projecting zoomed-in predictions back onto the original input.

Command for launching UI-TARS-72B & -7B

Please first set up your HUGGINGFACE_PATH and HF_TOKEN in the below commands.

HUGGINGFACE_PATH='the local directory to cache Hugging Face models'
HF_TOKEN='your_HF_token'
docker run --runtime nvidia --gpus '"device=0,1,2,3,4,5,6,7"' --ipc=host \
  -p 8100:8100 \
  --env "HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}" \
  --env "TORCH_USE_CUDA_DSA=1" \
  --env "CUDA_LAUNCH_BLOCKING=1" \
  -v $HUGGINGFACE_PATH:/root/.cache/huggingface \
  vllm/vllm-openai:v0.6.6 \
  --max-model-len 16384 \
  --max-num-seqs 256 \
  --gpu_memory_utilization 0.9 \
  --model bytedance-research/UI-TARS-72B-DPO \
  --tensor-parallel-size 8 \
  --enforce-eager \
  --limit-mm-per-prompt image=5 \
  --port 8100

HUGGINGFACE_PATH='the local directory to cache Hugging Face models'
HF_TOKEN='your_HF_token'
docker run --runtime nvidia --gpus '"device=0,1,2,3"' --ipc=host \
  -p 8200:8200 \
  --env "HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}" \
  --env "TORCH_USE_CUDA_DSA=1" \
  -v $HUGGINGFACE_PATH:/root/.cache/huggingface \
  vllm/vllm-openai:v0.6.6 \
  --max-model-len 16384 \
  --max-num-seqs 2048 \
  --gpu_memory_utilization 0.9 \
  --model bytedance-research/UI-TARS-7B-DPO \
  --tensor-parallel-size 4 \
  --limit-mm-per-prompt image=5 \
  --dtype bfloat16 \
  --port 8200

Command for launching Qwen2.5-VL-72B & -7B

Please first install https://github.com/QwenLM/Qwen-Agent.

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
vllm serve Qwen/Qwen2.5-VL-72B-Instruct --port 8300  --dtype bfloat16   --limit-mm-per-prompt '{"images": 5}'   --tensor-parallel-size 8

export CUDA_VISIBLE_DEVICES=0,1,2,3
vllm serve Qwen/Qwen2.5-VL-7B-Instruct   --port 8400   --dtype bfloat16   --limit-mm-per-prompt '{"images": 5}'   --tensor-parallel-size 4

Citation Information

If you find our code or paper useful, please consider citing:

@article{luo2025visual,
      title={Visual Test-time Scaling for GUI Agent Grounding},
      author={Luo, Tiange and Logeswaran, Lajanugen and Johnson, Justin and Lee, Honglak},
      journal={arXiv preprint arXiv:2505.00684},
      year={2025},
}