Evaluation Findings: Why OmniParser Outperforms UI-TARS

January 15, 2026 · View on GitHub

January 2025

Executive Summary

Our evaluation reveals surprising results that contradict the literature benchmarks:

Metric	Literature (ScreenSpot-Pro)	Our Evaluation (Synthetic)
UI-TARS	61.6%	36.1%
OmniParser	39.6%	97.4%
Winner	UI-TARS (+22%)	OmniParser (+61.3%)

Key finding: The task matters more than the model. OmniParser's detection-based approach dominates on our evaluation, while UI-TARS excels at complex instruction-following in professional applications.

Why Literature Benchmarks Predicted UI-TARS Would Win

The literature review (see docs/literature_review.md) identified UI-TARS 1.5 as SOTA:

ScreenSpot-Pro: 61.6% (vs OmniParser's 39.6%)
OSWorld: 42.5% (vs Claude 3.7's 28.0%)
AndroidWorld: 64.2%

These benchmarks led to the hypothesis that UI-TARS would outperform OmniParser.

Why Our Results Differ

1. Different Task Types

Aspect	ScreenSpot-Pro	Our Synthetic Evaluation
Task	Natural language instruction → click	Ground truth bbox → verify detection
Example input	"Click the Save button in the File menu"	Element at (0.15, 0.23) with text "Submit"
Required reasoning	Parse instruction, locate hierarchically	Simple matching/detection

UI-TARS is optimized for parsing complex instructions ("Click the third item in the dropdown menu") and multi-step reasoning. This capability is wasted when the target is already precisely specified.

OmniParser simply detects all UI elements and matches them. For well-defined targets, this direct approach wins.

2. Element Characteristics

Characteristic	ScreenSpot-Pro	Our Synthetic
Avg element size	0.07% of screen	~1-5% of screen
Element density	High (professional apps)	Moderate
Ambiguity	High (many similar buttons)	Low (distinct elements)
Resolution	High-res professional software	Standard 1920x1080

ScreenSpot-Pro tests tiny elements in professional software (CAD, video editing, IDEs) where targets are often just 20x20 pixels. Our synthetic data has larger, clearer targets where detection is easier.

3. Instruction Complexity

ScreenSpot-Pro instructions require reasoning:

"Click the brush tool in the toolbar" (must identify toolbar region, then brush icon)
"Select the layer named 'Background'" (must find Layers panel, scroll if needed)

Our evaluation uses direct descriptions:

"Click the 'Submit' button" (single element lookup)
"Click the search icon" (straightforward matching)

UI-TARS's "System-2 reasoning" capability provides no benefit for direct lookups.

Analysis: When Each Method Excels

OmniParser Strengths

Fast detection (724ms vs 2724ms)
High recall on standard UI elements
Consistent - detection-based approach has predictable behavior
Good for automation - works well when element characteristics are known

UI-TARS Strengths

Complex instructions - can parse "the third blue button from the left"
Hierarchical navigation - understands "in the File menu, under Export"
Ambiguity resolution - better at choosing among similar elements
Professional apps - trained on complex software interfaces

When UI-TARS Would Win

Our evaluation would favor UI-TARS if we:

Used ambiguous instructions ("click the settings icon" with multiple gear icons)
Required hierarchical reasoning ("the close button in the modal dialog")
Tested on professional software screenshots with tiny elements
Evaluated instruction-following accuracy rather than element detection

Implications for openadapt-grounding

Recommendation: Use OmniParser for Recording Playback

For the core use case of replaying recorded actions:

Click coordinates are known precisely
Elements have been identified during recording
Speed matters for responsive automation
OmniParser's 97%+ detection rate is sufficient

Consider UI-TARS for:

Natural language automation ("Click the submit button")
Handling ambiguous targets
Professional software with complex UIs
Cases where OmniParser fails on tiny icons

Ensemble Strategy (Low Value)

Our error analysis found minimal complementarity:

UI-TARS found only 1 unique element that OmniParser missed
Ensemble potential: 99.6% (+0.3% over OmniParser alone)
Not worth the 4x latency cost

Cropping Strategy Effectiveness

The literature predicted cropping would help significantly (ScreenSeekeR: +254% improvement).

Our results:

Method	Baseline	+ Cropping	Improvement
UI-TARS	36.1%	70.6%	+95%
OmniParser	97.4%	99.3%	+2%

Cropping helps UI-TARS dramatically (validates ScreenSeekeR findings) but provides marginal benefit for OmniParser (already at ceiling on our data).

Key Takeaways

Benchmark selection matters. ScreenSpot-Pro measures instruction-following on professional apps. Our synthetic benchmark measures element detection on standard UIs. Different tasks favor different approaches.
Simpler is often better. For well-defined targets, detection (OmniParser) beats reasoning (UI-TARS).
Know your use case. Recording playback = OmniParser. Natural language automation = consider UI-TARS.
Cropping remains valuable. Both methods benefit from cropping, especially UI-TARS.

Future Work

Evaluate on real recordings from openadapt to measure production performance
Test on ScreenSpot-Pro to validate literature benchmarks
Hybrid approach - use OmniParser for detection, fall back to UI-TARS for failures
Fine-tune for small elements - the gap is largest on small targets