Research Thesis: Demo-Conditioned Action Selection for GUI Agents
January 4, 2026 · View on GitHub
The Problem
Zero-shot VLMs fail on GUI tasks not due to lack of capability, but due to ambiguity in UI affordances. Given a screenshot and instruction, frontier models exhibit systematic spatial biases (e.g., clicking menu bar right instead of left). The observation-to-action mapping is learnable; the model simply does not know which element to click first.
The Hypothesis
Demo-conditioning resolves this ambiguity by providing procedural priors. A single relevant demonstration—showing the correct navigation path—dramatically improves episode success. The demo acts as an entropy-reducing signal over the action space, not as additional reasoning capacity.
Why This Sequencing
| Phase | Purpose |
|---|---|
| 1. Zero-shot baseline | Calibrate evaluation harness. Establish that failures are real, not artifacts. |
| 2. Demo-conditioned | The punchline. Prompt-level upper bound. Same model, same eval, no training. |
| 3. Fine-tuning | Distillation—only after Phase 2 proves the signal exists. |
Why prompt-level results must precede fine-tuning:
- Fine-tuning collapses representation, prompt engineering, retrieval strategy, and model capacity into one opaque blob
- Gains become non-attributable—you cannot isolate what drove improvement
- Demo-conditioning isolates the causal factor: trajectory priors
The Minimum Shocking Artifact
First-action accuracy on macOS System Settings tasks (n=45, Claude Sonnet 4.5):
| Condition | Accuracy | Delta |
|---|---|---|
| Zero-shot | 46.7% | — |
| Demo-conditioned | 100% | +53.3 pp |
| Length-matched control | 57.8% | +11.1 pp |
Same model. Same prompt structure. Same evaluation harness. No fine-tuning.
The length-matched control rules out prompt verbosity—the benefit is semantic, not token-length.
What This Proves
-
Action space is executable. The model can produce valid CLICK(x, y) actions that hit the correct target.
-
Observation-to-action mapping is learnable. Given the right context, accuracy reaches 100%.
-
Failure mode was ambiguity, not capacity. Zero-shot errors show consistent spatial bias (clicking right side of menu bar). With demo, model consistently identifies correct entry point.
-
Retrieval is a control knob; fine-tuning is a sledgehammer. Before investing in training infrastructure, demonstrate that prompt-level conditioning suffices for the target task distribution.
Implications for Benchmarks
This framing applies directly to standard benchmarks:
- OSWorld / WAA: Desktop automation with complex navigation paths
- WebArena / VisualWebArena: Web tasks requiring procedural knowledge
- Mind2Web / TTI: Multi-step web navigation with branching decisions
The prediction: benchmarks showing low zero-shot success will exhibit large gains from demo-conditioning on the subset of tasks where the failure mode is "wrong first action" rather than "wrong goal understanding."
Next Steps
- WAA Baseline: Zero-shot evaluation on Windows Agent Arena (154 tasks)
- Demo Retrieval: Given a new task, retrieve the most relevant demo from a library
- Episode Success: Extend from first-action to full trajectory completion
- Fine-tuning (Phase 3): Distill demo-conditioned behavior into model weights—only after Phases 1-2 establish the signal
December 2025 | OpenAdapt