OVOS GUI

March 5, 2026 · View on GitHub

Philosophy

A template defines what kind of information is being presented, not how it looks. The display layer (Qt/web/terminal/etc.) owns all rendering decisions. Skills own only the semantic data they provide.

Voice-first constraint

OVOS is a voice-first platform. The GUI is a companion to speech, not a replacement for it.

Touch is a shortcut, never the only path. Every interaction a user can perform by touch must also be completable by voice.
Some clients are display-only (no touch, no keyboard). Templates must never assume input capability.
Skills must never block on a GUI event. If a skill asks a question, it listens for the spoken answer simultaneously. A GUI touch shortcut just fires the same bus message the spoken answer would.

A template should be added when:

The semantic structure is meaningfully distinct from existing templates, and
It is needed by more than one unrelated skill (no single-skill templates), or
It is an essential skill (weather, clock)

A template should be rejected when:

It is a visual variation of an existing template (that is the display layer's job)
It can be composed from existing templates in sequence
Its data model is a strict subset of a broader template

Proposed Template Set

System group — managed by ovos-gui, not skills

Template	Purpose	Key session data
`SYSTEM_idle`	Resting / ambient screen	(none — display layer decides)
`SYSTEM_loading`	Indeterminate progress	`label: str`
`SYSTEM_status`	Terminal success / failure	`success: bool`, `label: str`
`SYSTEM_error`	Error with optional detail	`label: str`, `detail: str?`

SYSTEM_idle is reserved — skills must not display it directly.

Content group — read-only information

Template	Purpose	Key session data
`SYSTEM_text`	Long-form text, auto-paginated	`text: str`, `title: str?`
`SYSTEM_image`	Static image	`image: url`, `title: str?`, `caption: str?`, `fill: FillMode?`, `background_color: str?`
`SYSTEM_animated_image`	Animated image (GIF / WebP)	same as `SYSTEM_image`
`SYSTEM_list`	Scrollable list of labelled items	`items: List[{title, subtitle?, image?}]`, `title: str?`
`SYSTEM_grid`	2-D tile grid of image-primary items	`items: List[{image, title?}]`, `title: str?`
`SYSTEM_table`	Columnar data table with named headers	`columns: List[str]`, `rows: List[List[Any]]`, `title: str?`
`SYSTEM_html`	Rendered HTML string	`html: str`, `resource_url: str?`
`SYSTEM_url`	Full web page	`url: str`

Why three distinct collection templates?

The data models are semantically different, not just visually different:

	Primary axis	Item structure	Layout authority
`list`	Reading order	Hierarchical — title + subtitle + thumbnail	Fixed: single column
`grid`	Visual equality	Image-primary — image required, title optional	Display layer (adapts to screen)
`table`	Column identity	Relational — named columns, typed rows	Fixed: column × row

Merging them would force every display layer to infer intent from data shape, which is fragile. Keeping them separate makes the skill's intent explicit.

Media group — time-based playback

Template	Purpose	Key session data
`SYSTEM_audio_player`	Now-playing card (audio)	`title: str`, `artist: str?`, `album: str?`, `image: url?`, `position: float`, `duration: float`, `playing: bool`
`SYSTEM_video_player`	In-GUI video playback	`uri: str`, `title: str?`, `playing: bool`

Why separate audio and video? Audio playback shows a rich metadata card while the actual audio plays through the sound system — there is no video stream involved. Video playback is a raw media surface. Conflating them forces every display layer to handle both cases in one template.

Utility group — common single-purpose views

Template	Purpose	Key session data
`SYSTEM_clock`	Current time display	(self-updating, no data needed)
`SYSTEM_timer`	Countdown / count-up display	`duration: int` (seconds), `label: str?`
`SYSTEM_weather`	Weather summary card	`current_temp`, `min_temp`, `max_temp`, `condition: str`, `icon: url?`, `location: str?`
`SYSTEM_map`	Geographic location	`latitude: float`, `longitude: float`, `zoom: int?`, `label: str?`

Why SYSTEM_timer? Timers are pervasive (cooking, pomodoro, alarms) and have a distinct real-time countdown UI that cannot be expressed by SYSTEM_text without the skill manually updating the displayed string every second.

Why SYSTEM_map? Navigation, weather location, business lookup — all need a spatial view. A URL-embedded map is fragile (requires network, leaks provider choice).

Dialogue group — visual accompaniment to an active voice dialogue

OVOS is voice-first. The GUI never drives an interaction on its own. These templates display what is currently being asked through speech so the user can follow along. On capable devices a touch shortcut may be offered as a convenience, but:

Voice is always the primary (and on display-only devices, the only) path.
A skill must never block waiting for a GUI event — it must always be listening for the spoken answer in parallel.
The display layer decides whether to render touch targets at all.

Template	Purpose	Key session data
`SYSTEM_confirm`	Shows the yes/no question being spoken	`question: str`
`SYSTEM_select`	Shows the list of options being spoken	`prompt: str?`, `items: List[{label, value}]`

SYSTEM_input is excluded — free-text keyboard entry is not a voice-first interaction and cannot work on display-only devices. Skills that need text input must use the speech layer.

Avatar group — embodied assistant states

Template	Purpose	Key session data
`SYSTEM_face`	Animated avatar face	`sleeping: bool`

This template is intentionally minimal — the display layer decides what the avatar looks like. Additional emotional states (thinking, listening) are a display-layer concern, not a session-data concern.

Templates deliberately excluded

Rejected	Reason
`SYSTEM_slideshow`	Compose by showing `SYSTEM_image` pages at `index`
`SYSTEM_notification`	This is a system-layer concept, not a page template
`SYSTEM_qr_code`	Renderable as `SYSTEM_image`; generation belongs in the skill
`SYSTEM_chart` / `SYSTEM_graph`	Too display-layer-specific; use `SYSTEM_html` or `SYSTEM_image`
`SYSTEM_input`	Free-text keyboard entry breaks the voice-first contract and cannot work on display-only devices
Per-skill templates (calendar, spotify, etc.)	Violates the "multiple unrelated skills" rule

Summary

SYSTEM_idle             (reserved)

SYSTEM_loading
SYSTEM_status
SYSTEM_error

SYSTEM_text
SYSTEM_image
SYSTEM_animated_image
SYSTEM_list
SYSTEM_grid
SYSTEM_table

SYSTEM_html
SYSTEM_url

SYSTEM_audio_player
SYSTEM_video_player

SYSTEM_clock
SYSTEM_timer
SYSTEM_weather
SYSTEM_map

SYSTEM_confirm
SYSTEM_select

SYSTEM_face

21 templates total. Every one is justified by cross-skill usage and a data model that cannot be reduced to an existing template.