OVOS GUI

March 5, 2026 · View on GitHub

Philosophy

A template defines what kind of information is being presented, not how it looks. The display layer (Qt/web/terminal/etc.) owns all rendering decisions. Skills own only the semantic data they provide.

Voice-first constraint

OVOS is a voice-first platform. The GUI is a companion to speech, not a replacement for it.

  • Touch is a shortcut, never the only path. Every interaction a user can perform by touch must also be completable by voice.
  • Some clients are display-only (no touch, no keyboard). Templates must never assume input capability.
  • Skills must never block on a GUI event. If a skill asks a question, it listens for the spoken answer simultaneously. A GUI touch shortcut just fires the same bus message the spoken answer would.

A template should be added when:

  • The semantic structure is meaningfully distinct from existing templates, and
  • It is needed by more than one unrelated skill (no single-skill templates), or
  • It is an essential skill (weather, clock)

A template should be rejected when:

  • It is a visual variation of an existing template (that is the display layer's job)
  • It can be composed from existing templates in sequence
  • Its data model is a strict subset of a broader template

Proposed Template Set

System group — managed by ovos-gui, not skills

TemplatePurposeKey session data
SYSTEM_idleResting / ambient screen(none — display layer decides)
SYSTEM_loadingIndeterminate progresslabel: str
SYSTEM_statusTerminal success / failuresuccess: bool, label: str
SYSTEM_errorError with optional detaillabel: str, detail: str?

SYSTEM_idle is reserved — skills must not display it directly.


Content group — read-only information

TemplatePurposeKey session data
SYSTEM_textLong-form text, auto-paginatedtext: str, title: str?
SYSTEM_imageStatic imageimage: url, title: str?, caption: str?, fill: FillMode?, background_color: str?
SYSTEM_animated_imageAnimated image (GIF / WebP)same as SYSTEM_image
SYSTEM_listScrollable list of labelled itemsitems: List[{title, subtitle?, image?}], title: str?
SYSTEM_grid2-D tile grid of image-primary itemsitems: List[{image, title?}], title: str?
SYSTEM_tableColumnar data table with named headerscolumns: List[str], rows: List[List[Any]], title: str?
SYSTEM_htmlRendered HTML stringhtml: str, resource_url: str?
SYSTEM_urlFull web pageurl: str

Why three distinct collection templates?

The data models are semantically different, not just visually different:

Primary axisItem structureLayout authority
listReading orderHierarchical — title + subtitle + thumbnailFixed: single column
gridVisual equalityImage-primary — image required, title optionalDisplay layer (adapts to screen)
tableColumn identityRelational — named columns, typed rowsFixed: column × row

Merging them would force every display layer to infer intent from data shape, which is fragile. Keeping them separate makes the skill's intent explicit.


Media group — time-based playback

TemplatePurposeKey session data
SYSTEM_audio_playerNow-playing card (audio)title: str, artist: str?, album: str?, image: url?, position: float, duration: float, playing: bool
SYSTEM_video_playerIn-GUI video playbackuri: str, title: str?, playing: bool

Why separate audio and video? Audio playback shows a rich metadata card while the actual audio plays through the sound system — there is no video stream involved. Video playback is a raw media surface. Conflating them forces every display layer to handle both cases in one template.


Utility group — common single-purpose views

TemplatePurposeKey session data
SYSTEM_clockCurrent time display(self-updating, no data needed)
SYSTEM_timerCountdown / count-up displayduration: int (seconds), label: str?
SYSTEM_weatherWeather summary cardcurrent_temp, min_temp, max_temp, condition: str, icon: url?, location: str?
SYSTEM_mapGeographic locationlatitude: float, longitude: float, zoom: int?, label: str?

Why SYSTEM_timer? Timers are pervasive (cooking, pomodoro, alarms) and have a distinct real-time countdown UI that cannot be expressed by SYSTEM_text without the skill manually updating the displayed string every second.

Why SYSTEM_map? Navigation, weather location, business lookup — all need a spatial view. A URL-embedded map is fragile (requires network, leaks provider choice).


Dialogue group — visual accompaniment to an active voice dialogue

OVOS is voice-first. The GUI never drives an interaction on its own. These templates display what is currently being asked through speech so the user can follow along. On capable devices a touch shortcut may be offered as a convenience, but:

  • Voice is always the primary (and on display-only devices, the only) path.
  • A skill must never block waiting for a GUI event — it must always be listening for the spoken answer in parallel.
  • The display layer decides whether to render touch targets at all.
TemplatePurposeKey session data
SYSTEM_confirmShows the yes/no question being spokenquestion: str
SYSTEM_selectShows the list of options being spokenprompt: str?, items: List[{label, value}]

SYSTEM_input is excluded — free-text keyboard entry is not a voice-first interaction and cannot work on display-only devices. Skills that need text input must use the speech layer.


Avatar group — embodied assistant states

TemplatePurposeKey session data
SYSTEM_faceAnimated avatar facesleeping: bool

This template is intentionally minimal — the display layer decides what the avatar looks like. Additional emotional states (thinking, listening) are a display-layer concern, not a session-data concern.


Templates deliberately excluded

RejectedReason
SYSTEM_slideshowCompose by showing SYSTEM_image pages at index
SYSTEM_notificationThis is a system-layer concept, not a page template
SYSTEM_qr_codeRenderable as SYSTEM_image; generation belongs in the skill
SYSTEM_chart / SYSTEM_graphToo display-layer-specific; use SYSTEM_html or SYSTEM_image
SYSTEM_inputFree-text keyboard entry breaks the voice-first contract and cannot work on display-only devices
Per-skill templates (calendar, spotify, etc.)Violates the "multiple unrelated skills" rule

Summary

SYSTEM_idle             (reserved)

SYSTEM_loading
SYSTEM_status
SYSTEM_error

SYSTEM_text
SYSTEM_image
SYSTEM_animated_image
SYSTEM_list
SYSTEM_grid
SYSTEM_table

SYSTEM_html
SYSTEM_url

SYSTEM_audio_player
SYSTEM_video_player

SYSTEM_clock
SYSTEM_timer
SYSTEM_weather
SYSTEM_map

SYSTEM_confirm
SYSTEM_select

SYSTEM_face

21 templates total. Every one is justified by cross-skill usage and a data model that cannot be reduced to an existing template.