Live Typing UX Research

April 14, 2026 · View on GitHub

A research workspace for documenting the UX challenges of using live voice typing as a replacement for keyboard input on the desktop.

The focus is desktop-specific — not mobile dictation, not transcription-after-the-fact, but real-time speech-to-text that feeds a cursor in arbitrary applications (editors, chat boxes, terminals, address bars).

This repo catalogues:

  • Interaction patterns in current live-typing tools — what they're called (push-to-talk, hands-free / VAD-gated, hold-to-dictate, toggle-dictation, streaming overlay, commit-on-pause, etc.), how they behave from the user's perspective, and how they work under the hood (streaming ASR vs. chunked, endpointing, partial vs. final tokens, injection method into the focused window).
  • Friction points observed while prototyping — false commits, lost partials, focus loss, punctuation/formatting gaps, correction workflows, modal vs. modeless overlays.
  • A working spec for an "ideal" desktop live-typing UI, derived from the patterns above and prototypes built along the way, refined as the research evolves.

Each topic is recorded as a pair of cross-referenced files:

  • questions/<slug>.md — the question as posed, kept short and faithful to how it was asked.
  • ideas/<slug>.md — the exploratory response: patterns, tradeoffs, recommendations, references.

The two halves link to each other. This keeps the user's framing and the AI's analysis legible as separate artefacts rather than fused into one document.


Reference

  • Glossary — consolidated terminology used across all topics.

Topics

Batch 1 (14/04/26)

Consolidated PDF: outputs/2026-04-14-all.pdf

PromptOutput
Cursor-level dictation vs record-then-paste STT — who loves which, and whereAdoption patterns across industries, users, and operating systems
Custom vocabulary in transcription tools — how is it actually implemented?What's actually happening under the hood, and how big the list can get
Dictation formatting features — built-in model support vs supplementary post-processingBuilt-in vs supplementary formatting, across local and cloud models
Does end-of-utterance batch inference give better accuracy than chunked streaming?Batch inference is genuinely more accurate — it's not just frontend engineering
Focus loss during dictationHandling focus loss between dictation start and transcript arrival
Hotkey count tradeoffs for voice dictation control (single key, macro pads)From single toggle to a four-key macro pad
Industries that adopted speech-to-text enthusiastically through recognition of effectivenessIndustries that adopted STT through sheer recognition of effectiveness
Inference cadence and sentence entry for pause-for-thought dictatorsFinding the UX sweet spot for pause-for-thought dictators
Leading STT models for live typing — SaaS/API and locally runnableSaaS and local models, with architecture commonalities and differences
Local STT engines with built-in VAD (or equivalent silence-handling architecture)Built-in VAD, bolted-on VAD, and push-to-talk as an architectural sidestep
Local STT inference engines and GPU acceleration (NVIDIA vs AMD)GPU acceleration on NVIDIA vs AMD, with an engine × vendor × backend table
Pause-tolerant dictation — codify this user profile and recommend an STT stackThe pause-tolerant prompt-dictator — codified profile and matching STT stack
Streaming injection vs batch transcription on stopStreaming injection vs utterance-final injection in live dictation
VAD (voice activity detection) for live typingWhat's native, what's bolted on, and how to wire them together
What is the dynamic-rewriting display in tools like Deepgram called?Interim results, stabilization, and where the work happens
When Handy prints the transcript over 10–20 seconds, is that real-time inference or an artificial frontend delay?Handy's print delay is almost entirely keystroke-injection pacing, not inference
Why Whisper isn't ideal for live dictation, and how live STT models rewrite on the flyWhy Whisper isn't built for live dictation, and how streaming STT models rewrite on the fly

How this workspace works

A workspace for asking Claude (or any AI coding agent) technical how-to questions and turning the answers into living, maintainable guides — like a GitHub Gist, but multi-file and built to be revised over time.

Why a repo and not a Gist?

  • A guide can span multiple files (code samples, diagrams, follow-up Q&A).
  • Guides change as tools, OS versions, and best practices change — versioned files beat a one-shot answer.
  • Each guide gets its own folder, history, and (optionally) issues/PRs for corrections.

Getting started

After cloning from this template, run:

/setup-workspace

That replaces placeholders, seeds context/, and asks for a one-line description of what you'll be researching here.

Day-to-day commands

CommandWhat it does
/askPose a new technical question — produces a structured guide under guides/<slug>/README.md.
/reviseUpdate an existing guide with new findings.
/followupAppend a Q&A note to an existing guide without rewriting it.
/publishRebuild the top-level guide index and report the public repo URL.
/glossaryRebuild glossary.md from terminology defined across ideas/.

Layout

context/    — Stack, environment, constraints (read by every command)
questions/  — One file per question, <slug>.md, kept faithful to how it was posed
ideas/      — One file per response, <slug>.md, cross-referenced to its question
outputs/    — Loose drafts before promotion to a question/ideas pair

Visibility

This template defaults to public repos. Don't put secrets or sensitive context in context/ or guides/.