Parakeet Dictation

March 12, 2026 · View on GitHub

On-device voice typing for Linux / Wayland with built-in punctuation and capitalization — no cloud API, no GPU required.

Uses sherpa-onnx to run NVIDIA NeMo ASR models locally, including Parakeet TDT 0.6B, one of the highest-accuracy open-source speech recognition models available. Unlike Whisper-based dictation tools, Parakeet produces natively punctuated and capitalized output with no post-processing — making it ideal for live typing workflows.

Validated on Ubuntu 25.04 with KDE Plasma 6 / Wayland.

Note: This is a Wayland-only tool. X11 is not supported.

Why not Whisper?

Whisper is excellent for batch transcription but has drawbacks for live dictation:

  • No native punctuation control — requires separate punctuation models or heuristics
  • High latency — designed for processing complete audio files, not real-time segments
  • Heavy — even whisper-small uses more RAM than Parakeet TDT 0.6B (int8) while being less accurate for English

The NeMo family (Parakeet, Canary, Nemotron) was designed for production speech pipelines and outputs punctuated text natively. Parakeet TDT 0.6B v3 achieves state-of-the-art word error rates on English benchmarks while running ~30x real-time on CPU.

Why sherpa-onnx?

sherpa-onnx provides a clean, optimized runtime for running ONNX-exported NeMo models. It handles feature extraction, decoding, and streaming — all in a single Python package with no external dependencies beyond ONNX Runtime. This avoids the overhead of full NeMo/PyTorch and enables efficient CPU-only inference even on laptops.

Model Profiles

ProfileModelTypeParamsDownloadBest for
desktopParakeet TDT 0.6B v3 (int8)Offline (VAD-segmented)600M639 MBDesktop/workstation — best accuracy
laptopCanary 180M Flash (int8)Offline (VAD-segmented)180M198 MBLaptop, low RAM, travel
streamingNemotron Streaming 0.6B (int8)Online (frame-by-frame)600M631 MBTrue real-time — lowest latency

Model types explained

  • Offline (VAD-segmented): TEN VAD detects when you pause speaking, then sends the completed speech segment to the model. You get punctuated text ~1–2 seconds after each pause. Best accuracy.
  • Online (streaming): The model processes audio frame-by-frame as you speak, outputting partial results in real time. Lower latency, but slightly different sentence boundary behavior.

All models output punctuated, capitalized text natively.

Choosing a model

  • Desktop/workstation with plenty of RAM: Use desktop (Parakeet TDT 0.6B). Best accuracy, handles accents and technical vocabulary well. ~2 GB RAM.
  • Laptop or low-RAM machine: Use laptop (Canary 180M Flash). Only 198 MB download, ~500 MB RAM. Supports English, Spanish, German, and French.
  • Lowest possible latency: Use streaming (Nemotron Streaming 0.6B). Text appears as you speak rather than after pauses. English only.

Install

git clone https://github.com/danielrosehill/parakeet-dictation.git
cd parakeet-dictation
chmod +x build-deb.sh
./build-deb.sh

sudo dpkg -i parakeet-dictation_1.0.0.deb
sudo apt-get install -f          # resolve any missing deps
sudo /opt/parakeet-dictation/setup-pip-deps.sh

parakeet-dictation

Option B: Run from source

git clone https://github.com/danielrosehill/parakeet-dictation.git
cd parakeet-dictation
uv venv .venv
source .venv/bin/activate
uv pip install -r requirements.txt

# Download models (or use the in-app model manager)
uv pip install requests tqdm
python download_models.py desktop    # 639 MB — best accuracy
python download_models.py laptop     # 198 MB — lightweight
python download_models.py streaming  # 631 MB — real-time
python download_models.py all        # all profiles

# System dependencies (Ubuntu/Debian)
sudo apt install wtype gir1.2-ayatanaappindicator3-0.1 libportaudio2 libgirepository-2.0-dev libc++1

python dictation_app.py

Text input method

Text is typed into the focused application via wtype (Wayland-native). This is the default and recommended method.

Alternative methods available in Settings > General:

MethodCommandNotes
wtype (default)wtypeNative Wayland keystroke injection. Just works.
ydotoolydotoolRequires ydotoold daemon + /dev/uinput permissions (see below)
clipboardwl-copy + wtype Ctrl+VPastes via clipboard. Works everywhere but overwrites clipboard.

ydotool setup (only if using ydotool method)

sudo groupadd -f input
sudo usermod -aG input $USER
echo 'KERNEL=="uinput", GROUP="input", MODE="0660"' | sudo tee /etc/udev/rules.d/80-uinput.rules
sudo udevadm control --reload-rules
sudo udevadm trigger /dev/uinput
# Log out and back in for group membership to take effect

TEN VAD system dependency

TEN VAD requires libc++1 (C++ standard library). The .deb package installs this automatically. If running from source:

sudo apt install libc++1

On some systems, the libc++.so.1 symlink may be missing. If you get an error about libc++.so.1, create the symlink:

sudo ln -sf /lib/x86_64-linux-gnu/libc++.so.1.0.* /lib/x86_64-linux-gnu/libc++.so.1
sudo ln -sf /lib/x86_64-linux-gnu/libc++abi.so.1.0.* /lib/x86_64-linux-gnu/libc++abi.so.1
sudo ldconfig

Usage

The app runs as a system tray indicator with an optional full-size main window.

  • Tray icon: Right-click for start/stop, model switching, settings, and to open the main window.
  • Main window: Open from the tray menu ("Show Window"). Provides a transcript view, start/stop/pause controls, and a microphone selector. Closing the window hides it back to tray.

Default hotkeys

ActionDefaultDescription
ToggleCtrl+0Start/stop dictation
StartCtrl+9Start only (start/stop mode)
StopCtrl+8Stop only (start/stop mode)
PauseCtrl+Alt+0Pause/resume without stopping engine

Hotkey modes

  • Toggle mode (default): One key starts and stops dictation
  • Start/Stop mode: Separate keys for starting and stopping

All hotkeys are rebindable from Settings > Hotkeys.

Microphone selection

Choose your input device from:

  • Settings > General > Microphone (persisted)
  • Main window header bar mic selector (quick switch)

Leave set to "System Default" to use your desktop's default input device.

Model manager

Open Settings > Models to browse available model profiles, download them in-app, and select which one to use.

Audio feedback

  • Rising tone (880 Hz) — dictation started
  • Falling tone (440 Hz) — dictation stopped
  • Double beep (660 Hz) — paused/resumed

Night mode

Automatically suppresses audio feedback between configurable hours (default 22:00–09:00). Enable from Settings > General.

How it works

  1. TEN VAD detects speech segments in real time (~306 KB, lightweight)
  2. When you pause speaking, the completed segment is sent to the ASR model
  3. Transcribed text (with punctuation) is typed into the focused window via wtype

The streaming profile uses frame-by-frame processing instead of VAD segmentation.

Configuration

Settings stored in ~/.config/parakeet-dictation/config.json:

{
  "model_profile": "desktop",
  "num_threads": 4,
  "vad_threshold": 0.5,
  "beep_volume": 0.5,
  "audio_device": "",
  "typer": "wtype",
  "hotkey_mode": "toggle",
  "hotkey_toggle": "<ctrl>+0",
  "hotkey_start": "<ctrl>+9",
  "hotkey_stop": "<ctrl>+8",
  "hotkey_pause": "<ctrl>+<alt>+0",
  "night_mode": false,
  "night_start": 22,
  "night_end": 9
}

Requirements

  • Python 3.10+
  • Linux with Wayland (validated on Ubuntu 25.04, KDE Plasma 6)
  • ~500 MB – 2 GB RAM depending on model
  • wtype (Wayland text input)
  • libc++1 (for TEN VAD)

License

MIT