Parakeet Dictation
March 12, 2026 · View on GitHub
On-device voice typing for Linux / Wayland with built-in punctuation and capitalization — no cloud API, no GPU required.
Uses sherpa-onnx to run NVIDIA NeMo ASR models locally, including Parakeet TDT 0.6B, one of the highest-accuracy open-source speech recognition models available. Unlike Whisper-based dictation tools, Parakeet produces natively punctuated and capitalized output with no post-processing — making it ideal for live typing workflows.
Validated on Ubuntu 25.04 with KDE Plasma 6 / Wayland.
Note: This is a Wayland-only tool. X11 is not supported.
Why not Whisper?
Whisper is excellent for batch transcription but has drawbacks for live dictation:
- No native punctuation control — requires separate punctuation models or heuristics
- High latency — designed for processing complete audio files, not real-time segments
- Heavy — even
whisper-smalluses more RAM than Parakeet TDT 0.6B (int8) while being less accurate for English
The NeMo family (Parakeet, Canary, Nemotron) was designed for production speech pipelines and outputs punctuated text natively. Parakeet TDT 0.6B v3 achieves state-of-the-art word error rates on English benchmarks while running ~30x real-time on CPU.
Why sherpa-onnx?
sherpa-onnx provides a clean, optimized runtime for running ONNX-exported NeMo models. It handles feature extraction, decoding, and streaming — all in a single Python package with no external dependencies beyond ONNX Runtime. This avoids the overhead of full NeMo/PyTorch and enables efficient CPU-only inference even on laptops.
Model Profiles
| Profile | Model | Type | Params | Download | Best for |
|---|---|---|---|---|---|
| desktop | Parakeet TDT 0.6B v3 (int8) | Offline (VAD-segmented) | 600M | 639 MB | Desktop/workstation — best accuracy |
| laptop | Canary 180M Flash (int8) | Offline (VAD-segmented) | 180M | 198 MB | Laptop, low RAM, travel |
| streaming | Nemotron Streaming 0.6B (int8) | Online (frame-by-frame) | 600M | 631 MB | True real-time — lowest latency |
Model types explained
- Offline (VAD-segmented): TEN VAD detects when you pause speaking, then sends the completed speech segment to the model. You get punctuated text ~1–2 seconds after each pause. Best accuracy.
- Online (streaming): The model processes audio frame-by-frame as you speak, outputting partial results in real time. Lower latency, but slightly different sentence boundary behavior.
All models output punctuated, capitalized text natively.
Choosing a model
- Desktop/workstation with plenty of RAM: Use
desktop(Parakeet TDT 0.6B). Best accuracy, handles accents and technical vocabulary well. ~2 GB RAM. - Laptop or low-RAM machine: Use
laptop(Canary 180M Flash). Only 198 MB download, ~500 MB RAM. Supports English, Spanish, German, and French. - Lowest possible latency: Use
streaming(Nemotron Streaming 0.6B). Text appears as you speak rather than after pauses. English only.
Install
Option A: .deb package (recommended)
git clone https://github.com/danielrosehill/parakeet-dictation.git
cd parakeet-dictation
chmod +x build-deb.sh
./build-deb.sh
sudo dpkg -i parakeet-dictation_1.0.0.deb
sudo apt-get install -f # resolve any missing deps
sudo /opt/parakeet-dictation/setup-pip-deps.sh
parakeet-dictation
Option B: Run from source
git clone https://github.com/danielrosehill/parakeet-dictation.git
cd parakeet-dictation
uv venv .venv
source .venv/bin/activate
uv pip install -r requirements.txt
# Download models (or use the in-app model manager)
uv pip install requests tqdm
python download_models.py desktop # 639 MB — best accuracy
python download_models.py laptop # 198 MB — lightweight
python download_models.py streaming # 631 MB — real-time
python download_models.py all # all profiles
# System dependencies (Ubuntu/Debian)
sudo apt install wtype gir1.2-ayatanaappindicator3-0.1 libportaudio2 libgirepository-2.0-dev libc++1
python dictation_app.py
Text input method
Text is typed into the focused application via wtype (Wayland-native). This is the default and recommended method.
Alternative methods available in Settings > General:
| Method | Command | Notes |
|---|---|---|
| wtype (default) | wtype | Native Wayland keystroke injection. Just works. |
| ydotool | ydotool | Requires ydotoold daemon + /dev/uinput permissions (see below) |
| clipboard | wl-copy + wtype Ctrl+V | Pastes via clipboard. Works everywhere but overwrites clipboard. |
ydotool setup (only if using ydotool method)
sudo groupadd -f input
sudo usermod -aG input $USER
echo 'KERNEL=="uinput", GROUP="input", MODE="0660"' | sudo tee /etc/udev/rules.d/80-uinput.rules
sudo udevadm control --reload-rules
sudo udevadm trigger /dev/uinput
# Log out and back in for group membership to take effect
TEN VAD system dependency
TEN VAD requires libc++1 (C++ standard library). The .deb package installs this automatically. If running from source:
sudo apt install libc++1
On some systems, the libc++.so.1 symlink may be missing. If you get an error about libc++.so.1, create the symlink:
sudo ln -sf /lib/x86_64-linux-gnu/libc++.so.1.0.* /lib/x86_64-linux-gnu/libc++.so.1
sudo ln -sf /lib/x86_64-linux-gnu/libc++abi.so.1.0.* /lib/x86_64-linux-gnu/libc++abi.so.1
sudo ldconfig
Usage
The app runs as a system tray indicator with an optional full-size main window.
- Tray icon: Right-click for start/stop, model switching, settings, and to open the main window.
- Main window: Open from the tray menu ("Show Window"). Provides a transcript view, start/stop/pause controls, and a microphone selector. Closing the window hides it back to tray.
Default hotkeys
| Action | Default | Description |
|---|---|---|
| Toggle | Ctrl+0 | Start/stop dictation |
| Start | Ctrl+9 | Start only (start/stop mode) |
| Stop | Ctrl+8 | Stop only (start/stop mode) |
| Pause | Ctrl+Alt+0 | Pause/resume without stopping engine |
Hotkey modes
- Toggle mode (default): One key starts and stops dictation
- Start/Stop mode: Separate keys for starting and stopping
All hotkeys are rebindable from Settings > Hotkeys.
Microphone selection
Choose your input device from:
- Settings > General > Microphone (persisted)
- Main window header bar mic selector (quick switch)
Leave set to "System Default" to use your desktop's default input device.
Model manager
Open Settings > Models to browse available model profiles, download them in-app, and select which one to use.
Audio feedback
- Rising tone (880 Hz) — dictation started
- Falling tone (440 Hz) — dictation stopped
- Double beep (660 Hz) — paused/resumed
Night mode
Automatically suppresses audio feedback between configurable hours (default 22:00–09:00). Enable from Settings > General.
How it works
- TEN VAD detects speech segments in real time (~306 KB, lightweight)
- When you pause speaking, the completed segment is sent to the ASR model
- Transcribed text (with punctuation) is typed into the focused window via wtype
The streaming profile uses frame-by-frame processing instead of VAD segmentation.
Configuration
Settings stored in ~/.config/parakeet-dictation/config.json:
{
"model_profile": "desktop",
"num_threads": 4,
"vad_threshold": 0.5,
"beep_volume": 0.5,
"audio_device": "",
"typer": "wtype",
"hotkey_mode": "toggle",
"hotkey_toggle": "<ctrl>+0",
"hotkey_start": "<ctrl>+9",
"hotkey_stop": "<ctrl>+8",
"hotkey_pause": "<ctrl>+<alt>+0",
"night_mode": false,
"night_start": 22,
"night_end": 9
}
Requirements
- Python 3.10+
- Linux with Wayland (validated on Ubuntu 25.04, KDE Plasma 6)
- ~500 MB – 2 GB RAM depending on model
- wtype (Wayland text input)
- libc++1 (for TEN VAD)
License
MIT