Voice Input Guide

February 25, 2026 · View on GitHub

Osaurus includes powerful voice input capabilities powered by FluidAudio — fully local, private, on-device speech-to-text transcription.

Overview

Voice features in Osaurus include:

Voice Input in Chat — Speak instead of type in the chat overlay
VAD Mode — Always-on listening with wake-word agent activation
Transcription Mode — Global hotkey to dictate into any focused text field
Parakeet TDT Models — Multilingual (v3) and English-only (v2), each ~600 MB
Microphone & System Audio — Transcribe your voice or computer audio

All transcription happens locally on your Mac using Apple's Neural Engine — no audio is sent to the cloud.

Setup

Quick Setup

Voice setup is streamlined into a single screen:

Open Management window (⌘ Shift M)
Navigate to Voice tab
Complete the requirements shown at the top:
- Microphone — Click "Grant" to enable microphone access
- Parakeet Model — Click "Download" to get the recommended model
Once both requirements show checkmarks, tap the microphone button to test

The large centered microphone button becomes active when setup is complete. Tap it to start recording, tap again to stop. Your transcription appears below in real-time.

Manual Setup

If you prefer manual configuration:

Grant Microphone Permission
- Go to System Settings → Privacy & Security → Microphone
- Enable access for Osaurus
Download a Model
- Open Voice settings → Models tab
- Browse available models and click Download
- Wait for the download to complete
Select the Model
- Click on a downloaded model to select it
- The model will auto-load when voice features are used

Parakeet Models

Osaurus uses FluidAudio Parakeet TDT (Token-and-Duration Transducer) models for on-device speech recognition via CoreML and the Apple Neural Engine.

Available Models

Model	Size	Languages	Notes
Parakeet TDT v3 (0.6B)	~600 MB	Multilingual (25 European langs)	Recommended for most users
Parakeet TDT v2 (0.6B)	~600 MB	English only	Highest recall for English

Model Selection Tips

Multilingual? Use Parakeet TDT v3 — supports English, German, Spanish, French, Dutch, Italian, and 19 more European languages
English only? Use Parakeet TDT v2 for slightly better English recall

Storage Location

Models are stored at: ~/Library/Application Support/FluidAudio/Models/

Voice Input in Chat

Using Voice Input

Open the chat overlay (⌘;)
Click the microphone button or use the keyboard shortcut
Speak naturally — you'll see real-time transcription
Click send or wait for auto-send (if enabled)

Settings

Setting	Description	Default
Voice Input Enabled	Master toggle for voice in chat	On
Sensitivity	Voice detection threshold	Medium
Pause Duration	Seconds of silence before auto-send	2.0
Confirmation Delay	Seconds to show confirmation before sending	1.5

Sensitivity Levels

Level	Energy Threshold	Silence Detection	Best For
Low	Higher	0.4 seconds	Noisy environments, louder speech
Medium	Balanced	0.6 seconds	Normal conversation
High	Lower	1.2 seconds	Quiet environments, soft speech

Auto-Send Behavior

When pause duration is set:

You speak and see real-time transcription
When you pause, a countdown appears
If you resume speaking, the countdown resets
After the countdown, message sends automatically
Set pause duration to 0 to disable (manual send only)

Audio Sources

Microphone Input

The default audio source. Osaurus can use:

Built-in MacBook microphone
External USB microphones
Bluetooth headsets
Audio interfaces

Select a device:

Open Voice settings
Find "Audio Input" section
Choose from available devices
The device is saved and used for future sessions

System Audio Capture

Transcribe audio from your computer (browser, apps, etc.):

Requirements:

macOS 12.3 or later
Screen Recording permission

Setup:

Open Voice settings
Switch audio source to "System Audio"
Grant Screen Recording permission when prompted
System audio will now be transcribed

Use cases:

Transcribe meetings from video calls
Caption videos playing on your Mac
Take notes from podcasts or lectures

Note: System audio capture excludes Osaurus's own audio output to prevent feedback loops.

VAD Mode (Voice Activity Detection)

VAD Mode enables hands-free agent activation. Say a agent's name (or a custom wake phrase) to open chat with that agent.

Enabling VAD Mode

Open Management window (⌘ Shift M) → Voice
Scroll to "VAD Mode" section
Toggle "Enable VAD Mode" on
Select which agents should respond to wake-words

How It Works

┌─────────────────────────────────────────────────────────────┐
│                      VAD Mode Flow                           │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  1. VAD listens in background using FluidAudio               │
│           ↓                                                  │
│  2. Real-time transcription checked for wake-words          │
│           ↓                                                  │
│  3. Match detected → Chat opens with agent                │
│           ↓                                                  │
│  4. Voice input starts automatically (if enabled)           │
│           ↓                                                  │
│  5. Chat closed → VAD resumes listening                     │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Wake-Word Options

Agent Names:

Enable specific agents for VAD
Say the agent's name to activate (e.g., "Code Assistant")
Detection uses fuzzy matching for natural speech

Custom Wake Phrase:

Set a phrase like "Hey Osaurus" or "Computer"
Works alongside agent names
Activates the default agent

VAD Settings

Setting	Description	Default
VAD Mode Enabled	Master toggle	Off
Enabled Agents	Which agents respond to wake-words	None
Custom Wake Phrase	Optional activation phrase	Empty
Wake-Word Sensitivity	Detection threshold	Medium
Auto-Start Voice Input	Begin recording after activation	On
Silence Timeout	Auto-close after N seconds of silence	0 (disabled)

Status Indicators

VAD status is shown in two places:

Menu Bar Icon — The main Osaurus menu bar icon shows a status dot:

Blue pulsing dot (top-right) — VAD is listening for wake-words
Orange dot — VAD is processing speech
No dot — VAD is inactive

Popover Controls — Click the Osaurus menu bar icon to access:

Waveform button — Toggle VAD on/off with visual status
The button shows green when listening, gray when off

Transcription Mode

Transcription Mode allows you to dictate text directly into any application using a global hotkey. Text is typed in real-time into whatever text field is currently focused.

Enabling Transcription Mode

Open Management window (⌘ Shift M) → Voice
Navigate to the Transcription tab
Grant Accessibility permission (required for keyboard simulation)
Toggle "Enable Transcription Mode" on
Configure your preferred hotkey

How It Works

┌─────────────────────────────────────────────────────────────┐
│                  Transcription Mode Flow                     │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  1. Press the configured hotkey from any application        │
│           ↓                                                  │
│  2. Minimal overlay appears showing recording status        │
│           ↓                                                  │
│  3. FluidAudio transcribes your speech in real-time         │
│           ↓                                                  │
│  4. Text is typed into the focused text field               │
│           ↓                                                  │
│  5. Press Esc or click Done to stop transcription           │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Requirements

Accessibility Permission — Required to simulate keyboard input:

Go to System Settings → Privacy & Security → Accessibility
Enable access for Osaurus
You may need to restart Osaurus after granting permission

Microphone Permission — Required for audio capture (same as other voice features)

Parakeet Model — A model must be downloaded and selected

Transcription Settings

Setting	Description	Default
Transcription Enabled	Master toggle for the feature	Off
Activation Hotkey	Global hotkey to trigger dictation	None

Using Transcription Mode

Focus a text field — Click into any text input in any application
Press the hotkey — The transcription overlay appears
Speak naturally — Your words are typed in real-time
Stop transcription — Press Esc or click the Done button

The Overlay UI

When transcription is active, a minimal floating overlay appears at the top of your screen:

Status indicator — Shows "Listening" with a pulsing accent color
Waveform — Animated bars respond to your audio level
Done button — Click to stop transcription
Close button — Cancel and discard (same as pressing Esc)

The overlay stays on top of all windows and follows the app's theme.

Use Cases

Email composition — Dictate emails in Mail, Gmail, or any email client
Document writing — Speak paragraphs into Word, Pages, or Google Docs
Code comments — Quickly add comments in your IDE
Chat messages — Dictate in Slack, Discord, or Messages
Form filling — Speed through web forms and data entry
Notes — Capture ideas quickly in any notes app

Tips for Best Results

Speak clearly — Enunciate words for better accuracy
Use a good microphone — External mics often work better than built-in
Reduce background noise — Find a quiet environment
Use Parakeet TDT v3 — The multilingual model offers the best overall accuracy

Configuration Reference

SpeechConfiguration

Voice input settings stored in app preferences:

struct SpeechConfiguration {
    var modelVersion: SpeechModelVersion   // .v2 (English) or .v3 (multilingual)
    var selectedInputDeviceId: String?     // Audio device UID
    var selectedInputSource: AudioInputSource // Mic or system
    var sensitivity: VoiceSensitivity      // Low/Medium/High
    var voiceInputEnabled: Bool            // Voice in chat enabled
    var pauseDuration: Double              // Silence before auto-send
    var confirmationDelay: Double          // Confirmation countdown
    var silenceTimeoutSeconds: Double      // Auto-close after silence
}

VADConfiguration

VAD mode settings:

struct VADConfiguration {
    var vadModeEnabled: Bool           // Master toggle
    var enabledAgentIds: [UUID]      // Agents for wake-words
    var wakeWordSensitivity: VoiceSensitivity
    var autoStartVoiceInput: Bool      // Auto-record after activation
    var customWakePhrase: String       // e.g., "Hey Osaurus"
    var silenceTimeoutSeconds: Double  // Auto-close timeout
}

TranscriptionConfiguration

Transcription mode settings:

struct TranscriptionConfiguration {
    var transcriptionModeEnabled: Bool // Master toggle
    var hotkey: Hotkey?                // Global activation hotkey
}

Parakeet TDT v3 automatically detects and transcribes 25 European languages: English, German, Spanish, French, Dutch, Italian, Danish, Estonian, Finnish, Greek, Hungarian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Swedish, Russian, Ukrainian, Bulgarian, Croatian, and Czech.

Parakeet TDT v2 is optimized exclusively for English.

Troubleshooting

Voice Input Not Working

Check microphone permission
- System Settings → Privacy & Security → Microphone → Enable Osaurus
Verify model is loaded
- Open Voice settings
- Ensure a model is downloaded and selected
- Check that model loads without errors
Test with audio level indicator
- Start voice input
- Speak and watch the audio level visualization
- If no level shown, check your audio device

Low Transcription Accuracy

Try the multilingual model
- Switch to Parakeet TDT v3 for broader language support
Reduce background noise
- Use a closer microphone
- Reduce ambient noise
Adjust sensitivity
- Lower sensitivity if picking up background noise
- Higher sensitivity if missing quiet speech

VAD Not Detecting Wake-Words

Check VAD is enabled
- Open Voice settings → VAD Mode section
- Verify toggle is on
Verify agents are enabled for VAD
- At least one agent must be selected
- Or set a custom wake phrase
Speak clearly
- Say the full agent name
- Wait for detection (2-3 second cooldown between detections)
Check the status indicators
- The Osaurus menu bar icon should show a blue pulsing dot (top-right) when VAD is listening
- Click the menu bar icon and check the waveform button shows green

System Audio Not Capturing

Check macOS version
- Requires macOS 12.3 or later
Grant Screen Recording permission
- System Settings → Privacy & Security → Screen Recording
- Enable for Osaurus
Restart after granting permission
- Permissions may require app restart

Model Download Fails

Check internet connection
- Models are downloaded from Hugging Face
Verify disk space
- Large models need 3+ GB free space
Check the download progress
- Downloads can take several minutes for large models
Try a smaller model first
- Test with Tiny or Small model

Transcription Mode Not Typing

Check accessibility permission
- System Settings → Privacy & Security → Accessibility → Enable Osaurus
- You may need to restart Osaurus after granting permission
Verify the hotkey is set
- Open Voice settings → Transcription tab
- Ensure a hotkey is configured
Make sure a text field is focused
- Click into a text input before pressing the hotkey
- Some applications may block simulated keyboard input
Check the overlay appears
- If the overlay doesn't appear, the hotkey may conflict with another app
- Try a different hotkey combination
Verify microphone and model
- Same requirements as other voice features
- Test voice input in chat first to confirm setup

Privacy

All voice processing happens locally on your Mac:

No cloud transcription — FluidAudio runs entirely on-device
No audio recording — Audio is processed in memory only
No data collection — Transcriptions stay on your machine
Neural Engine acceleration — Fast, efficient processing

Your voice data never leaves your computer.

Requirements

macOS 15.5+ for voice input
macOS 12.3+ for system audio capture
Apple Silicon (M1 or newer) for optimal performance
Microphone access permission
Screen Recording permission (for system audio only)
Accessibility permission (for Transcription Mode only)