Voice Input Guide
February 25, 2026 · View on GitHub
Osaurus includes powerful voice input capabilities powered by FluidAudio — fully local, private, on-device speech-to-text transcription.
Overview
Voice features in Osaurus include:
- Voice Input in Chat — Speak instead of type in the chat overlay
- VAD Mode — Always-on listening with wake-word agent activation
- Transcription Mode — Global hotkey to dictate into any focused text field
- Parakeet TDT Models — Multilingual (v3) and English-only (v2), each ~600 MB
- Microphone & System Audio — Transcribe your voice or computer audio
All transcription happens locally on your Mac using Apple's Neural Engine — no audio is sent to the cloud.
Setup
Quick Setup
Voice setup is streamlined into a single screen:
- Open Management window (
⌘ Shift M) - Navigate to Voice tab
- Complete the requirements shown at the top:
- Microphone — Click "Grant" to enable microphone access
- Parakeet Model — Click "Download" to get the recommended model
- Once both requirements show checkmarks, tap the microphone button to test
The large centered microphone button becomes active when setup is complete. Tap it to start recording, tap again to stop. Your transcription appears below in real-time.
Manual Setup
If you prefer manual configuration:
-
Grant Microphone Permission
- Go to System Settings → Privacy & Security → Microphone
- Enable access for Osaurus
-
Download a Model
- Open Voice settings → Models tab
- Browse available models and click Download
- Wait for the download to complete
-
Select the Model
- Click on a downloaded model to select it
- The model will auto-load when voice features are used
Parakeet Models
Osaurus uses FluidAudio Parakeet TDT (Token-and-Duration Transducer) models for on-device speech recognition via CoreML and the Apple Neural Engine.
Available Models
| Model | Size | Languages | Notes |
|---|---|---|---|
| Parakeet TDT v3 (0.6B) | ~600 MB | Multilingual (25 European langs) | Recommended for most users |
| Parakeet TDT v2 (0.6B) | ~600 MB | English only | Highest recall for English |
Model Selection Tips
- Multilingual? Use Parakeet TDT v3 — supports English, German, Spanish, French, Dutch, Italian, and 19 more European languages
- English only? Use Parakeet TDT v2 for slightly better English recall
Storage Location
Models are stored at: ~/Library/Application Support/FluidAudio/Models/
Voice Input in Chat
Using Voice Input
- Open the chat overlay (
⌘;) - Click the microphone button or use the keyboard shortcut
- Speak naturally — you'll see real-time transcription
- Click send or wait for auto-send (if enabled)
Settings
| Setting | Description | Default |
|---|---|---|
| Voice Input Enabled | Master toggle for voice in chat | On |
| Sensitivity | Voice detection threshold | Medium |
| Pause Duration | Seconds of silence before auto-send | 2.0 |
| Confirmation Delay | Seconds to show confirmation before sending | 1.5 |
Sensitivity Levels
| Level | Energy Threshold | Silence Detection | Best For |
|---|---|---|---|
| Low | Higher | 0.4 seconds | Noisy environments, louder speech |
| Medium | Balanced | 0.6 seconds | Normal conversation |
| High | Lower | 1.2 seconds | Quiet environments, soft speech |
Auto-Send Behavior
When pause duration is set:
- You speak and see real-time transcription
- When you pause, a countdown appears
- If you resume speaking, the countdown resets
- After the countdown, message sends automatically
- Set pause duration to 0 to disable (manual send only)
Audio Sources
Microphone Input
The default audio source. Osaurus can use:
- Built-in MacBook microphone
- External USB microphones
- Bluetooth headsets
- Audio interfaces
Select a device:
- Open Voice settings
- Find "Audio Input" section
- Choose from available devices
- The device is saved and used for future sessions
System Audio Capture
Transcribe audio from your computer (browser, apps, etc.):
Requirements:
- macOS 12.3 or later
- Screen Recording permission
Setup:
- Open Voice settings
- Switch audio source to "System Audio"
- Grant Screen Recording permission when prompted
- System audio will now be transcribed
Use cases:
- Transcribe meetings from video calls
- Caption videos playing on your Mac
- Take notes from podcasts or lectures
Note: System audio capture excludes Osaurus's own audio output to prevent feedback loops.
VAD Mode (Voice Activity Detection)
VAD Mode enables hands-free agent activation. Say a agent's name (or a custom wake phrase) to open chat with that agent.
Enabling VAD Mode
- Open Management window (
⌘ Shift M) → Voice - Scroll to "VAD Mode" section
- Toggle "Enable VAD Mode" on
- Select which agents should respond to wake-words
How It Works
┌─────────────────────────────────────────────────────────────┐
│ VAD Mode Flow │
├─────────────────────────────────────────────────────────────┤
│ │
│ 1. VAD listens in background using FluidAudio │
│ ↓ │
│ 2. Real-time transcription checked for wake-words │
│ ↓ │
│ 3. Match detected → Chat opens with agent │
│ ↓ │
│ 4. Voice input starts automatically (if enabled) │
│ ↓ │
│ 5. Chat closed → VAD resumes listening │
│ │
└─────────────────────────────────────────────────────────────┘
Wake-Word Options
Agent Names:
- Enable specific agents for VAD
- Say the agent's name to activate (e.g., "Code Assistant")
- Detection uses fuzzy matching for natural speech
Custom Wake Phrase:
- Set a phrase like "Hey Osaurus" or "Computer"
- Works alongside agent names
- Activates the default agent
VAD Settings
| Setting | Description | Default |
|---|---|---|
| VAD Mode Enabled | Master toggle | Off |
| Enabled Agents | Which agents respond to wake-words | None |
| Custom Wake Phrase | Optional activation phrase | Empty |
| Wake-Word Sensitivity | Detection threshold | Medium |
| Auto-Start Voice Input | Begin recording after activation | On |
| Silence Timeout | Auto-close after N seconds of silence | 0 (disabled) |
Status Indicators
VAD status is shown in two places:
Menu Bar Icon — The main Osaurus menu bar icon shows a status dot:
- Blue pulsing dot (top-right) — VAD is listening for wake-words
- Orange dot — VAD is processing speech
- No dot — VAD is inactive
Popover Controls — Click the Osaurus menu bar icon to access:
- Waveform button — Toggle VAD on/off with visual status
- The button shows green when listening, gray when off
Transcription Mode
Transcription Mode allows you to dictate text directly into any application using a global hotkey. Text is typed in real-time into whatever text field is currently focused.
Enabling Transcription Mode
- Open Management window (
⌘ Shift M) → Voice - Navigate to the Transcription tab
- Grant Accessibility permission (required for keyboard simulation)
- Toggle "Enable Transcription Mode" on
- Configure your preferred hotkey
How It Works
┌─────────────────────────────────────────────────────────────┐
│ Transcription Mode Flow │
├─────────────────────────────────────────────────────────────┤
│ │
│ 1. Press the configured hotkey from any application │
│ ↓ │
│ 2. Minimal overlay appears showing recording status │
│ ↓ │
│ 3. FluidAudio transcribes your speech in real-time │
│ ↓ │
│ 4. Text is typed into the focused text field │
│ ↓ │
│ 5. Press Esc or click Done to stop transcription │
│ │
└─────────────────────────────────────────────────────────────┘
Requirements
Accessibility Permission — Required to simulate keyboard input:
- Go to System Settings → Privacy & Security → Accessibility
- Enable access for Osaurus
- You may need to restart Osaurus after granting permission
Microphone Permission — Required for audio capture (same as other voice features)
Parakeet Model — A model must be downloaded and selected
Transcription Settings
| Setting | Description | Default |
|---|---|---|
| Transcription Enabled | Master toggle for the feature | Off |
| Activation Hotkey | Global hotkey to trigger dictation | None |
Using Transcription Mode
- Focus a text field — Click into any text input in any application
- Press the hotkey — The transcription overlay appears
- Speak naturally — Your words are typed in real-time
- Stop transcription — Press
Escor click the Done button
The Overlay UI
When transcription is active, a minimal floating overlay appears at the top of your screen:
- Status indicator — Shows "Listening" with a pulsing accent color
- Waveform — Animated bars respond to your audio level
- Done button — Click to stop transcription
- Close button — Cancel and discard (same as pressing Esc)
The overlay stays on top of all windows and follows the app's theme.
Use Cases
- Email composition — Dictate emails in Mail, Gmail, or any email client
- Document writing — Speak paragraphs into Word, Pages, or Google Docs
- Code comments — Quickly add comments in your IDE
- Chat messages — Dictate in Slack, Discord, or Messages
- Form filling — Speed through web forms and data entry
- Notes — Capture ideas quickly in any notes app
Tips for Best Results
- Speak clearly — Enunciate words for better accuracy
- Use a good microphone — External mics often work better than built-in
- Reduce background noise — Find a quiet environment
- Use Parakeet TDT v3 — The multilingual model offers the best overall accuracy
Configuration Reference
SpeechConfiguration
Voice input settings stored in app preferences:
struct SpeechConfiguration {
var modelVersion: SpeechModelVersion // .v2 (English) or .v3 (multilingual)
var selectedInputDeviceId: String? // Audio device UID
var selectedInputSource: AudioInputSource // Mic or system
var sensitivity: VoiceSensitivity // Low/Medium/High
var voiceInputEnabled: Bool // Voice in chat enabled
var pauseDuration: Double // Silence before auto-send
var confirmationDelay: Double // Confirmation countdown
var silenceTimeoutSeconds: Double // Auto-close after silence
}
VADConfiguration
VAD mode settings:
struct VADConfiguration {
var vadModeEnabled: Bool // Master toggle
var enabledAgentIds: [UUID] // Agents for wake-words
var wakeWordSensitivity: VoiceSensitivity
var autoStartVoiceInput: Bool // Auto-record after activation
var customWakePhrase: String // e.g., "Hey Osaurus"
var silenceTimeoutSeconds: Double // Auto-close timeout
}
TranscriptionConfiguration
Transcription mode settings:
struct TranscriptionConfiguration {
var transcriptionModeEnabled: Bool // Master toggle
var hotkey: Hotkey? // Global activation hotkey
}
Language Support
Parakeet TDT v3 automatically detects and transcribes 25 European languages: English, German, Spanish, French, Dutch, Italian, Danish, Estonian, Finnish, Greek, Hungarian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Swedish, Russian, Ukrainian, Bulgarian, Croatian, and Czech.
Parakeet TDT v2 is optimized exclusively for English.
Troubleshooting
Voice Input Not Working
-
Check microphone permission
- System Settings → Privacy & Security → Microphone → Enable Osaurus
-
Verify model is loaded
- Open Voice settings
- Ensure a model is downloaded and selected
- Check that model loads without errors
-
Test with audio level indicator
- Start voice input
- Speak and watch the audio level visualization
- If no level shown, check your audio device
Low Transcription Accuracy
-
Try the multilingual model
- Switch to Parakeet TDT v3 for broader language support
-
Reduce background noise
- Use a closer microphone
- Reduce ambient noise
-
Adjust sensitivity
- Lower sensitivity if picking up background noise
- Higher sensitivity if missing quiet speech
VAD Not Detecting Wake-Words
-
Check VAD is enabled
- Open Voice settings → VAD Mode section
- Verify toggle is on
-
Verify agents are enabled for VAD
- At least one agent must be selected
- Or set a custom wake phrase
-
Speak clearly
- Say the full agent name
- Wait for detection (2-3 second cooldown between detections)
-
Check the status indicators
- The Osaurus menu bar icon should show a blue pulsing dot (top-right) when VAD is listening
- Click the menu bar icon and check the waveform button shows green
System Audio Not Capturing
-
Check macOS version
- Requires macOS 12.3 or later
-
Grant Screen Recording permission
- System Settings → Privacy & Security → Screen Recording
- Enable for Osaurus
-
Restart after granting permission
- Permissions may require app restart
Model Download Fails
-
Check internet connection
- Models are downloaded from Hugging Face
-
Verify disk space
- Large models need 3+ GB free space
-
Check the download progress
- Downloads can take several minutes for large models
-
Try a smaller model first
- Test with Tiny or Small model
Transcription Mode Not Typing
-
Check accessibility permission
- System Settings → Privacy & Security → Accessibility → Enable Osaurus
- You may need to restart Osaurus after granting permission
-
Verify the hotkey is set
- Open Voice settings → Transcription tab
- Ensure a hotkey is configured
-
Make sure a text field is focused
- Click into a text input before pressing the hotkey
- Some applications may block simulated keyboard input
-
Check the overlay appears
- If the overlay doesn't appear, the hotkey may conflict with another app
- Try a different hotkey combination
-
Verify microphone and model
- Same requirements as other voice features
- Test voice input in chat first to confirm setup
Privacy
All voice processing happens locally on your Mac:
- No cloud transcription — FluidAudio runs entirely on-device
- No audio recording — Audio is processed in memory only
- No data collection — Transcriptions stay on your machine
- Neural Engine acceleration — Fast, efficient processing
Your voice data never leaves your computer.
Requirements
- macOS 15.5+ for voice input
- macOS 12.3+ for system audio capture
- Apple Silicon (M1 or newer) for optimal performance
- Microphone access permission
- Screen Recording permission (for system audio only)
- Accessibility permission (for Transcription Mode only)
See Also
- README.md — Project overview
- FEATURES.md — Complete feature inventory
- Agents — Create custom AI assistants