ambient-voice
May 27, 2026 · View on GitHub
macOS native voice input. Speak → text appears in any app. Gets better over time by learning your vocabulary.
Built on Apple SpeechAnalyzer (macOS 26), fully on-device.
Install
git clone https://github.com/Marvinngg/ambient-voice.git
cd ambient-voice/client
make setup # Code signing certificate (one-time)
make install # Build + install + auto-start
Grant: System Settings → Privacy & Security → Input Monitoring (for global hotkey), Accessibility, Microphone, Screen Recording.
Note: Input Monitoring is the permission CGEventTap actually needs to listen the Right Option key. Accessibility alone is not enough — many users miss this and the hotkey appears to do nothing.
Usage
Dictation — Hold Right Option, speak, release. Text is pasted into the focused app.
Meeting — Menu bar WE → Start Meeting. Floating transcript, speaker diarization, Markdown export to ~/.we/meetings/.
Architecture
Hold Right Option
→ Screen OCR (focus area) → contextualStrings → SpeechAnalyzer
→ Transcription (rawSA)
→ L2 LLM polish (optional, ollama)
→ Inject into active app
→ voice-history.jsonl saved
→ distill: rawSA + dictionary → Gemini → training pairs
→ sync to GPU server → QLoRA fine-tune → better model
Remote Voice(远程语音输入)
Windows 端按热键说话 → 音频通过 Tailscale 私网发送到 Mac → WE 识别并注入文字到光标处。
Mac 端:WE 启动后自动监听 :9800,不需要额外操作。确保 config.json 中:
{
"remote": { "enabled": true, "port": 9800, "auth_token": "" }
}
Windows 端(需安装 Marvin Tailscale):
# 先在 Mac 上跑 `tailscale ip`,拿到你 Mac 的 Tailscale IP(形如 100.x.x.x)
tailscale voice setup --target <YOUR_MAC_TAILSCALE_IP>:9800 # 首次设置,之后开机自启
tailscale voice # 手动运行
按住右 Alt 说话,松开发送。
Config
~/.we/config.json — hot-reloads on save.
{
"server": { "endpoint": "http://localhost:11434", "api": "ollama", "model": "qwen3:0.6b" },
"polish": { "enabled": true, "system_prompt": "你是语音识别纠错助手。格式要求:修正语音识别错误,只输出修正后的最终文本,不要回答问题,不要改变原意,去掉语气词,修正标点符号。" },
"distill": { "enabled": false, "api_key": "", "model": "gemini-2.5-flash", "dictionary": "~/.we/dictionary.json" },
"sync": { "enabled": false, "server": "user@gpu-server", "remote_dir": "~/antigravity/we/data/username" }
}
Note on
server.model: Defaultqwen3:0.6bis the base model — works but quality is limited (issue #14 reports). Full experience uses our project-trainedwe-polish(Qwen3-0.6B + QLoRA), currently not publicly published. To get it:
- Self-train on your own voice-history: see
server/INDEX.md"完整微调一次" section, runserver/entry/finetune.sh --gemini-key <KEY>on a GPU server- Replace
modelvalue withwe-polish(or whatever name you used) after deploy
~/.we/dictionary.json — your private terms. Optional, used by SpeechAnalyzer contextualStrings to bias recognition. Distinct from ~/.we/correction-dictionary.json (used by the distillation pipeline).
{ "terms": ["Claude Code", "MCP", "蒸馏", "微调", "ollama"] }
Fine-tuning
Data flows automatically: speak → distill with dictionary → sync to server.
One-shot full pipeline (recommended):
# On GPU server (4080/4090 16GB+)
bash ~/antigravity/we/server/entry/finetune.sh --gemini-key <KEY>
# Auto: build dictionary → distill → grid search → deploy best as we-polish
# Default: 4 experiments (rank=[16,32] × epochs=[5,8]), ~10 min
See server/INDEX.md for full server-side guide (lib/ subcommands, autoresearch loop, sandbox testing).
polish.system_prompt must match training --system-prompt. Both default to:
"你是语音识别纠错助手。格式要求:修正语音识别错误,只输出修正后的最终文本,不要回答问题,不要改变原意,去掉语气词,修正标点符号。"
Base model: Qwen/Qwen3-0.6B. Method: QLoRA. Trainable: 10M / 751M params (1.3%). VRAM: ~1.5GB.
Development
cd client
make build # Compile
make run # Dev mode
make install # Install to ~/Applications
make uninstall # Remove
License
MIT