Testing & QA Report
April 4, 2026 · View on GitHub
Comprehensive testing of ohr's capabilities and limitations.
Test environment: macOS 26.3.1, Apple M2, 24 GB RAM, ohr 0.1.3
Test Methodology
Tests were run using synthetic speech generated by macOS say command (not real human recordings). This means accuracy numbers represent a best-case scenario — real-world accuracy with natural speech, accents, background noise, and multiple speakers will be lower.
Audio files were generated at various lengths (5s to 10min) and in all supported formats (m4a, wav, mp3, aiff, flac).
Performance
ohr transcribes at approximately 50x real-time on Apple M2.
| Audio Length | File Size | Segments | Words | Transcribe Time | Speed |
|---|---|---|---|---|---|
| 2.5s | 111 KB | 2 | 6 | 300ms | 8x |
| 23.5s | 1.0 MB | 21 | 60 | 600ms | 39x |
| 69.7s (~1 min) | 2.9 MB | 45 | 157 | 1.5s | 46x |
| 145.2s (~2.5 min) | 6.1 MB | 119 | 311 | 2.5s | 58x |
| 267.8s (~4.5 min) | 11 MB | 201 | 568 | 4.7s | 57x |
Key findings:
- Transcription speed scales sub-linearly — longer files are proportionally faster
- 10 minutes of audio transcribes in under 5 seconds
- Processing time is dominated by model loading for short files (~300ms overhead)
- No upper limit found in testing (10 minutes worked fine, longer likely works too)
Server performance: Similar to CLI. The 10-minute file transcribed in ~6.2s via HTTP (includes multipart parsing + network overhead).
Format Compatibility
All 5 supported formats produce usable transcription output:
| Format | Extension | Tested Size | Result |
|---|---|---|---|
| Apple M4A (AAC) | .m4a | 111 KB – 11 MB | Works |
| WAV (PCM) | .wav | 1.1 MB | Works |
| MP3 | .mp3 | 209 KB | Works |
| AIFF | .aiff, .aif | 665 KB | Works |
| FLAC | .flac | 582 KB | Works |
Not supported: OGG, OPUS, WebM audio. These return an error.
Stdin piping: Works for all formats. ohr detects the audio format from magic bytes in the file header (RIFF/WAVE for WAV, ID3/sync for MP3, ftyp for M4A, etc.).
Transcription Accuracy
Tested with macOS say command (synthetic speech). Results represent best-case accuracy.
Accuracy observations
| Category | Accuracy | Notes |
|---|---|---|
| Clear synthetic speech | ~90-95% | Most words correct |
| Numbers (digits) | Good | "12345" → "12345" |
| Numbers (spoken) | Poor | "five second" → "52nd", "thirty second" → "32nd" |
| Technical terms | Variable | "ohr" → "or" (unknown word) |
| Punctuation | Good | Periods and commas placed reasonably |
| Timestamps | Excellent | Sub-second precision, consistent |
Known accuracy issues
- Ordinal/number confusion — "five second test" becomes "52nd test". The model interprets spoken numbers as ordinals or digits unpredictably.
- Unknown words — Words not in the model's vocabulary (like "ohr") are transcribed as phonetically similar known words ("or").
- Segment boundaries — Segments sometimes split mid-phrase. Leading spaces appear on continuation segments.
- Synthetic speech bias — The
saycommand produces unnaturally even speech. Real human speech with pauses, filler words, and varied intonation may produce different results.
What we could NOT test
- Real human speech — All tests used synthetic TTS. Natural speech accuracy is unknown.
- Noisy environments — No background noise in test audio.
- Multiple speakers — No diarization (SpeechAnalyzer doesn't support it).
- Accented speech — Only default
sayvoice tested. - Whispered or quiet speech — Not tested.
- Music with lyrics — Not tested.
Language Support
30 languages are reported as supported by the model:
de_AT, de_CH, de_DE, en_AU, en_CA, en_GB, en_IE, en_IN, en_NZ, en_SG, en_US, en_ZA,
es_CL, es_ES, es_MX, es_US, fr_BE, fr_CA, fr_CH, fr_FR, it_CH, it_IT, ja_JP, ko_KR,
pt_BR, pt_PT, yue_CN, zh_CN, zh_HK, zh_TW
Language test results
| Test | Result |
|---|---|
| English (auto-detect) | Works well |
English with -l en-US | Works well |
German (synthetic say -v Anna) without language flag | Mixed results — some German words transcribed, some replaced with English |
German with -l de-DE flag on synthetic German | Empty output (model may reject mismatched voice/locale) |
English audio with -l de-DE flag | Empty output |
Takeaway: Language detection works for English. Other languages need real human speech for meaningful testing. Forcing a wrong language flag produces empty output rather than errors.
Edge Cases
| Scenario | Behavior | Exit Code |
|---|---|---|
| Empty file (0 bytes) | Error: "com.apple.coreaudio.avfaudio error" | 1 |
| Corrupted file (random bytes) | Error: "com.apple.coreaudio.avfaudio error" | 1 |
Text file renamed to .m4a | Error: "com.apple.coreaudio.avfaudio error" | 1 |
| 10 minutes of silence | OK, empty text output | 0 |
| File not found | Error: "file not found" | 4 |
Unsupported format (.xyz) | Error: "unsupported format" | 3 |
| Multiple files as args | Transcribes each sequentially | 0 |
| No args, no stdin | Prints usage | 2 |
| Unknown flag | Error + usage hint | 2 |
All error cases produce clean, user-friendly error messages to stderr.
Server Testing
The HTTP server was tested with all response formats via curl and Python httpx:
| Endpoint | Method | Result |
|---|---|---|
/health | GET | 200, JSON with model info |
/v1/models | GET | 200, model list |
/v1/audio/transcriptions | POST (multipart) | 200, all 5 response formats |
/v1/chat/completions | POST | 501 (honest stub) |
/v1/embeddings | POST | 501 (honest stub) |
/v1/logs | GET | 200 (with --debug) |
Security tests
| Test | Result |
|---|---|
| No token → protected endpoint | 401 Unauthorized |
| Wrong token | 401 Unauthorized |
| Correct Bearer token | 200 OK |
| Health without token (loopback) | 200 OK (public by default) |
| Foreign origin (evil.com) | 403 Forbidden |
| Subdomain attack (localhost.evil.com) | 403 Forbidden |
| Localhost origin | 200 OK |
| OPTIONS preflight | 204 No Content |
Test Suite Summary
| Suite | Count | Type |
|---|---|---|
| OhrErrorTests | 12 | Unit (Swift) |
| AudioFormatTests | 24 | Unit (Swift) |
| SubtitleFormatterTests | 16 | Unit (Swift) |
| OpenAIModelsTests | 10 | Unit (Swift) |
| TranscriptionValidatorTests | 15 | Unit (Swift) |
| OriginValidatorTests | 32 | Unit (Swift) |
| CLI E2E | 17 | Integration (Python) |
| Server Endpoints | 14 | Integration (Python) |
| Security | 11 | Integration (Python) |
| Total | 151 |
Recommendations
- Test with real human speech before relying on ohr for production transcription. Synthetic speech accuracy (~95%) is likely higher than real-world accuracy.
- Use the language flag (
-l en-US) when you know the language. Auto-detection works for English but is unreliable for other languages. - Expect imperfect transcription. The on-device model trades accuracy for privacy and speed. For critical applications, consider cloud-based alternatives.
- Long recordings work fine. No practical upper limit found. Performance scales well — 10 minutes transcribes in ~5 seconds.
- When piping to apfel for summarization, be aware of apfel's 4096 token (~3000 word) context window. Recordings longer than ~15 minutes may exceed this limit.