Testing & QA Report

April 4, 2026 · View on GitHub

Comprehensive testing of ohr's capabilities and limitations.

Test environment: macOS 26.3.1, Apple M2, 24 GB RAM, ohr 0.1.3

Test Methodology

Tests were run using synthetic speech generated by macOS say command (not real human recordings). This means accuracy numbers represent a best-case scenario — real-world accuracy with natural speech, accents, background noise, and multiple speakers will be lower.

Audio files were generated at various lengths (5s to 10min) and in all supported formats (m4a, wav, mp3, aiff, flac).

Performance

ohr transcribes at approximately 50x real-time on Apple M2.

Audio LengthFile SizeSegmentsWordsTranscribe TimeSpeed
2.5s111 KB26300ms8x
23.5s1.0 MB2160600ms39x
69.7s (~1 min)2.9 MB451571.5s46x
145.2s (~2.5 min)6.1 MB1193112.5s58x
267.8s (~4.5 min)11 MB2015684.7s57x

Key findings:

  • Transcription speed scales sub-linearly — longer files are proportionally faster
  • 10 minutes of audio transcribes in under 5 seconds
  • Processing time is dominated by model loading for short files (~300ms overhead)
  • No upper limit found in testing (10 minutes worked fine, longer likely works too)

Server performance: Similar to CLI. The 10-minute file transcribed in ~6.2s via HTTP (includes multipart parsing + network overhead).

Format Compatibility

All 5 supported formats produce usable transcription output:

FormatExtensionTested SizeResult
Apple M4A (AAC).m4a111 KB – 11 MBWorks
WAV (PCM).wav1.1 MBWorks
MP3.mp3209 KBWorks
AIFF.aiff, .aif665 KBWorks
FLAC.flac582 KBWorks

Not supported: OGG, OPUS, WebM audio. These return an error.

Stdin piping: Works for all formats. ohr detects the audio format from magic bytes in the file header (RIFF/WAVE for WAV, ID3/sync for MP3, ftyp for M4A, etc.).

Transcription Accuracy

Tested with macOS say command (synthetic speech). Results represent best-case accuracy.

Accuracy observations

CategoryAccuracyNotes
Clear synthetic speech~90-95%Most words correct
Numbers (digits)Good"12345" → "12345"
Numbers (spoken)Poor"five second" → "52nd", "thirty second" → "32nd"
Technical termsVariable"ohr" → "or" (unknown word)
PunctuationGoodPeriods and commas placed reasonably
TimestampsExcellentSub-second precision, consistent

Known accuracy issues

  1. Ordinal/number confusion — "five second test" becomes "52nd test". The model interprets spoken numbers as ordinals or digits unpredictably.
  2. Unknown words — Words not in the model's vocabulary (like "ohr") are transcribed as phonetically similar known words ("or").
  3. Segment boundaries — Segments sometimes split mid-phrase. Leading spaces appear on continuation segments.
  4. Synthetic speech bias — The say command produces unnaturally even speech. Real human speech with pauses, filler words, and varied intonation may produce different results.

What we could NOT test

  • Real human speech — All tests used synthetic TTS. Natural speech accuracy is unknown.
  • Noisy environments — No background noise in test audio.
  • Multiple speakers — No diarization (SpeechAnalyzer doesn't support it).
  • Accented speech — Only default say voice tested.
  • Whispered or quiet speech — Not tested.
  • Music with lyrics — Not tested.

Language Support

30 languages are reported as supported by the model:

de_AT, de_CH, de_DE, en_AU, en_CA, en_GB, en_IE, en_IN, en_NZ, en_SG, en_US, en_ZA,
es_CL, es_ES, es_MX, es_US, fr_BE, fr_CA, fr_CH, fr_FR, it_CH, it_IT, ja_JP, ko_KR,
pt_BR, pt_PT, yue_CN, zh_CN, zh_HK, zh_TW

Language test results

TestResult
English (auto-detect)Works well
English with -l en-USWorks well
German (synthetic say -v Anna) without language flagMixed results — some German words transcribed, some replaced with English
German with -l de-DE flag on synthetic GermanEmpty output (model may reject mismatched voice/locale)
English audio with -l de-DE flagEmpty output

Takeaway: Language detection works for English. Other languages need real human speech for meaningful testing. Forcing a wrong language flag produces empty output rather than errors.

Edge Cases

ScenarioBehaviorExit Code
Empty file (0 bytes)Error: "com.apple.coreaudio.avfaudio error"1
Corrupted file (random bytes)Error: "com.apple.coreaudio.avfaudio error"1
Text file renamed to .m4aError: "com.apple.coreaudio.avfaudio error"1
10 minutes of silenceOK, empty text output0
File not foundError: "file not found"4
Unsupported format (.xyz)Error: "unsupported format"3
Multiple files as argsTranscribes each sequentially0
No args, no stdinPrints usage2
Unknown flagError + usage hint2

All error cases produce clean, user-friendly error messages to stderr.

Server Testing

The HTTP server was tested with all response formats via curl and Python httpx:

EndpointMethodResult
/healthGET200, JSON with model info
/v1/modelsGET200, model list
/v1/audio/transcriptionsPOST (multipart)200, all 5 response formats
/v1/chat/completionsPOST501 (honest stub)
/v1/embeddingsPOST501 (honest stub)
/v1/logsGET200 (with --debug)

Security tests

TestResult
No token → protected endpoint401 Unauthorized
Wrong token401 Unauthorized
Correct Bearer token200 OK
Health without token (loopback)200 OK (public by default)
Foreign origin (evil.com)403 Forbidden
Subdomain attack (localhost.evil.com)403 Forbidden
Localhost origin200 OK
OPTIONS preflight204 No Content

Test Suite Summary

SuiteCountType
OhrErrorTests12Unit (Swift)
AudioFormatTests24Unit (Swift)
SubtitleFormatterTests16Unit (Swift)
OpenAIModelsTests10Unit (Swift)
TranscriptionValidatorTests15Unit (Swift)
OriginValidatorTests32Unit (Swift)
CLI E2E17Integration (Python)
Server Endpoints14Integration (Python)
Security11Integration (Python)
Total151

Recommendations

  1. Test with real human speech before relying on ohr for production transcription. Synthetic speech accuracy (~95%) is likely higher than real-world accuracy.
  2. Use the language flag (-l en-US) when you know the language. Auto-detection works for English but is unreliable for other languages.
  3. Expect imperfect transcription. The on-device model trades accuracy for privacy and speed. For critical applications, consider cloud-based alternatives.
  4. Long recordings work fine. No practical upper limit found. Performance scales well — 10 minutes transcribes in ~5 seconds.
  5. When piping to apfel for summarization, be aware of apfel's 4096 token (~3000 word) context window. Recordings longer than ~15 minutes may exceed this limit.