Testing & QA Report

April 4, 2026 · View on GitHub

Comprehensive testing of ohr's capabilities and limitations.

Test environment: macOS 26.3.1, Apple M2, 24 GB RAM, ohr 0.1.3

Test Methodology

Tests were run using synthetic speech generated by macOS say command (not real human recordings). This means accuracy numbers represent a best-case scenario — real-world accuracy with natural speech, accents, background noise, and multiple speakers will be lower.

Audio files were generated at various lengths (5s to 10min) and in all supported formats (m4a, wav, mp3, aiff, flac).

Performance

ohr transcribes at approximately 50x real-time on Apple M2.

Audio Length	File Size	Segments	Words	Transcribe Time	Speed
2.5s	111 KB	2	6	300ms	8x
23.5s	1.0 MB	21	60	600ms	39x
69.7s (~1 min)	2.9 MB	45	157	1.5s	46x
145.2s (~2.5 min)	6.1 MB	119	311	2.5s	58x
267.8s (~4.5 min)	11 MB	201	568	4.7s	57x

Key findings:

Transcription speed scales sub-linearly — longer files are proportionally faster
10 minutes of audio transcribes in under 5 seconds
Processing time is dominated by model loading for short files (~300ms overhead)
No upper limit found in testing (10 minutes worked fine, longer likely works too)

Server performance: Similar to CLI. The 10-minute file transcribed in ~6.2s via HTTP (includes multipart parsing + network overhead).

Format Compatibility

All 5 supported formats produce usable transcription output:

Format	Extension	Tested Size	Result
Apple M4A (AAC)	`.m4a`	111 KB – 11 MB	Works
WAV (PCM)	`.wav`	1.1 MB	Works
MP3	`.mp3`	209 KB	Works
AIFF	`.aiff`, `.aif`	665 KB	Works
FLAC	`.flac`	582 KB	Works

Not supported: OGG, OPUS, WebM audio. These return an error.

Stdin piping: Works for all formats. ohr detects the audio format from magic bytes in the file header (RIFF/WAVE for WAV, ID3/sync for MP3, ftyp for M4A, etc.).

Transcription Accuracy

Tested with macOS say command (synthetic speech). Results represent best-case accuracy.

Accuracy observations

Category	Accuracy	Notes
Clear synthetic speech	~90-95%	Most words correct
Numbers (digits)	Good	"12345" → "12345"
Numbers (spoken)	Poor	"five second" → "52nd", "thirty second" → "32nd"
Technical terms	Variable	"ohr" → "or" (unknown word)
Punctuation	Good	Periods and commas placed reasonably
Timestamps	Excellent	Sub-second precision, consistent

Known accuracy issues

Ordinal/number confusion — "five second test" becomes "52nd test". The model interprets spoken numbers as ordinals or digits unpredictably.
Unknown words — Words not in the model's vocabulary (like "ohr") are transcribed as phonetically similar known words ("or").
Segment boundaries — Segments sometimes split mid-phrase. Leading spaces appear on continuation segments.
Synthetic speech bias — The say command produces unnaturally even speech. Real human speech with pauses, filler words, and varied intonation may produce different results.

What we could NOT test

Real human speech — All tests used synthetic TTS. Natural speech accuracy is unknown.
Noisy environments — No background noise in test audio.
Multiple speakers — No diarization (SpeechAnalyzer doesn't support it).
Accented speech — Only default say voice tested.
Whispered or quiet speech — Not tested.
Music with lyrics — Not tested.

Language Support

30 languages are reported as supported by the model:

de_AT, de_CH, de_DE, en_AU, en_CA, en_GB, en_IE, en_IN, en_NZ, en_SG, en_US, en_ZA,
es_CL, es_ES, es_MX, es_US, fr_BE, fr_CA, fr_CH, fr_FR, it_CH, it_IT, ja_JP, ko_KR,
pt_BR, pt_PT, yue_CN, zh_CN, zh_HK, zh_TW

Language test results

Test	Result
English (auto-detect)	Works well
English with `-l en-US`	Works well
German (synthetic `say -v Anna`) without language flag	Mixed results — some German words transcribed, some replaced with English
German with `-l de-DE` flag on synthetic German	Empty output (model may reject mismatched voice/locale)
English audio with `-l de-DE` flag	Empty output

Takeaway: Language detection works for English. Other languages need real human speech for meaningful testing. Forcing a wrong language flag produces empty output rather than errors.

Edge Cases

Scenario	Behavior	Exit Code
Empty file (0 bytes)	Error: "com.apple.coreaudio.avfaudio error"	1
Corrupted file (random bytes)	Error: "com.apple.coreaudio.avfaudio error"	1
Text file renamed to `.m4a`	Error: "com.apple.coreaudio.avfaudio error"	1
10 minutes of silence	OK, empty text output	0
File not found	Error: "file not found"	4
Unsupported format (`.xyz`)	Error: "unsupported format"	3
Multiple files as args	Transcribes each sequentially	0
No args, no stdin	Prints usage	2
Unknown flag	Error + usage hint	2

All error cases produce clean, user-friendly error messages to stderr.

Server Testing

The HTTP server was tested with all response formats via curl and Python httpx:

Endpoint	Method	Result
`/health`	GET	200, JSON with model info
`/v1/models`	GET	200, model list
`/v1/audio/transcriptions`	POST (multipart)	200, all 5 response formats
`/v1/chat/completions`	POST	501 (honest stub)
`/v1/embeddings`	POST	501 (honest stub)
`/v1/logs`	GET	200 (with `--debug`)

Security tests

Test	Result
No token → protected endpoint	401 Unauthorized
Wrong token	401 Unauthorized
Correct Bearer token	200 OK
Health without token (loopback)	200 OK (public by default)
Foreign origin (evil.com)	403 Forbidden
Subdomain attack (localhost.evil.com)	403 Forbidden
Localhost origin	200 OK
OPTIONS preflight	204 No Content

Test Suite Summary

Suite	Count	Type
OhrErrorTests	12	Unit (Swift)
AudioFormatTests	24	Unit (Swift)
SubtitleFormatterTests	16	Unit (Swift)
OpenAIModelsTests	10	Unit (Swift)
TranscriptionValidatorTests	15	Unit (Swift)
OriginValidatorTests	32	Unit (Swift)
CLI E2E	17	Integration (Python)
Server Endpoints	14	Integration (Python)
Security	11	Integration (Python)
Total	151

Recommendations

Test with real human speech before relying on ohr for production transcription. Synthetic speech accuracy (~95%) is likely higher than real-world accuracy.
Use the language flag (-l en-US) when you know the language. Auto-detection works for English but is unreliable for other languages.
Expect imperfect transcription. The on-device model trades accuracy for privacy and speed. For critical applications, consider cloud-based alternatives.
Long recordings work fine. No practical upper limit found. Performance scales well — 10 minutes transcribes in ~5 seconds.
When piping to apfel for summarization, be aware of apfel's 4096 token (~3000 word) context window. Recordings longer than ~15 minutes may exceed this limit.