๐ง OtosakuStreamingASR-iOS
June 4, 2026 ยท View on GitHub
OtosakuStreamingASR is a lightweight on-device streaming speech recognition engine for iOS. It performs real-time audio processing using a Conformer-based architecture and CTC decoding.
Important
๐ Successor project: VoxRT
Streaming on-device ASR has moved to VoxRT โ a ground-up
rewrite on a custom Rust inference runtime. Same idea as OtosakuStreamingASR, but faster,
smaller, and now on both platforms:
- iOS โ voxrt-asr-ios โ NEON-accelerated FastConformer (32M), RTF 0.08โ0.10 on iPhone 13 Pro Max, fully offline.
- Android โ voxrt-asr-android โ same engine, ~150 ms latency, no cloud.
This repo stays up for reference, but new development happens in VoxRT. ๐ Start there.
๐ Features
- โ Fully offline
- ๐ Real-time streaming speech recognition
- ๐ Modular architecture (feature extractor, encoder, decoder)
๐ฅ Demo
Watch the model running live on iPhone 13:

๐ Installation
Add the Swift Package to your Xcode project:
https://github.com/Otosaku/OtosakuStreamingASR-iOS
๐งฐ Usage Example
import OtosakuStreamingASR
let asr = OtosakuStreamingASR()
try asr.prepareModel(from: modelURL)
asr.subscribe { text in
print("๐ฃ Recognized: \(text)")
}
// Raw audio chunk: [Double] in range [-1.0, 1.0], strictly 2559 samples per chunk (80ms at 16kHz)
try asr.predictChunk(rawChunk: yourRawAudioChunk)
try asr.stop() // Finalize and decode remaining buffer
asr.reset() // Reset internal model state
๐ง Model Details
-
Architecture: Fast Conformer (Cache-Aware Streaming)
-
Language: ๐ท๐บ Russian (fine-tuned from English)
-
Training: 250 hours of Russian speech (30 epochs)
-
WER (Word Error Rate):
- Russian (fine-tuned): 11%
- English (before fine-tuning): 6.5% on LibriSpeech
test-other
๐ Download Russian model: Link to model
For other languages or custom domains, contact me:
๐งต OtosakuStreamingASR API
| Method | Description |
|---|---|
prepareModel(from:) | Load model from directory |
predictChunk(rawChunk:) | Submit audio frame ([Double]) |
stop() | Finalize and decode remaining buffer |
reset() | Reset model state |
subscribe { String in } | Receive transcribed text in real time |
โ ๏ธ Input audio must be sampled at 16kHz and normalized to [-1.0, 1.0], strictly 2559 samples per chunk (80ms at 16kHz)
๐ Privacy First
This package is designed with privacy in mind:
- Runs entirely on-device
- No cloud calls or external dependencies
๐ฉ Contact
If you have any questions, suggestions, or are interested in adapting the model to another language or domain:
Email: otosaku.dsp@gmail.com
๐ License
MIT License