๐Ÿง  OtosakuStreamingASR-iOS

June 4, 2026 ยท View on GitHub

OtosakuStreamingASR is a lightweight on-device streaming speech recognition engine for iOS. It performs real-time audio processing using a Conformer-based architecture and CTC decoding.

Important

๐Ÿš€ Successor project: VoxRT

Streaming on-device ASR has moved to VoxRT โ€” a ground-up rewrite on a custom Rust inference runtime. Same idea as OtosakuStreamingASR, but faster, smaller, and now on both platforms:

  • iOS โ†’ voxrt-asr-ios โ€” NEON-accelerated FastConformer (32M), RTF 0.08โ€“0.10 on iPhone 13 Pro Max, fully offline.
  • Android โ†’ voxrt-asr-android โ€” same engine, ~150 ms latency, no cloud.

This repo stays up for reference, but new development happens in VoxRT. ๐Ÿ‘‰ Start there.

๐Ÿš€ Features

  • โœ… Fully offline
  • ๐ŸŽ™ Real-time streaming speech recognition
  • ๐Ÿ›  Modular architecture (feature extractor, encoder, decoder)

๐ŸŽฅ Demo

Watch the model running live on iPhone 13:

Demo running on iPhone


๐Ÿ“† Installation

Add the Swift Package to your Xcode project:

https://github.com/Otosaku/OtosakuStreamingASR-iOS

๐Ÿงฐ Usage Example

import OtosakuStreamingASR
                                                                                                
let asr = OtosakuStreamingASR()

try asr.prepareModel(from: modelURL)

asr.subscribe { text in
    print("๐Ÿ—ฃ Recognized: \(text)")
}

// Raw audio chunk: [Double] in range [-1.0, 1.0], strictly 2559 samples per chunk (80ms at 16kHz)
try asr.predictChunk(rawChunk: yourRawAudioChunk)

try asr.stop() // Finalize and decode remaining buffer

asr.reset() // Reset internal model state

๐Ÿง  Model Details

  • Architecture: Fast Conformer (Cache-Aware Streaming)

  • Language: ๐Ÿ‡ท๐Ÿ‡บ Russian (fine-tuned from English)

  • Training: 250 hours of Russian speech (30 epochs)

  • WER (Word Error Rate):

    • Russian (fine-tuned): 11%
    • English (before fine-tuning): 6.5% on LibriSpeech test-other

๐Ÿ”— Download Russian model: Link to model

For other languages or custom domains, contact me:

๐Ÿ“ง otosaku.dsp@gmail.com


๐Ÿงต OtosakuStreamingASR API

MethodDescription
prepareModel(from:)Load model from directory
predictChunk(rawChunk:)Submit audio frame ([Double])
stop()Finalize and decode remaining buffer
reset()Reset model state
subscribe { String in }Receive transcribed text in real time

โš ๏ธ Input audio must be sampled at 16kHz and normalized to [-1.0, 1.0], strictly 2559 samples per chunk (80ms at 16kHz)


๐Ÿ”’ Privacy First

This package is designed with privacy in mind:

  • Runs entirely on-device
  • No cloud calls or external dependencies

๐Ÿ“ฉ Contact

If you have any questions, suggestions, or are interested in adapting the model to another language or domain:

Email: otosaku.dsp@gmail.com


๐Ÿ“„ License

MIT License