Models

May 9, 2026 · View on GitHub

A guide to each CoreML model pipeline in FluidAudio.

ASR Models

Sliding-Window Transcription (Near Real-Time)

Long-form audio processed via SlidingWindowAsrManager — chunked, overlapped, and stitched. Distinct from the Streaming Transcription section below, which uses cache-aware encoders that emit partials as audio arrives.

ModelDescriptionContext
Parakeet TDT v2Batch speech-to-text, English only (0.6B params). TDT architecture.First ASR model added.
Parakeet TDT v3Batch speech-to-text, 25 European languages (0.6B params). Default ASR model.Released after v2 to add multilingual support.
Parakeet TDT-CTC-110MHybrid TDT-CTC batch model (110M params). 3.01% WER on LibriSpeech test-clean. 96.5x RTFx on M2 Mac. Fused preprocessor+encoder for reduced memory footprint. iOS compatible.Smaller, faster alternative to v3 with competitive accuracy.
Parakeet TDT JapaneseBatch speech-to-text, Japanese only (0.6B params). Hybrid model: INT8 CTC-trained preprocessor + encoder paired with a TDT decoder + joint. 6.85% CER on JSUT, 10.8x RTFx on M2.CTC-only Japanese inference was removed in 846924a1d; only the preprocessor + encoder from the original CTC repo are reused.
Parakeet CTC ChineseBatch speech-to-text, Mandarin Chinese (0.6B params). CTC architecture. 8.37% mean CER on THCHS-30 dataset. Int8 encoder (0.55GB) or FP32 (1.1GB).First Mandarin Chinese ASR model. Uses CTC greedy decoder.
Cohere Transcribe (FluidAudio#487, #537)Batch encoder-decoder speech-to-text, 14 languages (en/fr/de/es/it/pt/nl/pl/el/ar/ja/zh/ko/vi). 48-layer Conformer encoder + 8-layer transformer decoder with external KV cache. Mixed precision: INT8 encoder (1.8 GB, iOS 18+) + FP32 ANE-resident static-shape decoder (v2, ~1.6× faster on Apple Silicon than the dynamic FP16 v1 decoder). Hard 35 s per-call audio cap (max_audio_clip_s from upstream config), 16 384-token SentencePiece vocab. Language must be passed explicitly via the conditioned prompt.First Cohere Transcribe port; ANE-optimized v2 decoder (#537) lands fixed [1, 1, 1, 108] attention_mask so the decoder stays on the Neural Engine.
Qwen3-ASR (FluidAudio#281, #312, #410)Batch encoder-decoder speech-to-text, 30 languages with automatic language detection (zh/en/yue/ja/ko/vi/th/id/ms/hi/ar/tr/ru/de/fr/es/pt/it/nl/pl/sv/da/fi/cs/fil/fa/el/hu/mk/ro). 0.6B params. 2-model pipeline (ANE-optimized audio encoder + 28-layer stateful decoder with fused embedding/lm_head). FP32 (~1.1 GB) and INT8 (~0.6 GB) variants. ~60–80 ms per token, 1 s audio windows (100 mel frames at 10 ms hop). macOS 15 / iOS 18+.Beta — accuracy may trail PyTorch reference; see Benchmarks for FLEURS results across all 30 languages.

TDT/CTC models above are wrapped by SlidingWindowAsrManager, which chunks audio (~15s with overlap) and stitches the per-chunk transcripts.

Streaming Transcription (True Real-Time)

ModelDescriptionContext
Parakeet EOUStreaming speech-to-text with end-of-utterance detection (120M params). Three chunk-size variants — 160ms / 320ms / 1280ms — for ultra-low-latency to higher-accuracy streaming.Added after TDT was released & for streaming. Smaller model (120M vs 0.6B).
Nemotron Speech Streaming 0.6BRNNT streaming ASR with 4 chunk size variants (80ms, 160ms, 560ms, 1120ms). English-only (0.6B params). Int8 encoder quantization. Supports ultra-low latency (80ms chunks) to high accuracy (1120ms chunks).Larger streaming model for better accuracy and quality

Custom Vocabulary / Keyword Spotting

ModelDescriptionContext
Parakeet CTC 110MCTC-based encoder for custom keyword spotting. Runs rescoring alongside TDT to boost domain-specific terms (names, jargon).
Parakeet CTC 0.6BLarger CTC variant (same role as 110M) with better quality

VAD Models

ModelDescriptionContext
Silero VADVoice activity detection; speech vs silence on 256ms windows. Segments audio before ASR or diarization.Support model that other pipelines build on. Converted at the time being the best model out there

Diarization Models

ModelDescriptionContext
LS-EENDResearch prototype end-to-end streaming diarization model from Westlake University. Supports both streaming and complete-buffer inference for up to 10 speakers. Uses frame-in, frame-out processing, requiring 900ms of warmup audio and 100ms per update.Added after Sortformer to support largers speaker counts.
SortformerNVIDIA's enterprise-grade end-to-end streaming diarization model. Supports both streaming and complete-buffer inference for up to 4 speakers. More stable than LS-EEND, but sometimes misses speech. Processes audio in chunks, requiring 1040ms of warmup audio and 480ms per update for the low latency versions.Added after Pyannote to support low-latency streaming diarization.
Pyannote CoreML PipelineSpeaker diarization. Segmentation model + WeSpeaker embeddings for clustering. Online/streaming pipeline (DiarizerManager) based on pyannote/speaker-diarization-3.1. Offline batch pipeline (OfflineDiarizerManager) based on pyannote/speaker-diarization-community-1.First diarizer model added. Converted from Pyannote with custom made batching mode

TTS Models

ModelDescriptionContext
Kokoro TTSText-to-speech synthesis (82M params), 48 voices, minimal RAM usage on iOS. Generates all frames at once via flow matching over mel spectrograms + Vocos vocoder. Uses CoreML G2P model for phonemization.First TTS backend added + support custom pronounces
Kokoro ANE (7-stage)Same Kokoro 82M weights split into 7 CoreML stages so the ANE-friendly layers (Albert / PostAlbert / Alignment / Vocoder) stay resident on the Neural Engine while Prosody / Noise / Tail run on CPU+GPU. 3-11× RTFx vs. the single-graph Kokoro. Single voice (af_heart), ≤510 IPA phonemes per call, no chunker / SSML / custom lexicon.ANE-optimized variant derived (with permission) from laishere/kokoro-coreml
PocketTTSSecond TTS backend (~155M params). Autoregressive frame-by-frame generation with dynamic audio chunking. No phoneme stage, works directly on text tokens.Supports streaming, minimal RAM usage, excellent quality
StyleTTS2 (LibriTTS, iteration_3)Reference-audio–driven zero-shot English TTS. 8-stage CoreML pipeline (text_encoder → bert → ref_encoder → fused_diffusion_sampler → duration_predictor → fused_f0n_har_source → decoder_pre → decoder_upsample) with 3 lazily-loaded T = 64 / 128 / 256 bucket variants of bert / fused_diffusion_sampler. 5-step ADPM2 Karras-σ diffusion sampler with α/β style blending against a speaker reference clip. 24 kHz mono fp32 output. Phonemizer reuses Kokoro's Misaki lexicon cache + BART G2P CoreML model with Misaki uppercase diphthong shorthand (A O I Y Weɪ oʊ aɪ ɔɪ aʊ) expanded before encoding so the output matches the espeak IPA the model was trained on. Callers with a higher-quality phonemizer can bypass the stack via StyleTTS2Manager.synthesize(ipa:...). See StyleTTS2.md.Zero-shot voice cloning from a single reference WAV; English only

Evaluated Models (Not Supported)

Models we converted and tested but are not supported: too large for on-device deployment, limitations or superseded by better approaches.

ModelStatus
KittenTTS (FluidAudio#409, HF)Not supported due to inefficient espeak alternatives. Nano (15M) and Mini (82M) variants.
Qwen3-TTS (FluidAudio#290, mobius#20, HF)Now 1.1GB but too slow. Needs further testing.
Qwen3-ForcedAligner-0.6B (FluidAudio#315, mobius#21, HF)5-model CoreML pipeline, large footprint. Low upstream adoption (Qwen ASR CoreML model downloads).

Model Sources

ModelHuggingFace Repo
Parakeet TDT v3FluidInference/parakeet-tdt-0.6b-v3-coreml
Parakeet TDT v2FluidInference/parakeet-tdt-0.6b-v2-coreml
Parakeet TDT-CTC-110MFluidInference/parakeet-tdt-ctc-110m-coreml
Parakeet TDT JapaneseFluidInference/parakeet-0.6b-ja-coreml (hybrid: CTC preprocessor/encoder + TDT decoder/joint)
Parakeet CTC ChineseFluidInference/parakeet-ctc-0.6b-zh-cn-coreml
Parakeet CTC 110MFluidInference/parakeet-ctc-110m-coreml
Parakeet CTC 0.6BFluidInference/parakeet-ctc-0.6b-coreml
Parakeet EOUFluidInference/parakeet-realtime-eou-120m-coreml (subdirs: /160ms, /320ms, /1280ms)
Cohere Transcribe (INT8 hybrid, default)FluidInference/cohere-transcribe-03-2026-coreml (variant: /q8)
Qwen3-ASRFluidInference/qwen3-asr-0.6b-coreml (variants: /f32, /int8)
Silero VADFluidInference/silero-vad-coreml
Diarization (Pyannote)FluidInference/speaker-diarization-coreml
LS-EENDFluidInference/lseend-coreml
SortformerFluidInference/diar-streaming-sortformer-coreml
Kokoro TTSFluidInference/kokoro-82m-coreml
Kokoro ANE (7-stage)FluidInference/kokoro-82m-coreml/tree/main/ANE
PocketTTSFluidInference/pocket-tts-coreml
StyleTTS2 (LibriTTS, iteration_3)FluidInference/StyleTTS-2-coreml/iteration_3/compiled (shared phonemizer assets pulled from FluidInference/kokoro-82m-coreml: G2PEncoder.mlmodelc, G2PDecoder.mlmodelc, g2p_vocab.json, us_lexicon_cache.json)
Magpie TTS MultilingualFluidInference/magpie-tts-multilingual-357m-coreml
CosyVoice3 (Mandarin)FluidInference/CosyVoice3-0.5B-coreml
Nemotron StreamingFluidInference/nemotron-speech-streaming-en-0.6b-coreml