kokoro-coreml
April 28, 2026 · View on GitHub
Convert Kokoro TTS to CoreML — fp16+int8pal preset (80MB), 25× real-time on M4 Mac Mini, 17× on iPhone 16 Pro, with the bulk of the workload running on ANE.
Produces 7 mlpackages: KokoroAlbert, KokoroPostAlbert, KokoroAlignment, KokoroProsody, KokoroNoise, KokoroVocoder, KokoroTail.
Planned for integration into my reader app, ZRead, for on-device TTS.
Setup
uv, and espeak-ng.
brew install espeak-ng
uv sync
Convert
uv run python convert.py
Outputs the 7 mlpackages plus ref.wav (PyTorch) and test.wav (CoreML chain) to output/. Flags: --max-frames N (default 2000, ≈50s of audio — covers ALBERT's 510-token cap with headroom), --stages name1 name2 ... (skipped stages reuse existing mlpackages).
Benchmark
| # | T_enc | Audio | Passage |
|---|---|---|---|
| 1 | 13 | passage_1_13tok.wav | "Hello there." |
| 2 | 66 | passage_2_66tok.wav | "The early morning sun cast long shadows…" |
| 3 | 128 | passage_3_128tok.wav | "She had been waiting for nearly an hour…" |
| 4 | 272 | passage_4_272tok.wav | "In the quiet town of Millbrook…" |
| 5 | 457 | passage_5_457tok.wav | "The expedition departed on March 14th, 1923…" |
| 6 | 512 | passage_6_512tok.wav | "The history of human flight…" (510-phoneme cap) |
uv run python benchmark.py
Six real prose passages (T_enc 13 → 512). 25× real-time on a max-length 510-phoneme paragraph (28s audio in ~1.1s, measured on M4 Mac Mini 24GB). WAVs written to output/benchmark/.
========================================================================
T_enc T_a audio_s chain_ms speed | per-stage ms
------------------------------------------------------------------------
13 60 1.50 75.6 19.8x | Alb=4 Pos=3 Ali=1 Pro=1 Noi=23 Voc=31 Tai=1
66 178 4.45 172.3 25.8x | Alb=11 Pos=14 Ali=1 Pro=3 Noi=46 Voc=82 Tai=3
128 323 8.07 294.6 27.4x | Alb=15 Pos=26 Ali=1 Pro=4 Noi=74 Voc=148 Tai=6
272 651 16.27 642.6 25.3x | Alb=37 Pos=54 Ali=2 Pro=8 Noi=141 Voc=344 Tai=8
457 1078 26.95 1047.1 25.7x | Alb=70 Pos=90 Ali=4 Pro=12 Noi=217 Voc=585 Tai=14
512 1125 28.12 1101.1 25.5x | Alb=74 Pos=103 Ali=4 Pro=13 Noi=228 Voc=635 Tai=13
Mean speed: 25.0x real-time (higher is faster; >1.0x = real-time)
iOS
Same chain runs on iPhone via iOSDemo/. Mean 16.9× real-time on iPhone 16 Pro across the same six passages.
To run the demo, drop the 7 mlpackages from output/ (after convert.py) into iOSDemo/iOSDemo/Models/. The Resources/ folder ships the precomputed phonemes (benchmark_data.json), the af_heart voice (af_heart.bin$, 510 \times 256 \text{fp32}), \text{and} \text{the} \text{phoneme} $vocab.json — regenerate with uv run python dump_benchmark_data.py if you edit the benchmark texts. G2P runs offline in Python so the app doesn't need espeak-ng.
Stages and compute placement
| Stage | Precision | Compute Unit | Role |
|---|---|---|---|
| Albert | fp16 + int8pal | CPU_AND_NE | text encoder |
| PostAlbert | fp16 + int8pal | CPU_AND_NE | duration + d + t_en |
| Alignment | fp16 + int8pal | CPU_AND_NE | length regulation (cumsum + broadcast) |
| Prosody | fp16 + int8pal | ALL | F0 + N |
| Noise | fp32 + int8pal | ALL | SineGen + STFT + noise convs |
| Vocoder | fp16 + int8pal | CPU_AND_NE | dual output: anchor + x_pre |
| Tail | fp32 | ALL | conv_post + exp + sin + iSTFT |
Design notes
This pipeline is the result of a long sequence of dead ends. Key lessons:
Fight the ANE scheduler explicitly. The scheduler is opaque — compute_units=ALL may silently spill ANE-eligible ops to GPU. Force CPU_AND_NE first. If the graph stays on ANE, ship it. If ops fall off ANE, adjust the graph (split the model, add anchor outputs, replace problematic op patterns) until they don't.
Avoid GPU on iOS. GPU work is suspended for backgrounded apps before iOS 26. The iOS 26+ entitlement com.apple.developer.background-tasks.continued-processing.gpu (with BGContinuedProcessingTask) lifts this, but ANE-only stays simpler and works on older iOS.
Standalone models, not one merged graph. Primary reason: different stages need different compute units. We prefer fp16 everywhere — it's the precondition for ANE — but Noise and Tail lose audio quality in fp16 (sin/cumsum accumulate phase error past fp16's range), so they have to run fp32, which means off ANE. The rest stays fp16 on ANE. A merged graph picks one compute unit and the scheduler can't honor both. Splitting also sidesteps two CoreML quirks: FP16ComputePrecision(op_selector=...) is broken (skips cast_to_fp16 on selected ops but doesn't insert cast_to_fp32 at boundaries, so "fp32" ops run fp16 anyway), and make_pipeline() locks shapes.
Noise stage is fp32 + separated. SineGen does cumsum(F0/SR) * 2π * 300 → sin(); in fp16 the accumulated phase is huge and sin() collapses (corr 0.94 → 0.82). Splitting it out also dodges a scheduler blocker: two RangeDim streams merging in one model causes 141+ GPU ops (6× slowdown).
Vocoder dual-output trick — the canonical scheduler fight. ANE accumulates in pure fp16 (unlike GPU/PyTorch with fp32 accumulators), so 36 chained Conv+Snake passes amplify error through exp() in iSTFT → hoarseness. Mixed precision, phase reduction, fp16 fine-tuning, cos substitution all fail. Splitting the model to do the iSTFT tail off-ANE in fp32 also fails on its own — the truncated body's [1, 128, T] output makes the scheduler bail (168/617 ops to CPU, 353ms vs 44ms). Working solution: keep the full vocoder graph, add x_pre as a second output, discard the original audio output. The discarded output is a graph anchor that keeps the scheduler committed to ANE. A separate fp32 Tail mlpackage runs conv_post + exp + sin + iSTFT on CPU/GPU (~2ms) for clean audio.
Cos Snake. sin²(αx) = (1 − cos(2αx))/2. Slightly faster on ANE; quality identical.
Int8 palettization on vocoder is safe here because its audio output is discarded — palette artifacts never reach the listener. The fp32 Tail (which produces audio) is unpalettized.
Distillation was a detour that surfaced the noise-stream insight. I suspected the vocoder's ConvTranspose1d upsamples were the ANE blocker, so I distilled into a generator with nearest-neighbor upsample + Conv1d in place of ConvTranspose1d. The distilled student compiled, but most ops still wouldn't schedule onto ANE; bisecting why led to the dual-RangeDim scheduler blocker — splitting the noise generator into its own mlpackage fixed it. Quality issues in the student were closed separately by better training (improved losses, layer-tuning pretrain, then MPD adversarial training). Once the noise split was understood, the same split applied to the original (undistilled) vocoder, so distillation was dropped from the shipping pipeline.