Supported models
June 19, 2026 ยท View on GitHub
This page summarizes model-family support in the v0.0.27 source tree. The runtime source of truth is the code, not this prose page:
- detection:
src/models/detection.rs ModelTypeenum and module exports:src/models/mod.rs- loading policy:
src/model_metadata.rs - VLM loading routes:
src/loading/vlm*.rs
ModelType spans text and non-VLM language models, VLM variants, a
speech-to-text encoder-decoder (Whisper), and a text-to-speech model (Kokoro).
These are architecture/runtime variants, not a guarantee that every checkpoint
under a marketing family name is supported.
Text and hybrid model families
Implemented model families include:
- Llama-family and Mistral-style dense decoders
- Llama 4 text
- Qwen 2 / 2.5 / 3 / 3.5, Qwen MoE, Qwen3 Next
- Gemma 1 / 2 / 3 / 3n / 4 text variants
- Phi, Phi-3, Phi-3 Small, PhiMoE
- Mixtral and other MoE families
- DeepSeek v1 / v2 / v3 / v3.2
- dots.llm1 (
dots1, rednote: a DeepSeek-V3-style Mixture-of-Experts without MLA. Standard multi-head attention with per-head Q/K RMSNorm, a dense first layer (first_k_dense_replace), then sigmoid-routed experts that select ongate.weightlogits plus ane_score_correction_bias, with a single always-on shared expert. Validated againstmlx-community/dots.llm1.inst-mixed-4-6bit, a mixed 4/6-bit export whosev_projanddown_projtensors are 6-bit while the rest are 4-bit; the unified loaders detect the per-tensor bit width from shape.) - Cohere / Cohere2
- InternLM 2 / 3
- GLM 4, GLM MoE, GLM MoE DSA
- ERNIE 4.5 and ERNIE 4.5 MoE
- Hunyuan dense and MoE variants
- IBM Granite dense (
granite) - IBM Granite 4.x hybrid (
granitemoehybrid: interleaves Mamba2 SSM and GQA attention layers bylayer_types, applies the four Granite scalar multipliers (embedding, attention, residual, logits), and defaults to NoPE attention. The dense-MLP mode is validated againstmlx-community/granite-4.0-h-350m-4bit; the MoE mode (block_sparse_moe+shared_mlp) is implemented but awaits a public MLX checkpoint to validate. The non-hybridgranitemoevariant is not yet ported.) - BitNet b1.58 (
bitnet, Microsoft: a Llama-style transformer whose every projection is aBitLinearwith 1.58-bit ternary weights ({-1, 0, +1}) packed 4-per-uint8 and scaled by a single per-tensorweight_scale. A custom Metal kernel (bitlinear_matmul) multiplies directly on the packed bytes, so the unpacked weights never materialize. Two extra sub-norms (attn_sub_normbeforeo_proj,ffn_sub_norminside the MLP) and a squared-ReLU MLP (relu2(gate) * up). Runs in native bf16 (its squared-ReLU overflows f16), bypassing the Apple-Silicon f16 conversion. Validated againstmlx-community/bitnet-b1.58-2B-4Tand its-4bitvariant, which additionally affine-quantizes the embedding/lm_head to 4-bit (the BitLinear weights stay ternary); keeping the whole model bf16 also keeps that 4-bit dequant dtype-consistent.) - ExaOne / ExaOne 4 / ExaOne MoE / Solar Open
- OLMo / OLMo2 / OLMo3 / OLMoE
- StarCoder2, StableLM, SmolLM3, Baichuan, MiniCPM, MiniCPM3, MiniMax, Ministral3, Mistral4, Nemotron, Nemotron-NAS, Step 3.5, MiMo
- Apertus (
apertus, Swiss AI: Llama-style dense transformer with an xIELU activation MLP (no gate), QK-norm, llama3 RoPE scaling, and untied embeddings) - Seed-OSS (
seed_oss, ByteDance: plain Llama-style dense transformer with a standard SwiGLU MLP and standard residuals. The only deltas are a split attention bias (attention_biason q/k/v,attention_out_biason o_proj), an explicithead_dim, untied embeddings, and a{"rope_type": "default"}rope_scaling that applies no scaling. Validated againstmlx-community/Seed-OSS-36B-Instruct-4bit.) - Mamba, Mamba2, RWKV7, Recurrent Gemma, Jamba, Nemotron-H
- Falcon-H1 (TII: runs a Mamba2 SSM mixer and GQA attention in parallel within each block, summing both outputs; the MUP channel multipliers are pre-folded into the MLX weights)
- LFM2 and LFM2-MoE (Liquid Foundation Models: short-convolution and attention hybrid; the MoE variant routes through sigmoid-gated experts)
- PLaMo 2 (Preferred Networks: interleaves Mamba SSM and GQA attention layers by index; each block carries normformer-style pre/post offset RMSNorms, and the Mamba mixer derives B/C/dt from a post-conv projection). The architecture is validated against the mlx-lm reference at the token-id level (
tests/plamo2_parity.rs). CLI text generation additionally needs support for PLaMo's customPlamoTokenizer(thetokenizer.jsonlUnigram format), which the Rust tokenizer loader does not yet read. - Kimi Linear, LongCat Flash, LongCat Flash N-gram
- GPT-OSS
Many of these families have checkpoint-specific config or weight-layout
requirements. If a checkpoint fails detection or loading, inspect its
config.json::model_type first and compare it with src/models/detection.rs.
Block-diffusion text models
| Family | model_type key | Notes |
|---|---|---|
| DiffusionGemma | diffusion_gemma / diffusion_gemma_text | Block-diffusion on a Gemma 4 MoE backbone. Generates a canvas of tokens per block through iterative denoising rather than token-by-token left-to-right decoding. CLI (mlxcel generate) supports text and image input (--image <path>, repeatable). Served in mlxcel-server (serial, batch-1 by design) via /v1/chat/completions and /v1/completions; image input follows the standard image_url content part format. See Block-diffusion generation. |
DiffusionGemma uses a two-phase forward pass: an encoder prefill that caches the
prompt into dense FP16 KV caches, then a canvas loop that attends bidirectionally
within each output block while attending causally to the cached prefix.
Load detection accepts model_type: "diffusion_gemma" (outer config) and
model_type: "diffusion_gemma_text" (inner text_config).
The fused MoE gate_up_proj weights are split at load time. When a vision tower is present in the checkpoint, its weights are loaded and wired for image input; checkpoints without vision weights fall back to text-only mode without error.
Vision-language and multimodal variants
Implemented VLM variants include:
- Gemma 3 VL, Gemma 3n VL, Gemma 4 VL
- Gemma 4 Unified (
gemma4_unified): encoder-free text + image + audio. Patch-projection vision embedder and waveform-chunk audio path feed the shared Gemma 4 backbone, with blockwise bidirectional attention over image/video token spans during prefill. Video input is not yet supported. - Llama 4 VLM
- LLaVA and LLaVA-Bunny
- Aya Vision and PaliGemma
- Pixtral and Mistral 3 VLM wrappers
- Qwen2-VL, Qwen2.5-VL, Qwen3-VL, Qwen3.5-VL, and Qwen3-VL MoE
- Youtu-VL
- MiniCPM-O
- Moondream 3
- Phi-3 Vision, Phi4MM, Phi4 SigLIP VLM
- Molmo2 and Molmo-Point
- Nemotron-H Nano Omni
Audio/video capability is model-specific. The server request types include
image_url, video_url, and input_audio content blocks, but a loaded model
must advertise support for the corresponding modality. Video frame extraction
uses the system ffmpeg/ffprobe binaries at runtime.
Thinking default for gemma4_unified
Gemma 4 Unified (gemma4_unified) ships <|channel> / <channel|> thinking
markers in its tokenizer, so the server defaults enable_thinking=true for this
family on startup, mirroring ml-explore/mlx-lm#1114.
With thinking on, the model writes an internal scratchpad before the visible
reply, so simple prompts spend more of the budget on reasoning than on the
answer. A one-sentence answer can take roughly 275 completion tokens, and a
default max_tokens of 64 to 80 may return an empty content with
finish_reason of length because the whole budget went to thinking. Set
max_tokens to at least 512 for this family, and higher for multi-sentence
answers.
The scratchpad is no longer dropped: it is surfaced as reasoning_content on
both streaming responses (delta.reasoning_content) and non-streaming responses
(a reasoning_content field on the assistant message, present only when the
model produced reasoning). This applies to every thinking family, including
Qwen-style <think> models. To turn thinking off, pass
chat_template_kwargs={"enable_thinking": false} per request, or set the server
default via --chat-template-kwargs or LLAMA_ARG_CHAT_TEMPLATE_KWARGS. A
per-request value always wins over the server default.
Speech-to-text (ASR)
mlxcel loads Whisper-style encoder-decoder ASR checkpoints (model_type: "whisper") and serves them through the OpenAI audio endpoints. A convolutional audio encoder builds features from a 30-second log-mel window, and an autoregressive text decoder cross-attends to those features as it emits tokens, steered by the multilingual transcribe/translate task tokens.
When the server's loaded checkpoint is detected as Whisper, the speech-to-text slot is populated and POST /v1/audio/transcriptions (transcribe in place) and POST /v1/audio/translations (translate to English) return the recognized text. Uploaded audio is decoded with the shared WAV reader, resampled to 16 kHz, and processed in consecutive 30-second windows. An explicit language hint is honored; otherwise the language is detected from the first decoder step. Token suppression follows the Whisper rules: suppress_blank, the non-speech symbol set, and <|notimestamps|>.
This first port targets non-quantized (fp16/f32) checkpoints with greedy decoding, and the loader accepts both the native MLX and HuggingFace key layouts. Loading a Whisper checkpoint serves speech-to-text only; chat generation is not available in the same process. Beam search, word-level and segment timestamps, quantized checkpoints, and streaming transcription are follow-ups.
Text-to-speech (TTS)
mlxcel loads the Kokoro-82M model (a StyleTTS2 phoneme-to-mel acoustic model with a built-in iSTFTNet vocoder) and serves it through POST /v1/audio/speech. The path is: text to phonemes (grapheme-to-phoneme front-end) to a PLBert text encoder, a duration predictor that expands per-token features to per-frame, F0 (pitch) and energy prosody, and an iSTFTNet decoder that produces a 24 kHz mono waveform directly via an inverse STFT (no separate neural codec).
Detection works without a top-level model_type: the loader recognizes a Kokoro checkpoint by the istftnet config block or the kokoro-v1_0.safetensors weight filename, so -m <kokoro-dir> resolves to the TTS provider. The voice request field selects a pack from voices/<name>.safetensors (54 voices; default af_heart), validated against the available packs with a safe fallback. speed scales the predicted durations (larger is faster and shorter). response_format accepts wav today (returned via the shared WAV writer); other containers are a follow-up.
The grapheme-to-phoneme front-end is a self-contained American-English phonemizer: text is normalized (lower-cased, integers spoken, common punctuation kept), each word is looked up in a bundled lexicon, and out-of-vocabulary words fall back to deterministic letter-to-sound rules. It emits the IPA symbols in Kokoro's vocab and needs no external binary or download. Non-English voices in the checkpoint still load and synthesize, but their phonemes come from the English front-end, so pronunciation quality is limited; per-language g2p (the analogue of upstream Kokoro's misaki[xx] packages) is future work. Like Whisper, the model loads and runs every synthesis on one dedicated MLX worker thread, so loading a Kokoro checkpoint serves text-to-speech only.
Quantization formats
| Format | Status | Notes |
|---|---|---|
| FP16 / BF16 | supported | BF16 handling is platform/model dependent; Apple Silicon paths commonly convert to FP16 for execution. |
| 4-bit affine MLX checkpoints | supported | Primary path for many mlx-community checkpoints. CUDA coverage depends on MLX kernel support for the target GPU. |
| 8-bit affine | supported | Used for weights and/or KV cache depending on path. |
| NVFP4 / MXFP4 / MXFP8 | supported where implemented | Used by specific families such as GPT-OSS and recent quantized checkpoints. |
Do not infer quality or speed from the ability to load a quantized checkpoint. Run a smoke test and, for release claims, a benchmark/quality gate.
Distributed support summary
| Capability | Current summary |
|---|---|
| Tensor parallelism | Advertised for selected dense text families such as Llama, Qwen, Gemma text, ERNIE 4.5, and Hunyuan dense. Validate per model/rank count. |
| Pipeline parallelism | Best validated for Llama-family text models; stage executors exist for more families with less operator coverage. |
| VLM under TP/PP | Partial. Vision tower / projector partitioning is not uniformly supported. |
| Disaggregated inference | Infrastructure exists; validate per topology and workload. |
Speculative decoding
| Drafter | Target families | Notes |
|---|---|---|
| MTP | Gemma 4 target paths | Available through shared speculative decoding flags. |
| DFlash | Qwen 3.5 text/VLM paths | Available through shared speculative decoding flags. |
Use auto-detection by default. Override only when you know the target and drafter checkpoint pair are compatible.
Known non-goals / caveats
- A supported architecture does not imply every community checkpoint variant is supported.
- VLM and video/audio paths require additional runtime dependencies and prompt preparation beyond text-only generation.
- TurboQuant, TP, PP, and speculative decoding are not uniformly validated for every family.
- The
mlxcel archoutput is a CLI summary and may lag the detailed enum count; the canonical source remainssrc/models/mod.rsandsrc/models/detection.rs.
Adding support
See Adding a new model for the registration, loading, and test checklist.