Supported models

June 19, 2026 ยท View on GitHub

This page summarizes model-family support in the v0.0.27 source tree. The runtime source of truth is the code, not this prose page:

  • detection: src/models/detection.rs
  • ModelType enum and module exports: src/models/mod.rs
  • loading policy: src/model_metadata.rs
  • VLM loading routes: src/loading/vlm*.rs

ModelType spans text and non-VLM language models, VLM variants, a speech-to-text encoder-decoder (Whisper), and a text-to-speech model (Kokoro). These are architecture/runtime variants, not a guarantee that every checkpoint under a marketing family name is supported.

Text and hybrid model families

Implemented model families include:

  • Llama-family and Mistral-style dense decoders
  • Llama 4 text
  • Qwen 2 / 2.5 / 3 / 3.5, Qwen MoE, Qwen3 Next
  • Gemma 1 / 2 / 3 / 3n / 4 text variants
  • Phi, Phi-3, Phi-3 Small, PhiMoE
  • Mixtral and other MoE families
  • DeepSeek v1 / v2 / v3 / v3.2
  • dots.llm1 (dots1, rednote: a DeepSeek-V3-style Mixture-of-Experts without MLA. Standard multi-head attention with per-head Q/K RMSNorm, a dense first layer (first_k_dense_replace), then sigmoid-routed experts that select on gate.weight logits plus an e_score_correction_bias, with a single always-on shared expert. Validated against mlx-community/dots.llm1.inst-mixed-4-6bit, a mixed 4/6-bit export whose v_proj and down_proj tensors are 6-bit while the rest are 4-bit; the unified loaders detect the per-tensor bit width from shape.)
  • Cohere / Cohere2
  • InternLM 2 / 3
  • GLM 4, GLM MoE, GLM MoE DSA
  • ERNIE 4.5 and ERNIE 4.5 MoE
  • Hunyuan dense and MoE variants
  • IBM Granite dense (granite)
  • IBM Granite 4.x hybrid (granitemoehybrid: interleaves Mamba2 SSM and GQA attention layers by layer_types, applies the four Granite scalar multipliers (embedding, attention, residual, logits), and defaults to NoPE attention. The dense-MLP mode is validated against mlx-community/granite-4.0-h-350m-4bit; the MoE mode (block_sparse_moe + shared_mlp) is implemented but awaits a public MLX checkpoint to validate. The non-hybrid granitemoe variant is not yet ported.)
  • BitNet b1.58 (bitnet, Microsoft: a Llama-style transformer whose every projection is a BitLinear with 1.58-bit ternary weights ({-1, 0, +1}) packed 4-per-uint8 and scaled by a single per-tensor weight_scale. A custom Metal kernel (bitlinear_matmul) multiplies directly on the packed bytes, so the unpacked weights never materialize. Two extra sub-norms (attn_sub_norm before o_proj, ffn_sub_norm inside the MLP) and a squared-ReLU MLP (relu2(gate) * up). Runs in native bf16 (its squared-ReLU overflows f16), bypassing the Apple-Silicon f16 conversion. Validated against mlx-community/bitnet-b1.58-2B-4T and its -4bit variant, which additionally affine-quantizes the embedding/lm_head to 4-bit (the BitLinear weights stay ternary); keeping the whole model bf16 also keeps that 4-bit dequant dtype-consistent.)
  • ExaOne / ExaOne 4 / ExaOne MoE / Solar Open
  • OLMo / OLMo2 / OLMo3 / OLMoE
  • StarCoder2, StableLM, SmolLM3, Baichuan, MiniCPM, MiniCPM3, MiniMax, Ministral3, Mistral4, Nemotron, Nemotron-NAS, Step 3.5, MiMo
  • Apertus (apertus, Swiss AI: Llama-style dense transformer with an xIELU activation MLP (no gate), QK-norm, llama3 RoPE scaling, and untied embeddings)
  • Seed-OSS (seed_oss, ByteDance: plain Llama-style dense transformer with a standard SwiGLU MLP and standard residuals. The only deltas are a split attention bias (attention_bias on q/k/v, attention_out_bias on o_proj), an explicit head_dim, untied embeddings, and a {"rope_type": "default"} rope_scaling that applies no scaling. Validated against mlx-community/Seed-OSS-36B-Instruct-4bit.)
  • Mamba, Mamba2, RWKV7, Recurrent Gemma, Jamba, Nemotron-H
  • Falcon-H1 (TII: runs a Mamba2 SSM mixer and GQA attention in parallel within each block, summing both outputs; the MUP channel multipliers are pre-folded into the MLX weights)
  • LFM2 and LFM2-MoE (Liquid Foundation Models: short-convolution and attention hybrid; the MoE variant routes through sigmoid-gated experts)
  • PLaMo 2 (Preferred Networks: interleaves Mamba SSM and GQA attention layers by index; each block carries normformer-style pre/post offset RMSNorms, and the Mamba mixer derives B/C/dt from a post-conv projection). The architecture is validated against the mlx-lm reference at the token-id level (tests/plamo2_parity.rs). CLI text generation additionally needs support for PLaMo's custom PlamoTokenizer (the tokenizer.jsonl Unigram format), which the Rust tokenizer loader does not yet read.
  • Kimi Linear, LongCat Flash, LongCat Flash N-gram
  • GPT-OSS

Many of these families have checkpoint-specific config or weight-layout requirements. If a checkpoint fails detection or loading, inspect its config.json::model_type first and compare it with src/models/detection.rs.

Block-diffusion text models

Familymodel_type keyNotes
DiffusionGemmadiffusion_gemma / diffusion_gemma_textBlock-diffusion on a Gemma 4 MoE backbone. Generates a canvas of tokens per block through iterative denoising rather than token-by-token left-to-right decoding. CLI (mlxcel generate) supports text and image input (--image <path>, repeatable). Served in mlxcel-server (serial, batch-1 by design) via /v1/chat/completions and /v1/completions; image input follows the standard image_url content part format. See Block-diffusion generation.

DiffusionGemma uses a two-phase forward pass: an encoder prefill that caches the prompt into dense FP16 KV caches, then a canvas loop that attends bidirectionally within each output block while attending causally to the cached prefix. Load detection accepts model_type: "diffusion_gemma" (outer config) and model_type: "diffusion_gemma_text" (inner text_config). The fused MoE gate_up_proj weights are split at load time. When a vision tower is present in the checkpoint, its weights are loaded and wired for image input; checkpoints without vision weights fall back to text-only mode without error.

Vision-language and multimodal variants

Implemented VLM variants include:

  • Gemma 3 VL, Gemma 3n VL, Gemma 4 VL
  • Gemma 4 Unified (gemma4_unified): encoder-free text + image + audio. Patch-projection vision embedder and waveform-chunk audio path feed the shared Gemma 4 backbone, with blockwise bidirectional attention over image/video token spans during prefill. Video input is not yet supported.
  • Llama 4 VLM
  • LLaVA and LLaVA-Bunny
  • Aya Vision and PaliGemma
  • Pixtral and Mistral 3 VLM wrappers
  • Qwen2-VL, Qwen2.5-VL, Qwen3-VL, Qwen3.5-VL, and Qwen3-VL MoE
  • Youtu-VL
  • MiniCPM-O
  • Moondream 3
  • Phi-3 Vision, Phi4MM, Phi4 SigLIP VLM
  • Molmo2 and Molmo-Point
  • Nemotron-H Nano Omni

Audio/video capability is model-specific. The server request types include image_url, video_url, and input_audio content blocks, but a loaded model must advertise support for the corresponding modality. Video frame extraction uses the system ffmpeg/ffprobe binaries at runtime.

Thinking default for gemma4_unified

Gemma 4 Unified (gemma4_unified) ships <|channel> / <channel|> thinking markers in its tokenizer, so the server defaults enable_thinking=true for this family on startup, mirroring ml-explore/mlx-lm#1114. With thinking on, the model writes an internal scratchpad before the visible reply, so simple prompts spend more of the budget on reasoning than on the answer. A one-sentence answer can take roughly 275 completion tokens, and a default max_tokens of 64 to 80 may return an empty content with finish_reason of length because the whole budget went to thinking. Set max_tokens to at least 512 for this family, and higher for multi-sentence answers.

The scratchpad is no longer dropped: it is surfaced as reasoning_content on both streaming responses (delta.reasoning_content) and non-streaming responses (a reasoning_content field on the assistant message, present only when the model produced reasoning). This applies to every thinking family, including Qwen-style <think> models. To turn thinking off, pass chat_template_kwargs={"enable_thinking": false} per request, or set the server default via --chat-template-kwargs or LLAMA_ARG_CHAT_TEMPLATE_KWARGS. A per-request value always wins over the server default.

Speech-to-text (ASR)

mlxcel loads Whisper-style encoder-decoder ASR checkpoints (model_type: "whisper") and serves them through the OpenAI audio endpoints. A convolutional audio encoder builds features from a 30-second log-mel window, and an autoregressive text decoder cross-attends to those features as it emits tokens, steered by the multilingual transcribe/translate task tokens.

When the server's loaded checkpoint is detected as Whisper, the speech-to-text slot is populated and POST /v1/audio/transcriptions (transcribe in place) and POST /v1/audio/translations (translate to English) return the recognized text. Uploaded audio is decoded with the shared WAV reader, resampled to 16 kHz, and processed in consecutive 30-second windows. An explicit language hint is honored; otherwise the language is detected from the first decoder step. Token suppression follows the Whisper rules: suppress_blank, the non-speech symbol set, and <|notimestamps|>.

This first port targets non-quantized (fp16/f32) checkpoints with greedy decoding, and the loader accepts both the native MLX and HuggingFace key layouts. Loading a Whisper checkpoint serves speech-to-text only; chat generation is not available in the same process. Beam search, word-level and segment timestamps, quantized checkpoints, and streaming transcription are follow-ups.

Text-to-speech (TTS)

mlxcel loads the Kokoro-82M model (a StyleTTS2 phoneme-to-mel acoustic model with a built-in iSTFTNet vocoder) and serves it through POST /v1/audio/speech. The path is: text to phonemes (grapheme-to-phoneme front-end) to a PLBert text encoder, a duration predictor that expands per-token features to per-frame, F0 (pitch) and energy prosody, and an iSTFTNet decoder that produces a 24 kHz mono waveform directly via an inverse STFT (no separate neural codec).

Detection works without a top-level model_type: the loader recognizes a Kokoro checkpoint by the istftnet config block or the kokoro-v1_0.safetensors weight filename, so -m <kokoro-dir> resolves to the TTS provider. The voice request field selects a pack from voices/<name>.safetensors (54 voices; default af_heart), validated against the available packs with a safe fallback. speed scales the predicted durations (larger is faster and shorter). response_format accepts wav today (returned via the shared WAV writer); other containers are a follow-up.

The grapheme-to-phoneme front-end is a self-contained American-English phonemizer: text is normalized (lower-cased, integers spoken, common punctuation kept), each word is looked up in a bundled lexicon, and out-of-vocabulary words fall back to deterministic letter-to-sound rules. It emits the IPA symbols in Kokoro's vocab and needs no external binary or download. Non-English voices in the checkpoint still load and synthesize, but their phonemes come from the English front-end, so pronunciation quality is limited; per-language g2p (the analogue of upstream Kokoro's misaki[xx] packages) is future work. Like Whisper, the model loads and runs every synthesis on one dedicated MLX worker thread, so loading a Kokoro checkpoint serves text-to-speech only.

Quantization formats

FormatStatusNotes
FP16 / BF16supportedBF16 handling is platform/model dependent; Apple Silicon paths commonly convert to FP16 for execution.
4-bit affine MLX checkpointssupportedPrimary path for many mlx-community checkpoints. CUDA coverage depends on MLX kernel support for the target GPU.
8-bit affinesupportedUsed for weights and/or KV cache depending on path.
NVFP4 / MXFP4 / MXFP8supported where implementedUsed by specific families such as GPT-OSS and recent quantized checkpoints.

Do not infer quality or speed from the ability to load a quantized checkpoint. Run a smoke test and, for release claims, a benchmark/quality gate.

Distributed support summary

CapabilityCurrent summary
Tensor parallelismAdvertised for selected dense text families such as Llama, Qwen, Gemma text, ERNIE 4.5, and Hunyuan dense. Validate per model/rank count.
Pipeline parallelismBest validated for Llama-family text models; stage executors exist for more families with less operator coverage.
VLM under TP/PPPartial. Vision tower / projector partitioning is not uniformly supported.
Disaggregated inferenceInfrastructure exists; validate per topology and workload.

Speculative decoding

DrafterTarget familiesNotes
MTPGemma 4 target pathsAvailable through shared speculative decoding flags.
DFlashQwen 3.5 text/VLM pathsAvailable through shared speculative decoding flags.

Use auto-detection by default. Override only when you know the target and drafter checkpoint pair are compatible.

Known non-goals / caveats

  • A supported architecture does not imply every community checkpoint variant is supported.
  • VLM and video/audio paths require additional runtime dependencies and prompt preparation beyond text-only generation.
  • TurboQuant, TP, PP, and speculative decoding are not uniformly validated for every family.
  • The mlxcel arch output is a CLI summary and may lag the detailed enum count; the canonical source remains src/models/mod.rs and src/models/detection.rs.

Adding support

See Adding a new model for the registration, loading, and test checklist.