mlxcel

June 19, 2026 ยท View on GitHub

License: Apache 2.0 Latest Release CI

High-performance LLM/VLM inference runtime and server for Apple Silicon. The CLI and server are implemented in Rust and execute models through native MLX C++ bindings. Linux/CUDA builds are supported as a secondary target.

New in v0.3

v0.3.1

  • Fused decode-MoE now runs on CUDA. The fused single-token MoE decode kernel was Metal-only in 0.3.0; it is now ported to CUDA, so Linux/CUDA GPUs such as NVIDIA GB10 get the same fast path with byte-identical greedy output. Measured gains run from about 10% to 55%, up to 1.55x on qwen3-moe.
  • Six more MoE families on the fused kernel. qwen2_moe, LFM2, qwen3_vl_moe, Mixtral, Phi-3.5-MoE, and OLMoE are now wired to the fused decode-MoE path. It self-gates by expert size (MLXCEL_FUSED_MOE_MAX_DFF, default 4096), so large-expert models such as Mixtral 8x7B and Phi-3.5-MoE keep the proven gather path with no regression. Set MLXCEL_FUSED_MOE=0 to disable.
  • BitNet on CUDA. The BitLinear b1.58 ternary matmul kernel is ported to CUDA, so BitNet models run on CUDA GPUs.
  • Loads non-affine quantized VLM checkpoints. Non-affine VLM weights now load with the correct quant mode and group size, so checkpoints such as minicpm-v mxfp4 work instead of failing.

v0.3.0

  • Nine new model families. BitNet b1.58 (1.58-bit ternary), IBM Granite dense and GraniteMoeHybrid, LFM2 / LFM2-MoE, Falcon-H1, PLaMo 2, Apertus, ByteDance Seed-OSS, and dots.llm1 MoE, on top of the existing Llama, Qwen, Gemma, and DeepSeek coverage.
  • Faster MoE decode, on by default. The fused decode-MoE Metal kernel beats the previous gather path on single-token decode (about 13% on Gemma 4) and is enabled by default.
  • Loads newer mixed-precision checkpoints. mlxcel reads per-layer mixed bit widths and bf16 quantization scales, so recent mlx-community exports (for example 8-bit embeddings under a 4-bit default) load correctly. A bf16-scale decode regression on M1 Ultra is also fixed.
  • Linux CUDA release builds. Prebuilt x86_64 and aarch64 CUDA artifacts ship with bundled CCCL headers and reuse JIT-compiled kernels across runs through a persistent PTX cache.

See the changelog for the full list.

Overview

mlxcel provides a Rust command-line runtime and an OpenAI-compatible model server for MLX-format checkpoints. Loading, scheduling, and inference stay in one native process while model execution goes through MLX C++ bindings. It runs a broad range of text and vision-language model families directly from mlx-community checkpoints, with no conversion step.

The project started as work on structural model fine-tuning and has grown into a general-purpose serving runtime for local and small-cluster inference.

Why mlxcel

  • Smaller runtime surface. Model loading, scheduling, and inference stay in a single native server process. Deployments do not need to provision a Python environment, keep package versions in sync, or route requests through an interpreter layer.
  • Simple deployment artifact. mlxcel and mlxcel-server build as native executables, which makes packaging, service supervision, and upgrades straightforward. Platform runtime libraries are still required: for example macOS frameworks on Apple Silicon, and CUDA/OpenBLAS/LAPACK components for Linux builds.
  • llama-server-style operation. mlxcel-server accepts many llama-server-compatible flags and LLAMA_ARG_* environment variables, which makes migration from llama.cpp-based scripts simpler. Treat this as compatibility-oriented, not a guarantee that every llama.cpp option has identical behavior.
  • OpenAI-compatible HTTP API subset. The server supports SSE streaming and the /v1/chat/completions, /v1/completions, and /v1/responses endpoints.
  • Serving features for real deployments. Continuous batching, prompt-prefix caching, automatic prefix caching, speculative decoding, and KV-cache compression are available for supported model/runtime combinations.
  • Differentiated runtime controls. Default builds expose first-class YAML load-time model surgery through --surgery / MLXCEL_SURGERY, with operations such as scale, add, prune, replace, and interpolate for reproducible weight-space changes without retraining or writing converted checkpoints.
  • Multi-device and distributed modes. Tensor parallelism and pipeline parallelism are implemented for selected model families, including zero-config pipeline startup with static or mDNS-based discovery.
  • Broad model-family coverage. The runtime includes loaders for Llama, Qwen, Gemma, Phi, Mistral/Mixtral, DeepSeek, Cohere, InternLM, GLM, ExaOne, OLMo, ERNIE, Hunyuan, Mamba/RWKV/Jamba, Nemotron, MiniMax, Step, Kimi, and multiple VLM families. See Supported models for the maintained list.

Quick start

Install with Homebrew (macOS/Linux)

The Homebrew formula installs both mlxcel and mlxcel-server:

brew tap lablup/tap
brew install mlxcel

Run a model

The quickest path is mlxcel run: it resolves the model argument, auto-downloads on first use, reuses it afterward, and runs from any directory.

# Interactive chat REPL.
mlxcel run mlx-community/Qwen3.5-0.8B-4bit

# Bare name resolves to mlx-community/<name>.
mlxcel run Qwen3.5-0.8B-4bit

# One-shot generation with -p, then exit.
mlxcel run Qwen3.5-0.8B-4bit -p "Hello, world!" -n 100

# No model argument falls back to the default
# mlx-community/gemma-4-e2b-it-4bit.
mlxcel run

generate, serve, and inspect take the same model argument via -m, a HuggingFace owner/name repo-id (auto-downloaded into the store and reused after), a bare name (resolved as mlx-community/<name>), or an existing local path. mlxcel run is a thin wrapper over mlxcel generate and shares its sampling and generation flags.

# One-off generation.
mlxcel generate -m Qwen3.5-0.8B-4bit -p "Hello, world!" -n 100

# OpenAI-compatible server (mlxcel serve is the subcommand equivalent).
mlxcel-server -m Qwen3.5-0.8B-4bit --port 8080

# Restrict browser CORS to specific origins (default reflects any origin).
mlxcel-server -m Qwen3.5-0.8B-4bit --port 8080 --allowed-origins https://app.example.com,https://admin.example.com

# Read-only memory budget: weights + KV cache vs. available unified memory.
mlxcel inspect -m Qwen3.5-0.8B-4bit --max-tokens 32768

# Preflight that aborts if the model + 32K KV cache will not fit
# (--force, alias --no-memory-check, overrides the abort).
mlxcel generate -m Qwen3.5-0.8B-4bit -p "Hello, world!" -n 32768 --estimate-memory

Downloaded models land in a location-independent global store at ${MLXCEL_CACHE_DIR:-$HOME/.cache/mlxcel}/models/<owner>/<name>, shared across every working directory. To relocate the store, write a snapshot to an exact path, change the default org, or tune the memory preflight, see Environment variables, MLXCEL_MODELS_DIR / --models-dir, --local-dir, MLXCEL_DEFAULT_ORG, and MLXCEL_MEMORY_LIMIT / MLXCEL_HEADROOM_FACTOR.

If you build from source instead, use ./target/release/mlxcel and ./target/release/mlxcel-server in place of the installed commands above.

Manage downloaded models

List and prune the global store from any directory:

# List downloaded models with name, size, and last-modified time.
mlxcel list

# Machine-readable output (stable JSON array: repo_id, size_bytes, path, modified).
mlxcel list --json

# Repo-ids only, pipe-friendly for scripting (e.g. xargs mlxcel rm).
mlxcel list -q

# Restore the absolute path column.
mlxcel list -v

# Remove a model from the global store (prompts for confirmation).
mlxcel rm mlx-community/Qwen3.5-0.8B-4bit

# Remove without the prompt (for scripts / non-interactive shells).
mlxcel rm mlx-community/Qwen3.5-0.8B-4bit --yes

mlxcel arch prints the supported model-architecture catalog instead. mlxcel rm <repo-id> deletes only inside the mlxcel store and honors the same --models-dir override; a model that exists solely in the read-only HuggingFace cache (HF_HUB_CACHE / HF_HOME) is reported but never deleted.

Build from source on Apple Silicon

Prerequisites:

  • Rust toolchain
  • Xcode Command Line Tools
  • CMake-compatible build environment
  • Apple Metal toolchain component
xcodebuild -downloadComponent MetalToolchain   # one-time, if not already installed
git clone https://github.com/lablup/mlxcel.git
cd mlxcel
cargo build --release --features metal,accelerate

Linux/CUDA builds use the cuda feature and require the CUDA toolkit plus the system libraries used by MLX. A plain cargo build --release on Linux omits the cuda feature and produces a CPU-only binary that still runs but silently executes MLX on the CPU at a fraction of GPU throughput, so always pass --features cuda on an NVIDIA host. See Installation for the detailed prerequisite matrix.

Performance

mlxcel targets near-mlx-lm / mlx-vlm decode throughput for MLX-format checkpoints while keeping a native Rust runtime. In the M5 Max 128GB benchmark campaign, the headline result has two parts: faster short-prompt text prefill and near-reference decode throughput.

Prefill: prompt ingestion before the first generated token

Short-prompt text prefill is the standout result. mlxcel measured 2.78x the mlx-lm median on M5 Max across 67 comparable text pairs, and 1.79x on M1 Ultra across 74 comparable text pairs. VLM prefill is listed separately because image preprocessing, vision encoder, and projector work can be included in the prefill path.

ModeBaselineM5 Max pairsM5 Max median vs baselineM1 Ultra pairsM1 Ultra median vs baseline
Textmlx-lm672.78x741.79x
VLMmlx-vlm251.01x201.05x

Decode: steady-state token generation

Decode stays close to the Python MLX references on the same host. For M5 Max, text decode averaged 99% of mlx-lm with a 100% median, while VLM decode averaged 98% of mlx-vlm with a 98% median.

ModeBaselineComparable pairsAverage vs baselineMedian vs baseline>=90% parity>= baselineRange
Textmlx-lm6799%100%62 / 67 (93%)31 / 67 (46%)45%-129%
VLMmlx-vlm2498%98%18 / 24 (75%)10 / 24 (42%)59%-121%

Representative decode throughput is shown below in tokens per second. The mlxcel columns are the 2026-06-15 sweep on each host (v0.3.0, including the fix to a quantized-decode regression on bf16-scale checkpoints that mostly affected M1 Ultra). The M5 Max mlx-lm / mlx-vlm reference columns are retained from the earlier same-host campaign, so each ratio is mlxcel (2026-06-15) over that retained reference; a fresh same-host mlx-lm / mlx-vlm run validated that the reference is stable. M1 Ultra values are mlxcel-only capacity references. After that sweep the fused decode-MoE kernel was wired into more MoE families (qwen2_moe, lfm2, qwen3_vl_moe), so the Qwen3-VL 30B-A3B text-path M1 Ultra figure here is the refreshed post-wiring number (69 to 82 tok/s, +19%, --profile decode, median of 3); the M5 Max columns predate that wiring and are conservative for that row. Mixtral 8x7B stays on the gather path via the expert-size guard, so its figures are unchanged. Absolute results depend on model family, quantization, prompt shape, decode length, and hardware. See Benchmark results and Benchmarks for methodology and caveats.

Text modelM1 Ultra mlxcelM5 Max mlxcelM5 Max mlx-lmmlxcel / mlx-lm
SmolLM-135M 4bit375 tok/s917 tok/s712 tok/s129%
Llama 3.1 8B 4bit108 tok/s117 tok/s117 tok/s100%
Qwen2.5 7B 4bit113 tok/s126 tok/s124 tok/s102%
Gemma 2B 4bit196 tok/s215 tok/s223 tok/s96%
Gemma 3 4B 4bit117 tok/s183 tok/s182 tok/s101%
Gemma 2 2B 4bit166 tok/s241 tok/s242 tok/s100%
Phi-3.5-mini 4bit164 tok/s203 tok/s208 tok/s98%
Jamba v0.1 4bit (hybrid SSM)122 tok/s216 tok/s219 tok/s99%
Gemma 4 26B-A4B 4bit80 tok/s151 tok/s141 tok/s107%
Qwen3 MoE 30B 4bit84 tok/s176 tok/s147 tok/s120%
GLM-4 Flash 4bit46 tok/s104 tok/s104 tok/s100%
Nemotron-H 30B 4bit92 tok/s176 tok/s179 tok/s98%
Mixtral 8x7B 4bit54 tok/s65 tok/s66 tok/s98%
StarCoder2 3B 4bit166 tok/s216 tok/s215 tok/s100%
Qwen3.5 0.8B 4bit230 tok/s504 tok/s545 tok/s92%
Qwen3-VL 30B-A3B 4bit, text path82 tok/s151 tok/s147 tok/s103%
Qwen3-VL 32B 4bit, text path21 tok/s27 tok/s29 tok/s93%
GPT-OSS 120B 4bit58 tok/s114 tok/s110 tok/s104%
Solar Open 100B 4bit33 tok/s65 tok/s66 tok/s98%
VLM modelM1 Ultra mlxcelM5 Max mlxcelM5 Max mlx-vlmmlxcel / mlx-vlm
LLaVA Interleave Qwen 0.5B bf16265 tok/s341 tok/s345 tok/s99%
Qwen3.5 0.8B 4bit232 tok/s454 tok/s411 tok/s110%
Qwen3.5 35B-A3B 4bit75 tok/s149 tok/s129 tok/s116%
Gemma 4 E2B 4bit106 tok/s220 tok/s202 tok/s109%
Gemma 3n E2B 4bit73 tok/s151 tok/s125 tok/s121%
InternVL3 1B238 tok/s575 tok/s529 tok/s109%
Gemma 4 26B-A4B 4bit70 tok/s144 tok/s137 tok/s105%
Molmo2 4B60 tok/s64 tok/s67 tok/s96%
Phi 3.5 Vision 4bit122 tok/s168 tok/s160 tok/s105%

DiffusionGemma (block diffusion)

DiffusionGemma generates a canvas block at a time through iterative denoising rather than left-to-right autoregression. The decode harness above measures inter-token timing, which does not apply to diffusion's burst output, so the automated sweep records this checkpoint as a benchmark failure. The numbers below are a manual same-host comparison (192-token generation, chat template, seed 42, max_denoising_steps=48, median of 3 runs):

Diffusion modelM1 Ultra mlxcelM1 Ultra mlx-vlmmlxcel / mlx-vlm
DiffusionGemma 26B-A4B 4bit32 tok/s29 tok/s110%

Released mlx-vlm (0.4.4) does not include diffusion_gemma, so the reference column is mlx-vlm upstream main. The reported tok/s amortizes the per-block denoising passes and is not directly comparable to the autoregressive decode rows above. No M5 Max figure is listed because that comparison was not run on the same-host campaign.

The M5 Max sweep covers 98 text model directories and a matching 98-entry VLM mode pass. Ratio summaries include only rows where both mlxcel and the Python reference produced comparable decode measurements; unsupported checkpoints and benchmark-configuration failures are tracked in the benchmark notes. VLM rows should be read separately because vision preprocessing, processor setup, and prompt construction differ by family. Re-run the benchmark suite on your target hardware before using these numbers for capacity planning.

Supported models

Model support is architecture- and checkpoint-dependent. Run:

mlxcel arch

for the CLI summary, and see Supported models for the maintained architecture table, known limitations, and VLM coverage notes.

Optional GUI

mlxcel-server can be used directly through HTTP clients. For a local graphical front-end, Backend.AI Go can be used as a companion UI for chat, model management, and multi-model routing.

Documentation

Contributing

Issues and pull requests are welcome. See CONTRIBUTING.md for the contributor workflow, local quality gates (cargo fmt, clippy, cargo test, cargo deny check), and commit conventions. New model architectures, performance work, bug fixes, and documentation improvements are all useful. For larger changes, please open an issue first so the scope and validation plan can be discussed.

For security vulnerabilities, see SECURITY.md, do not file these as public issues.

License

Apache License 2.0 unless otherwise noted, see LICENSE. Third-party attributions carried forward under Apache-2.0 Section 4(d) are listed in NOTICE.

Acknowledgments

  • MLX, Apple's machine learning framework
  • mlx-lm (MIT, Copyright 2023 Apple Inc.) and mlx-vlm (MIT, Copyright 2025 Prince Canuma): Python projects whose model coverage and behavior mlxcel ports and mirrors. See NOTICE.
  • MLX Community, pre-converted MLX model checkpoints
  • turboquant_plus: TurboQuant KV cache compression algorithms ported in src/lib/mlxcel-core/src/cache/turbo/ (Apache-2.0, Copyright 2026 Tom Turney). See NOTICE.