Installation
June 19, 2026 · View on GitHub
mlxcel builds two native executables from the root Rust package:
mlxcel— command-line generation, model listing, and downloads.mlxcel-server— HTTP server with OpenAI/llama-server-style endpoints.
The binaries do not require Python or Node.js at runtime. They are not fully static binaries: platform GPU/runtime libraries are still required.
Supported platforms
| Platform | Status | Typical feature flags | Notes |
|---|---|---|---|
| macOS on Apple Silicon | primary | metal,accelerate | Main development and validation target. |
| Linux with NVIDIA CUDA | secondary | cuda | Release builds currently target CUDA 13-era systems; other versions depend on MLX/CUDA compatibility. |
| Linux CPU-only | not a release target | none | May compile in limited configurations, but it is not a useful or validated inference target for this project. |
| Windows | not documented here | — | The current public installation path is macOS/Linux. |
macOS on Apple Silicon
Prerequisites:
- Apple Silicon Mac.
- Rust toolchain compatible with the Rust 2024 edition.
- Xcode Command Line Tools (
xcode-select --install). - Metal toolchain component.
- CMake available on
PATH.
# One-time: install the Metal shader compiler if it is not already present.
xcodebuild -downloadComponent MetalToolchain
git clone https://github.com/lablup/mlxcel.git
cd mlxcel
cargo build --release --features metal,accelerate
The build outputs:
target/release/mlxcel
target/release/mlxcel-server
The macOS release workflow also packages a mlx.metallib artifact when needed.
If you distribute binaries manually, verify the runtime package layout against the
release workflow rather than assuming the executable alone is always sufficient.
Linux with CUDA
Prerequisites vary by distribution and CUDA version. At minimum you need:
- Rust toolchain compatible with the Rust 2024 edition.
- CMake and a C++20-capable compiler.
- CUDA toolkit with
nvcc. - NVIDIA driver compatible with the selected CUDA toolkit.
- cuDNN and CUDA runtime libraries required by the pinned MLX build.
- BLAS and LAPACK development packages, including the C headers. MLX's CMake
resolves
cblas.handlapacke.h, so thelapackeheaders must be present, not only the runtime libraries.
On Debian/Ubuntu (x86_64 or aarch64) the build packages are:
sudo apt-get install -y \
build-essential cmake git \
libopenblas-dev liblapack-dev liblapacke-dev
# CUDA toolkit (nvcc) and cuDNN come from NVIDIA's apt repository, e.g.
# cuda-toolkit-13-0 cudnn9-cuda-13
liblapacke-dev is the package that ships lapacke.h; liblapack-dev alone
omits it and the MLX CMake configure step fails with LAPACK_INCLUDE_DIRS set
to NOTFOUND.
Example build shape:
git clone https://github.com/lablup/mlxcel.git
cd mlxcel
cargo build --release --features cuda
CPU-only build footgun. A plain
cargo build --releaseon Linux uses the default features (nocuda) and produces a CPU-only binary. It still loads and generates, but silently runs MLX on the host CPU at a fraction of GPU throughput (single-digit tok/s on GB10 instead of hundreds), so the mistake is easy to miss. Always pass--features cudaon an NVIDIA host.
If CUDA is not installed under /usr/local/cuda, set CUDA_HOME:
CUDA_HOME=/opt/cuda cargo build --release --features cuda
CUDA architecture selection
src/lib/mlxcel-core/build.rs reads MLX_CUDA_ARCHITECTURES. If it is unset,
the build script tries to detect the compute capability with nvidia-smi and
falls back to 90a when detection fails. For SM 90 and above it appends CUDA's
architecture-specific a suffix (so 90 becomes 90a), because the dedicated
Hopper quantized kernel (qmm_sm90) is only compiled when 90a is in the arch
list. An explicitly set MLX_CUDA_ARCHITECTURES is used verbatim, so include the
suffix yourself for Hopper (90a).
# Hopper / GH200-style target. The `a` suffix is required for the Hopper
# quantized kernel; plain `90` builds without it.
MLX_CUDA_ARCHITECTURES=90a cargo build --release --features cuda
# GB10 / DGX Spark-style target used by the release workflow.
MLX_CUDA_ARCHITECTURES=121 cargo build --release --features cuda
# Multiple targets, if your MLX/CUDA toolchain supports them.
MLX_CUDA_ARCHITECTURES="90a;121" cargo build --release --features cuda
The repository release workflow builds two Linux CUDA targets on self-hosted
runners, each as one fat binary: aarch64 covering GH200 (90a), GB200 (100),
and GB10 (121) in a single build (90a;100;121), and x86_64 covering Ampere
through Blackwell (80;86;89;90a;100;120). For each target the mlxcel CLI and the
mlxcel-server are published as separate archives (mlxcel-... and
mlxcel-server-..., each roughly 347 MB) so a consumer downloads only the one
it needs. Treat other GPU/OS combinations as source builds that need local
validation.
Prebuilt CUDA artifact: runtime requirements
MLX's CUDA backend compiles some kernels (gather and other indexing kernels) at runtime with NVRTC the first time they run, so a prebuilt binary needs CUDA headers available on the deployment host, not only the runtime libraries:
- CCCL (libcu++) headers are bundled inside the prebuilt Linux CUDA
archives (both aarch64 and x86_64). Each unpacks to
bin/+include/cccl/, the layout MLX's JIT looks for relative to the executable (<exe-dir>/../include/cccl). Keepmlxcel/mlxcel-serverunderbin/and theinclude/cccl/directory beside it; do not flatten them. The runtime resolves the bundled headers from the executable's canonical path (/proc/self/exe), so any launch style works, including a relative./mlxcel. SetMLXCEL_CCCL_DIRto point the JIT at the CCCL headers explicitly, e.g. when embedding mlxcel and keeping a flat binary layout. - CUDA toolkit headers (
cuda_runtime.hand friends) come from the host. Install the CUDA toolkit and setCUDA_HOME(orCUDA_PATH) if it is not at/usr/local/cuda. Without them the first NVRTC compile fails withcannot open source fileerrors. - An NVIDIA driver matching the CUDA toolkit must be present to run on the GPU.
Compiled kernels are cached on disk (MLX_PTX_CACHE_DIR, default under the
system temp dir), so only the first run of each kernel variant pays the NVRTC
cost. Point MLX_PTX_CACHE_DIR at a persistent path to keep the cache across
sessions.
C++ ISA baseline (MLXCEL_CXX_MARCH)
In release builds the C++ bridge defaults to -march=native, which tunes for
(and only runs on) the build host's CPU. That is correct for builds that run
where they are built (developer machines, the per-machine GB10/GH200 release
assets). For a binary that must run on other machines, set MLXCEL_CXX_MARCH
to a portable baseline; the release workflow's x86-64 assets use x86-64-v3
(AVX2):
# Portable x86-64 build (any AVX2-capable CPU, ~2013+).
MLXCEL_CXX_MARCH=x86-64-v3 cargo build --release --features cuda
# Omit -march entirely (compiler default baseline).
MLXCEL_CXX_MARCH=none cargo build --release --features cuda
Runtime environment variables
| Variable | Description | Default |
|---|---|---|
CUDA_HOME | CUDA toolkit root, build-time and for runtime NVRTC headers | /usr/local/cuda when present |
MLX_CUDA_ARCHITECTURES | CUDA SM target list, build-time | auto-detect via nvidia-smi, then 90a fallback |
MLXCEL_CXX_MARCH | C++ bridge -march value, build-time; none omits the flag | native |
MLXCEL_CCCL_DIR | Override for the bundled CCCL (libcu++) header dir used by the CUDA NVRTC JIT | bundled <exe-dir>/../include/cccl, then build-time fallback |
MLX_PTX_CACHE_DIR | On-disk cache for JIT-compiled CUDA kernels | system temp dir |
MLXCEL_QUIET_JIT | Suppress the one-time "compiling CUDA kernels" notice on a cold first run | unset (notice shown) |
MLXCEL_DEVICE | Runtime device hint (gpu or cpu) | auto |
MLXCEL_WIRED_LIMIT | Apple Silicon wired-memory ceiling, e.g. 64GB; 0/none disables it | max |
LLAMA_ARG_* | Environment-backed server options accepted by clap | unset |
For the complete MLXCEL_* reference, see
Environment variables.
Verifying the build
./target/release/mlxcel --version
./target/release/mlxcel-server --version
# `download` defaults to the global store at
# ${MLXCEL_CACHE_DIR:-$HOME/.cache/mlxcel}/models/<owner>/<name>.
./target/release/mlxcel download mlx-community/Qwen3-0.6B-4bit
./target/release/mlxcel generate \
-m ~/.cache/mlxcel/models/mlx-community/Qwen3-0.6B-4bit \
-p "Hello" -n 1
Troubleshooting
Missing Metal toolchain on macOS — run
xcodebuild -downloadComponent MetalToolchain and rebuild.
Cannot find CUDA library directory on Linux — set CUDA_HOME to the CUDA
toolkit root and rebuild.
nvidia-smi is unavailable on the build host — set MLX_CUDA_ARCHITECTURES
explicitly.
CUDA/cuDNN linker errors — confirm that the libraries expected by the pinned
MLX version are installed and discoverable by the linker. The root build script
links CUDA runtime/math libraries directly and relies on the system driver for
libcuda.
gmake: *** Error 137 (SIGKILL) while compiling qmm_*.cu — the build ran
out of memory. The CUTLASS-heavy quantized-matmul kernels peak at ~4-5 GB of
compiler memory per parallel job, so a default -j$(nproc) build needs roughly
5 GB × cores. Cap the parallelism with cargo build -j N ... (cargo forwards
N to the CMake subbuild); pick N ≈ available_RAM_GB / 5.
CMake error: LAPACK_INCLUDE_DIRS ... NOTFOUND — install liblapacke-dev
(MLX needs lapacke.h, which liblapack-dev alone does not provide) and
libopenblas-dev.