CLI Reference

April 28, 2026 ยท View on GitHub

Commands Overview

CommandDescription
vllm-mlx serveStart OpenAI-compatible server
vllm-mlx modelInspect, acquire, or convert model artifacts
vllm-mlx bench-serveBenchmark a running server with prompt sweeps or workload contracts
vllm-mlx-benchRun performance benchmarks
vllm-mlx-chatStart Gradio chat interface

vllm-mlx serve

Start the OpenAI-compatible API server.

Usage

vllm-mlx serve <model> [options]
vllm-mlx serve --models-config <yaml> [options]

Options

OptionDescriptionDefault
--served-model-nameCustom model name exposed through the OpenAI API. If not set, the model path is used as the name.None
--portServer port8000
--hostServer host127.0.0.1
--api-keyAPI key for authenticationNone
--rate-limitRequests per minute per client (0 = disabled)0
--timeoutRequest timeout in seconds300
--enable-metricsExpose Prometheus metrics on /metricsFalse
--continuous-batchingEnable batching for multi-userFalse
--cache-memory-mbCache memory limit in MBAuto
--cache-memory-percentFraction of RAM for cache0.20
--no-memory-aware-cacheUse legacy entry-count cacheFalse
--use-paged-cacheEnable paged KV cacheFalse
--max-tokensDefault max tokens32768
--max-request-tokensMaximum max_tokens accepted from API clients32768
--stream-intervalTokens per stream chunk1
--mcp-configPath to MCP config fileNone
--paged-cache-block-sizeTokens per cache block64
--max-cache-blocksMaximum cache blocks1000
--max-num-seqsMax concurrent sequences256
--default-temperatureDefault temperature when not specified in requestNone
--default-top-pDefault top_p when not specified in requestNone
--default-chat-template-kwargsDefault chat template kwargs applied when request chat_template_kwargs is omitted (JSON object)None
--max-audio-upload-mbMaximum uploaded audio size for /v1/audio/transcriptions25
--max-tts-input-charsMaximum text length accepted by /v1/audio/speech4096
--reasoning-parserParser for reasoning models (qwen3, deepseek_r1)None
--embedding-modelPre-load an embedding model at startupNone
--enable-auto-tool-choiceEnable automatic tool callingFalse
--tool-call-parserTool call parser (auto, mistral, qwen, llama, hermes, deepseek, kimi, granite, nemotron, xlam, functionary, glm47)None
--models-configYAML registry file for multi-model servingNone

Examples

# Simple mode (single user, max throughput)
# Model path is used as the model name in the OpenAI API (e.g. model="mlx-community/Llama-3.2-3B-Instruct-4bit")
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit

Model will show up as 'mlx-community/Llama-3.2-3B-Instruct-4bit' in the `/v1/models` API endpoint. View with `curl http://localhost:8000/v1/models` or similar.

# With a custom API model name (model is accessed as "my-model" via the OpenAI API)
# --served-model-name sets the name clients must use when calling the API (e.g. model="my-model")
vllm-mlx serve --served-model-name my-model mlx-community/Llama-3.2-3B-Instruct-4bit
# Note: Model will show up as 'my-model' in the `/v1/models` API endpoint.

# Continuous batching (multiple users)
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --continuous-batching

# With memory limit for large models
vllm-mlx serve mlx-community/GLM-4.7-Flash-4bit \
  --continuous-batching \
  --cache-memory-mb 2048

# Production with paged cache
vllm-mlx serve mlx-community/Qwen3-0.6B-8bit \
  --continuous-batching \
  --use-paged-cache \
  --port 8000

# With MCP tools
vllm-mlx serve mlx-community/Qwen3-4B-4bit --mcp-config mcp.json

# Multimodal model
vllm-mlx serve mlx-community/Qwen3-VL-4B-Instruct-3bit

# Reasoning model (separates thinking from answer)
vllm-mlx serve mlx-community/Qwen3-8B-4bit --reasoning-parser qwen3

# Disable server-wide thinking by default (request-level chat_template_kwargs still override)
vllm-mlx serve mlx-community/Qwen3-8B-4bit \
  --reasoning-parser qwen3 \
  --default-chat-template-kwargs '{"enable_thinking": false}'

# DeepSeek reasoning model
vllm-mlx serve mlx-community/DeepSeek-R1-Distill-Qwen-7B-4bit --reasoning-parser deepseek_r1

# Tool calling with Mistral/Devstral
vllm-mlx serve mlx-community/Devstral-Small-2507-4bit \
  --enable-auto-tool-choice --tool-call-parser mistral

# Tool calling with Granite
vllm-mlx serve mlx-community/granite-4.0-tiny-preview-4bit \
  --enable-auto-tool-choice --tool-call-parser granite

# With API key authentication
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --api-key your-secret-key

# Registry-backed multi-model serving
vllm-mlx serve --models-config /etc/vllm-mlx/models.yaml --continuous-batching

# Expose Prometheus metrics
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --enable-metrics

# Production setup with security options
vllm-mlx serve mlx-community/Qwen3-4B-4bit \
  --api-key your-secret-key \
  --rate-limit 60 \
  --timeout 120 \
  --continuous-batching

For registry-backed serving, see Multi-Model Serving.

Security

When --api-key is set, all API requests require the Authorization: Bearer <api-key> header:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="your-secret-key"  # Must match --api-key
)

Or with curl:

curl http://localhost:8000/v1/models \
  -H "Authorization: Bearer your-secret-key"

vllm-mlx model

Inspect, acquire, and convert model artifacts without serving them. These commands are intended to make model setup auditable: inspect before download, download into a finalized artifact manifest, then convert through mlx-lm with the exact recipe recorded.

Usage

vllm-mlx model inspect <path-or-hf-model-id>
vllm-mlx model acquire <hf-model-id> [--target-dir <path>]
vllm-mlx model convert <path-or-hf-model-id> --output <path> [--quantize]

Options

CommandOptionDescription
inspect--revisionHugging Face revision to inspect
inspect--local-files-onlyInspect only local Hugging Face cache files
acquire--target-dirFinal local directory for a staged download
acquire--staging-dirTemporary directory used before finalizing --target-dir
acquire--mllmDownload multimodal file patterns
acquire--no-fast-transferDo not set HF_HUB_ENABLE_HF_TRANSFER=1
convert--outputOutput directory for the converted MLX model
convert--quantizeEnable mlx-lm quantization
convert--q-bits, --q-group-size, --q-modeQuantization recipe
convert--quant-predicatemlx-lm mixed-bit quantization recipe
convert--dtypeDtype for non-quantized parameters
convert--dry-runPrint command and manifest without executing conversion

Examples

vllm-mlx model inspect mlx-community/Llama-3.2-3B-Instruct-4bit

vllm-mlx model acquire mlx-community/Llama-3.2-3B-Instruct-4bit \
  --target-dir ./models/llama-3b-4bit

vllm-mlx model convert meta-llama/Llama-3.2-3B-Instruct \
  --output ./models/llama-3b-mlx-q4 \
  --quantize --q-bits 4 --q-group-size 64 --q-mode affine

vllm-mlx bench-serve

Benchmark a running vllm-mlx server over HTTP. Prompt-sweep mode measures TTFT, TPOT, throughput, cache deltas, and Metal memory. Workload mode adds per-case quality checks, repeated samples for variance, and comparison-only product policy timeouts. Workload cases can embed messages directly or point request_path at an existing OpenAI-compatible request JSON.

Usage

vllm-mlx bench-serve --url http://localhost:8000 [options]

Options

OptionDescriptionDefault
--urlRunning server base URLhttp://127.0.0.1:8080
--modelAPI model idAuto-detect
--promptsComma-separated prompt sets or files for sweep modeshort,medium,long
--workloadDeclarative workload JSON for contract modeNone
--concurrencyComma-separated concurrency levels for sweep mode1,4
--max-tokensMax tokens for sweep mode256
--repetitionsRepetitions per sweep configuration or workload case3
--enable-thinkingtrue, false, or true,false sweepNone
--scrape-metricsScrape /metrics before/after runstrue
--include-contentInclude full generated content in workload JSONFalse
--request-timeout-sWorkload HTTP transport timeout, 0 disables300
--cache-policyWorkload cache handling: preserve, before-run, before-caseWorkload default or preserve
--outputOutput filestdout
--formatOutput format: auto, table, json, csv, sql, sqliteauto = table for prompt sweeps, json for workloads

In workload mode, --request-timeout-s is the HTTP transport ceiling for each request. Product policy timeouts should live in the workload as policy_timeout_ms. Workload required_regex and forbidden_regex values are Python regex patterns, so literal strings are valid. Workload JSON may spell cache policy values with underscores, such as before_case; they normalize to the hyphenated CLI values.

Examples

# Prompt sweep
vllm-mlx bench-serve --url http://localhost:8000 \
  --prompts short,long --concurrency 1,4 --format json --output bench.json

# Contract workload with quality checks and policy-timeout evidence
vllm-mlx bench-serve --url http://localhost:8000 \
  --workload workload.json --repetitions 5 --output workload-results.json

# Append contract rows directly into SQLite for longitudinal comparisons
vllm-mlx bench-serve --url http://localhost:8000 \
  --workload workload.json --repetitions 5 --format sqlite --output bench.db

vllm-mlx-bench

Run performance benchmarks.