Model Compatibility Matrix

May 5, 2026 · View on GitHub

Empirical end-to-end compatibility results from a sweep of MLX-format models on Apple Silicon (M5 Max, vLLM 0.19.1, vllm-swift feat/auto-detect-reasoning-parser-and-hardening).

Each model was launched via vllm-swift serve <path> (auto-detect on), then exercised through a 3-turn agent harness:

T1 — list files (single-shot tool dispatch via bash)
T2 — write code (largest(a, b, c) saved via write tool)
T3 — review code (read the file back via read, comment on bugs)

Verdicts:

PASS — all 3 turns dispatched structured tool_calls, end-to-end write→read→review succeeded
SOFT-FAIL — at least one turn passed; failure traceable to model capability, not parser plumbing
HARD-FAIL — model never produced structured tool_calls; root cause documented below
SKIPPED — environment / required-files limitation outside vllm-swift's scope

This file is a snapshot. Re-running the sweep against the latest feat/auto-detect-reasoning-parser-and-hardening head should reproduce within run-to-run variance.

Summary scorecard

Model	Verdict	Auto-detected (tool / reasoning)	Failure mode	Root cause
Qwen3.5-9B-4bit	✅ PASS	qwen3_coder / qwen3	—	works
Qwen3-Coder-30B-A3B-Instruct-MLX-6bit	✅ PASS	qwen3_coder / (suppressed)	—	works (after qwen3+qwen3_coder reasoning-suppression fix)
Nemotron-Cascade-2-30B-A3B-4bit	✅ PASS	qwen3_coder / nemotron_v3	—	works
Qwen3.6-35B-A3B-4bit	✅ PASS	qwen3_coder / qwen3	—	works
Llama-3.2-3B-Instruct-4bit	✅ PASS	llama3_json / (none)	—	works
gpt-oss-20b-MXFP4-Q8	✅ PASS	openai / openai_gptoss	—	works
Qwen3-0.6B-4bit	⚠️ SOFT-FAIL	hermes / qwen3	T1+T2 ✓, T3 ✗ — model loses thread on 3rd turn	sub-1B intrinsic limit
Qwen3.5-2B-4bit	❌ HARD-FAIL	qwen3_coder / qwen3	silent generation: 0 tokens emitted to content/reasoning/tool_calls	Qwen3.5 small variant chat-template bug + reasoning-disabled-by-default
Llama-3.2-1B-Instruct-hf	❌ HARD-FAIL	llama3_json / (none)	never dispatches tools	sub-1B intrinsic limit (Meta: "lightweight models do not support built-in tools")
Mistral-7B-Instruct-v0.3-4bit	❌ HARD-FAIL	mistral / (none)	vLLM mistral parser fails with "Only one BOT token should have been outputted" + emits `[TOOL_CALLS][TOOL_CALLS]`	vLLM upstream parser bug
Phi-4-mini-instruct-4bit	⚠️ SOFT-FAIL	phi4_mini_json / (none)	Model emits `<\|tool_calls\|>...<\|/tool_calls\|>` as plain content text and vLLM's parser doesn't extract. Auto-recovery in vllm-swift's response rewriter (both non-streaming and streaming) now synthesizes structured `tool_calls` from this leak shape, lifting Phi-4-mini from HARD-FAIL to functional for clients that hit the documented behavior. Remaining failure mode is the model occasionally choosing to chat-explain instead of dispatch — a model-quality issue, not a parser/recovery one.	vLLM upstream parser gap; recovered by vllm-swift
gemma-4-e2b-it-4bit	⏭️ SKIPPED	gemma4 / gemma4	server boot fails: "Can't load video processor" missing `video_preprocessor_config.json`	environment / missing files

Auto-detect picks qwen3_coder + qwen3. Validates the empirical fix that routes Qwen3_5ForConditionalGeneration to qwen3_coder (the chat_template.jinja ships <tool_call><function=name><parameter=k>v... XML, not hermes JSON). All 3 turns dispatched cleanly; largest.py written and read back end-to-end.

Qwen3-Coder-30B-A3B-Instruct-MLX-6bit

Auto-detect picks qwen3_coder. Reasoning is intentionally suppressed by the -Coder- directory-name discriminator. Without that suppression, qwen3_coder (tool) + qwen3 (reasoning) race: model emits tool calls inside <think> blocks, the reasoning parser eats them, tool_calls=[]. With suppression, all 3 turns work and the model writes a docstring'd correct implementation.

Nemotron-Cascade-2-30B-A3B-4bit

Auto-detect picks qwen3_coder (NOT hermes — this was an empirical correction; see PR #14 for HF discussion #7 reference) + nemotron_v3 (NVIDIA's purpose-built reasoning parser). All 3 turns clean.

Qwen3.6-35B-A3B-4bit

The original target of issue #13. Auto-detect picks qwen3_coder + qwen3. Independently validated by @Defilan against mlx-community/Qwen3.6-35B-A3B-8bit (see PR #14 comment).

Llama-3.2-3B-Instruct-4bit

Auto-detect picks llama3_json only (no reasoning parser). 3B is the smallest Llama variant in our sweep that reliably dispatches tools. Per LangChain community, 3B is the practical floor for reliable Llama-family tool calling.

gpt-oss-20b-MXFP4-Q8

Auto-detect picks openai (tool) + openai_gptoss (reasoning). All 3 turns clean, with a numpy-style docstring on the generated function.

⚠️ SOFT-FAIL

Qwen3-0.6B-4bit

T1 (list) and T2 (write) dispatch correctly. T3 (read) — model loses the multi-turn thread. This is consistent with community findings on small-model tool calling: sub-7B agent loops degrade on multi-step chains. Detection is correct; model capability is the limiting factor.

❌ HARD-FAIL — model intrinsic / capability

Qwen3.5-2B-4bit

Auto-detect picks qwen3_coder + qwen3. Model generates zero tokens visible at the API: content="", reasoning_content="", tool_calls=[], finish_reason=stop immediately. Documented by Qwen themselves: Qwen3.5 chat template tool calling broken, and Qwen3.5 small-variant reasoning is disabled by default unless chat_template_kwargs={"enable_thinking": true} is passed. Detection is correct; the model variant ships a partially-working template.

Llama-3.2-1B-Instruct-hf

Auto-detect picks llama3_json. Model never produces structured tool_calls across any of the 3 turns. Meta's own Llama 3.2 model card states: "the lightweight models do not support built-in tools." 1B is below the practical floor for reliable agentic tool dispatch.

❌ HARD-FAIL — vLLM upstream parser

Mistral-7B-Instruct-v0.3-4bit

Auto-detect picks mistral. Server returns HTTP 500 with:

ValueError: Only one BOT token should have been outputted, but got [TOOL_CALLS][TOOL_CALLS][...]

i.e., the model correctly produces a tool-call structure, but the vLLM mistral_tool_parser rejects double-BOT-token emissions. Documented in multiple vLLM upstream issues:

vllm-swift's role here ends at picking the correct parser. The parser itself is the upstream bug surface.

Phi-4-mini-instruct-4bit

Auto-detect picks phi4_mini_json. Model emits the correct token sequence as plain content (<|tool_calls|>[{...}]<|/tool_calls|>), but vLLM's parser does not extract it into message.tool_calls. Documented:

Microsoft's Phi-4-mini-instruct model card also notes the model "could sometimes hallucinate function names."

Recovered by vllm-swift since v0.4.0. When phi4_mini_json is the auto-detected tool parser, the rewriter proxy spawns automatically (via _LEAKY_TOOL_PARSERS) and runs auto-recovery on both non-streaming and streaming responses. The leak shape gets parsed out of content and synthesized into a structured message.tool_calls, with finish_reason bumped to tool_calls. Clients see a normal structured tool dispatch. The remaining model-quality issue (Phi-4-mini sometimes choosing to chat-explain rather than dispatch) is unfixable from vllm-swift; if your agent loop hits it, a stronger model is the answer.

⏭️ SKIPPED — environment / missing files

gemma-4-e2b-it-4bit

vLLM 0.19.1 refuses to load the model:

OSError: Can't load video processor for '...gemma-4-e2b-it-4bit'. ...make sure '...' is the correct path to a directory containing a video_preprocessor_config.json file

vLLM's Gemma-4 path treats the family as multimodal even when the 4-bit MLX build ships text-only. Documented:

Workaround per vLLM Gemma 4 recipe: pass --limit-mm-per-prompt image=0,audio=0 to the inner vLLM args. vllm-swift's CLI passes through extra args, so users can still launch gemma-4 by appending that flag to vllm-swift serve. Not validated end-to-end in the sweep because the missing config file blocks load entirely on the MLX 4-bit build we have on disk.

Categorization

Category	Count	Models
Works end-to-end via vllm-swift	7/12	Qwen3.5-9B, Qwen3-Coder-30B-MLX, Nemotron-Cascade-2, Qwen3.6-35B-A3B, Llama-3.2-3B, gpt-oss-20b, Phi-4-mini (via auto-recovery since v0.4.0)
Model intrinsic limit (sub-7B)	3/12	Qwen3-0.6B, Qwen3.5-2B, Llama-3.2-1B
vLLM upstream parser bug (no in-band recovery)	1/12	Mistral-7B-v0.3 (parser 500s before recovery sees it)
Environment / missing files	1/12	gemma-4-e2b-it

0/12 failures are vllm-swift bugs. Every failure traces to (a) model capability limits below the agentic floor, (b) vLLM upstream parser issues, or (c) missing vendor config files outside the auto-detect scope.

Reproducing the sweep

# harness lives outside the repo (intentionally throwaway)
python3 /tmp/sweep_harness.py

The harness expects vllm-swift available at /Users/tom/.vllm-swift/venv and models at /Users/tom/models/<name>; tweak the VENV_PY and MODELS constants for other environments. Per-turn results land in /tmp/sweep-results/<model>.json with the full conversation in messages-<model>.json. Aggregate scorecard goes to SUMMARY.json.