Qwen 3.x NextN
May 13, 2026 · View on GitHub
Scope: Qwen3.6 (and compatible) models with NextN / MTP auxiliary head weights in GGUF. The draft context now reuses the target
llama_model(no second mmap of the combined_MTP.gguf); a secondllama_contextis built over the same model withllama_context_params.nextn_draft = true, which routes graph build to the NextN draft builder (qwen35_nextn/qwen35moe_nextn). Legacy standalone*_mtpGGUFs (override_arch) are still supported as a fallback for users who ship the draft head as a separate artifact. This path is namednextnin this fork to coexist with Gemma 4 MTP (--spec-type mtp), which uses a single target context andllama_decode_mtp_*.
See also MTP.md (Gemma) and docs/speculative.md for shared CLI concepts.
0. Pre-built model GGUFs
Recommended: the AtomicChat — Qwen 3.6 UDT collection — drop-in combined *_MTP.gguf quants tuned for this fork. Each repo ships Q3 / Q4 / Q5 / Q6 / Q8 _K_XL, plus the mmproj for vision and a copy of imatrix_unsloth.gguf_file for reproducibility. Upstream Unsloth files keep working too — same arch metadata, same NextN tail.
| Target | Recommended (AtomicChat UDT) | Upstream baseline (Unsloth) | Architecture |
|---|---|---|---|
| Qwen 3.6 35B-A3B (MoE) | AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF (Q4_K_XL ≈ 20.7 GiB) | unsloth/Qwen3.6-35B-A3B-MTP-GGUF | qwen35moe |
| Qwen 3.6 27B (dense) | AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF (Q4_K_XL ≈ 17.7 GiB) | unsloth/Qwen3.6-27B-MTP-GGUF | qwen35 |
Why UDT — built on Unsloth's public MTP-aware imatrix_unsloth.gguf_file, then layered with this fork's tensor-type masks (see §8): every blk.*.nextn.* / mtp.* tensor pinned to Q8_0 to preserve draft acceptance, and attn_q / attn_k lifted to Q6_K so the file pairs cleanly with TurboQuant3 KV. End-to-end recipe & runbook: docs/qwen-udt/RUNBOOK.md. Attribution: Qwen team (weights), Unsloth (imatrix + BF16 sources), @TheTom (TurboQuant), AtomicChat (UDT masks + packaging).
The shared-model NextN path
works on any of them as long as the file contains the NextN auxiliary
head (nextn_predict_layers > 0) — which all *-MTP-GGUF quants do by
construction. scripts/verify-qwen36-nextn-gguf.py will refuse to load a
file missing the NextN layer.
Quick pull via -hf (target) + -hfd (draft); the server resolves both to
the same file in the HF cache and takes the shared-model branch:
# 35B-A3B MoE (headline +24-36 % cell in the matrix)
llama-server \
-hf AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF:Q4_K_XL \
-hfd AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF:Q4_K_XL \
--spec-type nextn --draft-max 2 --draft-min 1 \
-c 8192 -ngl 99 -ngld 99 -fa on
# 27B dense
llama-server \
-hf AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF:Q4_K_XL \
-hfd AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF:Q4_K_XL \
--spec-type nextn --draft-max 2 --draft-min 1 \
-c 8192 -ngl 99 -ngld 99 -fa on
1. Architecture
| Piece | Role |
|---|---|
| Target context | Standard qwen35 / qwen35moe forward; graph publishes t_h_pre_norm (hidden before final norm). |
| Draft context | Built over the same llama_model with cparams.nextn_draft = true. The graph dispatcher picks llm_build_qwen35*_nextn against the target's NextN-layer tensors (model.layers[n_main + i].nextn.*). KV cache is sized only for the NextN layer (kv_only_nextn = true, overridden transparently in llama_context ctor). |
| Hidden transfer | Target and draft enable embeddings_pre_norm; llama_decode copies t_h_pre_norm rows into a CPU embd_pre_norm buffer. common_speculative_state_nextn reads via llama_get_embeddings_pre_norm_ith (no per-ubatch tensor hook). |
| Speculative driver | common_speculative_state_nextn in common/speculative.cpp (greedy Top-1 chain). |
| KV pairing | llama_set_nextn(target, draft) registers the draft context so llama_context_nextn_seq_rm can trim both KVs. |
The shared-model path eliminates the ~22 GB second mmap (one MTLBuffer per llama_model)
that used to OOM the 35B-A3B target on Apple Silicon (38 GB unified memory). See
llama_model_has_nextn_layer() (target arch ∈ {qwen35, qwen35moe} and
hparams.nextn_predict_layers > 0).
2. CLI / server
--spec-type nextn— enable NextN drafting (not Gemmamtp).--model-draft/-md— pass the same path as--model; the server detects this and switches to the shared-model path (no second model load). Pointing at a standalone NEXTN_ONLY GGUF (general.architecture = qwen35*_mtp) still works but loads a secondllama_model.--draft-max/--spec-draft-n-max— max chained draft tokens per round (seecommon/ server arg naming).- Gemma MTP flags (
--mtp-head,llama_decode_mtp_*,llama_model_load_mtp_from_file) are unchanged.
3. C API (subset)
llama_set_nextn(target_ctx, draft_ctx)— pair contexts for pairedseq_rm.llama_context_nextn_seq_rm(target_ctx, …)— remove KV on target and on the registered draft context (seq_id0 on draft).
Internal (see src/llama-ext.h, not in stable include/llama.h):
llama_set_embeddings_pre_norm(ctx, bool)— enable extraction/copy of pre-norm hidden rows intoembd_pre_norm.llama_get_embeddings_pre_norm_ith(ctx, i)— rowiof the last decode’s pre-norm buffer (i < 0supported like other embedding getters).
4. Operations
- Vocab: draft and target share tokenizer; arch check ensures
qwen35+qwen35_mtp(or MoE pair). - GDN rollback: target may use
n_rs_seqfrom speculative+GDN work; draft context forcesn_rs_seq = 0(seetools/server/server-context.cpp). - Metal / Vulkan: GDN partial rollback quality may still be upstream-limited; see PR #22400 notes in the project plan.
5. Verify GGUF
PYTHONPATH=gguf-py python3 scripts/verify-qwen36-nextn-gguf.py /path/to/model.gguf
6. Run scripts
scripts/run-qwen36-27b-nextn-server.shscripts/run-qwen36-35ba3b-nextn-server.sh
Set MAIN_GGUF to your Qwen3.6 *_MTP.gguf (see §0 for the recommended
unsloth quants); draft defaults to the same path so the server takes the
shared-model branch. Alternatively use -hf (target) + -hfd (draft) to
let llama-server pull both from Hugging Face into the local cache:
llama-server \
-hf AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF:Q4_K_XL \
-hfd AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF:Q4_K_XL \
--spec-type nextn --draft-max 2 --draft-min 1
7. Performance notes (MacBook Pro M4 Max, 40-core GPU, 48 GB, Metal)
Median TPS over 2 runs, prompt = 50-token instruction, --draft-max=2 --draft-min=1,
NextN draft DM=2 (single async chain), context 8192. Single-slot
(--parallel 1 -np 1 --cont-batching), full GPU offload (-ngl 99 -ngld 99 -fa on),
shared-model draft path (no second mmap of combined _MTP.gguf),
AtomicChat UDT-Q4_K_XL_MTP file. See
.scratch/bench-logs/qwen-udt-ab-20260513-132549.md.
Bench host
| Component | Value |
|---|---|
| Machine | MacBook Pro (Mac16,5, MX313LL/A) |
| SoC | Apple M4 Max — 16 CPU cores (12P + 4E), 40-core GPU |
| Unified memory | 48 GB LPDDR5 |
| OS | macOS 26.3.1 (build 25D2128), Darwin 25.3.0 |
| llama.cpp backend | Metal (full GPU offload: -ngl 99 -ngld 99, -fa on) |
| Server | local llama-server over 127.0.0.1:8080 |
| Client | python3 urllib → /v1/chat/completions, temperature=0, cache_prompt=false, stream=false |
| Driver | scripts/bench-matrix-qwen.sh (3 runs/cell, median tps, mean accept) |
Single-slot configuration (--parallel 1 -np 1 --cont-batching); no other
heavy GPU/CPU workloads were running on the host during the matrix sweep.
| model | mode | short tps (n=128) | long tps (n=512) | short accept | long accept | Δ short | Δ long |
|---|---|---|---|---|---|---|---|
| qwen-27B dense | f16-base | 21.34 | 20.82 | — | — | — | — |
| qwen-27B dense | f16-nextn | 22.86 | 21.57 | 93.9% | 85.1% | +7.1% | +3.6% |
| qwen-27B dense | turbo3-base | 19.71 | 18.74 | — | — | — | — |
| qwen-27B dense | turbo3-nextn | 20.75 | 19.73 | 85.5% | 78.7% | +5.3% | +5.3% |
| qwen-35B-A3B MoE | f16-base | 70.09 | 69.63 | — | — | — | — |
| qwen-35B-A3B MoE | f16-nextn | 95.22 | 89.13 | 88.2% | 78.7% | +35.8% | +28.0% |
| qwen-35B-A3B MoE | turbo3-base | 61.84 | 62.01 | — | — | — | — |
| qwen-35B-A3B MoE | turbo3-nextn | 82.73 | 77.20 | 82.9% | 80.6% | +33.8% | +24.5% |
Where NextN helps the most: MoE targets (qwen-35B-A3B). Verify is heavy enough that the
draft compute fully overlaps via the async pipeline; acceptance stays high (≥78%) at both
prompt lengths. Wins range from +24% (turbo3, long) to +36% (f16, short), on top of
the +13% TurboQuant memory-bandwidth lift from turbo3 KV.
Dense 27B is draft-compute-bound but no longer regresses. The NextN-layer is a full
transformer block; on a dense model t_draft ≈ 2.6× t_verify, so the async pipeline cannot
overlap it fully and the upside is bounded by accept-rate × (t_verify / (t_verify + non-overlapped t_draft)).
With the shared-model draft path (no double mmap, no graph rebuilds across submits) we land
at +5-7% across short/long, both KV typings — modest but consistent, and positive
where the previous double-mmap path was negative (the old qwen-matrix-shared matrix logged
−7.6% / −11.9% on long for f16-nextn / turbo3-nextn respectively). turbo3 KV adds ~5% extra
draft compute on this rig (Metal dequant inside NextN attention) but it is hidden in the
overlap and TurboQuant's bandwidth win covers the rest.
History within this branch (27B regression resolved)
| Bench log (mtime) | Path | 27B f16-nextn long (Δ vs f16-base) | 27B turbo3-nextn long (Δ vs turbo3-base) | Note |
|---|---|---|---|---|
qwen-matrix-shared-20260512-202358.md | double mmap | −7.6 % (18.93 vs 20.49) | −11.9 % (15.72 vs 17.85) | 35B-A3B OOM on long prompts |
qwen-matrix-fullrun-20260512-222625.md | shared model | +3.6 % (21.57 vs 20.82) | +5.3 % (19.73 vs 18.74) | this matrix |
The jump came from a single architectural change: dropping the second
llama_model_load_from_file and reusing the target's already-loaded NextN tensors via
cparams.nextn_draft = true. Side-effects: (a) 22 GB second MTLBuffer gone — 35B-A3B MoE
now runs without OOM and posts +24-36%; (b) draft KV cache resized only for the NextN layer
(kv_only_nextn = true is mutated transparently in llama_context ctor for draft); (c) the
NextN graph builder now flows through LLM_GRAPH_TYPE_NEXTN instead of override_arch.
8. UDT quantization recipe (calibration + masks)
Goal: keep Unsloth’s MTP-aware imatrix (public imatrix_unsloth.gguf_file per HF repo) while applying AtomicChat-specific --tensor-type-file overrides:
| File | Extra tensors vs base |
|---|---|
scripts/quantize-masks/qwen36-ud-base.txt | token_embd / output high bit width; attn_v / ffn_down lifted; ffn_gate_inp for MoE |
qwen36-ud-v1-nextn.txt | All blk.*.nextn.* and mtp.* at q8_0 (draft-head preservation) |
qwen36-ud-v2-turbo3.txt | attn_q / attn_k at q6_K (stack with TurboQuant3 KV) |
qwen36-ud-v3-combined.txt | Union of v1 + v2 (default release build) |
Build entrypoints
- Single quant:
scripts/quantize-qwen-udt.sh - Full sweep:
scripts/quantize-qwen-udt-matrix.sh - Remote / bench / HF: docs/qwen-udt/RUNBOOK.md
Note: UDT filenames use …Q4_K_XL… as a product tag; llama-quantize is still invoked with family types Q4_K_M, Q5_K_M, etc.
9. Released artifacts — AtomicChat UDT collection
The recipe above ships as two ready-to-pull Hugging Face repos, grouped into one collection:
- Collection — AtomicChat — Qwen 3.6 UDT
- 27B dense —
AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF - 35B-A3B MoE —
AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF
What's actually in each repo, and why it's a bit unusual for a quant drop:
- 5 quants per model, all
_MTP.gguf—Q3_K_XL/Q4_K_XL/Q5_K_XL/Q6_K/Q8_K_XL. Every file already includes the NextN auxiliary head, so the same path works for-mand-md— no second GGUF, no second mmap, no second tokenizer. - NextN-preserve mask (V1) — every
blk.*.nextn.*andmtp.*tensor pinned toQ8_0. The cost is ~10 MiB of file size; the win is that the draft head stays close to BF16 fidelity, which keepsacceptancehigh under--spec-type nextn. Plain UD quants compress the head at the same bit-width as the body and bleed acceptance underturbo3KV. - TurboQuant3-friendly mask (V2) — attention Q/K bumped to
Q6_K. This is the piece we tuned specifically for this fork: when KV is compressed to 3-bit via-ctk turbo3 -ctv turbo3, the attention scores see extra dequant noise on K, so giving Q/K a little more headroom on the weight side cancels most of it out. - Default release = V3 (V1 ∪ V2) — the combined mask shipped on Hugging Face. V1-only and V2-only quants exist as ablation artifacts in the build tree but are not published; the V3 file simply has both lifts at once.
- mmproj mirrored from Unsloth —
mmproj-F16.ggufandmmproj-BF16.ggufre-hosted byte-for-byte from the correspondingunsloth/Qwen3.6-*-MTP-GGUFrepo so a single-hfline gets you target + draft + projector. imatrix_unsloth.gguf_filere-hosted — same artifact as Unsloth's (77-chunk, MTP-aware), included in each repo so the build is reproducible from a clean clone of the recipe.- Apache-2.0, attribution: Qwen team (weights), Unsloth (imatrix + BF16 sources), @TheTom (TurboQuant), AtomicChat (UDT masks + packaging). Fork:
AtomicBot-ai/atomic-llama-cpp-turboquant.
The whole pipeline (download → quantize on H100 → bench on M4 Max → upload) is scripted in docs/qwen-udt/RUNBOOK.md; re-running it on the same Unsloth sources reproduces the published files bit-identical.
10. Multimodal (--mmproj) + speculative decoding (this fork)
Upstream llama-server used to disable all speculative modes whenever a projector was loaded, so a single Qwen 3.6 / Gemma 4 server could not host vision and a draft head at the same time. In atomic-llama-cpp-turboquant the load-time and slot-init gates accept --mmproj together with:
--spec-type mtp(Gemma 4 assistant)--spec-type nextn(Qwen3 NextN draft context)--spec-type eagle3(stub impl; same contract)
These three never look at the flattened prompt_tgt token stream — they read target hidden states / KV directly — so they can coexist with mtmd image chunks. Other modes stay disabled with a warning: separate draft models, all ngram_* modes, ctx_shift and cache_reuse.
What is and is not accelerated today
- Text-only turns on a multimodal slot — draft head runs as usual. Same acceptance rates as the no-mmproj configuration.
- Turns that contain an image chunk — server logs
skipping speculative prime for multimodal promptand falls back to plain target decoding for that turn only. The slot keeps generating text correctly, just without draft speedup.
The reason for the fallback: NextN / MTP begin() needs the target's pre-norm hidden state at every prompt position, but the mtmd image-decode path only writes outputs for the last row of an image batch (get_embeddings_pre_norm_ith returns null for image-pad positions, see tools/server/server-context.cpp). Until image chunks emit per-token outputs, priming on a mixed token stream would leave the draft KV partially seeded and desynced from the target by image-expanded positions. Skipping the prime keeps the slot stable and lets the next pure-text turn re-enable drafting from scratch.
Verified configurations
| Model | Spec | KV | mmproj | Image | Text reply | Decode |
|---|---|---|---|---|---|---|
| Qwen 3.6-35B-A3B-UDT-Q4_K_XL_MTP | nextn | turbo3 | F16 | recognised | OK | ~69 t/s |
| Gemma 4-26B-A4B-it-UD-Q4_K_XL | mtp | turbo3 | F16 | recognised | OK | ~55 t/s |
Both runs were validated on M4 Max with a single shared model file (no second mmap), -c 4096, -fa on.
Roadmap
Real draft acceleration on the vision turn itself requires making mtmd image batches emit per-token outputs (or a teacher-forced replay through target). Tracked as a follow-up; not blocking this fork's release.