TurboQuant Configuration Recommendations

April 2, 2026 · View on GitHub

Practical guidance for choosing TurboQuant settings based on your model, weight quantization, and hardware. Based on validated testing across Metal (M1 through M5), CUDA (RTX 3080 Ti through Blackwell), and HIP (AMD RDNA 4).

Multi-backend validated: Metal (Apple Silicon), CUDA (NVIDIA), and HIP (AMD) all produce consistent quality results. Speed characteristics vary by backend. CUDA decode is faster than q8_0 on some models (dusterbloom fused MMA FA). Metal prefill beats q8_0 at 32K+ context on 70B.

Sparse V: Enabled by default on all Metal builds. Skips V dequant for negligible attention weights. No PPL impact, +22.8% decode on MoE at 32K. Opt-out: TURBO_SPARSE_V=0.

Block size 128: Now the default. turbo3 achieves 5.12x compression (was 4.57x at block_size=32) with zero quality cost. Validated on Metal and CUDA (PR #32 fix).

Validated Good

These configurations produce healthy PPL in current testing:

Model classConfigEvidence
Q8_0 weights, any size-ctk turbo4 -ctv turbo4phi-4 +1.7%, 35B MoE +0.2%, 27B Dense healthy
Q8_0 weights, any size-ctk turbo3 -ctv turbo3phi-4 +4.2%, 35B MoE +1.1%
Q8_0 weights, any size-ctk q8_0 -ctv turbo4phi-4 +0.3%
Q8_0 weights, any size-ctk q8_0 -ctv turbo3phi-4 +1.1%
Q4_K_M, larger models (24B+)-ctk turbo3 -ctv turbo3Mistral-24B PPL 4.99, Llama-70B +11.4%, Command-R+ 104B +3.6%
Q4_K_M, larger models (70B+)-ctk turbo4 -ctv turbo4Llama-70B +6.3%, Command-R+ 104B +1.9%
Q4_K_M, tested sensitive models-ctk q8_0 -ctv turbo4Qwen2.5-7B +1.0% (Metal, CUDA, AMD all confirmed)
Q4_K_M, tested sensitive models-ctk q8_0 -ctv turbo3Qwen2.5-7B +2.0%
Q4_K_M, 70B+-ctk q8_0 -ctv turbo2Llama-70B +9.5%, Command-R+ 104B +7.9% (asymmetric rescue)

Validated Risky

These configurations produce catastrophic PPL in at least one tested model:

Model classConfigEvidence
Q4_K_M, Qwen2.5-7B-ctk turbo4 -ctv turbo4PPL 218 (vs 6.58 baseline)
Q4_K_M, Qwen2.5-7B-ctk turbo3 -ctv turbo3PPL 3556
Q4_K_M, Qwen2.5-1.5B-ctk turbo3 -ctv turbo3PPL 8641+ on both M5 and M2

Note: symmetric turbo on Q4_K_M is not universally broken. Mistral-24B, Llama-70B, and Command-R+ 104B Q4_K_M all handle it fine. Model family and size both matter. Qwen2.5 is consistently sensitive; Llama, Mistral, and Cohere are tolerant. Bigger models absorb quantization stacking better (104B turbo3 = +3.6% vs 70B turbo3 = +11.4%).

Experimental

These configurations showed promising results but have less validation depth:

Model classConfigEvidence
Q4_K_M, tested sensitive models-ctk q8_0 -ctv turbo2Qwen2.5-7B +5.1%
Q8_0 weights-ctk q8_0 -ctv turbo2phi-4 +3.1%
Q4_K_M, Qwen2.5-7B (AMD)-ctk q8_0 -ctv turbo3NaN on HIP (Metal gets +2.0%). HIP-specific, under investigation

Boundary V (auto-enabled for turbo2-V)

A layer-aware V compression strategy that protects the first 2 + last 2 layers with q8_0-V while compressing all remaining layers with turbo2-V. Auto-enabled when -ctv turbo2 is set on recent builds. Opt-out: TURBO_LAYER_ADAPTIVE=0. On older builds, activate with TURBO_LAYER_ADAPTIVE=7.

Validated across 4 models on Metal. Consistently recovers 37-91% of the turbo2-to-turbo3 quality gap. Benefit scales with model depth.

ModelLayersturbo2 PPLBoundary V PPLturbo3 PPLQuality recovered
phi-4-Q8_0404.8354.7844.74255%
Qwen2.5-7B Q4_K_M286.9116.8356.70737%
Qwen3.5-35B MoE645.2575.1485.13791%
Qwen3.5-27B Dense366.5346.4236.27342%

Validated at 512 and 8K context. NIAH retrieval passed. No speed penalty. Independently validated by @Corianas_ on NanoGPT.

Exploratory Findings (2026-03-31): Extended validation on Qwen3.5-35B MoE Q8_0 with q8_0-K + turbo2-V (Boundary V auto-enabled) at 512c, 8K, and 32K context. PPL matched or beat q8_0/turbo3 at every tested length. At 32K decode on M5 Max, turbo2+BV was ~2-3% faster than turbo3-V (n=2 runs). This makes q8_0/turbo2 the best tested MoE long-context V config on this model. See MoE V-compression frontier.

See Layer-Aware V Compression for the original Boundary V writeup.

Your situationStart withWhy
Q8_0+ weights-ctk turbo4 -ctv turbo4Best quality/compression balance
Q8_0+ weights, need more compression-ctk turbo3 -ctv turbo3+4% PPL, 5.12x compression
Q4_K_M, unknown model-ctk q8_0 -ctv turbo4Safe default, V still compressed
Q4_K_M, validated large model (24B+)-ctk turbo3 -ctv turbo3If you've confirmed PPL is healthy
Q4_K_M, 70B+-ctk turbo4 -ctv turbo4+6.3% on 70B, +1.9% on 104B. Symmetric works on large Llama/Cohere
Maximum V compression-ctk q8_0 -ctv turbo2+5-9.5% PPL, Boundary V auto-enabled
MoE long-context V compression-ctk q8_0 -ctv turbo2Tested on Qwen3.5-35B MoE: 7.53x V, PPL within 1%, 32K decode ~2-3% faster than turbo3-V

Important framing: Asymmetric q8_0-K + turbo-V is a quality/robustness rescue, not a speed optimization. You trade some decode throughput (K is uncompressed) for quality safety on sensitive models. If your model works fine with symmetric turbo, use symmetric.

MoE exception: On the tested Qwen3.5-35B MoE Q8_0 setup, q8_0/turbo2 with Boundary V (auto-enabled) is the best tested long-context V config: 7.53x V compression, PPL within 1% of q8_0 at 512c/8K/32K, quality matching or exceeding q8_0/turbo3, and 32K decode ~2-3% faster than turbo3-V (n=2 runs, M5 Max). This is a scoped result for this tested setup. See MoE V-compression frontier.

Why K Precision Matters More Than V

The attention mechanism computes softmax(Q * K^T) * V. K determines which tokens receive attention weight via softmax. Softmax amplifies small errors exponentially: a small shift in Q*K scores can flip which tokens dominate the output. V errors, by contrast, scale linearly through the weighted sum.

In current testing:

  • q8_0-K + turbo3-V on Qwen2.5-7B Q4_K_M gives PPL 6.71 (+2.0% vs baseline)
  • turbo3-K + q8_0-V on the same model gives PPL 3556 (catastrophic)

Same total bits, opposite directions, 500x quality difference. K precision is the dominant lever.

This is why asymmetric -ctk q8_0 -ctv turbo3 can rescue models where symmetric -ctk turbo3 -ctv turbo3 fails. You still get V cache compression while maintaining the attention routing accuracy that K requires.

Tested Configurations

Metal (Apple Silicon)

All results from Metal flash attention. PPL measured on wikitext-2-raw (512 context, 4 chunks) unless noted.

phi-4-14B (Q8_0 weights) — healthy across all configs

KVM5 PPLM2 PPLvs q8_0Status
q8_0q8_04.6904.691baselinehealthy
turbo4turbo44.7704.787+1.7% / +2.0%healthy
turbo3turbo34.8864.956+4.2% / +5.7%healthy
q8_0turbo44.7024.693+0.3%healthy
q8_0turbo34.742+1.1%healthy
q8_0turbo24.835+3.1%healthy

Cross-hardware matched: M2 Pro and M5 Max produce equivalent results.

Qwen2.5-7B-Instruct (Q4_K_M weights) — sensitive to symmetric turbo, rescued by asymmetric K/V

KVM5 PPLM2 PPLvs q8_0Status
q8_0q8_06.5776.579baselinehealthy
q8_0turbo46.6446.603+1.0%rescued
q8_0turbo36.7076.715+2.0%rescued
q8_0turbo26.911+5.1%rescued
turbo4turbo4217.7227.5catastrophicavoid
turbo3turbo335563778catastrophicavoid

Cross-hardware matched: both machines show identical quality patterns.

Llama-3.1-70B-Instruct (Q4_K_M weights, M5 Max 128GB) — tolerates symmetric turbo

KVPPLvs q8_0Status
q8_0q8_03.257baselinehealthy
q8_0turbo43.301+1.3%healthy
q8_0turbo33.325+2.1%healthy
q8_0turbo23.568+9.5%healthy
turbo4turbo43.461+6.3%healthy
turbo3turbo33.629+11.4%usable
turbo2turbo25.161+58.5%degraded

Asymmetric rescue works: q8_0/turbo2 = +9.5% vs symmetric turbo2/turbo2 = +58.5%. 6x improvement in quality degradation.

Long context PPL (wikitext-2-raw):

KVContextPPL
q8_0q8_08K3.617
q8_0turbo48K3.639
q8_0turbo38K3.653
turbo3turbo332K4.839
q8_0q8_048K3.575
turbo3turbo348K4.019

NIAH: 30/30 perfect (turbo3 = q8_0, 5 depths x 3 context lengths). See 70B stress test.

Command-R+ 104B (Q4_K_M weights, M5 Max 128GB) — tolerates symmetric turbo, 128K validated

KVPPLvs q8_0Status
q8_0q8_06.192baselinehealthy
q8_0turbo46.211+0.3%healthy
q8_0turbo36.296+1.7%healthy
q8_0turbo26.678+7.9%healthy
turbo4turbo46.312+1.9%healthy
turbo3turbo36.415+3.6%healthy
turbo2turbo27.049+13.8%usable

Largest model tested. 104B tolerates symmetric turbo better than 70B (bigger models = more headroom). turbo3 prefill faster than q8_0 at 32K (64.5 vs 62.3 t/s).

Long context PPL (turbo3/turbo3, wikitext-2-raw):

ContextPPLPass time
48K3.672931s
64K4.3211481s
96K4.1702966s
128K4.0244996s

128K full native context achieved by raising macOS GPU memory cap from default ~75% of physical RAM. Recommended setting is 90% to avoid kernel panics under sustained load. Peak memory 74 GB of 128 GB.

# Recommended: 90% of physical RAM (safe for sustained inference)
# 128GB Mac
sudo sysctl iogpu.wired_limit_mb=117964
# 96GB Mac
sudo sysctl iogpu.wired_limit_mb=88474
# 64GB Mac
sudo sysctl iogpu.wired_limit_mb=58982

NIAH: 10/10 perfect at 4K and 8K (turbo3). 16K timed out due to slow decode on 104B, not retrieval failure.

See M5 Max stress test.

Qwen3.5-35B-A3B MoE (Q8_0 weights) — healthy

KVM5 PPLStatus
turbo3turbo35.130healthy
turbo4turbo45.078healthy

Qwen3.5-27B Dense (Q8_0 weights) — healthy

KVM5 PPLStatus
turbo3turbo36.339healthy

Mistral-Small-24B-Instruct (Q4_K_M weights) — healthy at this size

KVM5 PPLStatus
turbo3turbo34.987healthy

This shows Q4_K_M is not universally incompatible with symmetric turbo. The 24B model has enough capacity to absorb the quantization stacking that breaks the 7B model.

CUDA (NVIDIA)

Community-validated on RTX 3080 Ti, RTX 3090, RTX 4090, RTX 5090, and DGX Spark (Blackwell sm_121).

seanrasch — Qwen3.5-4B Q4_K_M (RTX 3090, block_size=128)

KVPPLvs fp16Status
fp16fp1610.037baseline
q8_0turbo410.124+0.9%healthy
q8_0turbo310.163+1.3%healthy
turbo3turbo310.247+2.1%healthy
q8_0turbo210.568+5.3%healthy

Qwen3.5 hybrid architecture handles symmetric turbo without the blowup seen on Qwen2.x.

dusterbloom — Decode Speed (RTX 3090, block_size=128, fused MMA FA)

ModelDecode vs q8_0
Gemma-3-12B+7.3% faster
Qwen3.5-35B MoE+4.2% faster
Nemotron-9B+3.4% faster
Qwen3.5-9B+0.8% faster

Prefill near parity (-0.3% to +2.5% at pp8192).

Char__Bob — Mistral-Small-24B (RTX 3090)

q8_0 OOMs at 128K on 3090. turbo3 enables 128K (18,430 MiB fits). Decode flat ~43 t/s.

seanrasch — NIAH (RTX 3080 Ti)

Qwen3.5-9B turbo3/turbo3: 33/33 (100%), matches f16.

HIP (AMD)

First AMD validation. RX 9070 XT (RDNA 4, gfx1201), Windows 11, HIP SDK 7.1. First attempt, no tuning.

Qwen2.5-7B-Instruct (Q4_K_M weights)

KVPPLvs q8_0Status
q8_0q8_07.794baselineOK
q8_0turbo47.876+1.0%recommended
q8_0turbo3NaNHIP-specific issue
turbo4turbo4401.4catastrophicQ4_K_M sensitivity
turbo3turbo381,277catastrophicQ4_K_M sensitivity

Asymmetric q8_0/turbo4 confirmed on AMD. Symmetric Q4_K_M failure consistent across all three GPU vendors. q8_0/turbo3 NaN is HIP-specific (Metal gets +2.0%). No speed penalty on working configs.

See Windows RDNA 4 Setup Guide for build instructions and 9 gotchas.

Practical Guidance

Strong base weights (Q8_0, Q6_K, or higher)

Symmetric turbo works well. Start with turbo4 for best quality, turbo3 for more compression:

# Best quality with compression
llama-server -m model-Q8_0.gguf -ctk turbo4 -ctv turbo4 -fa 1

# More compression, still healthy
llama-server -m model-Q8_0.gguf -ctk turbo3 -ctv turbo3 -fa 1

Low-bit base weights (Q4_K_M) on sensitive models

Qwen2.5-7B Q4_K_M fails catastrophically with symmetric turbo but is rescued by asymmetric K/V. Not all Q4_K_M models are sensitive. If unsure, start asymmetric:

# Recommended: near-baseline quality with V compression
llama-server -m model-Q4_K_M.gguf -ctk q8_0 -ctv turbo4 -fa 1

# More V compression, still good (+2% PPL)
llama-server -m model-Q4_K_M.gguf -ctk q8_0 -ctv turbo3 -fa 1

# Maximum V compression (+5-9.5% PPL, Boundary V auto-enabled)
llama-server -m model-Q4_K_M.gguf -ctk q8_0 -ctv turbo2 -fa 1

Low-bit base weights (Q4_K_M) on larger or less sensitive models

Symmetric turbo3 works on Mistral-24B Q4_K_M (PPL 4.99), Llama-70B Q4_K_M (PPL 3.629), and Command-R+ 104B Q4_K_M (PPL 6.415). Validate on your specific model:

# Try symmetric first on large Q4_K_M models
llama-server -m large-model-Q4_K_M.gguf -ctk turbo3 -ctv turbo3 -fa 1

# Fall back to asymmetric if quality is poor
llama-server -m large-model-Q4_K_M.gguf -ctk q8_0 -ctv turbo3 -fa 1

Unknown models

Start conservative and validate:

# Safe starting point for any model
llama-server -m model.gguf -ctk q8_0 -ctv turbo4 -fa 1

Block Size

The default storage block size is 128 elements (changed from 32). turbo3 achieves 5.12x compression (vs 4.57x at block_size=32) with zero quality cost.

Validated on:

  • Metal: 3 architectures (dense, dense Qwen, hybrid MoE), 3 context lengths (512, 8K, 32K), 2 Apple Silicon platforms (M5 Max, M2 Pro). Both symmetric and asymmetric paths.
  • CUDA: Validated via PR #32 fix (warp-to-block mapping for block_size=128). PPL identical on RTX 3090 sm_86.
  • NIAH: 3/3 pass at block_size=128 (phi-4, symmetric turbo3, 4K context).

On M2 Pro (Qwen2.5-1.5B, q8_0-K + turbo3-V), block_size=128 improved decode speed by 3-7%. This gain was not observed on M5 Max and may be specific to bandwidth-constrained hardware.

See block size study for the full data.

Notes and Caveats

  • Multi-backend: Metal, CUDA, and HIP all validated. Asymmetric q8_0-K + turbo4-V confirmed across all three GPU vendors on Qwen2.5-7B Q4_K_M.
  • Model sensitivity varies: Qwen2.5 is consistently sensitive to symmetric turbo on Q4_K_M. Llama, Mistral, and Qwen3.5 tolerate it. Test before deploying on new model families.
  • turbo2 as V cache: turbo2-V with Boundary V auto-enabled gives +5-9.5% PPL depending on model. Boundary V recovers 37-91% of the quality gap.
  • PPL is measured at 512 context with 4 chunks unless noted. Long-context PPL validated up to 128K on Command-R+ 104B.
  • 104B at 128K context: Confirmed on M5 Max 128GB with Command-R+ Q4_K_M. turbo3/turbo3 PPL 4.024. Requires raising macOS GPU memory cap: sudo sysctl iogpu.wired_limit_mb=117964 (90% of 128GB, safe for sustained inference). Without this, Metal hangs at ~49K context on 70B+ models. Setting above 90% risks kernel panics under sustained load (community reported). GGML_METAL_NO_RESIDENCY=1 is not needed (isolation testing confirmed). See M5 Max stress test.
  • turbo3 prefill faster than q8_0 at 32K: Confirmed on both 70B (+7.4%) and 104B (+3.5%). Smaller KV cache reduces memory bandwidth during attention.
  • Multi-GPU with vision models: On multi-GPU setups where mmproj (vision encoder) causes OOM on GPU0, set MTMD_BACKEND_DEVICE=CUDA1 to move the vision model to GPU1. This was the key to unlocking 2x128K dual-slot + vision on Qwen3.5-122B across dual L40S 48GB (reported by @sjoerdmaessen). Without this, mmproj competes with model weights for GPU0 VRAM.
  • Community: 50+ testers across M1/M2/M3/M5 Mac, RTX 3080 Ti/3090/4090/5090, L40S, DGX Spark Blackwell, AMD RX 9070 XT, Strix Halo, 7900 XTX.