TurboQuant+

May 28, 2026 ยท View on GitHub

Getting Started Guide | Configuration Recommendations | llama.cpp Fork | Swift MLX Fork | vllm-swift | Commercial Support

๐ŸŽ On Apple Silicon and want the fastest path? Use ekryski/mlx-swift-lm โ€” Eric Kryski's Swift MLX implementation that I've been actively collaborating on. Native Swift, ~2.5x faster decode than Python mlx-lm, full TurboQuant+ support including turbo4v2 (4-bit K + 2-bit V). 144 tok/s on Qwen3.5-35B-A3B MoE at 4K on M5 Max. For OpenAI-compatible serving, use vllm-swift โ€” a native Swift/Metal backend for vLLM built on mlx-swift-lm. No Python in the inference hot path, works with Hermes, OpenCode, and any OpenAI client. This llama.cpp repo is for cross-platform deployment (CUDA, ROCm, CPU, Metal).

Prebuilt Binaries

Download ready-to-run builds from Releases. No build tools needed.

PlatformDownload
Mac (Apple Silicon, Metal)turboquant-plus-*-macos-arm64-metal.tar.gz
Windows (CUDA 12.4)turboquant-plus-*-windows-x64-cuda12.4.zip

Unpack and run. Includes llama-server, llama-bench, llama-cli, and all tools.


Implementation of TurboQuant (ICLR 2026) with implementation work, experiments, and follow-on findings beyond the base paper. KV cache compression for local LLM inference.

Note

This repository is an experimental integration and research workspace for TurboQuant-related work targeting llama.cpp. The goal is to make it easier to compare approaches, collect reproducible benchmark and quality data, and share implementation details across hardware and backends. It is not intended as a separate long-term fork or a proposal to merge the branch as a whole.

If individual pieces prove useful and stable, the intent is to upstream them incrementally as small, reviewable patches in line with llama.cpp's normal contribution process.

What's In This Branch

  • Experimental TurboQuant-related integrations for llama.cpp
  • Benchmark and quality validation across models, contexts, and hardware
  • Backend-specific implementation work and performance experiments
  • Documentation and writeups intended to make testing and reproduction easier
  • Candidate ideas that may be worth upstreaming individually if they prove stable

Current Findings

Three follow-on findings in this branch have been independently validated by multiple researchers across different hardware and backends:

  1. V compression is free. Compressing the value cache (even down to 2 bits) has zero measurable effect on attention quality when key precision is maintained. Confirmed on Metal (M5 Max), CUDA RTX 4090 (@sztlink), and CUDA RTX 3090 (@HyperionMS2040). See asymmetric K/V paper.
  2. All quality degradation comes from K compression. This is why asymmetric configs (q8_0-K + turbo-V) rescue models where symmetric fails. Validated across Qwen, Llama, Mistral, and Command-R+ families. See M5 Max stress test.
  3. Boundary layers are disproportionately sensitive. Protecting the first 2 + last 2 layers at higher precision recovers 37-91% of the quality gap. See Boundary V paper.

Additional experiments and writeups: Sparse V dequant (+22.8% decode), block size optimization (5.12x compression), turbo4 resurrection (QJL hurts, PolarQuant works), EDEN optimal-S response (rotation is first-order, scale is second-order).

Compresses transformer KV cache 3.8-6.4x using PolarQuant + Walsh-Hadamard rotation. Near q8_0 prefill speed and ~0.9x decode throughput at long context (Apple Silicon). Full format family: turbo2 (2-bit, 6.4x), turbo3 (3-bit, 4.6-5.1x), turbo4 (4-bit, 3.8x). turbo3 compression depends on storage block size; see block size study.

Sparse V: Attention-gated KV cache decoding that skips low-weight V positions during inference. Up to +22.8% decode speed at 32K context, validated on wikitext-103 (50 chunks, CI +/-0.021) with no measurable PPL change. Not TurboQuant-specific; validated across q8_0, q4_0, and turbo3 KV formats. ~1% perplexity increase vs q8_0 from compression; Sparse V itself introduces no additional degradation (ON/OFF delta = 0.000).

Validated end-to-end from 1.5B to 104B on M5 Max via llama.cpp Metal. 104B at 128K context on a MacBook with turbo3 (PPL 4.024, 74 GB peak memory).

Status: v1 Complete, Speed Optimized, Community-Tested

  • 511+ Python tests, 100% code coverage on diagnostics
  • C port integrated into llama.cpp with Metal GPU kernels
  • --cache-type-k turbo3 --cache-type-v turbo3 works on Apple Silicon (turbo2/turbo3/turbo4 all supported)
  • turbo2 Metal support: 2-bit, 6.4x compression, +6.48% PPL โ€” for extreme memory pressure or asymmetric K/V
  • q8_0 prefill speed parity achieved (2747 vs 2694 tok/s)
  • Norm correction: PPL beats q8_0 on CUDA (-1.17%), +1.1% on Metal (ported from @spiritbuun)
  • 4-mag LUT: auto-detected on M1/M2/M3/M4, +38-45% decode at long context
  • Layer-adaptive mode 2: q8_0 quality at 3.5x compression (last 8 layers at q8_0)
  • Temporal decay: 30-34% memory savings at long context (experiment branch)
  • NIAH retrieval: 9/9 single needle with sparse V (vs 7/9 baseline), 100% multi-key through 32K. 30/30 on Llama-70B, 10/10 on Command-R+ 104B
  • 14 decode approaches tested on M2 Pro โ€” comprehensive hardware analysis
  • Stress tested up to 104B: Command-R+ 104B Q4_K_M at 128K context (PPL 4.024). Llama-70B Q4_K_M at 48K (PPL 4.019). turbo3 prefill faster than q8_0 at 32K on both models
  • Community: 30+ testers across M1/M2/M3/M5 Mac, RTX 3080 Ti/3090/4090/5090, AMD 6800 XT/9070 XT
  • Rotation Gaussianization validated on real Qwen3 KV tensors (kurtosis 900 โ†’ 2.9)

Quality and Speed (M5 Max 128GB)

Top-of-Tree Results

Cache TypeBits/valCompressionPPL (wikitext-2, 512c)vs q8_0
f1616.01.0x6.121-0.16%
q8_08.51.9x6.111baseline
turbo44.253.8x6.125+0.23%
q4_04.53.6x6.142+0.52%
turbo33.5โ€ 4.6xโ€ 6.176+1.06%
turbo22.56.4x6.507+6.48%

turbo4 (4-bit PolarQuant) has the best quality after q8_0 โ€” closer to q8_0 than q4_0, at better compression. turbo3 trades quality for maximum compression. turbo2 (2-bit) trades more quality for extreme compression โ€” best used asymmetrically.

โ€ turbo3 at default block_size=32. At block_size=128, turbo3 achieves 3.125 bits/val and 5.12x compression with identical PPL, validated on Metal across 3 model architectures, 3 context lengths (512โ€“32K), and 2 Apple Silicon platforms. Tested on both asymmetric (q8_0-K + turbo3-V) and symmetric (turbo3/turbo3) paths. On the tested M2 Pro setup (Qwen2.5-1.5B, q8_0-K + turbo3-V), block_size=128 also improved decode by 3โ€“7%; this gain was not observed on M5 Max. Earlier turbo3 figures (4.6x) reflect the block_size=32 default. CUDA not yet validated. See block size study.

Important: choosing the right config for your model. TurboQuant quality depends on your base weight quantization. Models with Q8_0+ weights work well with symmetric turbo (e.g., -ctk turbo3 -ctv turbo3). Some low-bit models with Q4_K_M weights may benefit from asymmetric K/V: use -ctk q8_0 -ctv turbo4 to keep K precision high while compressing V (tested on Qwen2.5-7B Q4_K_M). K precision is the dominant quality factor because it controls attention routing via softmax. Note: not all Q4_K_M models are sensitive โ€” Mistral-24B, Llama-70B, and Command-R+ 104B all handle symmetric turbo fine. Bigger models absorb quantization stacking better (104B: +3.6% vs 70B: +11.4% for turbo3). Validate on your specific model. See Configuration Recommendations for the full tested matrix and practical guidance.

Validated on Metal (Apple Silicon). CUDA mixed q8_0 ร— turbo parity is not yet verified.

Asymmetric K/V (NEW)

TurboQuant supports independent K and V cache types. In current testing, keeping K at q8_0 while compressing V with turbo rescues quality on low-bit models where symmetric turbo degrades:

Model (weights)KVPPLvs q8_0
Qwen2.5-7B (Q4_K_M)q8_0turbo46.64+1.0%
Qwen2.5-7B (Q4_K_M)q8_0turbo36.71+2.0%
Qwen2.5-7B (Q4_K_M)turbo3turbo33556catastrophic
# Validated starting point for low-bit models
# (tested on Qwen2.5-7B Q4_K_M; not all Q4_K_M models need this)
llama-server -m model-Q4_K_M.gguf -ctk q8_0 -ctv turbo4 -fa 1

Boundary V (Layer-Aware V Compression)

Not all V layers need the same precision. Boundary V protects the first 2 + last 2 layers with q8_0-V while compressing all remaining layers with turbo2-V. 15 lines of code, no speed penalty.

Modelturbo2 PPLBoundary V PPLturbo3 PPLQuality recovered
phi-4-Q8_0 (40L)4.8354.7844.74255%
Qwen2.5-7B Q4_K_M (28L)6.9116.8356.70737%
Qwen3.5-35B MoE (64L)5.2575.1485.13791%
Qwen3.5-27B Dense (36L)6.5346.4236.27342%

Validated at 512 and 8K context. NIAH retrieval passed. Benefit scales with model depth (91% on 64-layer MoE). Independently validated by @Corianas_ on NanoGPT.

Enabled by default. Activate manually on older builds with TURBO_LAYER_ADAPTIVE=7 env var.

# Boundary V โ€” boundary layers q8_0-V, rest turbo2-V
llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1

See full paper.

Prefill Context Scaling (Verified 2K-32K)

Contextturbo4 tok/sturbo3 tok/sq8_0 tok/sturbo4/q8_0turbo3/q8_0
2K2682270826651.01x1.02x
4K2370228922551.05x1.01x
8K2041205420021.02x1.03x
16K1621169816051.01x1.06x
32K1141120410981.04x1.10x

Prefill: both turbo3 and turbo4 match or exceed q8_0 speed. Compressed cache uses less bandwidth.

Decode Speed โ€” MoE (M5 Max 128GB, Qwen3.5-35B-A3B, Sparse V)

ConfigShort (tg128)pp32768+tg128Short vs q8_0
q8_085.71 tok/s1173.91 tok/sโ€”
turbo479.87 tok/s1060.12 tok/s0.93x
turbo376.84 tok/s1141.74 tok/s0.90x

turbo4 decode is faster than turbo3 due to simpler nibble packing and direct-extract dequant.

Real-world server benchmark (70-page PDF, ~24K context):

ConfigPrefill tok/sDecode tok/sDecode vs q8_0
q8_01449.968.2โ€”
turbo41405.963.70.93x
turbo31417.853.30.78x

NIAH Retrieval (turbo4)

Testq8_0turbo4turbo3 + sparse V
Single needle (33 positions)30/33 (90.9%)31/33 (93.9%)9/9 (3-pos)

turbo4 beats q8_0 on retrieval (31/33 vs 30/33). Shared failure at 8K/100% is a model weakness, not quantization. See turbo4 resurrection for the full investigation.

Large Model Stress Tests (M5 Max 128GB)

ModelParamsWeightsConfigPPLvs q8_0Max ContextNIAH
Llama-3.1-70B70BQ4_K_Mturbo4/turbo43.461+6.3%48K30/30
Llama-3.1-70B70BQ4_K_Mturbo3/turbo33.629+11.4%48K30/30
Command-R+ 104B104BQ4_K_Mturbo4/turbo46.312+1.9%128K10/10
Command-R+ 104B104BQ4_K_Mturbo3/turbo36.415+3.6%128K10/10

turbo3 prefill is faster than q8_0 at 32K on both models (70B: 80.8 vs 75.2 t/s, 104B: 64.5 vs 62.3 t/s). Smaller KV cache = less memory bandwidth during attention.

104B at 128K requires raising macOS GPU memory cap: sudo sysctl iogpu.wired_limit_mb=117964 (90% of 128GB). Without this, Metal stalls at ~49K context on 70B+ models. See Getting Started Guide for per-RAM values.

See M5 Max stress test for the full data.

KL Divergence vs f16

CacheMean KLDฮ”p RMSSame top-p %
q8_00.0015491.23%98.43%
turbo40.0096332.71%95.98%
q4_00.0080912.75%95.83%
turbo30.0161454.09%94.31%

turbo4 KLD is 40% lower than turbo3. Same top-p agreement matches q4_0.

Decode Speed โ€” Dense (M5 Max 128GB, Qwen3.5-27B, Sparse V)

TestWith sparse VWithoutDelta
Short (tg128)16.7316.61+0.7%
8K (pp8192+tg128)298.27294.52+1.3%
16K (pp16384+tg128)316.98311.24+1.8%

Dense models see smaller gains (attention is <5% of decode โ€” FFN dominates). No regressions. Safe to enable by default.

Sparse V dequant skips V dequantization for positions where softmax attention weight < 1e-6. At long context, most attention weights are negligible โ€” this saves approximately half the total dequant cost. +22.8% decode at 32K vs turbo3 without sparse V, pushing the ratio from 0.76x to 0.93x of q8_0. Sparse V introduces no additional PPL degradation beyond the underlying compression (validated at 32K with 50 chunks on wikitext-103, CI ยฑ0.021). Benefit scales with context length. This is implemented as a minimal kernel modification.

Sparse V is not TurboQuant-specific: on q8_0 KV cache it yields a +5% decode speedup with identical PPL and NIAH, confirming this is a general attention-aware optimization rather than a compression-specific trick. See the full paper.

On M2/M1 (pre-M5), the auto-detected 4-mag LUT gives an additional +38-45% decode improvement at long context, and is additive with sparse V. See Decode Speed Hardware Analysis for the full 14-approach experiment log, and Context Scaling Deep Dive for the M5 Max optimization journey.

Community Hardware: CUDA (RTX 3090)

Tested by @jaker86 on RTX 3090. Model: Qwen3.5-9B Q4_K_M. Build from signalnine's CUDA fork PR #24.

ConfigKVPPL (wikitext-2)vs q8_0Decode t/sPrefill t/s
q8_0q8_0q8_08.2018โ€”102.693774
turbo3turbo3turbo38.3124+1.3%98.683707
turbo4turbo4turbo48.3012+1.2%95.873628
turbo2turbo2turbo28.6639+5.6%98.053680
mixedturbo3turbo28.5312+4.0%97.323524
mixedturbo2turbo38.4356+2.9%96.613608

CUDA decode within 4-7% of q8_0 across all configs. Prefill within 4-7%. Mixed K/V configs working correctly after PR #24 fix (prefill was 329 t/s before fix, now 3500+).

Community Hardware: M1 Max 64GB

Tested by @mariotomich. Model: Qwen3.5-35B-A3B Q8_0, Sparse V ON. Real prompt: 38,596 tokens (70-pages.md), llama-cli with Qwen chat template.

KVPrefill t/sDecode t/svs q8_0
q8_0399.012.4โ€”
turbo2406.210.8-12.9%
turbo3370.47.7-37.9%
turbo4365.016.6+33.9%

turbo4 decode beats q8_0 by +33.9% at long context on M1 Max. At 38K tokens, KV bandwidth savings outweigh dequant cost. Sparse V amplifies the gain. turbo3 decode regression (-37.9%) is the known M1 L2 cache wall โ€” turbo3 dequant complexity causes cache eviction on pre-M5 hardware.

Asymmetric q8_0-K + turbo4-V (recommended for pre-M5):

Synthetic (llama-bench):

KVpp512 t/stg128 t/spp65536+tg128 t/s
q8_0876.139.55275.0
q8_0-K + turbo4-V894.9 (+2.2%)38.64 (-2.3%)271.0 (-1.5%)

Asymmetric avoids the turbo3 decode regression (-37.9%) on pre-M5 hardware.

KV cache memory at 262K context:

KVCache MiBSavedCompression
q8_02782โ€”baseline
turbo414221360 MiB1.96x
q8_0-K + turbo4-V2102680 MiB1.32x

PPL on real document (70-pages.md, ctx=512, 20 chunks): q8_0 16.29, turbo4 16.44 (+0.93%), turbo3 16.42 (+0.76%), turbo2 17.22 (+5.69%).

Community Hardware: AMD RX 9070 XT (RDNA 4, gfx1201, Windows 11)

First AMD GPU validation. First attempt โ€” no debugging, no analysis, just raw testing out of the box. Qwen2.5-7B Q4_K_M on HIP SDK 7.1. gfx1201 detected natively โ€” no HSA_OVERRIDE_GFX_VERSION needed.

KVPPL (wikitext-2)vs q8_0Prefill t/sDecode t/sStatus
q8_0q8_07.794baseline589.584.7OK
q8_0turbo47.876+1.0%588.486.8recommended
q8_0turbo3NaNcatastrophic605.187.8broken (HIP-specific)
turbo4turbo4401.4catastrophic556.484.0broken (Q4_K_M)
turbo3turbo381,277catastrophic580.386.0broken (Q4_K_M)

Key findings:

  • q8_0-K + turbo4-V confirmed on AMD โ€” +1.0% PPL, no speed penalty, 25% KV memory savings
  • Symmetric turbo catastrophic on Q4_K_M, consistent with Metal/CUDA results
  • q8_0/turbo3 produces NaN on this model (Metal gets +2.0%) โ€” HIP-specific, under investigation
  • Speed flat across configs (~85 t/s decode, ~590 t/s prefill at pp512)
  • Context scaling: 0.96-0.99x vs q8_0 at pp2048-8192

See Windows RDNA 4 Setup Guide for build instructions and 9 gotchas.

Speed Optimization Journey

OptimizationPrefill tok/svs q8_0
turbo3 fp32 WHT (initial)7390.27x
+ fp16 WHT10740.40x
+ half4 vectorized butterfly14110.52x
+ graph-side WHT rotation20950.78x
+ block-32 storage27471.02x
+ optimized dequant25240.98x

The final number (2524 at 4K) is lower than the peak (2747 at 512) because longer context is naturally slower. The key metric is the ratio vs q8_0, which stays flat at 0.99x. See Speed Experiments for the full journey.

Compression Quality (Python Prototype)

ConfigCompressionCosine SimMSE
TurboQuant 2-bit7.1ร—0.790.0047
TurboQuant 2.5-bit (outlier)4.9ร—0.860.0029
TurboQuant 3-bit4.9ร—0.910.0018
TurboQuant 3.5-bit (outlier)3.8ร—0.950.0009
TurboQuant 4-bit3.8ร—0.960.0007

Needle-In-A-Haystack (NIAH) Retrieval

Tested using Kamradt and NVIDIA RULER methodology. Qwen3.5-35B-A3B on M5 Max 128GB.

Single Needle Retrieval (with sparse V dequant):

Testq8_0turbo3turbo3 + sparse V
Single needle (9 positions)7/97/99/9 (100%)

turbo3 + sparse V achieves 9/9 in this setup (vs 7/9 baseline), suggesting a potential denoising effect from removing low-weight quantization noise. Needle positions have meaningful attention weights (well above the 1e-6 threshold) and are never skipped.

Sparse V shows no measurable impact on perplexity across all tested contexts and datasets. Observed improvements in retrieval tasks (e.g., NIAH) are treated as secondary signals and may reflect reduced quantization noise rather than fundamental model quality changes.

Single Needle โ€” Depth (0-100%) x Context Length (pre-sparse-V):

Depth4K8K16K32K
q8_05/54/54/54/5
turbo35/54/55/53/5

Pre-sparse-V aggregate: q8_0 85% (17/20), turbo3 80% (16/20). No systematic degradation from compression. N=10 needles remarkably stable (9-10/10 at every depth).

Multi-Key with 3 Distractors (RULER MK-NIAH):

Cache Type4K8K16K32K
q8_01/11/11/11/1
turbo31/11/11/11/1

100% retrieval accuracy with distractors through 32K. turbo3 correctly ignores distractor needles at all context depths.

Long-Context Perplexity (Primary Quality Metric)

50-chunk wikitext-103 at 32K context (strongest validation, CI ยฑ0.021):

ConfigPPLvs q8_0Sparse V ฮ”
q8_0 (8-bit KV)7.0638โ€”โ€”
q4_0 (4-bit KV)7.0857+0.31%โ€”
turbo3 WITHOUT sparse V (3.5-bit)7.1796+1.64%โ€”
turbo3 WITH sparse V (3.5-bit)7.1796+1.64%0.0000

Note: q4_0 is included as a reference baseline. No optimization was applied to q4_0 in this work. Development focused on q8_0 and turbo3 paths.

Key Validation

Real Qwen3-1.7B KV tensor rotation Gaussianization:

Raw kurtosis:       900.4  โ†’ After rotation: 2.9  (Gaussian = 3.0)
Std after rotation:  0.088388
Expected (1/โˆšd):     0.088388
Ratio:               1.000 exactly

Getting Started

Prerequisites

  • Python >= 3.10
  • NumPy >= 1.24, SciPy >= 1.10
  • cmake + C/C++ compiler (for llama.cpp build)
  • Xcode Command Line Tools (macOS Metal build)
  • Optional: torch, transformers, accelerate (~4GB download, for real model validation)

Install the Python Prototype

git clone https://github.com/TheTom/turboquant_plus.git
cd turboquant_plus
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

# Verify โ€” should print "141 passed"
python3 -m pytest tests/ -v

Run the Demo

# Quick compression demo (no model needed)
python3 benchmarks/demo.py

# Validate on real model KV tensors (downloads Qwen3-1.7B, ~4GB)
pip install transformers torch accelerate
python3 benchmarks/validate_real_model.py

Build llama.cpp with TurboQuant

The llama.cpp port adds two new KV cache types: turbo3 (3.25 bits, 4.9ร— compression) and turbo4 (4.25 bits, 3.8ร— compression).

# Clone the llama.cpp fork with TurboQuant support
git clone https://github.com/TheTom/llama-cpp-turboquant.git
cd llama-cpp-turboquant
git checkout feature/turboquant-kv-cache

# Build with Metal (Apple Silicon)
cmake -B build -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j

# Build with CUDA (NVIDIA) โ€” community tested on RTX 3080 Ti/3090/4090/5090
# cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
# cmake --build build -j

# Build with HIP (AMD) โ€” tested on RX 9070 XT (RDNA 4, gfx1201)
# See docs/windows-rdna4-setup.md for Windows gotchas
# cmake -S . -B build -G Ninja -DGPU_TARGETS=gfx1201 -DGGML_HIP=ON \
#   -DGGML_CUDA_FA_ALL_QUANTS=ON -DCMAKE_C_COMPILER=clang \
#   -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_BUILD_TYPE=Release
# cmake --build build --config Release

# Verify turbo types are available
./build/bin/llama-server --help | grep turbo
# Expected output includes: turbo3, turbo4

The fork modifies these files from upstream llama.cpp:

  • ggml/include/ggml.h โ€” new type enum entries
  • ggml/src/ggml-common.h โ€” block structures
  • ggml/src/ggml-quants.h โ€” function declarations
  • ggml/src/ggml-turbo-quant.c โ€” C quantize/dequantize (new file)
  • ggml/src/ggml.c โ€” type traits registration
  • ggml/src/CMakeLists.txt โ€” build config
  • ggml/src/ggml-metal/ggml-metal.metal โ€” Metal GPU kernels
  • ggml/src/ggml-metal/ggml-metal-device.m โ€” Metal device validation
  • common/arg.cpp โ€” CLI arg parsing

Run Inference with TurboQuant KV Cache

# Server mode (for Hermes Agent, Claude Code, OpenCode, etc.)
./build/bin/llama-server \
  -m models/your-model.gguf \
  --alias "model-turbo" \
  --jinja -ngl 99 -c 262144 -fa on \
  --cache-type-k turbo3 --cache-type-v turbo3 \
  -np 1 --metrics --host 0.0.0.0 --port 8080

# CLI mode (quick test)
./build/bin/llama-cli \
  -m models/your-model.gguf \
  -ngl 99 -c 2048 -fa on \
  --cache-type-k turbo3 --cache-type-v turbo3 \
  -n 100 -p "Hello world" --jinja

Cache Type Reference

FlagBits/valCompression vs fp16Description
turbo33.5โ€ 4.6xโ€ 3-bit PolarQuant + WHT rotation. Best compression, q8_0 speed.
turbo44.253.8x4-bit PolarQuant (16 centroids). Best quality.
q8_082.0xllama.cpp default quantized cache.
q4_044.0xllama.cpp 4-bit cache.

Architecture

Input: KV cache vector x โˆˆ R^d (one attention head)
    โ”‚
    โ”œโ”€โ”€ Extract norm: ฮณ = ||x||, xฬ‚ = x/ฮณ
    โ”‚
    โ”œโ”€โ”€ Random rotation: WHT + random sign flips
    โ”‚   coordinates ~ N(0, 1/d) after rotation
    โ”‚
    โ”œโ”€โ”€ Optimal scalar quantization (Lloyd-Max)
    โ”‚   turbo4: 16 centroids (4-bit), turbo3: 8 centroids (3-bit), turbo2: 4 centroids (2-bit)
    โ”‚
    โ””โ”€โ”€ Output: quantized indices + norm per block
        Compression: 3.8x (turbo4), 5.1x (turbo3), 7.5x (turbo2)

Note on QJL: reference only, not used in production.

The original TurboQuant paper (Zandieh et al. 2024, arXiv 2406.03482) includes a 1-bit QJL error-correction stage. The Python qjl.py here implements it for paper reproducibility.

Production drops QJL on both K and V. TheTom/llama-cpp-turboquant recommended config is --cache-type-k q8_0 --cache-type-v turbo3. No QJL on either side.

QJL eliminates reconstruction bias but amplifies variance, which softmax turns into attention noise. Five independent groups confirmed (buun, scos-lab, Arclabs001, +2). See turbo4-resurrection.md for the full ablation and mechanism.

If you're building on this repo: use TurboQuantMSE (V cache), or implement straight 4 to 8 bit PolarQuant on K. Only enable the QJL / TurboQuant (with QJL) classes if you are reproducing the original paper or doing K-side research below 8-bit. Even then, validate at your target context length. Historical evidence is that QJL noise accumulates past ~16K context.

Project Structure

turboquant/
โ”œโ”€โ”€ rotation.py        # Walsh-Hadamard Transform + random sign flips
โ”œโ”€โ”€ codebook.py        # Lloyd-Max optimal centroid computation
โ”œโ”€โ”€ polar_quant.py     # PolarQuant โ€” norm extraction + WHT rotation + scalar quantization
โ”œโ”€โ”€ qjl.py            # QJL 1-bit quantizer (paper-faithful reference, see README ยงQJL). Not used by TheTom/llama-cpp-turboquant in production.
โ”œโ”€โ”€ turboquant.py      # Full TurboQuant pipeline
โ”œโ”€โ”€ kv_cache.py        # KV cache integration layer
โ”œโ”€โ”€ outlier.py         # Outlier channel strategy (2.5-bit, 3.5-bit)
โ”œโ”€โ”€ lloyd_max.py       # Lloyd-Max quantizer implementation
โ”œโ”€โ”€ utils.py           # Bit packing, memory measurement
โ”œโ”€โ”€ isoquant.py        # IsoQuant (quaternion SO(4)) experimental comparison
โ””โ”€โ”€ rotorquant.py      # RotorQuant experimental comparison

tests/                 # 14 test files, 500+ tests
benchmarks/
โ”œโ”€โ”€ demo.py                       # Quick compression demo
โ”œโ”€โ”€ run_benchmark.py              # Server-based benchmark runner
โ”œโ”€โ”€ benchmark_results.md          # Full benchmark report
โ”œโ”€โ”€ benchmark_llama.sh            # llama.cpp benchmark script
โ”œโ”€โ”€ benchmark_norm_correction.py  # Norm correction validation
โ”œโ”€โ”€ benchmark_ppl_tq_vs_rq.py    # TurboQuant vs RotorQuant PPL comparison
โ”œโ”€โ”€ temporal_decay_prototype.py   # Temporal decay experiment
โ”œโ”€โ”€ test_with_llama.py            # Integration test at Qwen 3.5 dimensions
โ”œโ”€โ”€ test_outlier_comparison.py    # Outlier strategy comparison
โ””โ”€โ”€ validate_real_model.py        # Real model KV tensor validation

docs/
โ”œโ”€โ”€ turboquant-recommendations.md # Configuration guide (tested matrix)
โ”œโ”€โ”€ windows-rdna4-setup.md        # Windows + AMD RDNA 4 build guide
โ”œโ”€โ”€ papers/
โ”‚   โ”œโ”€โ”€ turbo4-resurrection.md    # turbo4 bug hunt (PPL 679 โ†’ 6.125)
โ”‚   โ”œโ”€โ”€ sparse-v-dequant.md       # Sparse V attention-gated optimization
โ”‚   โ”œโ”€โ”€ layer-aware-v-compression.md  # Boundary V (layer-adaptive V precision)
โ”‚   โ””โ”€โ”€ block-size-experiment.md  # Block size 32โ†’128 (12% compression win)
โ””โ”€โ”€ (25+ engineering docs, investigations, experiment logs)

Roadmap

PhaseStatusDetails
Core algorithms (NumPy)โœ…500+ tests across 14 test files
Distortion validationโœ…Matches paper bounds (Table 2)
Real model validationโœ…Rotation validated on Qwen3 KV tensors (kurtosis 900โ†’2.9)
llama.cpp C portโœ…Metal GPU inference working on M1 through M5
Metal shader optimizationโœ…q8_0 speed parity: prefill matches or beats q8_0
CUDA backendโœ…Community-tested on RTX 3080 Ti/3090/4090/5090, DGX Spark Blackwell
HIP/AMD backendโœ…RX 9070 XT (RDNA 4) validated, gfx1201 native
Asymmetric K/Vโœ…q8_0-K + turbo-V rescues Q4_K_M models
Boundary Vโœ…Layer-aware V compression, 37-91% quality recovery
Sparse Vโœ…Attention-gated dequant skip, +22.8% decode on MoE. Upstream PR #21119
Block size optimizationโœ…32โ†’128, 12% better compression, zero quality cost
Upstream coordination๐Ÿ”„llama.cpp PR preparation (#27)
TurboQuant+ extensionsโณAdaptive bits, temporal decay, MoE-aware compression
MLX Swift port๐Ÿ”„Active collaboration with @ekryski on mlx-swift-lm โ€” turbo4v2 working, Gemma4 fixes in progress

Paper Reference

Engineering Docs

Detailed debugging logs, gotchas, and benchmarks from the llama.cpp port:

MLX Framework Port (Experimental)

TurboQuant KV cache compression is being ported to Apple's MLX framework for native Python/Swift inference on Apple Silicon.

Fork: TheTom/mlx feature/turboquant-plus

Results (M5 Max)

Qwen2.5-3B 4bit โ€” delegated KVCache (5-run avg, 500 decode tokens, 7ad7500):

ConfigDecode tok/svs BaselinePPLPPL Delta
Baseline (f16 KV)172.6100%1.8764โ€”
Sym turbo4171.299.2%1.9083+1.70%
Asym (K=FP16, V=turbo4)171.099.0%1.8859+0.51%

Quality: Output text indistinguishable from baseline. KL divergence < 0.001, cosine similarity > 0.989.

35B MoE (Qwen3.5-35B-A3B 8bit):

ConfigPrefillDecodevs Baseline
Baseline11.495.7100%
turbo4 fused + boundary132.794.296%

Qwen3.5-27B Dense 8bit (16/64 KV layers):

ConfigPPLPPL DeltaDecodevs Baseline
Baseline1.4800โ€”17.9100%
turbo4 asymmetric1.5082+1.91%15.587%
turbo4 symmetric1.5219+2.83%15.486%

Quality Validation (Qwen2.5-7B 8bit, dense, all 28 layers KV):

TestSymmetric turbo4Asymmetric (K=FP16)
KLD6.86 (broken)0.003
Top-1 match10.5% (broken)98.1%
NIAH0/15 FAIL15/15 PASS

Warning

Symmetric turbo is catastrophic on dense models. All K layers compressed โ†’ softmax error compounds across 28 layers. Asymmetric (K=FP16, V=turbo4) is mandatory for dense architectures. Hybrid models (Qwen3.5) with delta net layers are accidentally safe because only a fraction of layers use KV cache.

Dense models (short context, deferred compression):

ModelBaseline Decodeturbo4 asym DecodePPL Delta
Qwen2.5-7B 8bit64.264.10.00%
phi-4 8bit32.932.70.00%

M2 Pro โ€” Qwen2.5-1.5B 8bit (dense, 28/28 KV layers, asymmetric):

TestResult
KLD0.004
Top-1 match96.8%
NIAH30/30 PASS
ContextBaseline DecodeTurbo Asymmetricvs Baseline
12834.835.2101%
409646.921.646%

M2 Pro shows more decode regression at long context โ€” lower memory bandwidth amplifies turbo overhead.

M5 Max Context Scaling (Qwen2.5-7B 8bit, delegated KVCache, 7ad7500):

ContextBaselineSym turbo4vs BaselineAsym (K=FP16)vs Baseline
51263.663.6100%64.0101%
1K63.162.8100%62.699%
2K62.761.898%62.299%
4K61.060.299%61.0100%
8K58.256.998%57.799%
16K54.653.097%53.899%

Previous numbers (61-83%) were measured before the delegated KVCache optimization (7ad7500). Root cause was mx.concatenate allocating new arrays every decode step ร— n_layers. Fixed by delegating FP16 storage to an internal KVCache with pre-allocated buffers.

MLX Python vs llama.cpp (Qwen2.5-7B, M5 Max):

FrameworkPrefill (400 tok)DecodeMemory
llama.cpp (Q8_0)38720.97.5 GB
MLX (8bit)24321.28.5 GB

MLX decode matches llama.cpp. Prefill 37% slower (lazy graph vs pre-compiled).

MLX Python vs llama.cpp (M2 Pro, Qwen2.5-7B):

FrameworkPrefill (400 tok)Decode
llama.cpp38720.9
MLX24321.3

Note: Future benchmark logs should record Apple Silicon power mode (Low / Auto / High) when known, as it can materially affect throughput.

Quick Start (MLX Python)

import mlx_lm
from mlx.nn.layers.turbo_kv_cache import TurboKVCache

model, tokenizer = mlx_lm.load("mlx-community/Qwen2.5-7B-Instruct-8bit")
n_layers = len(model.model.layers)
cache = [TurboKVCache(bits=4, key_bits=4) for _ in range(n_layers)]
text = mlx_lm.generate(model, tokenizer, prompt="Hello!",
                        max_tokens=200, prompt_cache=cache, verbose=True)
pip install git+https://github.com/TheTom/mlx.git@feature/turboquant-plus
pip install mlx-lm

How it works

TurboKVCache is a drop-in replacement for mlx-lm's KVCache that adds TurboQuant 4-bit K+V compression. Compatible with mlx-lm and mlx-vlm โ€” no framework changes needed.

Delegated KVCache architecture (7ad7500): During prefill, stores raw FP16. On first decode step, compresses to packed TurboQuant storage and seeds an internal KVCache with decoded FP16. Subsequent decode tokens go through the native KVCache (pre-allocated buffers, zero-alloc slice-assign). Packed storage updated in background via periodic batch recompression on CPU stream.

  • 97โ€“100% baseline decode speed across 512โ€“16K context (Qwen2.5-7B, M5 Max)
  • +0.51% PPL (asymmetric), +1.70% PPL (symmetric)
  • 99% answer agreement with baseline (520 multimodal samples)
  • Works with stock mlx-lm and mlx-vlm, no fork needed
  • All TurboQuant+ papers applied (beta centroids, dual SRHT signs, boundary layers)

Quick Start โ€” mlx-vlm (multimodal)

from mlx_vlm import load
from mlx_vlm.models.cache import make_prompt_cache
from mlx_lm.models.cache import KVCache
from mlx.nn.layers.turbo_kv_cache import TurboKVCacheLite, compact_turbo_cache

model, processor = load("mlx-community/gemma-4-26b-a4b-it-bf16")

# Wrap KV layers with TurboKVCacheLite
cache = make_prompt_cache(model.language_model)
kv_indices = [i for i, c in enumerate(cache) if isinstance(c, KVCache)]
for idx in kv_indices:
    cache[idx] = TurboKVCacheLite(cache[idx], bits=4, key_bits=4)

# Generate as normal โ€” prefill stores FP16
from mlx_vlm import generate
generate(model, processor, prompt="...", max_tokens=1, prompt_cache=cache)

# Compact: compress K+V to 4-bit TurboQuant
compact_turbo_cache(cache)

# Continue generating โ€” native SDPA at full speed
generate(model, processor, prompt="Continue.", max_tokens=200, prompt_cache=cache)
pip install git+https://github.com/TheTom/mlx.git@feature/turboquant-plus
pip install mlx-vlm

Quick Start โ€” mlx-lm (text)

import mlx_lm
from mlx.nn.layers.turbo_kv_cache import make_turbo_cache, compact_turbo_cache

model, tokenizer = mlx_lm.load("mlx-community/Qwen2.5-7B-Instruct-8bit")
cache = make_turbo_cache(model, bits=4)
mlx_lm.generate(model, tokenizer, prompt="Hello!", max_tokens=1, prompt_cache=cache)
compact_turbo_cache(cache)
mlx_lm.generate(model, tokenizer, prompt="Continue.", max_tokens=200,
                 prompt_cache=cache, verbose=True)

MM-NIAH Multimodal Benchmark (520 samples)

gemma-4-26b-a4b-it ยท BF16 ยท 4-bit TQ+ Compact ยท MM-NIAH (val) ยท M5 Max 128GB

BucketBL AccTQ+ AccAgreeBL DecodeTQ+ DecodeSpeedupBL KVTQ+ KVKV saved
~1K85%84%99%55.154.70.99x0.21G0.19G10%
~3K81%79%99%55.254.10.98x0.27G0.21G22%
~7K80%81%99%54.151.70.96x0.37G0.24G35%
~15K76%76%100%52.047.80.92x0.53G0.28G47%
~30K77%75%98%46.940.20.86x0.87G0.36G59%
~60K75%76%99%42.633.70.79x1.30G0.47G64%
Total79%78%99%51.147.20.92x0.58G0.29G50%

99% answer agreement with baseline across all context lengths โ€” zero systematic quality degradation. KV savings of 10โ€“64% where TQ+ is active. Decode speedup scales from 0.99x at ~1K to 0.79x at ~60K (dequant-once overhead on longer prefills).


Contributing

Issues and PRs welcome. The main areas where help is needed:

  1. Upstream PR โ€” prepare llama.cpp contribution (CONTRIBUTING.md requirements)
  2. CUDA kernel optimization โ€” fused FA kernels, decode speed parity
  3. MLX memory recovery โ€” implement FP16 KV drop + compressed-only attention for memory-constrained long context
  4. Quality metrics โ€” multi-run statistics, additional task benchmarks (GSM8K, code gen, reasoning)
  5. Long context validation โ€” 64K+ testing across architectures

Support

If you find this work useful, you can support it via GitHub Sponsors or BTC:

BTC: bc1qsfaaf6mkz2yxx2vavg2n0zgsf3qj25uh94t83rwuq7de67dey05sc3tgjx

Commercial support: For inference optimization and KV cache tuning engagements, DM @no_stp_on_snek on X.

License

Apache License 2.0 โ€” see LICENSE.

Copyright 2026 Tom Turney.

Based on Google Research's TurboQuant paper (arXiv 2504.19874).