TurboQuant+

June 12, 2026 ยท View on GitHub

๐Ÿš€ TurboQuant KV cache compression is now in vLLM (PR #38479, merged April 2026): --kv-cache-dtype turboquant_k8v4 and friends, with fused Triton store/decode kernels. The PR discussion drew on the asymmetric K/V findings from this repo. Upstream llama.cpp has merged the core idea too: Hadamard KV cache rotation (#21038, citing TurboQuant directly) with fast WHT kernels on CPU (#22631), CUDA (#23615), and Vulkan (#23687). Rotation + the stock q4_0 cache is essentially turbo4's rotation stage; the PolarQuant codebook, norm extraction, and asymmetric policies remain here and in the fork.

Getting Started Guide | Configuration Recommendations | Benchmarks | Commercial Support

Implementation of TurboQuant (ICLR 2026) with implementation work, experiments, and follow-on findings beyond the base paper. Compresses transformer KV cache 3.8-6.4x using PolarQuant + Walsh-Hadamard rotation, at near q8_0 prefill speed and ~0.9x decode throughput at long context. Validated end-to-end from 1.5B to 104B at 128K context on a MacBook (turbo3, PPL 4.024, 74 GB peak memory).

This repository is the research home: the Python reference implementation, the validation papers, and the benchmark data. To run TurboQuant in an inference engine, pick from the table below. Pieces that prove useful and stable get upstreamed incrementally as small, reviewable patches.

Run It Today

EnginePlatformStatusNotes
vLLMCUDA / ROCm, datacenterUpstream, merged--kv-cache-dtype turboquant_k8v4 and friends (PR #38479)
llama.cppAll backendsUpstream, rotation mergedHadamard KV cache rotation (#21038) + fast WHT kernels (CPU #22631, CUDA #23615, Vulkan #23687). Rotation + q4_0 cache approximates turbo4; full PolarQuant codec is in the fork below
llama-cpp-turboquantMetal, CUDA, HIP, CPUProduction forkturbo2/3/4 KV cache + TQ3_1S/TQ4_1S weight formats; prebuilt binaries for Mac (Metal) and Windows (CUDA)
mlx-swift-lmApple Silicon, SwiftActive collaborationFastest Apple path: ~2.5x faster decode than Python mlx-lm, full TQ+ support including turbo4v2. 144 tok/s on Qwen3.5-35B-A3B MoE at 4K on M5 Max
vllm-swiftApple Silicon, SwiftActiveOpenAI-compatible serving built on mlx-swift-lm; no Python in the inference hot path
AtlasRust, MetalIntegratedturbo4 KV cache append/decode kernels
mlxcelRust, MLXCommunity portTurbo KV cache docs. Thanks to @inureyes and the Lablup team for the careful port and upstream attribution
MLX Python forkApple Silicon, PythonExperimentalTurboKVCache drop-in for mlx-lm and mlx-vlm; see docs/mlx-port.md
turboquant-pytorchPyTorchIndependent implementationFrom-scratch PyTorch TurboQuant for KV cache; applies this repo's layer-adaptive compression and Sparse V findings
turboquant-vllmCUDA, vLLM packageCommunity packageTQ+ KV cache for vLLM with fused CUDA kernels; builds on this repo's block_size=128 and K-dominance research
QuanslothLocal server productCommunity productAir-gapped local AI server built on the TurboQuant+ stack

Key Findings

Three follow-on findings, independently validated by multiple researchers across different hardware and backends:

  1. V compression is free. Compressing the value cache (even down to 2 bits) has zero measurable effect on attention quality when key precision is maintained. Confirmed on Metal (M5 Max), CUDA RTX 4090 (@sztlink), and CUDA RTX 3090 (@HyperionMS2040). See asymmetric K/V paper.
  2. All quality degradation comes from K compression. This is why asymmetric configs (q8_0-K + turbo-V) rescue models where symmetric fails. Validated across Qwen, Llama, Mistral, and Command-R+ families. See M5 Max stress test.
  3. Boundary layers are disproportionately sensitive. Protecting the first 2 + last 2 layers at higher precision recovers 37-91% of the quality gap. See Boundary V paper.

Additional experiments and writeups: Sparse V dequant (+22.8% decode at 32K, not TurboQuant-specific), block size optimization (5.12x compression), turbo4 resurrection (QJL hurts, PolarQuant works), EDEN optimal-S response (rotation is first-order, scale is second-order).

Quality at a Glance (M5 Max 128GB)

Cache TypeBits/valCompressionPPL (wikitext-2, 512c)vs q8_0
f1616.01.0x6.121-0.16%
q8_08.51.9x6.111baseline
turbo44.253.8x6.125+0.23%
q4_04.53.6x6.142+0.52%
turbo33.5โ€ 4.6xโ€ 6.176+1.06%
turbo22.56.4x6.507+6.48%

turbo4 (4-bit PolarQuant) has the best quality after q8_0 โ€” closer to q8_0 than q4_0, at better compression. turbo3 trades quality for maximum compression. turbo2 (2-bit) trades more quality for extreme compression โ€” best used asymmetrically.

โ€ turbo3 at default block_size=32. At block_size=128, turbo3 achieves 3.125 bits/val and 5.12x compression with identical PPL. See block size study.

Important: choosing the right config for your model. TurboQuant quality depends on your base weight quantization. Models with Q8_0+ weights work well with symmetric turbo (e.g., -ctk turbo3 -ctv turbo3). Some low-bit models with Q4_K_M weights may benefit from asymmetric K/V: use -ctk q8_0 -ctv turbo4 to keep K precision high while compressing V. K precision is the dominant quality factor because it controls attention routing via softmax. Bigger models absorb quantization stacking better (104B: +3.6% vs 70B: +11.4% for turbo3). Validate on your specific model. See Configuration Recommendations for the full tested matrix.

Everything else lives in docs/benchmarks.md: asymmetric K/V and Boundary V tables, prefill context scaling, MoE and dense decode speed, NIAH retrieval, KL divergence, 70B/104B stress tests, community results on RTX 3090, M1 Max, and AMD RX 9070 XT, and the speed optimization journey.

Getting Started

Python Reference Implementation

git clone https://github.com/TheTom/turboquant_plus.git
cd turboquant_plus
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

# Verify โ€” should print "141 passed"
python3 -m pytest tests/ -v

# Quick compression demo (no model needed)
python3 benchmarks/demo.py

# Validate on real model KV tensors (downloads Qwen3-1.7B, ~4GB)
pip install transformers torch accelerate
python3 benchmarks/validate_real_model.py

Requires Python >= 3.10, NumPy >= 1.24, SciPy >= 1.10. torch / transformers / accelerate are optional (real-model validation only).

Run Inference

The fastest path is a prebuilt binary, or build from the fork โ€” see the llama-cpp-turboquant quick start for build instructions, supported backends, and usage details.

# Server mode with TurboQuant KV cache
./build/bin/llama-server -m models/your-model.gguf \
  --jinja -ngl 99 -c 262144 -fa on \
  --cache-type-k turbo3 --cache-type-v turbo3 \
  -np 1 --host 0.0.0.0 --port 8080
FlagBits/valCompression vs fp16Description
turbo33.5โ€ 4.6xโ€ 3-bit PolarQuant + WHT rotation. Best compression, q8_0 speed.
turbo44.253.8x4-bit PolarQuant (16 centroids). Best quality.
q8_082.0xllama.cpp default quantized cache.
q4_044.0xllama.cpp 4-bit cache.

For per-model configuration guidance (symmetric vs asymmetric, Boundary V, large-model memory caps), see the Getting Started Guide and Configuration Recommendations.

Architecture

Input: KV cache vector x โˆˆ R^d (one attention head)
    โ”‚
    โ”œโ”€โ”€ Extract norm: ฮณ = ||x||, xฬ‚ = x/ฮณ
    โ”‚
    โ”œโ”€โ”€ Random rotation: WHT + random sign flips
    โ”‚   coordinates ~ N(0, 1/d) after rotation
    โ”‚
    โ”œโ”€โ”€ Optimal scalar quantization (Lloyd-Max)
    โ”‚   turbo4: 16 centroids (4-bit), turbo3: 8 centroids (3-bit), turbo2: 4 centroids (2-bit)
    โ”‚
    โ””โ”€โ”€ Output: quantized indices + norm per block
        Compression: 3.8x (turbo4), 5.1x (turbo3), 7.5x (turbo2)

Note on QJL: reference only, not used in production.

The original TurboQuant paper (Zandieh et al. 2024, arXiv 2406.03482) includes a 1-bit QJL error-correction stage. The Python qjl.py here implements it for paper reproducibility.

Production drops QJL on both K and V. QJL eliminates reconstruction bias but amplifies variance, which softmax turns into attention noise. Five independent groups confirmed (buun, scos-lab, Arclabs001, +2). See turbo4-resurrection.md for the full ablation and mechanism.

If you're building on this repo: use TurboQuantMSE (V cache), or implement straight 4 to 8 bit PolarQuant on K. Only enable the QJL / TurboQuant (with QJL) classes if you are reproducing the original paper or doing K-side research below 8-bit.

Project Structure
turboquant/
โ”œโ”€โ”€ rotation.py        # Walsh-Hadamard Transform + random sign flips
โ”œโ”€โ”€ codebook.py        # Lloyd-Max optimal centroid computation
โ”œโ”€โ”€ polar_quant.py     # PolarQuant โ€” norm extraction + WHT rotation + scalar quantization
โ”œโ”€โ”€ qjl.py            # QJL 1-bit quantizer (paper-faithful reference, see README ยงQJL). Not used in production.
โ”œโ”€โ”€ turboquant.py      # Full TurboQuant pipeline
โ”œโ”€โ”€ kv_cache.py        # KV cache integration layer
โ”œโ”€โ”€ outlier.py         # Outlier channel strategy (2.5-bit, 3.5-bit)
โ”œโ”€โ”€ lloyd_max.py       # Lloyd-Max quantizer implementation
โ”œโ”€โ”€ utils.py           # Bit packing, memory measurement
โ”œโ”€โ”€ isoquant.py        # IsoQuant (quaternion SO(4)) experimental comparison
โ””โ”€โ”€ rotorquant.py      # RotorQuant experimental comparison

tests/                 # 14 test files, 500+ tests
benchmarks/
โ”œโ”€โ”€ demo.py                       # Quick compression demo
โ”œโ”€โ”€ run_benchmark.py              # Server-based benchmark runner
โ”œโ”€โ”€ benchmark_results.md          # Full benchmark report
โ”œโ”€โ”€ benchmark_llama.sh            # llama.cpp benchmark script
โ”œโ”€โ”€ benchmark_norm_correction.py  # Norm correction validation
โ”œโ”€โ”€ benchmark_ppl_tq_vs_rq.py    # TurboQuant vs RotorQuant PPL comparison
โ”œโ”€โ”€ temporal_decay_prototype.py   # Temporal decay experiment
โ”œโ”€โ”€ test_with_llama.py            # Integration test at Qwen 3.5 dimensions
โ”œโ”€โ”€ test_outlier_comparison.py    # Outlier strategy comparison
โ””โ”€โ”€ validate_real_model.py        # Real model KV tensor validation

docs/
โ”œโ”€โ”€ benchmarks.md                 # Full benchmark and validation data
โ”œโ”€โ”€ mlx-port.md                   # MLX framework port (Python)
โ”œโ”€โ”€ changelog.md                  # v1 milestone history
โ”œโ”€โ”€ turboquant-recommendations.md # Configuration guide (tested matrix)
โ”œโ”€โ”€ windows-rdna4-setup.md        # Windows + AMD RDNA 4 build guide
โ”œโ”€โ”€ papers/                       # Validation papers and experiment writeups
โ””โ”€โ”€ (25+ engineering docs, investigations, experiment logs)

Roadmap

PhaseStatusDetails
Core algorithms (NumPy)โœ…500+ tests across 14 test files
Distortion validationโœ…Matches paper bounds (Table 2)
Real model validationโœ…Rotation validated on Qwen3 KV tensors (kurtosis 900โ†’2.9)
llama.cpp C portโœ…Metal GPU inference working on M1 through M5
Metal shader optimizationโœ…q8_0 speed parity: prefill matches or beats q8_0
CUDA backendโœ…Community-tested on RTX 3080 Ti/3090/4090/5090, DGX Spark Blackwell
HIP/AMD backendโœ…RX 9070 XT (RDNA 4) validated, gfx1201 native
Asymmetric K/Vโœ…q8_0-K + turbo-V rescues Q4_K_M models
Boundary Vโœ…Layer-aware V compression, 37-91% quality recovery
Sparse Vโœ…Attention-gated dequant skip, +22.8% decode on MoE. Upstream PR #21119
Block size optimizationโœ…32โ†’128, 12% better compression, zero quality cost
vLLM upstreamโœ…Merged as the TurboQuant attention backend (PR #38479)
llama.cpp upstream (rotation)โœ…Hadamard KV rotation + FWHT kernels merged upstream (#21038, #22631)
Upstream coordination๐Ÿ”„llama.cpp PR preparation for the full codec (#27)
TurboQuant+ extensionsโณAdaptive bits, temporal decay, MoE-aware compression
MLX Swift port๐Ÿ”„Active collaboration with @ekryski on mlx-swift-lm โ€” turbo4v2 working

Paper Reference

Docs

Contributing

Issues and PRs welcome. The main areas where help is needed:

  1. Upstream PR โ€” prepare llama.cpp contribution (CONTRIBUTING.md requirements)
  2. CUDA kernel optimization โ€” fused FA kernels, decode speed parity
  3. MLX memory recovery โ€” implement FP16 KV drop + compressed-only attention for memory-constrained long context
  4. Quality metrics โ€” multi-run statistics, additional task benchmarks (GSM8K, code gen, reasoning)
  5. Long context validation โ€” 64K+ testing across architectures

Support

If you find this work useful, you can support it via GitHub Sponsors or BTC:

BTC: bc1qsfaaf6mkz2yxx2vavg2n0zgsf3qj25uh94t83rwuq7de67dey05sc3tgjx

Commercial support: For inference optimization and KV cache tuning engagements, DM @no_stp_on_snek on X.

License

Apache License 2.0 โ€” see LICENSE.

Copyright 2026 Tom Turney.

Based on Google Research's TurboQuant paper (arXiv 2504.19874).