TurboQuant+
June 12, 2026 ยท View on GitHub
๐ TurboQuant KV cache compression is now in vLLM (PR #38479, merged April 2026):
--kv-cache-dtype turboquant_k8v4and friends, with fused Triton store/decode kernels. The PR discussion drew on the asymmetric K/V findings from this repo. Upstream llama.cpp has merged the core idea too: Hadamard KV cache rotation (#21038, citing TurboQuant directly) with fast WHT kernels on CPU (#22631), CUDA (#23615), and Vulkan (#23687). Rotation + the stock q4_0 cache is essentially turbo4's rotation stage; the PolarQuant codebook, norm extraction, and asymmetric policies remain here and in the fork.
Getting Started Guide | Configuration Recommendations | Benchmarks | Commercial Support
Implementation of TurboQuant (ICLR 2026) with implementation work, experiments, and follow-on findings beyond the base paper. Compresses transformer KV cache 3.8-6.4x using PolarQuant + Walsh-Hadamard rotation, at near q8_0 prefill speed and ~0.9x decode throughput at long context. Validated end-to-end from 1.5B to 104B at 128K context on a MacBook (turbo3, PPL 4.024, 74 GB peak memory).
This repository is the research home: the Python reference implementation, the validation papers, and the benchmark data. To run TurboQuant in an inference engine, pick from the table below. Pieces that prove useful and stable get upstreamed incrementally as small, reviewable patches.
Run It Today
| Engine | Platform | Status | Notes |
|---|---|---|---|
| vLLM | CUDA / ROCm, datacenter | Upstream, merged | --kv-cache-dtype turboquant_k8v4 and friends (PR #38479) |
| llama.cpp | All backends | Upstream, rotation merged | Hadamard KV cache rotation (#21038) + fast WHT kernels (CPU #22631, CUDA #23615, Vulkan #23687). Rotation + q4_0 cache approximates turbo4; full PolarQuant codec is in the fork below |
| llama-cpp-turboquant | Metal, CUDA, HIP, CPU | Production fork | turbo2/3/4 KV cache + TQ3_1S/TQ4_1S weight formats; prebuilt binaries for Mac (Metal) and Windows (CUDA) |
| mlx-swift-lm | Apple Silicon, Swift | Active collaboration | Fastest Apple path: ~2.5x faster decode than Python mlx-lm, full TQ+ support including turbo4v2. 144 tok/s on Qwen3.5-35B-A3B MoE at 4K on M5 Max |
| vllm-swift | Apple Silicon, Swift | Active | OpenAI-compatible serving built on mlx-swift-lm; no Python in the inference hot path |
| Atlas | Rust, Metal | Integrated | turbo4 KV cache append/decode kernels |
| mlxcel | Rust, MLX | Community port | Turbo KV cache docs. Thanks to @inureyes and the Lablup team for the careful port and upstream attribution |
| MLX Python fork | Apple Silicon, Python | Experimental | TurboKVCache drop-in for mlx-lm and mlx-vlm; see docs/mlx-port.md |
| turboquant-pytorch | PyTorch | Independent implementation | From-scratch PyTorch TurboQuant for KV cache; applies this repo's layer-adaptive compression and Sparse V findings |
| turboquant-vllm | CUDA, vLLM package | Community package | TQ+ KV cache for vLLM with fused CUDA kernels; builds on this repo's block_size=128 and K-dominance research |
| Quansloth | Local server product | Community product | Air-gapped local AI server built on the TurboQuant+ stack |
Key Findings
Three follow-on findings, independently validated by multiple researchers across different hardware and backends:
- V compression is free. Compressing the value cache (even down to 2 bits) has zero measurable effect on attention quality when key precision is maintained. Confirmed on Metal (M5 Max), CUDA RTX 4090 (@sztlink), and CUDA RTX 3090 (@HyperionMS2040). See asymmetric K/V paper.
- All quality degradation comes from K compression. This is why asymmetric configs (q8_0-K + turbo-V) rescue models where symmetric fails. Validated across Qwen, Llama, Mistral, and Command-R+ families. See M5 Max stress test.
- Boundary layers are disproportionately sensitive. Protecting the first 2 + last 2 layers at higher precision recovers 37-91% of the quality gap. See Boundary V paper.
Additional experiments and writeups: Sparse V dequant (+22.8% decode at 32K, not TurboQuant-specific), block size optimization (5.12x compression), turbo4 resurrection (QJL hurts, PolarQuant works), EDEN optimal-S response (rotation is first-order, scale is second-order).
Quality at a Glance (M5 Max 128GB)
| Cache Type | Bits/val | Compression | PPL (wikitext-2, 512c) | vs q8_0 |
|---|---|---|---|---|
| f16 | 16.0 | 1.0x | 6.121 | -0.16% |
| q8_0 | 8.5 | 1.9x | 6.111 | baseline |
| turbo4 | 4.25 | 3.8x | 6.125 | +0.23% |
| q4_0 | 4.5 | 3.6x | 6.142 | +0.52% |
| turbo3 | 3.5โ | 4.6xโ | 6.176 | +1.06% |
| turbo2 | 2.5 | 6.4x | 6.507 | +6.48% |
turbo4 (4-bit PolarQuant) has the best quality after q8_0 โ closer to q8_0 than q4_0, at better compression. turbo3 trades quality for maximum compression. turbo2 (2-bit) trades more quality for extreme compression โ best used asymmetrically.
โ turbo3 at default block_size=32. At block_size=128, turbo3 achieves 3.125 bits/val and 5.12x compression with identical PPL. See block size study.
Important: choosing the right config for your model. TurboQuant quality depends on your base weight quantization. Models with Q8_0+ weights work well with symmetric turbo (e.g.,
-ctk turbo3 -ctv turbo3). Some low-bit models with Q4_K_M weights may benefit from asymmetric K/V: use-ctk q8_0 -ctv turbo4to keep K precision high while compressing V. K precision is the dominant quality factor because it controls attention routing via softmax. Bigger models absorb quantization stacking better (104B: +3.6% vs 70B: +11.4% for turbo3). Validate on your specific model. See Configuration Recommendations for the full tested matrix.
Everything else lives in docs/benchmarks.md: asymmetric K/V and Boundary V tables, prefill context scaling, MoE and dense decode speed, NIAH retrieval, KL divergence, 70B/104B stress tests, community results on RTX 3090, M1 Max, and AMD RX 9070 XT, and the speed optimization journey.
Getting Started
Python Reference Implementation
git clone https://github.com/TheTom/turboquant_plus.git
cd turboquant_plus
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
# Verify โ should print "141 passed"
python3 -m pytest tests/ -v
# Quick compression demo (no model needed)
python3 benchmarks/demo.py
# Validate on real model KV tensors (downloads Qwen3-1.7B, ~4GB)
pip install transformers torch accelerate
python3 benchmarks/validate_real_model.py
Requires Python >= 3.10, NumPy >= 1.24, SciPy >= 1.10. torch / transformers / accelerate are optional (real-model validation only).
Run Inference
The fastest path is a prebuilt binary, or build from the fork โ see the llama-cpp-turboquant quick start for build instructions, supported backends, and usage details.
# Server mode with TurboQuant KV cache
./build/bin/llama-server -m models/your-model.gguf \
--jinja -ngl 99 -c 262144 -fa on \
--cache-type-k turbo3 --cache-type-v turbo3 \
-np 1 --host 0.0.0.0 --port 8080
| Flag | Bits/val | Compression vs fp16 | Description |
|---|---|---|---|
turbo3 | 3.5โ | 4.6xโ | 3-bit PolarQuant + WHT rotation. Best compression, q8_0 speed. |
turbo4 | 4.25 | 3.8x | 4-bit PolarQuant (16 centroids). Best quality. |
q8_0 | 8 | 2.0x | llama.cpp default quantized cache. |
q4_0 | 4 | 4.0x | llama.cpp 4-bit cache. |
For per-model configuration guidance (symmetric vs asymmetric, Boundary V, large-model memory caps), see the Getting Started Guide and Configuration Recommendations.
Architecture
Input: KV cache vector x โ R^d (one attention head)
โ
โโโ Extract norm: ฮณ = ||x||, xฬ = x/ฮณ
โ
โโโ Random rotation: WHT + random sign flips
โ coordinates ~ N(0, 1/d) after rotation
โ
โโโ Optimal scalar quantization (Lloyd-Max)
โ turbo4: 16 centroids (4-bit), turbo3: 8 centroids (3-bit), turbo2: 4 centroids (2-bit)
โ
โโโ Output: quantized indices + norm per block
Compression: 3.8x (turbo4), 5.1x (turbo3), 7.5x (turbo2)
Note on QJL: reference only, not used in production.
The original TurboQuant paper (Zandieh et al. 2024, arXiv 2406.03482) includes a 1-bit QJL error-correction stage. The Python
qjl.pyhere implements it for paper reproducibility.Production drops QJL on both K and V. QJL eliminates reconstruction bias but amplifies variance, which softmax turns into attention noise. Five independent groups confirmed (buun, scos-lab, Arclabs001, +2). See turbo4-resurrection.md for the full ablation and mechanism.
If you're building on this repo: use
TurboQuantMSE(V cache), or implement straight 4 to 8 bit PolarQuant on K. Only enable theQJL/TurboQuant(with QJL) classes if you are reproducing the original paper or doing K-side research below 8-bit.
Project Structure
turboquant/
โโโ rotation.py # Walsh-Hadamard Transform + random sign flips
โโโ codebook.py # Lloyd-Max optimal centroid computation
โโโ polar_quant.py # PolarQuant โ norm extraction + WHT rotation + scalar quantization
โโโ qjl.py # QJL 1-bit quantizer (paper-faithful reference, see README ยงQJL). Not used in production.
โโโ turboquant.py # Full TurboQuant pipeline
โโโ kv_cache.py # KV cache integration layer
โโโ outlier.py # Outlier channel strategy (2.5-bit, 3.5-bit)
โโโ lloyd_max.py # Lloyd-Max quantizer implementation
โโโ utils.py # Bit packing, memory measurement
โโโ isoquant.py # IsoQuant (quaternion SO(4)) experimental comparison
โโโ rotorquant.py # RotorQuant experimental comparison
tests/ # 14 test files, 500+ tests
benchmarks/
โโโ demo.py # Quick compression demo
โโโ run_benchmark.py # Server-based benchmark runner
โโโ benchmark_results.md # Full benchmark report
โโโ benchmark_llama.sh # llama.cpp benchmark script
โโโ benchmark_norm_correction.py # Norm correction validation
โโโ benchmark_ppl_tq_vs_rq.py # TurboQuant vs RotorQuant PPL comparison
โโโ temporal_decay_prototype.py # Temporal decay experiment
โโโ test_with_llama.py # Integration test at Qwen 3.5 dimensions
โโโ test_outlier_comparison.py # Outlier strategy comparison
โโโ validate_real_model.py # Real model KV tensor validation
docs/
โโโ benchmarks.md # Full benchmark and validation data
โโโ mlx-port.md # MLX framework port (Python)
โโโ changelog.md # v1 milestone history
โโโ turboquant-recommendations.md # Configuration guide (tested matrix)
โโโ windows-rdna4-setup.md # Windows + AMD RDNA 4 build guide
โโโ papers/ # Validation papers and experiment writeups
โโโ (25+ engineering docs, investigations, experiment logs)
Roadmap
| Phase | Status | Details |
|---|---|---|
| Core algorithms (NumPy) | โ | 500+ tests across 14 test files |
| Distortion validation | โ | Matches paper bounds (Table 2) |
| Real model validation | โ | Rotation validated on Qwen3 KV tensors (kurtosis 900โ2.9) |
| llama.cpp C port | โ | Metal GPU inference working on M1 through M5 |
| Metal shader optimization | โ | q8_0 speed parity: prefill matches or beats q8_0 |
| CUDA backend | โ | Community-tested on RTX 3080 Ti/3090/4090/5090, DGX Spark Blackwell |
| HIP/AMD backend | โ | RX 9070 XT (RDNA 4) validated, gfx1201 native |
| Asymmetric K/V | โ | q8_0-K + turbo-V rescues Q4_K_M models |
| Boundary V | โ | Layer-aware V compression, 37-91% quality recovery |
| Sparse V | โ | Attention-gated dequant skip, +22.8% decode on MoE. Upstream PR #21119 |
| Block size optimization | โ | 32โ128, 12% better compression, zero quality cost |
| vLLM upstream | โ | Merged as the TurboQuant attention backend (PR #38479) |
| llama.cpp upstream (rotation) | โ | Hadamard KV rotation + FWHT kernels merged upstream (#21038, #22631) |
| Upstream coordination | ๐ | llama.cpp PR preparation for the full codec (#27) |
| TurboQuant+ extensions | โณ | Adaptive bits, temporal decay, MoE-aware compression |
| MLX Swift port | ๐ | Active collaboration with @ekryski on mlx-swift-lm โ turbo4v2 working |
Paper Reference
- TurboQuant: arXiv 2504.19874 (ICLR 2026)
- PolarQuant: arXiv 2502.02617 (AISTATS 2026)
- QJL: arXiv 2406.03482
- Google Research Blog: TurboQuant: Redefining AI Efficiency
Docs
- Benchmarks โ full quality/speed/retrieval data, community hardware results
- MLX Port โ the experimental MLX Python port: results, quickstarts, MM-NIAH
- Changelog โ v1 milestone history
- Quality Benchmarks โ perplexity validation, bisection log
- Speed Investigation โ Metal gotchas, fp16 WHT results, optimization history
- Speed Experiments โ the full 739 โ 2747 tok/s optimization journey
- Context Scaling Deep Dive โ why turbo3 degraded at long context, how we fixed it
- Pre-Rotate-Queries Investigation โ why graph-side WHT failed initially
- Quality + Speed Gate โ pre-push script checking PPL AND context scaling ratio
Contributing
Issues and PRs welcome. The main areas where help is needed:
- Upstream PR โ prepare llama.cpp contribution (CONTRIBUTING.md requirements)
- CUDA kernel optimization โ fused FA kernels, decode speed parity
- MLX memory recovery โ implement FP16 KV drop + compressed-only attention for memory-constrained long context
- Quality metrics โ multi-run statistics, additional task benchmarks (GSM8K, code gen, reasoning)
- Long context validation โ 64K+ testing across architectures
Support
If you find this work useful, you can support it via GitHub Sponsors or BTC:
BTC: bc1qsfaaf6mkz2yxx2vavg2n0zgsf3qj25uh94t83rwuq7de67dey05sc3tgjx
Commercial support: For inference optimization and KV cache tuning engagements, DM @no_stp_on_snek on X.
License
Apache License 2.0 โ see LICENSE.
Copyright 2026 Tom Turney.
Based on Google Research's TurboQuant paper (arXiv 2504.19874).