REFRACT Leaderboard

April 30, 2026 · View on GitHub

⚠️ READ THIS BEFORE USING THESE NUMBERS.

This leaderboard ranks how faithfully a quantized KV-cache config preserves a model's own fp16-KV behaviour. It is NOT a model quality leaderboard. Specifically:

  • A high REFRACT score does NOT mean the model is good. It means the quantization didn't make it materially worse than its own fp16 baseline.
  • A low REFRACT score does NOT necessarily mean the model is bad. It means the quantization is materially diverging from the fp16 reference; the underlying model could be excellent at fp16.
  • REFRACT scores do not represent how a model plays in the field. Real downstream performance depends on prompt distribution, task difficulty, sampling choices, the user's prompts, and hundreds of factors REFRACT can't see. For real-world task evaluation use HELM, lm-eval, or your own task suite. REFRACT is the tool to answer: "did my quantization break this model relative to its own fp16 self?"

Treat this leaderboard as "which models tolerate which KV configs", not "which models are best".


Each row is tagged with the framework version that produced it. Within a single REFRACT version, rows are directly comparable. Across versions, expect ±2 composite points of methodology drift.

Best models for ctk=q8_0,ctv=turbo4 candidate (REFRACT v0.3.0)

What this asks: does this candidate KV config faithfully preserve this model's fp16 behaviour? What this does NOT ask: which of these models is smartest, best at code, best at reasoning, etc.

RankModelCompositeBandTrajectoryKLDKLD natsR-NIAHPLADNotes
1Mistral-Small-24B Q4_K_M90.86EXCELLENT76.6599.710.0029100.0091.34Tolerates q8/turbo4 best of the matrix.
2phi-4 Q890.25EXCELLENT77.9599.550.0046100.0087.35Microsoft's phi-4. Distribution-faithful under turbo.
3Llama-4-Scout-17B-16E Q4_K_M89.77PASS73.5897.320.0272100.0093.54Just below EXCELLENT. PLAD highest in the matrix.
4Qwen3.5-2B Q881.48PASS60.0798.350.0167100.0081.47Trajectory drift visible but distribution + retrieval intact.
5Gemma-4-E2B Q4_K_L78.51DEGRADED52.7393.500.0672100.0088.57Borderline pass; audit before deploying.
6Qwen2.5-7B Q877.98DEGRADED55.1398.750.0126100.0076.73KLD high but Trajectory + PLAD struggle.
7Gemma-4-31B Q850.78FAIL26.4149.230.7086100.0094.45Distribution materially shifted (0.7 nats). Don't ship.
8Gemma-4-26B-A4B Q829.12FAIL17.3217.591.7381100.0078.40Catastrophic distribution shift (1.7 nats). The motivating model from the paper.

Negative control (sanity check, not a real candidate)

ModelCandidate KVCompositeBandNotes
Gemma-4-26B-A4B Q8turbo4/turbo4 (symmetric)~11FAILDeliberately broken config from the paper. KLD = 2.13 nats, Trajectory = 3.93. If your run on this combo does NOT FAIL, your framework setup has a problem.

How to read the rankings

The rank order means: which model tolerates this exact KV config best while staying close to its own fp16 reference. Several non-obvious properties of these numbers:

  • Mistral-24B at #1 doesn't mean it's the strongest model. It means q8/turbo4 quantization barely moves it from its own fp16 distribution. Mistral happens to be robust to KV quantization at this bit budget.

  • Gemma-4-31B at #7 with KLD 49.23 doesn't mean Gemma is a bad model. It means q8/turbo4 specifically doesn't preserve its distribution. Try ctk=q8_0,ctv=q8_0 (no turbo) on the same model and the score will jump dramatically.

  • Llama-4-Scout PASS at #3 doesn't mean Llama-4 is good. It means the q8/turbo4 quant doesn't degrade it relative to its own fp16 baseline. Llama-4's broader reputation issues (hallucination, reasoning shallowness) are model-architecture problems REFRACT cannot see — REFRACT only measures quant fidelity.

In short: REFRACT ranks the quant, not the model.

What REFRACT does measure

For each candidate KV config × model:

  • Trajectory — does greedy decode produce the same tokens as fp16?
  • KLD@D — does the per-token distribution match fp16's on a corpus?
  • R-NIAH — does long-context retrieval work as well as fp16?
  • PLAD — does the model's robustness to typos / casing / paraphrase match fp16's?

A high composite says all four surfaces look like fp16. It says nothing about whether fp16 itself was any good for your workload.

What REFRACT does NOT measure

  • Real downstream task accuracy (HELM, lm-eval, MMLU territory)
  • Code quality, reasoning depth, instruction-following
  • Agentic capability, tool-use fidelity (T-Call axis is v0.4)
  • Hallucination rate
  • Safety, refusal calibration
  • Production latency, throughput, memory footprint
  • Anything the user actually cares about end-to-end

A model can pass REFRACT and still be bad at your task. A model can pass REFRACT under one quant and fail at your task because the model itself is a poor fit. You still need real-world evaluation.

How to submit a result

Alpha-stage informal: open an issue or PR with:

  1. The JSON of your run (refract score ... --json-out report.json). It embeds framework_version, environment.backend, and (for llama.cpp) environment.llama_cpp_commit so we can validate the row is from a known-good methodology version.
  2. The HTML report if you want a visual share (--html-out).
  3. refract selftest --backend X --model Y output.
  4. A repeatability run (refract repeatability --runs 4 ...) if the score is in a band that surprises you.

We accept rows where:

  • framework_version >= 0.3.0 (chat-template handling fixed)
  • Confidence flags are clean (no R-NIAH low without an explanation)

Reproducibility info for the rows above

  • REFRACT version: v0.3.0
  • Backend: llamacpp (TheTom/llama.cpp feature/turboquant-kv-cache)
  • Hardware: M5 Max 128 GB RAM, macOS Tahoe 26.4
  • Reference KV: ctk=f16,ctv=f16
  • R-NIAH ctx_max: 16384
  • Total matrix runtime: 7370s (2h 2min) for v0.2.0 + 7336s for v0.3.0

Sample report files: refract/examples/*.json + *.html.

Versioning

REFRACT version dictates methodology. Major changes that affect the score surface:

VersionMethodology changeEffect
v0.1.4Trajectory axis replacing buggy GTMMost v0.1.4+ scores 5–15 points different from v0.1.x
v0.2.0Added R-NIAH + PLAD axesComposite went from harmonic_mean(2) to harmonic_mean(2..4)
v0.2.1R-NIAH neutral needle + n_predict=256Fixed refusal artifacts
v0.3.0Chat-template handling via --jinja -rea offFixed instruct-model engagement
v0.3.1Backend abstraction + confidence guards + version stampsNo score impact
v0.3.2HTML reports + repeatability subcommandNo score impact

Cite results as "Mistral-24B got 90.86 EXCELLENT on REFRACT v0.3.0, candidate ctk=q8_0/ctv=turbo4, llamacpp backend, M5 Max" — the version

  • candidate are load-bearing.

Future automation

A refract leaderboard --in dir/ --out LEADERBOARD.md subcommand that walks a directory of submitted JSONs is a v0.4 target. For now the table is curated from the matrix output JSONs.