REFRACT QUICKSTART
May 4, 2026 · View on GitHub
Runtime expectations (7B Q8 model on Apple Silicon)
Mode Axes Time When selftestpreflight only ~1s static / ~30s with --modelBefore your first real run. Free. score(default)Trajectory + KLD ~5–7 min Most runs. Go/no-go on a candidate KV config. score --full+ R-NIAH + PLAD ~25–30 min Pre-ship audit. Adds long-context retrieval + brittleness. repeatability --runs 4repeats default 4× default Sanity-check reproducibility. Default is quick. Most users should never wait 30 minutes unless they're explicitly running
--fullfor a ship-decision audit.
⚠️ ALPHA — for initial testing and feedback only.
The framework works end-to-end and produces real, useful numbers today, but:
- Setup is manual — clone the repo, build llama.cpp / install mlx-lm yourself, fetch the corpus + prompts, edit paths in flags. No
pip install refractyet (entry point is in pyproject as of v0.3.2; PyPI publish pending).- All four backends work end-to-end (v0.3.2): llama.cpp, MLX, vLLM, SGLang. vLLM and SGLang were verified on AMD MI300X / ROCm 7.2 in the cross-engine bench at
docs/papers/cross-engine-mi300x.md.- Confidence guards exist but aren't exhaustive — you may find edge cases. Please open an issue with the JSON.
- Score interpretation is calibrated on one matrix run of 7 models. Bands (90/80/60) are provisional and may shift in v0.4.
Goals for this alpha: real users on real models exposing real friction we can fix. If you hit a wall, open an issue with your
selftestoutput and the JSON of the failing run.
Goal: get from "git clone" to a real REFRACT score in under 5–7 minutes on the default (quick) mode.
What REFRACT does (one paragraph)
REFRACT scores how faithful a quantized KV-cache config is to the same model's fp16-KV reference. Score 0–100, higher is better. It's a multi-axis composite (Trajectory + KLD + R-NIAH + PLAD), bit-exact on Metal, fail-loud (any single broken axis tanks the composite). Replaces "lower PPL = better" because PPL inverts sign on instruct-tuned models.
Step 0 — install REFRACT
Recommended: PyPI
# Apple Silicon
pip install 'refract-llm[refract-mlx]'
# CUDA / ROCm (vLLM in-process)
pip install 'refract-llm[refract-vllm]'
# SGLang HTTP client (you run the SGLang server separately, e.g. via Docker)
pip install 'refract-llm[refract-sglang]'
# All three backends in one shot
pip install 'refract-llm[full]'
After install: the refract CLI is on your PATH, the v0.1 prompts file
- example reports + the llama.cpp trajectory patch all ship inside the wheel.
macOS gotcha — system Python 3.9 won't work. macOS ships
/usr/bin/python3as 3.9; mlx-lm requires Python 3.10+. Use a newer Python (e.g.brew install python@3.13thenpython3.13 -m venv ...) for the MLX backend. The base framework runs on 3.9, but the backend extra refuses to resolve there.
Source install (for hacking / contributing)
git clone https://github.com/TheTom/turboquant_plus.git
cd turboquant_plus
pip install -e . # editable install, base
pip install -e .[refract-mlx] # editable + MLX backend
pip install -e .[refract-vllm] # editable + vLLM backend
pip install -e .[refract-sglang] # editable + SGLang backend
pip install -e .[dev] # editable + pytest + coverage + build tooling
Every later command (python3 -m refract.cli ...) assumes you are
running from the turboquant_plus/ checkout.
If you also want the patched llama.cpp binaries (the llamacpp backend
needs them on PATH / LD_LIBRARY_PATH):
# In a sibling directory
git clone https://github.com/TheTom/llama-cpp-turboquant.git
cd llama-cpp-turboquant
cmake -B build-allquants -DGGML_METAL=ON # or -DGGML_HIP=ON, -DGGML_CUDA=ON
cmake --build build-allquants -j --target llama-cli llama-completion llama-tokenize llama-perplexity
# Set LLAMA_CPP_BIN_DIR to the build-allquants/bin path before running refract
export LLAMA_CPP_BIN_DIR=$PWD/build-allquants/bin
For the vLLM backend on CUDA / ROCm, either install upstream vllm
(pip install vllm) or pull the author's fork:
git clone https://github.com/TheTom/vllm.git
cd vllm
pip install -e . # this can take a while on ROCm; refer to vLLM's own ROCm docs
For SGLang, the simplest path is the published Docker image (the bench
in docs/papers/cross-engine-mi300x.md uses
lmsysorg/sglang:v0.5.10.post1-rocm720-mi30x for AMD MI300X — see §6
for the in-container patches that image needs).
Prereqs
Once REFRACT is installed and you're inside the turboquant_plus/
checkout, you need:
- Python 3.10+
- One of:
- llama.cpp build with
--jinjasupport and the REFRACT v0.1.4 patch intools/completion/completion.cpp. (Patch emits per-token JSONL whenREFRACT_TRAJECTORYenv var is set.) - mlx-lm (
pip install mlx mlx-lm). MLX backend is native Python; no patches needed. - vllm (
pip install vllmorpip install -e .[refract-vllm]). Working backend as of v0.3.2. Caches one LLM at a time, evicts on KV-config change. Tunable viaREFRACT_VLLM_*env knobs. - SGLang server (Docker recommended;
pip install -e .[refract-sglang]for the HTTP client). Backend posts to a pre-launched SGLang server. KV dtype is fixed at server launch, sorun_kldrequires either two simultaneous servers (REFRACT_SGLANG_REF_URL+REFRACT_SGLANG_CAND_URL) or a two-phase orchestrator (example indocs/papers/cross-engine-mi300x.md).
- llama.cpp build with
- A model in the right format for your backend:
.gguffor llama.cpp- directory with
config.json + model.safetensorsfor mlx - HF safetensors directory for vllm and sglang
- Corpus + haystack: automatic. REFRACT auto-downloads
wikitext-2-raw (~10MB) to
~/.cache/refract/on first run and useswiki.test.rawfor KLD +wiki.train.rawfor R-NIAH unless you pass paths explicitly. Pre-fetch with:
Disable auto-download withpython3 -m refract.cli fetch--no-auto-fetchif you want to require explicit paths (CI-friendly). - The prompts JSONL ships at
refract/prompts/v0.1.jsonl.
Constrained VRAM? Pass extra llama.cpp flags
REFRACT defaults to -ngl 99 (all layers on GPU) for the llama.cpp
backend. Consumer-card users running large MoE models (e.g.
Qwen3.6-35B-A3B on a 12 GB 3060) won't fit that — they need
-ncmoe N to offload some MoE expert layers to CPU.
Pass any extra llama.cpp flags via REFRACT_LLAMA_EXTRA_FLAGS:
# 12 GB consumer GPU running Qwen3.6-35B-A3B with MoE offload
export REFRACT_LLAMA_EXTRA_FLAGS="-ngl 28 -ncmoe 32"
python3 -m refract.cli score --backend llamacpp --model /path/to/model.gguf ...
The flags get appended to every llama-cli, llama-completion, and
llama-perplexity invocation after REFRACT's own. llama.cpp uses
last-wins for repeated flags, so REFRACT_LLAMA_EXTRA_FLAGS="-ngl 28 -ncmoe 32" overrides the default -ngl 99. Parsed with shlex so
quoted args work the same as on the command line.
Confirmed working scenarios:
- Consumer 12 GB GPU + 35B-A3B MoE:
-ngl 28 -ncmoe 32 - CPU-only fallback:
-ngl 0 - Tensor split across multiple GPUs:
-ts 1,1
If a flag REFRACT doesn't recognize trips up its own subprocess, open an issue with the failing command line and we'll plumb it.
Step 2 — preflight (~30 seconds)
# llama.cpp model (.gguf)
python3 -m refract.cli selftest --backend auto --model /path/to/model.gguf
# OR an MLX model (directory with config.json + model.safetensors)
python3 -m refract.cli selftest --backend auto --model /path/to/mlx-model-dir/
# Without --model: static checks only (~1 second)
python3 -m refract.cli selftest
--backend auto infers from the path: .gguf → llamacpp; directory →
mlx (or vllm if REFRACT_BACKEND=vllm). Override with
--backend llamacpp|mlx|vllm|sglang or set REFRACT_BACKEND env var.
Verifies binaries, flags, env vars, and a tiny generation. If it bails, fix the reported issue before going further. Don't burn a long run finding out your setup is broken.
Step 3 — first quick score (5–7 min on a 7B Q8)
python3 -m refract.cli score \
--model /path/to/model.gguf \
--candidate "ctk=q8_0,ctv=q8_0" \
--prompts refract/prompts/v0.1.jsonl \
--json-out my-first-report.json \
--html-out my-first-report.html
--corpus is auto-resolved from ~/.cache/refract/ (downloaded on
first run). This runs Trajectory + KLD@D — the two cheap axes. You'll
get a composite score, a band (EXCELLENT/PASS/DEGRADED/FAIL), and a
plain-English diagnosis of what the per-axis pattern means.
Step 4 — full audit (25–30 min on a 7B Q8)
Add --full. Both haystack file and corpus are auto-resolved from the cache.
python3 -m refract.cli score \
--model /path/to/model.gguf \
--candidate "ctk=q8_0,ctv=q8_0" \
--prompts refract/prompts/v0.1.jsonl \
--full \
--rniah-up-to 16384 \
--json-out my-full-report.json \
--html-out my-full-report.html
Long-context audit knob
--rniah-up-to N controls how deep R-NIAH probes. Lengths are
auto-generated as a doubling step-up from 4K up to N:
--rniah-up-to | Lengths tested | R-NIAH wall-time on 7B Q8 |
|---|---|---|
16384 (default) | 4K, 8K, 16K | ~10–15 min |
32768 | 4K, 8K, 16K, 32K | ~25–35 min |
65536 | 4K, 8K, 16K, 32K, 64K | ~60–90 min |
131072 | 4K … 128K | ~3+ hours |
Pick a value matching your model's actual usable context. If the model
fails at 64K under fp16, R-NIAH will report confidence: low for those
cells (per-cell base_acc = 0) — cleaner to cap below that.
Power users: --rniah-lengths 4096,16384,65536 overrides the doubling
step-up with an explicit list.
Generating the HTML report
Pass --html-out path.html to any score invocation. The HTML report
is a single self-contained file (~40 KB) you can email, paste into
Discord, or open offline:
python3 -m refract.cli score \
--model /path/to/model.gguf \
--candidate "ctk=q8_0,ctv=q8_0" \
--prompts refract/prompts/v0.1.jsonl \
--json-out report.json \
--html-out report.html
What's in it:
- Composite + per-axis stats strip at the top
- Plain-English diagnosis (colored callout)
- Per-axis breakdown with bars and bands
- R-NIAH per-cell heatmap + PLAD per-perturbation table when
--full - Run details (model size, hardware, KV configs)
- Reproduce command (sanitized — no personal paths)
- Embedded raw JSON in a
<details>section - Sun/moon toggle in the top-right for light/dark mode (follows OS by default)
What's bundled vs external:
- HTML, CSS, JS, raw JSON: all inline. Works offline.
- Geist font: loads from Google Fonts CDN with system-ui fallback. Online → polished typography. Offline → system fonts (Apple SF / Segoe UI), still readable.
- Dark mode uses
light-dark()CSS — needs Chrome 123+ / Safari 17.5+ / Firefox 120+ (all 2024). Older browsers see the light theme cleanly; dark mode is progressive enhancement.
Sample reports live in examples/ (4 real reports from
the 2026-04-30 matrix run). Open one to preview the format before
running your own.
Step 5 — interpret the result
Quick table:
| Composite | Band | What it means |
|---|---|---|
| 90–100 | EXCELLENT | Indistinguishable from fp16. Safe to deploy. |
| 80–90 | PASS | Minor drift; safe to deploy in most uses. |
| 60–80 | DEGRADED | Visible drift; audit on your workload first. |
| 0–60 | FAIL | Material quality loss; treat as broken. |
If the composite is below 90, look at the per-axis breakdown and the Diagnosis block in the report. It will tell you in plain English which surface broke (e.g., "decode distribution drift detected; candidate generates different tokens than fp16 on short-context prompts") and a suggested next move.
For deeper interpretation see INTERPRETATION.md.
Step 6 — compare candidates side by side
python3 -m refract.cli compare \
report-q8q8.json report-q8turbo4.json report-q4q4.json
Prints a comparison table. Useful for finding the breaking point of a model under increasingly aggressive quants.
Backends
| Backend | Status | Use for |
|---|---|---|
| llamacpp | shipping | .gguf models, all four axes, TurboQuant configs |
| mlx | shipping | MLX models (directory layout); Trajectory + R-NIAH + PLAD work; KLD has limitations on RotatingKVCache models |
| vllm | shipping | HF safetensors models on CUDA / ROCm; all four axes; in-process LLM (caches one at a time, evicts on KV-config change). Verified on MI300X (Qwen3.6-35B-A3B). |
| sglang | shipping | HF safetensors models served via a pre-launched SGLang server (HTTP). KV dtype is fixed at server launch — see docs/papers/cross-engine-mi300x.md §6 for a two-phase orchestrator that handles this. |
Override default with --backend mlx (or REFRACT_BACKEND=mlx).
Common pitfalls (also see PITFALLS.md)
- Don't use the v0.1.x
gtmaxis — it has a known detokenize→retokenize unit-mismatch bug. v0.3.1 default is--axis-a trajectory(the proper fix). - Instruct models need chat-template handling — REFRACT v0.3.0+
applies it automatically via
--jinja. If you see all-zero retrieval (R-NIAHbase_acc = 0everywhere), your llama.cpp build may be too old. - Thinking-mode models — auto-detected at run start; reasoning
disabled via
-rea off. The detection line in the banner says whether your model triggered it. - R-NIAH with
base_acc < 0.2averaged across cells flagsconfidence: lowin the JSON — the model isn't engaging the task and the score is noise-floor. - PLAD
paraphrase = NaNmeans no synonym matches in your prompts set. Other perturbations (typo/case/punct) still produce valid numbers; the cell is recorded asskipped_perturbationsin JSON.
Reproducibility
Reports embed:
framework_version(REFRACT version)environment.backend(llamacpp / mlx / vllm)environment.llama_cpp_commit(when llamacpp)environment.mlx_lm_version(when mlx)score_directionandscore_range(so machine consumers can't accidentally invert the comparison)
When sharing scores ("I got 87 on Mistral-7B"), include the JSON. The number alone is not reproducible without the version stamp.