Benchmarks

May 8, 2026 · View on GitHub

SQLRite ships with a curated benchmark suite that pits the engine against SQLite (the reference) and optionally DuckDB (analytical-slice comparator) on a fixed set of OLTP, SQL-feature-scaling, and AI-era workloads. Numbers are produced on demand with make bench; raw JSON envelopes get committed under benchmarks/results/ so trends can be diffed mechanically.

The point isn't to "win" — SQLite has 25+ years of optimization behind it, and the ratios shipped here range from healthy single-digit gaps to known-broken multi-million-x outliers that surfaced honest engine TODOs (SQLR-18 / SQLR-19 / SQLR-20 / SQLR-21). The point is to (a) baseline so future engine work has a number to move, (b) prove the differentiator workloads (HNSW, BM25, hybrid retrieval) actually deliver, and (c) ground roadmap conversations about LSM / columnar / JIT directions with evidence rather than vibes.

For the design rationale (resolved Q1–Q8) see docs/benchmarks-plan.md. For the workload-by-workload narrative numbers see benchmarks/README.md.


Table of contents


Quick start

# Lean profile — SQLRite + SQLite (rusqlite-bundled, WAL+NORMAL).
# ~5 min on an M-series MBP.
make bench

# Full profile — adds the optional DuckDB driver on Group B (W7/W8/W9).
# ~30 min on the same host (DuckDB's per-row INSERT path is heavy on
# the 1M-row Group B setup).
make bench-duckdb

Both commands run cargo bench -p sqlrite-benchmarks and then aggregate criterion's per-bench JSON into a single envelope under benchmarks/results/. Open target/criterion/report/index.html afterwards for the full criterion HTML report (per-bench p50/p95/p99 distributions, regression plots, etc.).


What the suite measures

Twelve workloads, three groups. Each one has a fixed input dataset (deterministic seed), a correctness gate that runs once before timing starts, and a fixed criterion configuration. Workloads are versioned (W1.v1, W1.v2, …) so a future shape change is an explicit, traceable bump (Q8).

Group A — OLTP baseline

IDNameShape
W1Read-by-PK100k-row table, 10k random WHERE id = ? probes
W2Range scanWHERE indexed_col >= ? AND <= ?, three width buckets (100 / 1k / 10k rows)
W3Bulk insert100k rows in one transaction
W4Single-row insert1k auto-committed INSERTs against a 1k-row preloaded table
W5Mixed OLTP50/50 SELECT-by-PK + UPDATE-by-PK on a 100k-row table
W6Index lookup10k probes by secondary = ? on a unique non-PK index

Group B — SQL-feature scaling

IDNameShape
W7AggregateSELECT SUM(v) FROM big, 1M rows
W8GROUP BYSELECT k, COUNT(*) FROM big GROUP BY k, three cardinalities (10 / 1k / 100k)
W9INNER JOINper-PK probe on customers ⨝ orders, 10k×10k tables (plan target was 100k×100k — see SQLR-20)

Group C — Differentiators

IDNameShape
W10Vector top-1010k 384-dim vectors, cosine top-10. Brute-force + HNSW variants.
W11BM25 top-101k-doc corpus, top-10 BM25. SQLite FTS5 as comparator.
W12Hybrid retrieval50/50 BM25 + cosine fusion on a 1k-doc corpus. SQLRite-only.

W11 / W12 cap at 1k docs because of an engine constraint (SQLR-21); plan target was 10k.


Reading the numbers

A few methodology notes that change how you read the table.

Q3 — SQLite is run with the tuned profile. journal_mode=WAL, synchronous=NORMAL, temp_store=MEMORY, cache_size=-65536 (64 MB). Rationale: SQLRite's WAL is mandatory + always-on, so SQLite-default (journal_mode=DELETE, synchronous=FULL, on-commit fsync) is not apples-to-apples. WAL+NORMAL matches SQLRite's commit fsync semantics. A SQLite-default secondary column is a future opt-in.

Q3 — DuckDB runs at defaults. No equivalent to SQLite's WAL+NORMAL knob; DuckDB's MVCC + commit semantics are uniform.

Parser tax — historical, addressed in SQLR-23. Pre-SQLR-23, the engine parsed SQL on every Connection::execute / Connection::prepare call. The bench driver's SQLRite path inlined ? placeholders into the SQL string, so every iteration also walked the full sqlparser AST. Several workloads' headline numbers were dominated by this overhead — W1 and W6 (sub-µs paths where parser cost was most of the per-iter time), W10 (the 384-dim bracket-array literal in ORDER BY was ~4 KB of SQL the parser walked every iteration; brute-force vs HNSW looked indistinguishable as a result).

SQLR-23 shipped:

  • Connection::prepare_cached — small per-connection LRU of parsed plans (default cap 16, matches rusqlite).
  • Statement::query_with_params(&[Value]) / Statement::execute_with_params(&[Value]) — bind ? placeholders at execute time without re-running sqlparser.
  • Value::Vector(Vec<f32>) as a first-class bind type — the 4 KB query vector for W10 is now bound directly instead of being re-lexed every iteration. The HNSW probe optimizer still recognizes the bound shape, so the algorithmic shortcut keeps firing.

The bench harness Driver::query_one / query_all paths route through prepare_cached + the bound API. Every workload's WorkloadId.version was bumped v1 → v2 in lockstep — old JSON envelopes keep the v1 tag and stay readable, but cross-version comparisons require an explicit acknowledgment in the comparison script. The headline table below carries the v2 numbers from the post-SQLR-23 republished run (SQLR-25); the retired v1 baseline lives in the historical section underneath.

Where DuckDB is misleading. Per-PK-probe single-row OLTP queries (W9) are SQLite's home turf, not DuckDB's. The plan flags this as "apples-to-oranges"; we still publish the number because the directional comparison is informative.

Workload-shape capacity caps. Two workloads ship at smaller-than-plan dataset sizes because of engine constraints, both tracked as separate follow-ups (SQLR-20, SQLR-21). The deviations are documented inline in benchmarks/src/workloads/.


Headline numbers

Median latency from the post-SQLR-23 pinned-host run — benchmarks/results/2026-05-08-apple-ac84d560.json, Apple M1 Pro / macOS 23.5.0, criterion defaults (3 s warm-up, 5 s measurement, 100 samples on light workloads / 10 samples on heavy ones — see the JSON envelope's per-sample samples field). Only medians here; the JSON carries 95 % CIs, mean, std-dev, ops/s.

WorkloadSQLRiteSQLite (WAL+NORMAL)DuckDBNotes
W1 read-by-PK3.92 µs2.09 µs~1.9× — gap closed by SQLR-23 (was ~4.8× in v1)
W2 range-10024.27 ms66.62 µs~364× — full-scan vs index range probe
W2 range-1k26.64 ms649.30 µs~41×
W2 range-10k30.73 ms7.01 ms~4.4× — converges as scan dominates
W3 bulk insert (100k/txn)606.20 ms183.96 ms~3.3× — 100k INSERT plan parsed once, not per-row (was ~6.2× in v1)
W4 single-row insert6.57 ms11.35 µs~579× ⚠️ SQLR-18
W5 mixed OLTP58.00 ms9.65 µs~6,010× ⚠️ SQLR-18
W6 index lookup4.04 µs2.56 µs~1.6× — gap closed by SQLR-23 (was ~4.2× in v1)
W7 SUM (1M rows)103.62 ms31.57 ms478.78 µsDuckDB ~66× faster than SQLite
W8 GROUP BY card-10197.32 ms366.52 ms949.75 µsDuckDB ~386× faster than SQLite
W8 GROUP BY card-1k1.380 s240.64 ms1.039 msDuckDB ~232× faster than SQLite
W8 GROUP BY card-100kskipped239.72 ms22.93 msSQLRite skipped ⚠️ SQLR-19; DuckDB ~10× faster than SQLite
W9 INNER JOIN (10k×10k)30.30 s2.16 µs484.97 µs~14M× ⚠️ SQLR-20; DuckDB ~225× slower than SQLite (analytical-engine OLTP weakness)
W10 vector top-10 (brute-force, 10k×384)120.88 mscompute-bound; modest ~13% drop vs v1
W10 vector top-10 (HNSW)2.40 ms~53× faster than v1 ⭐ — SQLR-23 + SQLR-28 unmasked the index; HNSW now ~50× faster than brute-force
W11 BM25 top-10 (1k docs)501.63 µs23.65 µs~21× — fts_match / bm25_score no longer re-parsed (was ~43× in v1)
W12 hybrid (1k docs)607.90 µsRAG headline (~15% faster than v1)

The canonical v2 run is benchmarks/results/2026-05-08-apple-ac84d560.json. It supersedes the v1 baseline (table below) end-to-end: every workload was rerun on the same canonical Apple M1 Pro host after SQLR-23 bumped WorkloadId.version from v1 → v2 in lockstep (W10 → v3 after SQLR-28 widened the HNSW probe to cosine + dot). The dirty=true flag reflects the working-tree state at run time (this doc update + the new envelope itself uncommitted); the measurements themselves only depend on the bench binary, which was built from the clean ac84d560 tip. Subsequent official runs land alongside this file with their own date / host / commit.

Historical (v1, retired)

The pre-SQLR-23 baseline from benchmarks/results/2026-05-07-apple-9ffd55a5.json, retained so the methodology shift is visible. The v1→v2 jump is not an algorithmic improvement — it's the bench-driver methodology change (per-iter inline_paramsprepare_cached + bound ? parameters; Value::Vector for HNSW-eligible KNN). Cross-version comparisons (W1.v1 vs W1.v2) are flagged in the comparison script per Q8; the compare.py v1↔v2 report walks each one.

WorkloadSQLRite (v1)SQLite (v1)DuckDB (v1)
W1 read-by-PK9.87 µs2.05 µs
W2 range-10023.99 ms60.50 µs
W2 range-1k24.92 ms585.21 µs
W2 range-10k30.15 ms6.24 ms
W3 bulk insert (100k/txn)1.029 s166.43 ms
W4 single-row insert6.76 ms9.78 µs
W5 mixed OLTP55.63 ms9.96 µs
W6 index lookup10.45 µs2.50 µs
W7 SUM (1M rows)109.47 ms31.14 ms468.74 µs
W8 GROUP BY card-10201.80 ms438.09 ms761.40 µs
W8 GROUP BY card-1k1.372 s251.13 ms871.80 µs
W8 GROUP BY card-100kskipped238.96 ms19.58 ms
W9 INNER JOIN (10k×10k)34.25 s2.23 µs699.23 µs
W10 brute-force138.66 ms
W10 HNSW126.81 ms
W11 BM25 top-10 (1k docs)1.079 ms25.03 µs
W12 hybrid (1k docs)713.53 µs

Engineering debts surfaced

Every bench-suite-found gap that exceeds the plan's "informational, not a gate" heuristic gets its own task. As of v1 of the suite:

TaskWorkloadSymptomLikely root cause
SQLR-17(CI infra)desktop-build apt-get hung 39 min on the 9.1 PRAzure-side runner / mirror flake — only act if recurs
SQLR-18W4 single-row INSERT~673× vs SQLiteBottom-up B-tree rebuild on every COMMIT (CLAUDE.md "B-tree commit strategy")
SQLR-19W8 GROUP BY 100k-cardinality~245 s/iter (skipped by default)Suspected Vec-backed group store — should be HashMap
SQLR-20W9 INNER JOIN~14M× at 10k×10k; intractable at the 100k×100k plan targetNested-loop driver doesn't push ON predicate to inner-side index probe
SQLR-21W11 / W12 corpus capFTS doc-lengths sidecar must fit in one 4 KiB page (~1,360-doc cap)Phase 8.1 — overflow chaining for posting + sidecar cells

All five are "investigation, not a release gate" — the suite ships with the gap measured + the workaround documented inline + the task linked. Each task carries a reproducer and a sketch of the fix.


Reproducing a run

# Default — lean (SQLRite + SQLite), criterion defaults
# (3s warmup, 5s measurement, 100 samples), heavy workloads
# capped at 10 samples.
make bench

# Full — adds DuckDB on Group B
make bench-duckdb

# Sharpen estimates on heavy workloads (override the per-group
# sample_size cap):
cargo bench -p sqlrite-benchmarks --bench suite -- \
    --measurement-time 30 --sample-size 30 'W3'

# Force-include SQLRite × W8/card-100k (default-skipped — SQLR-19):
SQLRITE_BENCH_W8_CARD_100K_SQLRITE=1 make bench

The aggregator picks an output filename based on host + commit short-SHA: benchmarks/results/<YYYY-MM-DD>-<host_token>-<short_sha>.json. Override with OUTPUT=path/to.json scripts/run.sh.

Local results are gitignored (benchmarks/results/.gitignore); only the pinned-host "official" runs get committed.


Comparing two runs

benchmarks/scripts/compare.py \
    benchmarks/results/2026-05-07-applem1pro-aaaaaaaa.json \
    benchmarks/results/2026-06-01-applem1pro-bbbbbbbb.json \
    --md /tmp/diff.md

Reads two JSON envelopes, matches samples by (workload, driver), computes per-workload percent change + ratio, and prints a Markdown table to stdout (or --md OUT.md). Same-version-only by Q8 — cross-version pairs land in their own "ignored" section. Cross-host pairs get a header warning; the script still runs but the numbers shouldn't be trusted as a true delta.

Pure stdlib Python — no third-party deps.


What's NOT measured (and why)

  • CPU%. Noisy on a shared machine; redundant with wall-clock for single-threaded workloads.
  • Concurrency curves. Engine is single-writer by design (Phase 4e). No concurrent-writer workload is meaningful until that changes.
  • Network I/O. All targets are in-process. Cloudflare D1 / rqlite are explicitly out of scope per the plan's viability section — they're network-dominated and would measure latency, not engine throughput.
  • libSQL. Deferred — its embedded SQL surface tracks SQLite within a few percent, so a third row-oriented OLTP comparator would mostly add noise. Worth revisiting alongside a vector-only benchmark page when sqlite-vec / libSQL native vector indexes become a useful comparison axis.
  • fsync count on macOS. /proc/self/io exists on Linux only. On macOS the equivalent would need dtrace; out of scope for v1.

See also