Benchmarks plan
May 11, 2026 · View on GitHub
Status: approved 2026-05-06 — Q1–Q8 resolved; all six sub-phases (9.1–9.6) shipped. User-facing canonical reference: docs/benchmarks.md. All eight design questions were resolved by the project owner; see the Decisions section below for the canonical answers. Tracks task SQLR-4 / SQLR-16 (execution).
TL;DR. Stand up a small, focused benchmark suite that pits the engine against SQLite (mandatory) and DuckDB (optional, analytical-slice only) under a curated set of OLTP + analytical + AI-era workloads. Skip distributed and network-resident options (Cloudflare D1, rqlite) — they don't share SQLRite's deployment shape. Defer libSQL until we have a vector- or replication-flavored axis that justifies a third row-oriented embedded engine. Suite lives in a new top-level benchmarks/ workspace member, not built by default, runs on demand on a pinned host, emits JSON for trend tracking.
The point isn't to "win" — SQLite has 25+ years of optimization behind it, expect a 10–100× gap on most workloads at first. The point is to (a) get a baseline so future engine work has a number to move, (b) prove the differentiator workloads (HNSW, BM25, hybrid retrieval) actually deliver, and (c) ground roadmap conversations about LSM / columnar engines and JIT'd executors with evidence rather than vibes.
Why this exists
The "Possible extras" list in roadmap.md has carried "Benchmarks against SQLite" since the project was reopened. With Phase 8 done, JOINs (SQLR-5) and GROUP BY / aggregates (SQLR-3) shipped, the engine now has enough SQL surface that benchmark workloads beyond "single-row by PK" actually exercise something interesting.
Three concrete drivers:
- Decision support for the
Possible extraslist. Roadmap items like "Alternate storage engines (LSM/SSTable for write-heavy workloads)" and the deferred Phase 5a.2 cursor refactor are speculative without measurement. A single-row-INSERT throughput number against SQLite tells us whether the bottom-up B-tree rebuild is actually the problem worth fixing, or whether the executor's row reassembly dominates. - Differentiator validation. Phase 7d (HNSW), Phase 8 (BM25), and Phase 8d (hybrid retrieval) shipped without comparable numbers. We have correctness tests; we don't have "this query against this corpus runs in N ms." That gap matters for users evaluating SQLRite for RAG.
- Regression detection. Once the harness exists and emits JSON, it becomes mechanical to spot a 30% slowdown in a future PR. No regression detector is wired in v1, but the JSON shape is designed to support one.
Scope philosophy
Same posture as the phase plans: stay proportional, ship in narrow sub-phases, each one independently useful.
- Curated, not exhaustive. ~10 workloads, hand-picked for the questions we want answered. Not a YCSB clone, not TPC-C — those measure things we don't ship (concurrency, real transactions across nodes).
- One driver trait, multiple engines behind it. A workload is engine-agnostic Rust code; the engine choice is a generic parameter. Adding libSQL or DuckDB is one file, not a fork.
- Read-only on shared infra. No CI runs at first — too noisy. The harness emits JSON to
benchmarks/results/keyed by host + commit, and a pinned local machine produces the publishable numbers. - No SLOs in v1. Publish numbers, don't gate merges. Once we have ~3 dated runs we can look at adding a "fail PR if SQLRite regresses >20% on workload X" job.
Comparison-target viability
The task brief lists five candidates. Honest assessment of each:
✅ SQLite — primary target
The reference implementation. SQLRite is explicitly modeled on SQLite (see docs/architecture.md, docs/file-format.md), so apples-to-apples comparison is the whole point.
- Driver.
rusqlite— Rust bindings to libsqlite3, links the C library. Mature, well-optimized, the right "is this our fault or libsqlite's fault?" baseline. - Configuration. Run with
journal_mode=WAL,synchronous=NORMAL,temp_store=MEMORY. SQLRite's WAL is mandatory + always-on, so SQLite-default (journal_mode=DELETE, full fsync per commit) is not apples-to-apples. The Q3 discussion below proposes locking in the WAL+NORMAL profile. - Coverage. Every workload (1–9 below). The differentiator workloads (10–12) bring in SQLite's
FTS5virtual table for BM25 andsqlite-vecfor vectors as well-defined opponents.
✅ DuckDB — secondary, analytical-slice only
Embedded, in-process, single-file. Same deployment shape as SQLRite, different storage model: columnar OLAP, vectorized executor, MVCC. That divergence is what makes DuckDB interesting as a comparison.
- Driver.
duckdb-rs— official Rust bindings. - Where it's apples-to-apples. Read-only SELECTs, COUNT/SUM/AVG aggregates, GROUP BY at scale, indexed range scans on read-only data. These are workloads where SQLRite's row-store is structurally disadvantaged and the question is how disadvantaged.
- Where it's apples-to-oranges (skip). Single-row INSERTs (DuckDB's bulk-load path is heavy; per-row insert is pathological by design), UPDATE/DELETE workloads, transactional mixed OLTP, secondary-index lookups by PK on small reads. Including DuckDB on these would be misleading.
- Why include at all. Two reasons: (1) the roadmap "Possible extras" mentions LSM/SSTable as a future write-heavy storage engine — DuckDB-on-OLAP gives us a sister number for "what does a different storage model look like?"; (2) it grounds the "we are not DuckDB, we are a SQLite alternative" positioning with a measurement, not just a sentence.
- Gating. Optional
--features duckdbon the bench crate. Defaultmake benchdoesn't pull DuckDB;make bench-duckdbdoes. Avoids a heavy dep for users who only care about the SQLite comparison.
🟡 libSQL (Turso embedded) — defer to v2 of the suite
libSQL is a fork of SQLite (still C, with extensions: native vector type, server-side filter pushdown, Hrana wire protocol for remote sync). For pure embedded use, the SQL execution path mostly tracks SQLite — same parser tree, same VM, same B-tree. Differences worth measuring only show up on:
- Native vector — libSQL has a built-in vector index. Comparing SQLRite's HNSW (Phase 7d) against libSQL's would be more interesting than against
sqlite-vec, since both are "first-party" implementations rather than extensions. - Replication-aware writes — irrelevant for embedded benchmarks.
Verdict: skip in v1. The embedded SQL surface tracks SQLite closely enough that adding libSQL as a third row-oriented OLTP driver would mostly produce noise within a few percent of the SQLite numbers. Revisit when we want to publish a vector-only benchmark page (post-9.4) and want a non-extension competitor.
❌ Cloudflare D1 — out of scope
D1 is a remote, managed SQLite-compatible service that runs at Cloudflare's edge. Every query goes over HTTP. Even single-digit-millisecond network round-trips will dominate every workload — we'd be measuring the network, not the engine.
Document as out-of-scope. Revisit only if SQLRite ever ships a remote-server mode (no such phase is on the roadmap).
❌ rqlite — out of scope
rqlite is a distributed SQLite (Raft consensus over a cluster of nodes, accessed via HTTP). Not embedded. Reads from a follower still cross HTTP; writes pay Raft consensus latency. Same network-dominates-everything problem as D1.
Document as out-of-scope. Worth revisiting only if SQLRite explores a distributed mode — interesting then as a "what does the consistency cost look like?" reference.
Workloads
Ten workloads, three groups. Each one has a fixed input dataset (deterministic seed), a fixed expected result that's checked once before timing starts, and a fixed criterion configuration. Group A = OLTP baseline; Group B = SQL-feature scaling; Group C = SQLRite differentiators.
Group A — OLTP baseline (vs SQLite)
| ID | Name | Shape | Why it matters |
|---|---|---|---|
| W1 | Read-by-PK | 100k-row table, prepared SELECT … WHERE id = ?, 10k random keys | Reference latency for the hottest path |
| W2 | Range scan | WHERE indexed_col BETWEEN x AND y, ranges sized 100 / 1k / 10k rows | Tests B-tree leaf walk + secondary-index path |
| W3 | Bulk insert | 100k rows in one transaction | Throughput; isolates COMMIT cost |
| W4 | Single-row insert | 1k INSERTs, each in its own implicit transaction | The fsync / WAL-commit hot path — expected gap vs SQLite |
| W5 | Mixed OLTP | YCSB-A flavor: 50/50 SELECT-by-PK / UPDATE-by-PK, 100k-row keyed table, 10k ops | Realistic-ish read+write mix |
| W6 | Index lookup | SELECT * FROM t WHERE secondary = ?, 10k probes on 100k rows | Tests secondary-index ROWID indirection |
Group B — SQL-feature scaling (vs SQLite, optionally DuckDB)
| ID | Name | Shape | Why it matters |
|---|---|---|---|
| W7 | Aggregate | SELECT SUM(x) FROM t, 1M rows | Full-scan + accumulator throughput |
| W8 | GROUP BY | SELECT k, COUNT(*) FROM t GROUP BY k, group counts of 10 / 1k / 100k | Hash aggregator behavior under cardinality pressure |
| W9 | JOIN | INNER JOIN on indexed PK/FK between two 100k-row tables | New territory after SQLR-5 — exercises the join planner |
Group C — Differentiators (SQLRite-flavored, opportunistic comparators)
| ID | Name | Shape | Comparator |
|---|---|---|---|
| W10 | Vector top-10 | 10k 384-dim vectors, cosine top-10 query, with HNSW + brute-force variants | sqlite-vec extension if installable; else SQLRite-only baseline |
| W11 | BM25 top-10 | 10k-doc corpus, top-10 BM25 query | SQLite FTS5 virtual table |
| W12 | Hybrid retrieval | 50/50 BM25 + cosine fusion (mirrors examples/hybrid-retrieval/) | SQLRite-only baseline |
For W10/W11, the goal isn't to beat the comparators — they're battle-hardened — it's to publish absolute numbers ("HNSW top-10 over 10k vectors: N ms") so users evaluating SQLRite for RAG have something concrete.
For W12, no off-the-shelf comparator exists in a single embedded engine; the number stands on its own.
Group D — Concurrent writes (Phase 11.11b, the Phase-11 MVCC differentiator)
| ID | Name | Shape | Comparator |
|---|---|---|---|
| W13 | Concurrent writers | 4 worker threads × 50 BEGIN/UPDATE/COMMIT cycles each, random rowid in 1..=1000 (≈ 0.4% collision per op), UPDATE counters SET n = n + 1 WHERE id = ? | SQLite (BEGIN IMMEDIATE + busy_timeout = 5s per-connection) |
The headline workload Phase 11's MVCC machinery was designed for. SQLRite drives BEGIN CONCURRENT across sibling Connection::connect handles minted from the same process; SQLite drives BEGIN IMMEDIATE across separate rusqlite::Connection handles serializing through the WAL write lock. Both engines run the same retry-on-busy outer loop (is_retryable_busy is engine-dispatched); only SQLRite actually exercises the retry path under this workload's shape — the contrast is the measurement.
Workload parameters live in benchmarks/src/workloads/concurrent_writers.rs as named constants (W13_PRELOAD_ROWS, W13_N_WORKERS, W13_TXS_PER_WORKER). Bumping any of them is a workload-version bump under Q8.
Metrics
Keep tight. The task brief lists many candidates; the suite measures these:
- Latency: p50 / p95 / p99 / max. Per workload, captured via
criterion's HDR-style sampling. Tail latency matters more than mean for OLTP. - Throughput: ops/s and rows/s. Reported on bulk paths (W3, W7, W8).
- Wall-clock per workload. Sanity figure on the README table.
- Disk usage at rest.
.sqlrite/.sqlite/.duckdbfile size after each insert workload. One-line per-workload metric, useful for spotting page-fragmentation regressions. - Peak RSS. Captured by wrapping the harness in
/usr/bin/time -v(Linux) //usr/bin/time -l(macOS), one number per run. Don't over-rotate on this — RSS is noisy and the engine's whole-DB-in-RAM model means it's mostly a function of dataset size. - fsync count (W4 only). Linux:
/proc/self/io. macOS: skip (no equivalent without dtrace). The point is to confirm that the gap between SQLRite and SQLite-WAL-NORMAL on W4 is fsync-shaped, not parser-shaped.
Explicitly not measured in v1:
- CPU%. Noisy on a shared machine, redundant with wall-clock for single-threaded workloads.
- Network I/O. All targets are in-process.
Updated post-Phase 11.11b: Group D's W13 (concurrent writers) lifts the single-writer caveat — SQLRite now has a real multi-writer story via BEGIN CONCURRENT, and W13 measures it directly against SQLite's single-writer baseline. Full concurrency-curve sweeps (varying N workers and collision rate) are a clean follow-up; v1 reports a single representative point.
Methodology
Tooling
- Harness:
criterion = "0.5". Defaults: 3 s warm-up, 5 s measurement, 30 samples, statistical confidence intervals. Outputs HTML reports + JSON. - Drivers: Rust crates only —
rusqlite,duckdb-rs,sqlrite::Connection. Keeps the harness language-uniform; no cross-language harness drift. - Process isolation: each criterion
bench_functionopens a fresh DB file in a per-runTempDir. Cleanup on drop. Prevents page-cache / WAL-state bleed across runs. - Data generation: deterministic seed (
StdRng::seed_from_u64(42)). Datasets generated once intobenchmarks/data/and reused across runs to keep generation cost out of timing. - Correctness gate: every workload returns a result hash; the hash is verified against a fixed expected value before any timing runs. Wrong answers fast = not a win, and we want to catch bugs that change query results from showing up as a "speedup."
SQLite tuning
Locked-in profile (see Q3):
PRAGMA journal_mode = WAL;
PRAGMA synchronous = NORMAL;
PRAGMA temp_store = MEMORY;
PRAGMA cache_size = -65536; -- 64 MB
Rationale: SQLRite's WAL is mandatory + checkpointer is always on; SQLRite's commit fsync semantics map most closely to SQLite's synchronous=NORMAL (fsync on checkpoint, not on every commit). Comparing against SQLite-default (DELETE + full sync) would flatter SQLRite by measuring SQLite's most paranoid mode against SQLRite's only mode. We can publish a SQLite-default column too once the harness exists, but the headline number uses NORMAL.
Hardware + reproducibility
- v1 host: owner's M-series MBP. Specs captured into the JSON envelope (CPU model, RAM, OS, kernel). Background processes quiesced via [criterion's
nocapturenotes]. - CI: off. Adds noise without value at this stage. The JSON output format is designed so that a future "regression detector" job can compare two runs from the same host.
- Run command:
make benchfrom repo root → invokescargo bench -p sqlrite-benchmarks.make bench-duckdbadds the--features duckdbaxis.
Repository layout
New top-level benchmarks/ workspace member. CI skips it via an explicit --exclude sqlrite-benchmarks on every cargo build / cargo test / cargo clippy / cargo doc invocation (the same pattern that hides sqlrite-desktop, sqlrite-python, and sqlrite-nodejs). Run locally with make bench.
benchmarks/
├── Cargo.toml — depends on rusqlite (default), duckdb (optional), criterion
├── README.md — how to run, results table, host pinning notes
├── src/
│ ├── lib.rs — Driver trait, common helpers
│ ├── data.rs — deterministic dataset generators (seeded)
│ ├── drivers/
│ │ ├── sqlrite.rs
│ │ ├── sqlite.rs
│ │ └── duckdb.rs — feature-gated
│ └── workloads/
│ ├── kv.rs — W1
│ ├── range_scan.rs — W2
│ ├── bulk_insert.rs — W3
│ ├── single_insert.rs — W4
│ ├── mixed_oltp.rs — W5
│ ├── index_lookup.rs — W6
│ ├── aggregate.rs — W7
│ ├── group_by.rs — W8
│ ├── join.rs — W9
│ ├── vector.rs — W10
│ ├── fts.rs — W11
│ ├── hybrid.rs — W12
│ └── concurrent_writers.rs — W13 (Phase 11.11b)
├── benches/
│ └── suite.rs — single criterion entry point that fans out
├── scripts/
│ ├── run.sh — pipeline + JSON capture into results/
│ └── compare.py — render JSON → markdown table
└── results/
└── 2026-MM-DD-<host>-<commit>.json
The Driver trait carries just enough surface to express every workload:
pub trait Driver {
type Conn;
fn name(&self) -> &'static str;
fn open(&self, path: &Path) -> anyhow::Result<Self::Conn>;
fn execute(&self, conn: &mut Self::Conn, sql: &str) -> anyhow::Result<()>;
fn execute_with_params(&self, conn: &mut Self::Conn, sql: &str, params: &[Value]) -> anyhow::Result<()>;
fn query_one(&self, conn: &mut Self::Conn, sql: &str, params: &[Value]) -> anyhow::Result<Vec<Value>>;
fn query_all(&self, conn: &mut Self::Conn, sql: &str, params: &[Value]) -> anyhow::Result<Vec<Vec<Value>>>;
}
Workloads are generic over D: Driver; the criterion entry point fans the same workload across (SQLRiteDriver, SQLiteDriver, [DuckDBDriver]) and emits one bench per (workload, driver) pair.
Sub-phases
Each ships as its own PR, runs the full existing test suite green, and adds one row to benchmarks/README.md's results table.
9.1 — Harness scaffolding (~400 LOC + tests)
benchmarks/crate skeleton,Drivertrait,data.rsseeded generators.- Two drivers: SQLRite (via
sqlrite::Connection) + SQLite (viarusqlite). - One workload end-to-end: W1 (read-by-PK). Proves the harness shape.
- Lock in JSON output schema (workload, driver, p50/p95/p99/max, ops/s, dataset size, commit, host fingerprint).
make benchtarget.
Exit criterion: make bench produces a JSON file under benchmarks/results/, benchmarks/README.md shows a 2-row table for W1.
9.2 — Group A workloads (~400 LOC + tests)
W2–W6. Each workload is one file in src/workloads/, plus one entry in benches/suite.rs. No engine changes — workloads compose only existing public API.
Exit criterion: all 6 Group A rows in the results table; if W4 (single-row insert) shows >100× gap, file a follow-up to investigate the commit path before moving on. (Investigation, not a gate — the gap is informational.)
9.3 — Group B workloads (~200 LOC + tests)
W7–W9. Aggregates / GROUP BY / JOIN — exercises SQLR-3 and SQLR-5 surface.
Exit criterion: all 3 Group B rows. JOIN performance vs SQLite is the most informative number here — SQLite has 25 years of join-planner tuning that SQLRite skipped; the magnitude of the gap is itself a roadmap input.
9.4 — Group C differentiators (~300 LOC + tests)
W10–W12. Vector top-10 (with sqlite-vec if installable), BM25 (vs SQLite FTS5), hybrid (SQLRite-only).
Exit criterion: absolute latency numbers published in the README. These are the headline numbers for the "SQLRite for RAG" pitch.
9.5 — DuckDB driver (optional) (~150 LOC)
Add the duckdb-rs driver under a --features duckdb flag. Wire only into Group B workloads (W7–W9) per the viability section. make bench-duckdb runs the extended suite.
Exit criterion: Group B table grows a third column. If a workload is misleading on DuckDB (e.g. DuckDB needs CHECKPOINT semantics that don't translate), document and skip rather than publish bad numbers.
9.6 — Reporting + first published run (~50 LOC + docs)
scripts/compare.pyrenders any two JSONs into a Markdown diff table.- First "official" pinned-host run committed under
benchmarks/results/. docs/benchmarks.mdbecomes the canonical reference (mirrors howdocs/fts.mdis the canonical FTS reference for Phase 8). Cross-links fromREADME.md"Roadmap" section anddocs/_index.md.
Exit criterion: docs/benchmarks.md exists, the README has a "Benchmarks" section pointing at it, the first dated results JSON is committed.
9.7 — Group D concurrent writers (Phase 11.11b, shipped)
Adds W13 (concurrent writers, mostly-disjoint rows) under a new Group D. The Driver trait grows three optional methods (connect_sibling, concurrent_begin_sql, is_retryable_busy) with defaults that make sense for engines without an MVCC story; SQLRite overrides all three. SQLite gains a busy_timeout = 5s pragma at open so its BEGIN IMMEDIATE blocks rather than fails on contention. The workload lives in benchmarks/src/workloads/concurrent_writers.rs.
Exit criterion: W13 runs under both drivers, correctness gate passes (SUM(n) == n_workers * txs_per_worker after a sample), and the JSON envelope picks up W13.v1 rows for both drivers.
Post-9.7 ideas (parked)
- libSQL driver if/when we want a non-extension vector competitor for W10.
- Per-PR regression detector. A GitHub Action that runs the bench on a self-hosted runner and posts a comment if any workload regresses >20% from the last
mainbaseline. - Concurrency curves for W13. Sweep
Nworkers (1, 2, 4, 8, 16) andKrows (10, 100, 1k, 10k) to chart SQLRite-MVCC's scaling envelope vs SQLite's serial baseline. v1 reports a single representative point; the sweep is a clean follow-up. - W13b hot-row contention. Same workload,
K = 10rows instead of 1000 — collision probability climbs to ~40% per op, exercising the retry loop hard. Useful for stressing the GC + retry path under adversarial contention. - Larger datasets (10M, 100M). v1 is sized for fast iteration on a laptop. A "release-blocker run" config could 100× the row counts.
Total scope estimate
~1.5 kLOC of new Rust + ~150 lines of Python + a Makefile target + the canonical docs/benchmarks.md reference. Parallel scope with Phase 8 (~1.2 kLOC across 6 sub-phases). No engine changes — pure additive testing infrastructure.
Decisions (was: open questions)
Q1–Q8 were resolved by the project owner on 2026-05-06. Each question keeps its original options + recommendation as a record of the rationale; the Decided: line at the top is the canonical answer the implementation should follow.
Q1. Bench harness host
Decided: pinned local M-series MBP for v1. Bare-metal hosts (Hetzner / Equinix) revisited post-9.6 once we're publishing numbers externally and want stability across runs. The JSON envelope already captures CPU model / RAM / OS / kernel so a future host swap is documentable, not a silent break.
Pinned local M-series MBP for v1, or rent a bare-metal box (Hetzner / Equinix) for stable numbers from day one?
Recommendation: local laptop. Cheaper, faster iteration, "developer wall-clock" is the right unit at this stage. Switch to bare-metal when we want to publish numbers externally (post-9.6).
Q2. SLO thresholds
Decided: no PR-gating in v1. The harness publishes numbers; it does not fail PRs. Tracked as a post-9.6 follow-up — needs ~3+ same-host baseline runs before any threshold (e.g. "regress >20% on workload X") is anything but noise-fitting.
Should the bench harness gate PRs ("fail if SQLRite regresses >X% on workload Y")?
Recommendation: no for v1. Need ~3+ baseline runs to know what "noise" looks like before setting a threshold. Document as a Post-9.6 idea.
Q3. SQLite tuning
Decided: tuned (WAL +
synchronous=NORMAL) is the headline. SQLite-default (journal_mode=DELETE+synchronous=FULL) is the secondary column — opt-in via the harness, not the defaultmake benchaxis. Rationale stays indocs/benchmarks.mdso anyone reading "SQLite Y ms vs SQLRite Z ms" sees up front that both engines are durability-comparable, not "SQLRite vs SQLite's most paranoid mode."
Compare against SQLite default settings (journal_mode=DELETE, synchronous=FULL) or tuned (WAL + synchronous=NORMAL)?
Recommendation: tuned (WAL+NORMAL) as the headline number, with a note in benchmarks.md explaining why. Optionally publish a "SQLite-default" column too, since some users compare against the default. Apples-to-apples is the goal — SQLRite has no synchronous=FULL mode to opt into.
Q4. DuckDB inclusion
Decided: opt-in via
--features duckdbon the bench crate.make benchstays lean (rusqlite + sqlrite only);make bench-duckdbpulls the heavy dep and runs only Group B (W7–W9), per the viability section.
Hard-out, opt-in feature, or default-on?
Recommendation: opt-in via --features duckdb. Heavy dep, only useful on Group B. make bench stays lean.
Q5. libSQL
Decided: punted to post-9.6. Embedded libSQL tracks SQLite within a few percent on the OLTP path; not enough signal to justify a third row-oriented driver in v1. Revisit alongside any "vector-only" benchmark page (post-9.4) where a non-extension vector competitor would be informative.
Add as a +1 driver in 9.5 alongside DuckDB, or punt to post-9.6?
Recommendation: punt. The OLTP numbers will track SQLite within a few percent and add noise without insight. Worth adding when we want a non-extension vector competitor on W10.
Q6. D1 / rqlite
Decided: out of scope. Both are network-resident; round-trip latency dominates every workload. Out-of-scope rationale already lives in the viability section; no driver work, no follow-up. Revisit only if SQLRite ever ships a remote-server / distributed mode.
Already proposed out-of-scope in the viability section. Confirming by Q.
Recommendation: out. Document the rationale, don't burn cycles.
Q7. Where to publish
Decided: in-repo. Canonical reference at
docs/benchmarks.md(lands in 9.6); raw JSON committed underbenchmarks/results/keyed by date + host + commit; cross-link fromREADME.mdanddocs/_index.md. A standalone docs site can grow out of this if/when demand appears, but versioned-with-the-code is the right v1 default.
In-repo Markdown (docs/benchmarks.md + raw JSON in benchmarks/results/), or a separate docs site?
Recommendation: in-repo. Versioned with the code, no extra infra, clickable from the README. A separate site can grow out of this if there's demand.
Q8. Workload shape changes mid-suite
Decided: workloads carry an explicit version (
W1.v1,W1.v2, …). The JSON output schema includesworkload_versionper row; the comparison script only diffs same-version pairs and warns on cross-version compares. Bumping the version is the explicit "we changed the benchmark" gesture; old JSON files remain readable forever.
If we add a column or change a query in a workload between releases, how do we keep historical comparison meaningful?
Recommendation: workloads are versioned (W1.v1, W1.v2). Old JSON keeps the old workload-version key; results page only compares same-version runs. Cheap, opt-in, avoids "we silently changed the benchmark" mistakes.
Risks + things to watch
- Driver bias. A poorly-written SQLite driver call (e.g. forgetting
prepare_cached) makes SQLRite look 5× better than it is. Mitigation: code review every driver impl with the question "is this how a perf-conscious user ofwould write it?", and the correctness gate (hash-matching) catches divergent semantics. Update — SQLR-23: SQLRite gained prepare_cached+ parameter binding, and the bench harness's SQLRite driver was flipped from per-call SQL formatting (inline_params) to the bound + cached path. Every workload'sWorkloadId.versionwas bumpedv1 → v2in lockstep so this methodology change is captured explicitly. Oldv1JSON envelopes stay readable but the comparison script flags cross-version pairs. - Criterion overhead in micro-workloads. For W1 (sub-microsecond per op territory after warmup), criterion's per-iter accounting can dominate. Mitigation: batch iterations inside the bench closure (criterion's
iter_batched+ a 1k-iteration inner loop), report ops/s computed against the inner-loop count. sqlite-vecavailability. The extension isn't shipped with stock SQLite. W10 should treat the SQLite vector comparator as opportunistic — if not installed, run SQLRite-only and note it in the table. Don't make Group C hard-depend on it.- macOS vs Linux skew. fsync semantics differ; W4 numbers won't be portable across OSes. Mitigation: JSON envelope captures
os.kind, results page only compares within the same OS family. - Future format-version bumps. Workloads write
.sqlritefiles. A future on-disk format change (e.g. v5→v6) means historical results files reference databases the engine can't reopen. Mitigation: results JSON only stores numbers + dataset spec, never the DB file. Datasets are always regenerated from seed.
See also
roadmap.md— Phase 8 closeout + "Possible extras" entry that this plan replaces.docs/architecture.md— engine layer map; benchmarks bind to the publicConnectionsurface only.docs/fts.md,examples/hybrid-retrieval/— input shapes for W11, W12.docs/phase-7-plan.md,docs/phase-8-plan.md— plan-doc shape this document mirrors.