Full-text search + hybrid retrieval
June 11, 2026 · View on GitHub
fts is SQLRite's keyword-search feature: build an inverted index on a TEXT column with CREATE INDEX … USING fts (col), then ask BM25-relevance-ranked queries from any product surface. It composes naturally with vec_distance_cosine (Phase 7d) for hybrid retrieval — the modern RAG default that fuses lexical exact-match with semantic embedding-space proximity.
This doc is the canonical reference. For the design rationale (resolved Q1–Q10), see docs/phase-8-plan.md.
Table of contents
- What FTS does
- Quick start
- SQL surface
- Hybrid retrieval
- Tokenizer rules
- BM25 parameters
- Lifecycle: INSERT / DELETE / UPDATE
- Persistence + file format
- Optimizer hook
- How to use it from each surface
- Limitations
- See also
What FTS does
Given a TEXT column with an FTS index, you can:
- Match rows by query terms:
WHERE fts_match(body, 'rust embedded')— true if the row contains any of the query's tokens (post-tokenization). - Rank by relevance:
ORDER BY bm25_score(body, 'rust embedded') DESC LIMIT k— top-k by BM25, served from the inverted index in O(query-term-count × k log k). - Fuse with vector similarity:
ORDER BY 0.5 * bm25_score(...) + 0.5 * (1.0 - vec_distance_cosine(...)) DESC LIMIT k— hybrid retrieval via raw arithmetic.
FTS does not auto-index every TEXT column. It only kicks in for columns with an explicit CREATE INDEX … USING fts (col). Mirrors how HNSW vector indexes opt in.
Quick start
-- 1. Schema + data
CREATE TABLE docs (id INTEGER PRIMARY KEY, body TEXT);
INSERT INTO docs (body) VALUES ('rust is a systems programming language');
INSERT INTO docs (body) VALUES ('sqlite is an embedded database engine');
INSERT INTO docs (body) VALUES ('postgres is a relational database server');
-- 2. Build the FTS index (any time after the column exists; the index
-- backfills from current rows and stays in lockstep on INSERT).
CREATE INDEX docs_fts ON docs USING fts (body);
-- 3a. Lexical filter — true / false predicate.
SELECT id FROM docs WHERE fts_match(body, 'database');
-- 3b. Relevance ranking — top-k.
SELECT id FROM docs
WHERE fts_match(body, 'embedded database')
ORDER BY bm25_score(body, 'embedded database') DESC
LIMIT 5;
-- 3c. Hybrid: BM25 + vector cosine, equal weights.
-- (Assumes an `embedding VECTOR(N)` column on the same table.)
SELECT id FROM docs
WHERE fts_match(body, 'embedded database')
ORDER BY 0.5 * bm25_score(body, 'embedded database')
+ 0.5 * (1.0 - vec_distance_cosine(embedding, [0.12, -0.04, /* ... */]))
DESC LIMIT 5;
For a runnable hybrid example with hand-baked embeddings, see examples/hybrid-retrieval/.
SQL surface
CREATE INDEX … USING fts (col)
CREATE INDEX <name> ON <table> USING fts (<column>);
Constraints:
- Single-column only. Multi-column FTS (
fts(title, body)) is deferred to Phase 8.1 (Q2). - TEXT columns only.
INTEGER,REAL,BOOLEAN,VECTOR,JSONcolumns are rejected with a clear error message at CREATE-INDEX time. UNIQUEis meaningless and rejected. UNIQUE on an inverted index doesn't have a sensible interpretation.- Name uniqueness spans all index kinds. A B-Tree, HNSW, or other FTS index on the same table can't share a name;
IF NOT EXISTSsilently no-ops in that case. - Backfill is automatic. Existing rows are tokenized + indexed at CREATE time. New rows added via INSERT update the index incrementally.
fts_match(col, 'q')
Boolean predicate; returns TRUE if the row has at least one query term in its tokenized representation, FALSE otherwise. Most-common shape:
WHERE fts_match(body, 'rust embedded')
Multi-token queries use any-term (OR) semantics — a row matches if any of rust / embedded is in its tokenized body. Boolean operators inside the query string (AND / OR / NOT) and phrase queries (MATCH '"exact phrase"') are deferred to Phase 8.1.
The function requires a TEXT column with an FTS index attached. Without one, it errors at evaluation time:
fts_match(body, ...): no FTS index on column 'body' (run CREATE INDEX <name> ON <table> USING fts(body) first)
The column argument may be qualified (fts_match(docs.body, 'q') or the FROM alias). The qualifier is validated the same way as any column reference (SQLR-15): it must name the table in scope, else the query errors with unknown table qualifier.
This contrasts with SQLite's MATCH operator, which is parser-level; SQLRite uses a function-call shape because sqlparser's SQLite dialect doesn't expose a BinaryOperator::Match variant we can dispatch on (Q1).
bm25_score(col, 'q')
Per-row BM25 relevance score; returns Value::Real (f64). Higher = more relevant. Use in ORDER BY for top-k retrieval:
ORDER BY bm25_score(body, 'rust embedded') DESC LIMIT 10
Notes:
DESCis the conventional direction.ASCworks (returns least-relevant first) but disables the optimizer probe — see Optimizer hook.- Same query string must appear in
WHERE fts_match(...)andORDER BY bm25_score(...). The optimizer recognizes the joint pattern; mismatched strings fall through to a slow per-row evaluation. - Same caveats as
fts_match: requires an FTS index on the column; errors clearly otherwise. Qualified column args (bm25_score(docs.body, …)) are validated against the table in scope (SQLR-15) — note the optimizer probe only matches a bare column identifier, so a qualified arg always takes the per-row evaluation path.
The math is the standard BM25+ variant (k1 = 1.5, b = 0.75, fixed at SQLite FTS5 defaults). Full formula in src/sql/fts/bm25.rs.
Hybrid retrieval
Vector-only retrieval misses keyword-grounded queries; lexical-only retrieval misses paraphrases. Hybrid combines both via SQL arithmetic:
SELECT id, title FROM docs
WHERE fts_match(body, 'rust embedded database')
ORDER BY 0.5 * bm25_score(body, 'rust embedded database')
+ 0.5 * (1.0 - vec_distance_cosine(embedding, [...]))
DESC LIMIT 10;
Two gotchas:
- Cosine returns distance, not similarity.
vec_distance_cosinereturns1 - cos(a, b)—0for identical,1for orthogonal,2for diametrically opposite. Lower is closer. Hybrid scoring assumes "higher is better", so the SQL inverts:1.0 - vec_distance_cosine(...). Forgetting this flip silently inverts the ranking. WHERE fts_matchfilters before scoring. The optimizer pre-filters rows with no lexical match, so docs the embedding would have surfaced (paraphrases with no shared tokens) never get scored. Tradeoff per Q6.
Why arithmetic and not a typed hybrid_score(...) function? Composability — different aggregations (MAX, three-way fusion, RRF) all work via the same arithmetic; a typed function would lock in one shape. See Q8.
The runnable examples/hybrid-retrieval/ walkthrough has weight-tuning sketches and a production checklist (BM25 normalization, real embeddings, cross-encoder re-rank).
Tokenizer rules
Resolves Q3: ASCII MVP. The tokenizer (src/sql/fts/tokenizer.rs) splits on [^A-Za-z0-9]+ and lowercases.
| Input | Tokens |
|---|---|
"Hello, World!" | ["hello", "world"] |
"co-op" | ["co", "op"] |
"rust2026" | ["rust2026"] (digits are alphanumeric) |
"FooBar BAZ" | ["foobar", "baz"] |
"café" | ["caf"] (non-ASCII bytes act as separators) |
"日本語" | [] (every byte is non-ASCII) |
"" | [] |
No stemming and no stop-list (Q4 + Q5). BM25's IDF naturally downweights common terms; modern RAG pipelines rely on exact lexical matches for technical retrieval ("python the language" vs "pythons the snakes").
Unicode-aware tokenization is deferred to Phase 8.1 behind a unicode cargo feature.
BM25 parameters
Fixed at SQLite FTS5's defaults for the MVP:
| Parameter | Value | Meaning |
|---|---|---|
k1 | 1.5 | Term-frequency saturation. Higher → less aggressive saturation. |
b | 0.75 | Length-normalization weight. 0 → off; 1 → fully proportional. |
Per-call overrides aren't exposed yet. See Bm25Params in src/sql/fts/bm25.rs — the public struct is shaped to grow per-call configuration without breaking signatures.
Lifecycle: INSERT / DELETE / UPDATE
| Operation | FTS-index behavior |
|---|---|
| CREATE INDEX … USING fts | Backfills from current rows; tokenizes each cell; index is ready immediately. |
| INSERT | Incremental update on the in-memory PostingList — single-row tokenize + posting append. Cheap. |
| DELETE | Marks the index needs_rebuild = true. The next save (auto-save on the next write, or explicit COMMIT) rebuilds the posting list from current rows before serializing. |
| UPDATE on the indexed TEXT column | Same as DELETE: marks dirty + rebuild on save. |
| UPDATE on a different column | No effect on FTS state. |
| DROP INDEX | Not yet supported — Phase 8.1 follow-up. Drop the table to remove the index. |
Resolves Q7: rebuild-on-save mirrors HNSW's strategy. Incremental delete + tombstone compaction is a Phase 8.1 candidate when someone hits the perf wall.
Persistence + file format
The in-memory PostingList (src/sql/fts/posting_list.rs) survives save_database → open_database round-trips byte-equivalently. Each FTS index is a B-Tree of KIND_FTS_POSTING cells parallel to the table's data tree, with a sidecar cell carrying per-doc lengths so total_docs stays honest in BM25 even after a row tokenized to zero tokens.
On-demand v4 → v5 file-format bump (Q10): existing v4 databases without FTS keep writing v4; the first save with at least one FTS index attached promotes the file to v5. Decoders accept both versions. Zero migration friction for v0.1.x users who don't use FTS.
Cell format reference: docs/file-format.md.
Optimizer hook
The executor's try_fts_probe (src/sql/executor.rs) recognizes the canonical top-k shape:
SELECT … FROM <t>
WHERE fts_match(<col>, '<q>')
ORDER BY bm25_score(<col>, '<q>') DESC
LIMIT k
When matched, the engine serves top-k directly from the inverted index in O(query-term-count × k log k) — no full-row scan. Falls through to per-row scalar evaluation if:
- ORDER BY direction is
ASC(BM25 is "higher = better"; ASC almost certainly indicates user error). - The function name isn't
bm25_score. - The query string in
WHERE fts_matchdoesn't match the one inORDER BY bm25_scoreliterally. WHEREreferences columns other than the one(s) the optimizer can prove are subsumed.
Mirrors try_hnsw_probe for vector KNN. Same WHERE-drop posture per Q6 — additional WHERE conditions on the optimizer fast path are silently skipped; the canonical FTS query has only WHERE fts_match(...) so the trade-off is rarely visible. A correctness-preserving multi-index composer is deferred to a later phase.
How to use it from each surface
The SQL surface is identical across every product. Here's how each entry point exposes it:
REPL
$ cargo run -- mydb.sqlrite
sqlrite> CREATE INDEX docs_fts ON docs USING fts (body);
sqlrite> SELECT id FROM docs WHERE fts_match(body, 'rust') ORDER BY bm25_score(body, 'rust') DESC LIMIT 5;
Rust library
use sqlrite::{Connection, Result};
let mut conn = Connection::open_in_memory()?;
conn.execute("CREATE TABLE docs (id INTEGER PRIMARY KEY, body TEXT);")?;
conn.execute("INSERT INTO docs (body) VALUES ('rust embedded database');")?;
conn.execute("CREATE INDEX docs_fts ON docs USING fts (body);")?;
let stmt = conn.prepare(
"SELECT id FROM docs WHERE fts_match(body, 'rust') \
ORDER BY bm25_score(body, 'rust') DESC LIMIT 5;"
)?;
let mut rows = stmt.query()?;
while let Some(row) = rows.next()? {
let id: i64 = row.get_by_name("id")?;
println!("{id}");
}
# Ok::<_, sqlrite::error::SQLRiteError>(())
MCP server (bm25_search tool)
LLM clients driving SQLRite over MCP get a typed shortcut that wraps the canonical SQL:
{
"name": "bm25_search",
"arguments": {
"table": "docs",
"column": "body",
"query": "rust embedded database",
"k": 5
}
}
See mcp.md for the full MCP tool catalog. Symmetric with vector_search for vector KNN.
Python / Node.js / Go / WASM SDKs
The SDKs go through the same Connection API as the Rust library — call execute() to issue CREATE INDEX … USING fts (...), call prepare() + query() for fts_match / bm25_score reads. There's no SDK-specific FTS shortcut; the SQL composes the same way everywhere.
Desktop app
Just type the SQL into the query editor. The app embeds the engine directly, so the SQL surface is identical.
Limitations
Per the Phase 8 plan's "Out of scope" section, the MVP intentionally defers:
- Multi-column FTS — single-column per index (Q2). Compose via
OR:WHERE fts_match(title, 'q') OR fts_match(body, 'q'). - Unicode tokenization — ASCII MVP per Q3. CJK / accented Latin won't be searchable until the
unicodecargo feature lands in Phase 8.1. - Stemming + stop words — none (Q4 + Q5).
- Configurable
k1/b— fixed at FTS5 defaults; per-column field-weighting deferred. - Phrase queries (
MATCH '"exact phrase"') — not in the MVP. Phrase queries need positional postings, doubling the index size. Phase 8.1 candidate. - Boolean query operators (
AND/OR/NOTinside the query string) — MVP uses any-term semantics. Boolean syntax deferred. - Highlight / snippet generation (
snippet(col, 'q')) — not in the MVP. Easy to add later if users want it. - Multi-index intersection optimizer — FTS pre-filter + scalar WHERE (Q6).
- Incremental DELETE / UPDATE — rebuild-on-save instead (Q7).
A single posting cell that exceeds page capacity (~4 KiB) errors loudly; overflow chaining is a Phase 8.1 stretch goal. This shouldn't bite real corpora — even 'the' in a million-row English corpus stays under the limit with varint encoding.
See also
docs/phase-8-plan.md— design doc with all Q1–Q10 resolutions and per-sub-phase scope.docs/file-format.md—KIND_FTS_POSTINGcell layout, on-demand v5 bump.docs/sql-engine.md—try_fts_probeoptimizer hook in the executor pipeline.docs/supported-sql.md—USING ftssyntax +fts_match/bm25_scorefunction reference.docs/mcp.md—bm25_searchMCP tool.examples/hybrid-retrieval/— runnable hybrid-retrieval walkthrough with weight-tuning sketches and production checklist.docs/ask.md— natural-language → SQL across every surface; the prompt may pick upfts_match/bm25_scoreonce 8f.1 wires them into the system rules.