Storage model
April 23, 2026 · View on GitHub
How data lives in memory while a database is open. For how it's laid out on disk, see file-format.md.
The canonical code is src/sql/db/table.rs.
The big picture
Database
├── db_name: "users.sqlrite"
├── source_path: Some("/path/to/users.sqlrite")
├── pager: Some(Pager)
└── tables: HashMap<String, Table>
"users" -> Table
"notes" -> Table
...
Table
├── tb_name: "users"
├── columns: Vec<Column> ordered, schema
├── rows: Rc<RefCell<HashMap<String, Row>>>
│ keyed by column name
│ value = one column's full data
├── indexes: HashMap<String, String> (placeholder; not used yet)
├── last_rowid: i64
└── primary_key: "id" | "-1" "-1" = table has no explicit PK
Column
├── column_name: "name"
├── datatype: DataType::Text
├── is_pk: false
├── not_null: true
├── is_unique: true
├── is_indexed: true
└── index: Index::Text(BTreeMap<String, i64>)
Row (stored per column inside Table.rows)
├── Integer(BTreeMap<rowid: i64, value: i32>)
├── Text(BTreeMap<rowid: i64, value: String>)
├── Real(BTreeMap<rowid: i64, value: f32>)
├── Bool(BTreeMap<rowid: i64, value: bool>)
└── None
Column-oriented layout
The surprising thing is that Table.rows is not keyed by rowid. It's keyed by column name, and each value is a Row enum holding a BTreeMap<rowid, value> for that one column.
In other words, to read "what's at row 5?" you have to ask each column's BTreeMap for key 5:
users.rows["id"] = Row::Integer({5 => 5})
users.rows["name"] = Row::Text({5 => "alice"})
users.rows["age"] = Row::Integer({5 => 30})
All columns share the same keyset. Insertion pads missing columns with NULL-ish values (see below), so every rowid has an entry in every column's map.
Why column-oriented?
Historical rather than principled. The first pass of the codebase in 2020 used this shape, probably because it made per-column index maintenance straightforward. Phase 3c will replace it with row-oriented cells as part of moving to real B-Tree storage.
Consequences
- Projecting a column (e.g.,
SELECT name FROM users) is one BTreeMap scan. Efficient. - Reassembling a row (
SELECT *) requires N map lookups for N columns. Each lookup is O(log R), so row reconstruction is O(C log R). Not terrible for small N but not ideal. - Deleting a row means removing the rowid from every column's BTreeMap and every index's BTreeMap. Handled by
Table::delete_row.
ROWID
Every row has a rowid: i64. It's the implicit primary key when no explicit INTEGER PRIMARY KEY is declared, and it's the alias for the PK when one is. Table.last_rowid tracks the most recent assignment and is bumped on each insert.
When the user omits an INTEGER PRIMARY KEY column in an INSERT, insert_row auto-assigns last_rowid + 1. When the user supplies an explicit value, that value becomes the new rowid (and last_rowid advances to it, if it's larger).
Non-integer PRIMARY KEY columns are unusual; the code handles them but auto-assign only kicks in for the integer case.
Indexes
Each Column carries an Index:
enum Index {
Integer(BTreeMap<i32, i64>), // value -> rowid
Text(BTreeMap<String, i64>), // value -> rowid
None, // not indexed
}
Indexes exist only on columns marked PRIMARY KEY or UNIQUE, and only for Integer and Text types today. Real and Bool columns get Index::None even if declared unique, and UNIQUE enforcement falls back to a linear scan.
The index is maintained inline with every insert, update, and delete:
insert_rowinserts into the column's index after successfully inserting into the Row map.set_valueremoves the old index entry (aretain-based scan, since we look up by rowid) then inserts the new one.delete_rowstrips every index entry pointing at the rowid being deleted.
Indexes are also used on write paths for UNIQUE-constraint checks — validate_unique_constraint (called before every INSERT) does O(log R) contains_key lookups.
They are not yet used on read paths. Today's SELECT always does a full table scan via Table::rowids. A planner that can turn WHERE id = 5 into an index probe is Phase 3+ work.
NULL handling
Storage for NULL is inconsistent by type, which is a known wart:
- Text columns can store NULL by encoding the literal string
"Null"in the BTreeMap. Reads special-case this back toValue::NullinRow::get. Consequence: a user who actually inserts the string'Null'into a Text column will read backNULL. Acceptable for now, will be cleaned up in Phase 3c. - Integer / Real / Bool columns can't store NULL. If a user omits such a column in INSERT,
insert_rowreturns aType mismatcherror. This is stricter than SQL (which allows NULL by default unlessNOT NULLis declared) but safer than the old behavior of panicking on"Null".parse::<i32>().
A proper NULL-bitmap mechanism is on the Phase 3c to-do list.
Runtime Value vs storage Row
When the executor evaluates an expression, it works with Value — a runtime enum with wider variants:
pub enum Value {
Integer(i64),
Text(String),
Real(f64),
Bool(bool),
Null,
}
Conversion:
Row::get(rowid) → Option<Value>widens storagei32toValue::Integer(i64),f32toValue::Real(f64).Table::set_value(col, rowid, Value)narrows back withascasts. The declared column type enforces what's allowed (e.g., aValue::Textinto an Integer column is a type error, not a silent corruption).
This split keeps the executor type-agnostic — it just uses Value arithmetic — while storage stays compact.
Lifecycle: write paths
INSERT
executor::execute_insert(inline in theStatement::Insertarm): delegate toInsertQuery::newto get(table_name, columns, rows).- Look up the table via
db.get_table_mut. - For each row of values:
a. Check every value's column exists on the table.
b. Call
Table::validate_unique_constraint— uses the column indexes for O(log R) lookups. c. CallTable::insert_row— auto-assigns the PK if missing, then writes every column's BTreeMap + every index's BTreeMap in lockstep.
UPDATE
executor::execute_update: resolve assignment targets to column names, verify they exist.- Two passes to avoid borrow-checker fights:
- Read pass (
&db): walk every rowid, evaluate WHERE, for matching rows evaluate the assignment RHS expressions, collect planned writes. - Write pass (
&mut db): for each(rowid, [(col, value)]), callTable::set_value.
- Read pass (
set_valueenforces the column's declared type and UNIQUE constraint before mutation. It refreshes the index (old entry removed, new entry inserted).
DELETE
executor::execute_delete: same two-pass pattern.- Read pass collects matching rowids.
- Write pass calls
Table::delete_row(rowid)for each.
CREATE TABLE
CreateQuery::newproduces aParsedColumnlist with type + constraints.Table::newbuilds theTablestruct: allocates an emptyBTreeMapper column inrows, buildsColumnstructs including emptyIndexes on UNIQUE/PK columns.- Top-level dispatcher inserts the table into
db.tables.
Lifecycle: read paths
SELECT
executor::execute_selectlooks up the table.- Resolves projection to a concrete ordered column list.
- Walks every rowid from
Table::rowids(grabbed from the first column's BTreeMap, since all columns share the same keyset). - For each rowid:
- If a WHERE clause exists, evaluate
eval_predicateagainst the row's data. - If matched, keep the rowid for later.
- If a WHERE clause exists, evaluate
- Sort the matched rowids if ORDER BY is present.
- Truncate to LIMIT.
- Render as a prettytable string, column by column via
Table::get_value.
Full table scan, every time. Index-backed lookups are Phase 3+.
What Phase 3c changed — and what it didn't
Changed (on-disk). Rows are no longer stored as one opaque bincode blob per table. Each row is serialized as a cell (length-prefixed, kind-tagged, null-bitmap + typed value blocks) and packed into TableLeaf pages behind a slot directory. Cells that don't fit spill into an overflow chain. The schema catalog is itself a table — sqlrite_master — stored in the same cell format. File format version is now 2. See file-format.md for the byte-level details.
Didn't change (in-memory). The Database / Table / Column / Row / Index structures described in the sections above are still the runtime representation. On save_database, Table::extract_row turns each in-memory row into a Cell; on open_database, each Cell is decoded and its values are handed to Table::restore_row, which populates the per-column BTreeMaps and rebuilds the indexes.
What Phase 3d changed
Phase 3d put a real B-Tree on top of the leaf layer. The on-disk file now has InteriorNode pages above TableLeaf pages; find_leftmost_leaf during open descends the tree, then the existing sibling-chain walk delivers cells in rowid order. Interior cells use the same cell_length | kind_tag | body prefix as local/overflow cells, so binary search for a rowid works uniformly across all page kinds.
Write path: save rebuilds the tree bottom-up from the in-memory sorted rows. No in-place splits or merges — the whole tree is emitted fresh on every commit. That's the educational and engineering trade-off to stay compatible with the "Table lives in memory, save flushes a full snapshot" model. Deterministic build (tables sorted by name, rows by rowid) keeps the Pager's diff commit effective: unchanged tables produce byte-identical pages and aren't rewritten.
Read path is still eager-load: load_table_rows walks every leaf and populates Table's in-memory maps up front. A cursor abstraction that streams rows through the pager without materializing the whole table is deferred to Phase 5 (the library-API split), where the cost of the bigger refactor also buys us the public Connection/Statement/Rows API.
What Phase 3e changed
Phase 3e added secondary indexes. Every UNIQUE / PRIMARY KEY column auto-creates a SecondaryIndex at CREATE TABLE time (named sqlrite_autoindex_<table>_<col>); CREATE [UNIQUE] INDEX name ON t (col) adds explicit ones. Column is now a pure schema descriptor — the per-column BTreeMap moved to Table::secondary_indexes, where it's shared infrastructure between auto and explicit indexes.
On disk. Every index persists as its own cell-based B-Tree using a new KIND_INDEX cell carrying (value, original_rowid). sqlrite_master grew a type column distinguishing 'table' rows from 'index' rows (file format v3).
Executor. The WHERE col = literal (and literal = col) shape now probes the matching index for an O(log N) lookup instead of scanning every row. Other predicate shapes still fall back to the full-scan path.
What Phase 3f / later phases will bring
See roadmap.md for the next phase (WAL and file locks in Phase 4, library + WASM in Phase 5). The cursor / lazy-load refactor is parked explicitly in Phase 5 — it touches the executor, every Table accessor, and needs cursor state threaded through, so it pairs naturally with the public Connection / Statement / Rows API.