Storage model

April 23, 2026 · View on GitHub

How data lives in memory while a database is open. For how it's laid out on disk, see file-format.md.

The canonical code is src/sql/db/table.rs.

The big picture

Database
├── db_name: "users.sqlrite"
├── source_path: Some("/path/to/users.sqlrite")
├── pager: Some(Pager)
└── tables: HashMap<String, Table>
      "users"  -> Table
      "notes"  -> Table
      ...

Table
├── tb_name: "users"
├── columns: Vec<Column>         ordered, schema
├── rows: Rc<RefCell<HashMap<String, Row>>>
│                               keyed by column name
│                               value = one column's full data
├── indexes: HashMap<String, String>   (placeholder; not used yet)
├── last_rowid: i64
└── primary_key: "id" | "-1"    "-1" = table has no explicit PK

Column
├── column_name: "name"
├── datatype: DataType::Text
├── is_pk: false
├── not_null: true
├── is_unique: true
├── is_indexed: true
└── index: Index::Text(BTreeMap<String, i64>)

Row   (stored per column inside Table.rows)
├── Integer(BTreeMap<rowid: i64, value: i32>)
├── Text(BTreeMap<rowid: i64, value: String>)
├── Real(BTreeMap<rowid: i64, value: f32>)
├── Bool(BTreeMap<rowid: i64, value: bool>)
└── None

Column-oriented layout

The surprising thing is that Table.rows is not keyed by rowid. It's keyed by column name, and each value is a Row enum holding a BTreeMap<rowid, value> for that one column.

In other words, to read "what's at row 5?" you have to ask each column's BTreeMap for key 5:

users.rows["id"]    = Row::Integer({5 => 5})
users.rows["name"]  = Row::Text({5 => "alice"})
users.rows["age"]   = Row::Integer({5 => 30})

All columns share the same keyset. Insertion pads missing columns with NULL-ish values (see below), so every rowid has an entry in every column's map.

Historical rather than principled. The first pass of the codebase in 2020 used this shape, probably because it made per-column index maintenance straightforward. Phase 3c will replace it with row-oriented cells as part of moving to real B-Tree storage.

Consequences

Projecting a column (e.g., SELECT name FROM users) is one BTreeMap scan. Efficient.
Reassembling a row (SELECT *) requires N map lookups for N columns. Each lookup is O(log R), so row reconstruction is O(C log R). Not terrible for small N but not ideal.
Deleting a row means removing the rowid from every column's BTreeMap and every index's BTreeMap. Handled by Table::delete_row.

ROWID

Every row has a rowid: i64. It's the implicit primary key when no explicit INTEGER PRIMARY KEY is declared, and it's the alias for the PK when one is. Table.last_rowid tracks the most recent assignment and is bumped on each insert.

When the user omits an INTEGER PRIMARY KEY column in an INSERT, insert_row auto-assigns last_rowid + 1. When the user supplies an explicit value, that value becomes the new rowid (and last_rowid advances to it, if it's larger).

Non-integer PRIMARY KEY columns are unusual; the code handles them but auto-assign only kicks in for the integer case.

Indexes

Each Column carries an Index:

enum Index {
    Integer(BTreeMap<i32, i64>),  // value -> rowid
    Text(BTreeMap<String, i64>),  // value -> rowid
    None,                          // not indexed
}

Indexes exist only on columns marked PRIMARY KEY or UNIQUE, and only for Integer and Text types today. Real and Bool columns get Index::None even if declared unique, and UNIQUE enforcement falls back to a linear scan.

The index is maintained inline with every insert, update, and delete:

insert_row inserts into the column's index after successfully inserting into the Row map.
set_value removes the old index entry (a retain-based scan, since we look up by rowid) then inserts the new one.
delete_row strips every index entry pointing at the rowid being deleted.

Indexes are also used on write paths for UNIQUE-constraint checks — validate_unique_constraint (called before every INSERT) does O(log R) contains_key lookups.

They are not yet used on read paths. Today's SELECT always does a full table scan via Table::rowids. A planner that can turn WHERE id = 5 into an index probe is Phase 3+ work.

NULL handling

Storage for NULL is inconsistent by type, which is a known wart:

Text columns can store NULL by encoding the literal string "Null" in the BTreeMap. Reads special-case this back to Value::Null in Row::get. Consequence: a user who actually inserts the string 'Null' into a Text column will read back NULL. Acceptable for now, will be cleaned up in Phase 3c.
Integer / Real / Bool columns can't store NULL. If a user omits such a column in INSERT, insert_row returns a Type mismatch error. This is stricter than SQL (which allows NULL by default unless NOT NULL is declared) but safer than the old behavior of panicking on "Null".parse::<i32>().

A proper NULL-bitmap mechanism is on the Phase 3c to-do list.

Runtime `Value` vs storage `Row`

When the executor evaluates an expression, it works with Value — a runtime enum with wider variants:

pub enum Value {
    Integer(i64),
    Text(String),
    Real(f64),
    Bool(bool),
    Null,
}

Conversion:

Row::get(rowid) → Option<Value> widens storage i32 to Value::Integer(i64), f32 to Value::Real(f64).
Table::set_value(col, rowid, Value) narrows back with as casts. The declared column type enforces what's allowed (e.g., a Value::Text into an Integer column is a type error, not a silent corruption).

This split keeps the executor type-agnostic — it just uses Value arithmetic — while storage stays compact.

Lifecycle: write paths

INSERT

executor::execute_insert (inline in the Statement::Insert arm): delegate to InsertQuery::new to get (table_name, columns, rows).
Look up the table via db.get_table_mut.
For each row of values: a. Check every value's column exists on the table. b. Call Table::validate_unique_constraint — uses the column indexes for O(log R) lookups. c. Call Table::insert_row — auto-assigns the PK if missing, then writes every column's BTreeMap + every index's BTreeMap in lockstep.

UPDATE

executor::execute_update: resolve assignment targets to column names, verify they exist.
Two passes to avoid borrow-checker fights:
- Read pass (&db): walk every rowid, evaluate WHERE, for matching rows evaluate the assignment RHS expressions, collect planned writes.
- Write pass (&mut db): for each (rowid, [(col, value)]), call Table::set_value.
set_value enforces the column's declared type and UNIQUE constraint before mutation. It refreshes the index (old entry removed, new entry inserted).

DELETE

executor::execute_delete: same two-pass pattern.
Read pass collects matching rowids.
Write pass calls Table::delete_row(rowid) for each.

CREATE TABLE

CreateQuery::new produces a ParsedColumn list with type + constraints.
Table::new builds the Table struct: allocates an empty BTreeMap per column in rows, builds Column structs including empty Indexes on UNIQUE/PK columns.
Top-level dispatcher inserts the table into db.tables.

Lifecycle: read paths

SELECT

executor::execute_select looks up the table.
Resolves projection to a concrete ordered column list.
Walks every rowid from Table::rowids (grabbed from the first column's BTreeMap, since all columns share the same keyset).
For each rowid:
- If a WHERE clause exists, evaluate eval_predicate against the row's data.
- If matched, keep the rowid for later.
Sort the matched rowids if ORDER BY is present.
Truncate to LIMIT.
Render as a prettytable string, column by column via Table::get_value.

Full table scan, every time. Index-backed lookups are Phase 3+.

What Phase 3c changed — and what it didn't

Changed (on-disk). Rows are no longer stored as one opaque bincode blob per table. Each row is serialized as a cell (length-prefixed, kind-tagged, null-bitmap + typed value blocks) and packed into TableLeaf pages behind a slot directory. Cells that don't fit spill into an overflow chain. The schema catalog is itself a table — sqlrite_master — stored in the same cell format. File format version is now 2. See file-format.md for the byte-level details.

Didn't change (in-memory). The Database / Table / Column / Row / Index structures described in the sections above are still the runtime representation. On save_database, Table::extract_row turns each in-memory row into a Cell; on open_database, each Cell is decoded and its values are handed to Table::restore_row, which populates the per-column BTreeMaps and rebuilds the indexes.

What Phase 3d changed

Phase 3d put a real B-Tree on top of the leaf layer. The on-disk file now has InteriorNode pages above TableLeaf pages; find_leftmost_leaf during open descends the tree, then the existing sibling-chain walk delivers cells in rowid order. Interior cells use the same cell_length | kind_tag | body prefix as local/overflow cells, so binary search for a rowid works uniformly across all page kinds.

Write path: save rebuilds the tree bottom-up from the in-memory sorted rows. No in-place splits or merges — the whole tree is emitted fresh on every commit. That's the educational and engineering trade-off to stay compatible with the "Table lives in memory, save flushes a full snapshot" model. Deterministic build (tables sorted by name, rows by rowid) keeps the Pager's diff commit effective: unchanged tables produce byte-identical pages and aren't rewritten.

Read path is still eager-load: load_table_rows walks every leaf and populates Table's in-memory maps up front. A cursor abstraction that streams rows through the pager without materializing the whole table is deferred to Phase 5 (the library-API split), where the cost of the bigger refactor also buys us the public Connection/Statement/Rows API.

What Phase 3e changed

Phase 3e added secondary indexes. Every UNIQUE / PRIMARY KEY column auto-creates a SecondaryIndex at CREATE TABLE time (named sqlrite_autoindex_<table>_<col>); CREATE [UNIQUE] INDEX name ON t (col) adds explicit ones. Column is now a pure schema descriptor — the per-column BTreeMap moved to Table::secondary_indexes, where it's shared infrastructure between auto and explicit indexes.

On disk. Every index persists as its own cell-based B-Tree using a new KIND_INDEX cell carrying (value, original_rowid). sqlrite_master grew a type column distinguishing 'table' rows from 'index' rows (file format v3).

Executor. The WHERE col = literal (and literal = col) shape now probes the matching index for an O(log N) lookup instead of scanning every row. Other predicate shapes still fall back to the full-scan path.

What Phase 3f / later phases will bring

See roadmap.md for the next phase (WAL and file locks in Phase 4, library + WASM in Phase 5). The cursor / lazy-load refactor is parked explicitly in Phase 5 — it touches the executor, every Table accessor, and needs cursor state threaded through, so it pairs naturally with the public Connection / Statement / Rows API.