Storage model

April 23, 2026 · View on GitHub

How data lives in memory while a database is open. For how it's laid out on disk, see file-format.md.

The canonical code is src/sql/db/table.rs.

The big picture

Database
├── db_name: "users.sqlrite"
├── source_path: Some("/path/to/users.sqlrite")
├── pager: Some(Pager)
└── tables: HashMap<String, Table>
      "users"  -> Table
      "notes"  -> Table
      ...

Table
├── tb_name: "users"
├── columns: Vec<Column>         ordered, schema
├── rows: Rc<RefCell<HashMap<String, Row>>>
│                               keyed by column name
│                               value = one column's full data
├── indexes: HashMap<String, String>   (placeholder; not used yet)
├── last_rowid: i64
└── primary_key: "id" | "-1"    "-1" = table has no explicit PK

Column
├── column_name: "name"
├── datatype: DataType::Text
├── is_pk: false
├── not_null: true
├── is_unique: true
├── is_indexed: true
└── index: Index::Text(BTreeMap<String, i64>)

Row   (stored per column inside Table.rows)
├── Integer(BTreeMap<rowid: i64, value: i32>)
├── Text(BTreeMap<rowid: i64, value: String>)
├── Real(BTreeMap<rowid: i64, value: f32>)
├── Bool(BTreeMap<rowid: i64, value: bool>)
└── None

Column-oriented layout

The surprising thing is that Table.rows is not keyed by rowid. It's keyed by column name, and each value is a Row enum holding a BTreeMap<rowid, value> for that one column.

In other words, to read "what's at row 5?" you have to ask each column's BTreeMap for key 5:

users.rows["id"]    = Row::Integer({5 => 5})
users.rows["name"]  = Row::Text({5 => "alice"})
users.rows["age"]   = Row::Integer({5 => 30})

All columns share the same keyset. Insertion pads missing columns with NULL-ish values (see below), so every rowid has an entry in every column's map.

Why column-oriented?

Historical rather than principled. The first pass of the codebase in 2020 used this shape, probably because it made per-column index maintenance straightforward. Phase 3c will replace it with row-oriented cells as part of moving to real B-Tree storage.

Consequences

  • Projecting a column (e.g., SELECT name FROM users) is one BTreeMap scan. Efficient.
  • Reassembling a row (SELECT *) requires N map lookups for N columns. Each lookup is O(log R), so row reconstruction is O(C log R). Not terrible for small N but not ideal.
  • Deleting a row means removing the rowid from every column's BTreeMap and every index's BTreeMap. Handled by Table::delete_row.

ROWID

Every row has a rowid: i64. It's the implicit primary key when no explicit INTEGER PRIMARY KEY is declared, and it's the alias for the PK when one is. Table.last_rowid tracks the most recent assignment and is bumped on each insert.

When the user omits an INTEGER PRIMARY KEY column in an INSERT, insert_row auto-assigns last_rowid + 1. When the user supplies an explicit value, that value becomes the new rowid (and last_rowid advances to it, if it's larger).

Non-integer PRIMARY KEY columns are unusual; the code handles them but auto-assign only kicks in for the integer case.

Indexes

Each Column carries an Index:

enum Index {
    Integer(BTreeMap<i32, i64>),  // value -> rowid
    Text(BTreeMap<String, i64>),  // value -> rowid
    None,                          // not indexed
}

Indexes exist only on columns marked PRIMARY KEY or UNIQUE, and only for Integer and Text types today. Real and Bool columns get Index::None even if declared unique, and UNIQUE enforcement falls back to a linear scan.

The index is maintained inline with every insert, update, and delete:

  • insert_row inserts into the column's index after successfully inserting into the Row map.
  • set_value removes the old index entry (a retain-based scan, since we look up by rowid) then inserts the new one.
  • delete_row strips every index entry pointing at the rowid being deleted.

Indexes are also used on write paths for UNIQUE-constraint checks — validate_unique_constraint (called before every INSERT) does O(log R) contains_key lookups.

They are not yet used on read paths. Today's SELECT always does a full table scan via Table::rowids. A planner that can turn WHERE id = 5 into an index probe is Phase 3+ work.

NULL handling

Storage for NULL is inconsistent by type, which is a known wart:

  • Text columns can store NULL by encoding the literal string "Null" in the BTreeMap. Reads special-case this back to Value::Null in Row::get. Consequence: a user who actually inserts the string 'Null' into a Text column will read back NULL. Acceptable for now, will be cleaned up in Phase 3c.
  • Integer / Real / Bool columns can't store NULL. If a user omits such a column in INSERT, insert_row returns a Type mismatch error. This is stricter than SQL (which allows NULL by default unless NOT NULL is declared) but safer than the old behavior of panicking on "Null".parse::<i32>().

A proper NULL-bitmap mechanism is on the Phase 3c to-do list.

Runtime Value vs storage Row

When the executor evaluates an expression, it works with Value — a runtime enum with wider variants:

pub enum Value {
    Integer(i64),
    Text(String),
    Real(f64),
    Bool(bool),
    Null,
}

Conversion:

  • Row::get(rowid) → Option<Value> widens storage i32 to Value::Integer(i64), f32 to Value::Real(f64).
  • Table::set_value(col, rowid, Value) narrows back with as casts. The declared column type enforces what's allowed (e.g., a Value::Text into an Integer column is a type error, not a silent corruption).

This split keeps the executor type-agnostic — it just uses Value arithmetic — while storage stays compact.

Lifecycle: write paths

INSERT

  1. executor::execute_insert (inline in the Statement::Insert arm): delegate to InsertQuery::new to get (table_name, columns, rows).
  2. Look up the table via db.get_table_mut.
  3. For each row of values: a. Check every value's column exists on the table. b. Call Table::validate_unique_constraint — uses the column indexes for O(log R) lookups. c. Call Table::insert_row — auto-assigns the PK if missing, then writes every column's BTreeMap + every index's BTreeMap in lockstep.

UPDATE

  1. executor::execute_update: resolve assignment targets to column names, verify they exist.
  2. Two passes to avoid borrow-checker fights:
    • Read pass (&db): walk every rowid, evaluate WHERE, for matching rows evaluate the assignment RHS expressions, collect planned writes.
    • Write pass (&mut db): for each (rowid, [(col, value)]), call Table::set_value.
  3. set_value enforces the column's declared type and UNIQUE constraint before mutation. It refreshes the index (old entry removed, new entry inserted).

DELETE

  1. executor::execute_delete: same two-pass pattern.
  2. Read pass collects matching rowids.
  3. Write pass calls Table::delete_row(rowid) for each.

CREATE TABLE

  1. CreateQuery::new produces a ParsedColumn list with type + constraints.
  2. Table::new builds the Table struct: allocates an empty BTreeMap per column in rows, builds Column structs including empty Indexes on UNIQUE/PK columns.
  3. Top-level dispatcher inserts the table into db.tables.

Lifecycle: read paths

SELECT

  1. executor::execute_select looks up the table.
  2. Resolves projection to a concrete ordered column list.
  3. Walks every rowid from Table::rowids (grabbed from the first column's BTreeMap, since all columns share the same keyset).
  4. For each rowid:
    • If a WHERE clause exists, evaluate eval_predicate against the row's data.
    • If matched, keep the rowid for later.
  5. Sort the matched rowids if ORDER BY is present.
  6. Truncate to LIMIT.
  7. Render as a prettytable string, column by column via Table::get_value.

Full table scan, every time. Index-backed lookups are Phase 3+.

What Phase 3c changed — and what it didn't

Changed (on-disk). Rows are no longer stored as one opaque bincode blob per table. Each row is serialized as a cell (length-prefixed, kind-tagged, null-bitmap + typed value blocks) and packed into TableLeaf pages behind a slot directory. Cells that don't fit spill into an overflow chain. The schema catalog is itself a table — sqlrite_master — stored in the same cell format. File format version is now 2. See file-format.md for the byte-level details.

Didn't change (in-memory). The Database / Table / Column / Row / Index structures described in the sections above are still the runtime representation. On save_database, Table::extract_row turns each in-memory row into a Cell; on open_database, each Cell is decoded and its values are handed to Table::restore_row, which populates the per-column BTreeMaps and rebuilds the indexes.

What Phase 3d changed

Phase 3d put a real B-Tree on top of the leaf layer. The on-disk file now has InteriorNode pages above TableLeaf pages; find_leftmost_leaf during open descends the tree, then the existing sibling-chain walk delivers cells in rowid order. Interior cells use the same cell_length | kind_tag | body prefix as local/overflow cells, so binary search for a rowid works uniformly across all page kinds.

Write path: save rebuilds the tree bottom-up from the in-memory sorted rows. No in-place splits or merges — the whole tree is emitted fresh on every commit. That's the educational and engineering trade-off to stay compatible with the "Table lives in memory, save flushes a full snapshot" model. Deterministic build (tables sorted by name, rows by rowid) keeps the Pager's diff commit effective: unchanged tables produce byte-identical pages and aren't rewritten.

Read path is still eager-load: load_table_rows walks every leaf and populates Table's in-memory maps up front. A cursor abstraction that streams rows through the pager without materializing the whole table is deferred to Phase 5 (the library-API split), where the cost of the bigger refactor also buys us the public Connection/Statement/Rows API.

What Phase 3e changed

Phase 3e added secondary indexes. Every UNIQUE / PRIMARY KEY column auto-creates a SecondaryIndex at CREATE TABLE time (named sqlrite_autoindex_<table>_<col>); CREATE [UNIQUE] INDEX name ON t (col) adds explicit ones. Column is now a pure schema descriptor — the per-column BTreeMap moved to Table::secondary_indexes, where it's shared infrastructure between auto and explicit indexes.

On disk. Every index persists as its own cell-based B-Tree using a new KIND_INDEX cell carrying (value, original_rowid). sqlrite_master grew a type column distinguishing 'table' rows from 'index' rows (file format v3).

Executor. The WHERE col = literal (and literal = col) shape now probes the matching index for an O(log N) lookup instead of scanning every row. Other predicate shapes still fall back to the full-scan path.

What Phase 3f / later phases will bring

See roadmap.md for the next phase (WAL and file locks in Phase 4, library + WASM in Phase 5). The cursor / lazy-load refactor is parked explicitly in Phase 5 — it touches the executor, every Table accessor, and needs cursor state threaded through, so it pairs naturally with the public Connection / Statement / Rows API.