The Pager

May 8, 2026 · View on GitHub

The Pager is the component that actually talks to disk. Everything above it — the executor, the meta-commands, the Database itself — is blissfully unaware of file offsets. It hands bytes to the pager; the pager decides what gets written when.

The canonical code is in src/sql/pager/pager.rs.

What it does

  • Owns the open handles for the .sqlrite main file and its -wal sidecar (Phase 4c).
  • Maintains three maps of page bytes:
    • on_disk — byte snapshot of the main file as last checkpointed. Frozen between checkpoints.
    • wal_cache — latest committed body for every page whose new contents live in the WAL instead of the main file.
    • staged — pages queued for the next commit.
  • On commit(new_header):
    1. Compare each staged page against the effective committed state (wal_cache layered on on_disk).
    2. Append a WAL frame for every page whose bytes actually differ, in ascending page order.
    3. Append a final commit frame whose body is the new page 0 (encoded header) and whose commit_page_count field carries the post-commit page count. The commit frame is fsync'd — that's the barrier that seals the transaction.
    4. Promote the just-written frames into wal_cache; clear staged; update current_header.

The main file isn't touched. It only changes when the checkpointer (Phase 4d) flushes accumulated WAL frames back and truncates the log.

The effect: the first commit appends one frame per page; subsequent commits append frames only for what changed.

Data structures

pub struct Pager {
    storage: FileStorage,                          // main-file handle (read at open; checkpointer writes it)
    current_header: DbHeader,                      // last committed header, reflecting WAL replay
    on_disk: HashMap<u32, Box<[u8; PAGE_SIZE]>>,   // main-file pages, frozen between checkpoints
    staged: HashMap<u32, Box<[u8; PAGE_SIZE]>>,    // queued writes for the next commit
    wal_cache: HashMap<u32, Box<[u8; PAGE_SIZE]>>, // latest committed WAL bodies, layered on on_disk
    wal: Wal,                                      // open WAL sidecar (see wal.rs)
}

Box<[u8; PAGE_SIZE]> heap-allocates each 4 KiB page so the hash map isn't dragging 4-KiB value inlines through rehashes.

Lifecycle

Opening an existing file

Pager::open(path) (read-write, shorthand for open_with_mode(path, AccessMode::ReadWrite)):

  1. Open the main file read-write, acquire the exclusive advisory lock (flock(LOCK_EX)).
  2. Read and verify page 0 (header) → initial current_header.
  3. For page numbers 1..page_count, seek + read the page, store in on_disk.
  4. Open (or create, if missing) the -wal sidecar; acquire its exclusive lock.
  5. Replay committed WAL frames into wal_cache.
  6. If the WAL contains a page-0 frame, decode it and override current_header — the WAL's copy is always more recent than the main file's.
  7. staged starts empty.

After open, the pager carries a complete snapshot of what's on disk plus every WAL-resident update since the last checkpoint. A pre-Phase-4c .sqlrite file (no sidecar) gets a fresh empty WAL on first open.

Opening read-only (Phase 4e)

Pager::open_read_only(path) (shorthand for open_with_mode(path, AccessMode::ReadOnly)) takes a shared advisory lock (flock(LOCK_SH)) on the main file and on the WAL sidecar (if it exists). Multiple read-only openers coexist; any writer is excluded (POSIX flock). Differences from read-write open:

  • Main file opened read-only (no write(true) on OpenOptions).
  • WAL sidecar: not created if missing — a read-only caller must not materialize one. With an absent sidecar the Pager serves reads straight from on_disk with an empty wal_cache.
  • stage_page / commit / checkpoint all short-circuit with General error: cannot commit: database is opened read-only via the require_writable guard. The wal field is Option<Wal> and becomes None only on the "no sidecar" read-only path.

Library-level callers use sqlrite::open_database_read_only(path, name); the REPL exposes it as --readonly / -r.

Creating a new file

Pager::create(path):

  1. Truncate-or-create the main file; acquire its lock.
  2. Encode an empty sqlrite_master as page 1 (a TableLeaf with zero cells).
  3. Write the header with page_count = 2, schema_root_page = 1; fsync.
  4. Create a matching empty WAL sidecar at path + "-wal" (any stale WAL from a prior crashed session gets truncated).
  5. Populate on_disk with page 1 so no-op commits don't append frames for it.

The fresh DB is durable immediately — a valid header + empty schema on disk, a valid empty WAL alongside it.

Staging writes

The higher-level save_database (src/sql/pager/mod.rs) does the actual re-serialization:

pub fn save_database(db: &mut Database, path: &Path) -> Result<()> {
    let same_path = db.source_path.as_deref() == Some(path);
    let mut pager = /* re-attach or open or create */;
    pager.clear_staged();

    let mut next_free_page = 1;
    let mut catalog_entries = Vec::new();

    let mut table_names: Vec<&String> = db.tables.keys().collect();
    table_names.sort();                         // deterministic ordering
    for name in table_names {
        let bytes = bincode::encode(&db.tables[name])?;
        let start = next_free_page;
        next_free_page = stage_chain(&mut pager, &bytes, PageType::TableData, start)?;
        catalog_entries.push((name.clone(), start));
    }
    let catalog_bytes = bincode::encode(&catalog_entries)?;
    let schema_root = next_free_page;
    next_free_page = stage_chain(&mut pager, &catalog_bytes, PageType::SchemaRoot, schema_root)?;

    pager.commit(DbHeader { page_count: next_free_page, schema_root_page: schema_root })?;
    if same_path { db.pager = Some(pager); }
    Ok(())
}

Every page is staged via pager.stage_page(n, bytes), which is just staged.insert(n, bytes).

Committing

commit(new_header) (WAL-backed, Phase 4c):

// 1. Diff staged against the effective committed state.
let dirty: Vec<_> = staged.into_iter()
    .filter(|(n, bytes)| {
        let existing = wal_cache.get(n).or_else(|| on_disk.get(n));
        match existing {
            Some(e) => e != bytes,
            None => true,
        }
    })
    .collect();
dirty.sort_by_key(|(n, _)| *n);

// 2. Append one dirty frame per changed page.
for (n, bytes) in &dirty {
    wal.append_frame(*n, bytes, None)?;
}

// 3. Seal the transaction with a commit frame for page 0.
//    Body = encoded new header, commit_page_count = new page count.
//    append_frame fsyncs when commit_page_count.is_some().
let page0 = encode_header(&new_header);
wal.append_frame(0, &page0, Some(new_header.page_count))?;

// 4. Promote everything we just wrote into wal_cache.
for (n, bytes) in dirty { wal_cache.insert(n, bytes); }
wal_cache.insert(0, page0);
self.current_header = new_header;

The sort keeps the WAL append order deterministic and matches ascending-page iteration order for the future checkpointer.

commit returns the number of dirty data frames (the commit frame is implicit and not counted). Unit tests assert on that count to confirm the diff still works — a repeated identical commit writes zero data frames, only the commit frame.

Reads

read_page(page_num):

  1. staged — pending writes always shadow everything.
  2. Bounds check: page_num < current_header.page_count. Pages logically dropped by a shrink-commit stay in on_disk until checkpoint, but readers don't see them.
  3. wal_cache — committed WAL updates since the last checkpoint.
  4. on_disk — the frozen main-file snapshot.

Checkpointing (Phase 4d)

checkpoint() folds accumulated WAL pages back into the main file and truncates the WAL. It fires automatically from commit once Wal::frame_count() >= AUTO_CHECKPOINT_THRESHOLD_FRAMES (currently 100) and can also be triggered explicitly. The sequence uses two fsync barriers, matching SQLite's ordering so no reordered writeback can expose a half-migrated file:

  1. Write every wal_cache data page at its page_num * PAGE_SIZE offset in the main file.
  2. fsync — data pages must hit stable storage before the header publishes them.
  3. Overwrite the main-file header with current_header.
  4. set_len(page_count * PAGE_SIZE) — shrink the main file if we dropped pages.
  5. fsync — header + truncate durable together. This is the checkpoint's commit point.
  6. Wal::truncate() — resets the sidecar to header-only, rolls the salt, bumps the checkpoint sequence.
  7. Drain wal_cache entries (minus page 0) into on_disk; drop on_disk entries past the new page count.

Crash safety.

  • Crash between 1 and 2: data-page writes are buffered, header untouched — the main file still reads as its pre-checkpoint self. WAL is intact; retry rewrites the same bytes.
  • Crash between 2 and 5: data pages durable, header old. WAL still holds the authoritative bytes including page 0; next open overrides the stale header from wal_cache[0] and reads resolve correctly.
  • Crash between 5 and 6: main file fully migrated, WAL lingers. Next open sees wal_cache entries byte-identical to the main file; the next checkpoint cleans the stale WAL.

checkpoint() returns the number of data pages written to the main file (excluding the header). Back-to-back checkpoints return 0 the second time. wal.truncate() runs before the in-memory cache swap so that a wal.truncate I/O failure leaves the Pager in a well-defined state (main file consistent on disk, stale WAL present, wal_cache still populated) rather than an uninspectable intermediate.

How the diff earns its keep

Consider an UPDATE that touches exactly one row in one table:

  1. executor::execute_update updates the in-memory Table.
  2. process_command calls pager::save_database.
  3. Every table is re-serialized to bytes and staged into the pager.
  4. For the table whose row changed, the bincode bytes differ → those pages are dirty.
  5. For every other table, the bincode bytes are byte-identical to what's already on disk → those pages stay clean → no writes.

Without the diff, step 3's "re-serialize every table" would trigger a full file rewrite on every statement. With the diff, disk I/O scales with the number of changed tables, not total tables.

This only works because save_database iterates tables in sorted order — if the order were random, a table that didn't change might land at a different page number, appearing dirty. See Design decisions §7.

Free-page list and VACUUM (SQLR-6)

Save now uses a PageAllocator instead of a bare next_free_page counter. The allocator pulls pages from three sources, in preference order:

  1. Per-table preferred pool — every table/index/master is given the page numbers it occupied last save (collected by walking from its old rootpage). An unchanged table re-stages byte-identical pages at the same numbers, so the diff pager skips every write for it.
  2. Global freelist — pages from dropped tables/indexes that are recorded in the persisted freelist (rooted at header.freelist_head).
  3. Extendnext_extend++, monotonic past the high-water mark.

After staging, pages that were live before this save but didn't get restaged this round (e.g., the leaves of a dropped table) move onto the new freelist. The freelist itself is encoded into a chain of FreelistTrunk pages — each trunk holds up to 1021 free leaf-page numbers plus a next_page pointer to the following trunk. Trunks consume some of the free pages they describe (a trunk page IS a free page borrowed for metadata), so a freelist of N pages takes ceil(N / 1022) trunks and persists N − T leaf entries.

VACUUM; (a SQL statement) calls vacuum_database, which is save_database with empty per-table preferred pools and an empty initial freelist. Allocation falls through to extend on every page → contiguous layout from page 1, no freelist trunks, file truncates to the new high-water mark on the next checkpoint.

Format-version side effect: a save that produces a non-empty freelist promotes the file from v4/v5 to v6 (mirrors Phase 8c's v4→v5 FTS rule). VACUUM clears the freelist but doesn't downgrade — v6 is a strict superset.

Auto-VACUUM trigger (SQLR-10)

After SQLR-6, the file still required a manual VACUUM; to actually shrink — the freelist absorbed orphan pages but the high-water mark stayed put. SQLR-10 adds a heuristic that fires vacuum_database automatically after a page-releasing DDL (DROP TABLE, DROP INDEX, ALTER TABLE DROP COLUMN) when the freelist exceeds a configurable fraction of page_count.

Configuration lives on Database::auto_vacuum_threshold: Option<f32> and is exposed at the connection level via Connection::set_auto_vacuum_threshold / auto_vacuum_threshold, and via SQL through PRAGMA auto_vacuum (SQLR-13 — see src/sql/pragma.rs). Defaults: Some(0.25) (SQLite parity at 25%); pass None (or PRAGMA auto_vacuum = OFF) to opt out per connection. The threshold is per-Connection runtime state and is not persisted in the file header — every reopen starts at the default.

The trigger lives at the end of process_command_with_render, immediately after the auto-save. Order matters: the freelist isn't accurate until the bottom-up rebuild runs during save, so we save first, then check the ratio. The check itself is freelist::should_auto_vacuum(pager, threshold), which:

  • skips databases under MIN_PAGES_FOR_AUTO_VACUUM (16 pages = 64 KiB) so tiny files don't churn,
  • counts both leaf and trunk pages in the freelist (trunks are reclaimable bytes too),
  • returns true iff (leaves + trunks) / page_count > threshold.

Auto-VACUUM is also skipped mid-transaction (no save → freelist is stale and the compact would publish in-flight work) and on in-memory databases (no file). The path bypasses executor::execute_vacuum — that wrapper builds a user-facing status string and rejects in-transaction calls, both wrong for a silent maintenance hook — and calls vacuum_database directly.

What it doesn't do (yet)

  • No LRU eviction. on_disk + wal_cache together grow with the page count. For a 1 GiB database, that's ~1 GiB of page cache. Bounded cache is future work.
  • No per-statement granularity. The whole database is re-serialized on every commit; the diff keeps the written set small but the CPU cost of reserialization is unchanged.
  • No concurrent reader-and-writer. Phase 4e graduated to shared/exclusive lock modes (multi-reader or single-writer), but POSIX flock can't give us both at once. True concurrent access would need a shared-memory coordination file with read marks — not on the roadmap.
  • Savepoints / nested transactions. Phase 4f added top-level BEGIN / COMMIT / ROLLBACK (snapshot-based rollback, auto-save suppressed inside a transaction), but nested BEGIN is rejected — real savepoints aren't on the roadmap.

Interaction with Database

Database holds an Option<Pager>:

  • None → in-memory only. Writes don't touch disk.
  • Some(pager) → file-backed. save_database(db, path) takes the pager off, drives it, and re-attaches on success.

Being able to take() and put back a pager avoids threading lifetimes through the call stack. It does mean a mid-save failure leaves db.pager == None — which is fine, the next .save or auto-save will re-open the file from scratch.

Testing

Unit tests in src/sql/pager/pager.rs:

  • create_then_open_round_trips — verify create writes a valid empty DB and open can read it back.
  • create_spawns_wal_sidecar — Phase 4c: create produces a header-only -wal file alongside the main file.
  • commit_writes_only_dirty_pages — the diff test: write 3 pages, re-stage 2 unchanged + 1 changed, assert the commit only wrote 1 dirty data frame.
  • commit_leaves_main_file_untouched_and_shrink_hides_dropped_pages — confirms the main file stays frozen across commits, the WAL grows, and a shrink-commit hides dropped pages via the bounds check even though their bytes linger in the main file until checkpoint.
  • wal_replay_on_reopen_restores_committed_state — end-to-end: close a Pager after a commit, reopen, verify every staged page comes back via WAL replay.
  • two_commits_only_stage_the_delta — two identical commits produce zero dirty data frames the second time (only the implicit commit frame).
  • second_pager_on_same_file_is_rejected — Phase 4a: exclusive lock rejects simultaneous openers.
  • Phase 4d checkpoint suite:
    • explicit_checkpoint_folds_wal_into_main_file_and_truncates_wal — after checkpoint, main file holds the committed content, WAL is back to header-only.
    • checkpoint_is_idempotent — back-to-back checkpoints: second is a no-op.
    • checkpoint_with_shrink_truncates_main_file — a shrink-commit followed by checkpoint actually makes the main file smaller.
    • auto_checkpoint_fires_past_frame_threshold — enough commits trigger auto-checkpoint; the WAL ends up empty on its own.
    • reopen_after_crash_mid_checkpoint_recovers_via_wal — closing before checkpoint is equivalent to a crash mid-checkpoint; WAL replay restores the post-commit view.
  • Phase 4e shared/exclusive suite:
    • two_read_only_openers_coexist — two open_read_only calls on the same file succeed simultaneously.
    • read_write_blocks_read_only_and_vice_versa — POSIX flock semantics: RO excludes RW and vice versa.
    • read_only_pager_rejects_mutationscommit / checkpoint return typed errors; reads still work.
    • read_only_open_without_wal_sidecar_succeeds — a deleted sidecar isn't recreated in RO mode; reads fall back to the main file.

The 9 WAL-format tests live in src/sql/pager/wal.rs and cover header / frame round-trips, torn-write recovery, and the orphan-dirty replay invariant in isolation from the Pager.

Higher-level tests in src/sql/pager/mod.rs cover round-tripping populated databases, rejecting garbage, and handling tables that span multiple pages.