The Pager
May 8, 2026 · View on GitHub
The Pager is the component that actually talks to disk. Everything above it — the executor, the meta-commands, the Database itself — is blissfully unaware of file offsets. It hands bytes to the pager; the pager decides what gets written when.
The canonical code is in src/sql/pager/pager.rs.
What it does
- Owns the open handles for the
.sqlritemain file and its-walsidecar (Phase 4c). - Maintains three maps of page bytes:
on_disk— byte snapshot of the main file as last checkpointed. Frozen between checkpoints.wal_cache— latest committed body for every page whose new contents live in the WAL instead of the main file.staged— pages queued for the next commit.
- On
commit(new_header):- Compare each staged page against the effective committed state (
wal_cachelayered onon_disk). - Append a WAL frame for every page whose bytes actually differ, in ascending page order.
- Append a final commit frame whose body is the new page 0 (encoded header) and whose
commit_page_countfield carries the post-commit page count. The commit frame is fsync'd — that's the barrier that seals the transaction. - Promote the just-written frames into
wal_cache; clearstaged; updatecurrent_header.
- Compare each staged page against the effective committed state (
The main file isn't touched. It only changes when the checkpointer (Phase 4d) flushes accumulated WAL frames back and truncates the log.
The effect: the first commit appends one frame per page; subsequent commits append frames only for what changed.
Data structures
pub struct Pager {
storage: FileStorage, // main-file handle (read at open; checkpointer writes it)
current_header: DbHeader, // last committed header, reflecting WAL replay
on_disk: HashMap<u32, Box<[u8; PAGE_SIZE]>>, // main-file pages, frozen between checkpoints
staged: HashMap<u32, Box<[u8; PAGE_SIZE]>>, // queued writes for the next commit
wal_cache: HashMap<u32, Box<[u8; PAGE_SIZE]>>, // latest committed WAL bodies, layered on on_disk
wal: Wal, // open WAL sidecar (see wal.rs)
}
Box<[u8; PAGE_SIZE]> heap-allocates each 4 KiB page so the hash map isn't dragging 4-KiB value inlines through rehashes.
Lifecycle
Opening an existing file
Pager::open(path) (read-write, shorthand for open_with_mode(path, AccessMode::ReadWrite)):
- Open the main file read-write, acquire the exclusive advisory lock (
flock(LOCK_EX)). - Read and verify page 0 (header) → initial
current_header. - For page numbers
1..page_count, seek + read the page, store inon_disk. - Open (or create, if missing) the
-walsidecar; acquire its exclusive lock. - Replay committed WAL frames into
wal_cache. - If the WAL contains a page-0 frame, decode it and override
current_header— the WAL's copy is always more recent than the main file's. stagedstarts empty.
After open, the pager carries a complete snapshot of what's on disk plus every WAL-resident update since the last checkpoint. A pre-Phase-4c .sqlrite file (no sidecar) gets a fresh empty WAL on first open.
Opening read-only (Phase 4e)
Pager::open_read_only(path) (shorthand for open_with_mode(path, AccessMode::ReadOnly)) takes a shared advisory lock (flock(LOCK_SH)) on the main file and on the WAL sidecar (if it exists). Multiple read-only openers coexist; any writer is excluded (POSIX flock). Differences from read-write open:
- Main file opened read-only (no
write(true)onOpenOptions). - WAL sidecar: not created if missing — a read-only caller must not materialize one. With an absent sidecar the Pager serves reads straight from
on_diskwith an emptywal_cache. stage_page/commit/checkpointall short-circuit withGeneral error: cannot commit: database is opened read-onlyvia therequire_writableguard. Thewalfield isOption<Wal>and becomesNoneonly on the "no sidecar" read-only path.
Library-level callers use sqlrite::open_database_read_only(path, name); the REPL exposes it as --readonly / -r.
Creating a new file
Pager::create(path):
- Truncate-or-create the main file; acquire its lock.
- Encode an empty
sqlrite_masteras page 1 (aTableLeafwith zero cells). - Write the header with
page_count = 2,schema_root_page = 1;fsync. - Create a matching empty WAL sidecar at
path + "-wal"(any stale WAL from a prior crashed session gets truncated). - Populate
on_diskwith page 1 so no-op commits don't append frames for it.
The fresh DB is durable immediately — a valid header + empty schema on disk, a valid empty WAL alongside it.
Staging writes
The higher-level save_database (src/sql/pager/mod.rs) does the actual re-serialization:
pub fn save_database(db: &mut Database, path: &Path) -> Result<()> {
let same_path = db.source_path.as_deref() == Some(path);
let mut pager = /* re-attach or open or create */;
pager.clear_staged();
let mut next_free_page = 1;
let mut catalog_entries = Vec::new();
let mut table_names: Vec<&String> = db.tables.keys().collect();
table_names.sort(); // deterministic ordering
for name in table_names {
let bytes = bincode::encode(&db.tables[name])?;
let start = next_free_page;
next_free_page = stage_chain(&mut pager, &bytes, PageType::TableData, start)?;
catalog_entries.push((name.clone(), start));
}
let catalog_bytes = bincode::encode(&catalog_entries)?;
let schema_root = next_free_page;
next_free_page = stage_chain(&mut pager, &catalog_bytes, PageType::SchemaRoot, schema_root)?;
pager.commit(DbHeader { page_count: next_free_page, schema_root_page: schema_root })?;
if same_path { db.pager = Some(pager); }
Ok(())
}
Every page is staged via pager.stage_page(n, bytes), which is just staged.insert(n, bytes).
Committing
commit(new_header) (WAL-backed, Phase 4c):
// 1. Diff staged against the effective committed state.
let dirty: Vec<_> = staged.into_iter()
.filter(|(n, bytes)| {
let existing = wal_cache.get(n).or_else(|| on_disk.get(n));
match existing {
Some(e) => e != bytes,
None => true,
}
})
.collect();
dirty.sort_by_key(|(n, _)| *n);
// 2. Append one dirty frame per changed page.
for (n, bytes) in &dirty {
wal.append_frame(*n, bytes, None)?;
}
// 3. Seal the transaction with a commit frame for page 0.
// Body = encoded new header, commit_page_count = new page count.
// append_frame fsyncs when commit_page_count.is_some().
let page0 = encode_header(&new_header);
wal.append_frame(0, &page0, Some(new_header.page_count))?;
// 4. Promote everything we just wrote into wal_cache.
for (n, bytes) in dirty { wal_cache.insert(n, bytes); }
wal_cache.insert(0, page0);
self.current_header = new_header;
The sort keeps the WAL append order deterministic and matches ascending-page iteration order for the future checkpointer.
commit returns the number of dirty data frames (the commit frame is implicit and not counted). Unit tests assert on that count to confirm the diff still works — a repeated identical commit writes zero data frames, only the commit frame.
Reads
read_page(page_num):
staged— pending writes always shadow everything.- Bounds check:
page_num < current_header.page_count. Pages logically dropped by a shrink-commit stay inon_diskuntil checkpoint, but readers don't see them. wal_cache— committed WAL updates since the last checkpoint.on_disk— the frozen main-file snapshot.
Checkpointing (Phase 4d)
checkpoint() folds accumulated WAL pages back into the main file and truncates the WAL. It fires automatically from commit once Wal::frame_count() >= AUTO_CHECKPOINT_THRESHOLD_FRAMES (currently 100) and can also be triggered explicitly. The sequence uses two fsync barriers, matching SQLite's ordering so no reordered writeback can expose a half-migrated file:
- Write every
wal_cachedata page at itspage_num * PAGE_SIZEoffset in the main file. fsync— data pages must hit stable storage before the header publishes them.- Overwrite the main-file header with
current_header. set_len(page_count * PAGE_SIZE)— shrink the main file if we dropped pages.fsync— header + truncate durable together. This is the checkpoint's commit point.Wal::truncate()— resets the sidecar to header-only, rolls the salt, bumps the checkpoint sequence.- Drain
wal_cacheentries (minus page 0) intoon_disk; drop on_disk entries past the new page count.
Crash safety.
- Crash between 1 and 2: data-page writes are buffered, header untouched — the main file still reads as its pre-checkpoint self. WAL is intact; retry rewrites the same bytes.
- Crash between 2 and 5: data pages durable, header old. WAL still holds the authoritative bytes including page 0; next open overrides the stale header from wal_cache[0] and reads resolve correctly.
- Crash between 5 and 6: main file fully migrated, WAL lingers. Next open sees wal_cache entries byte-identical to the main file; the next checkpoint cleans the stale WAL.
checkpoint() returns the number of data pages written to the main file (excluding the header). Back-to-back checkpoints return 0 the second time. wal.truncate() runs before the in-memory cache swap so that a wal.truncate I/O failure leaves the Pager in a well-defined state (main file consistent on disk, stale WAL present, wal_cache still populated) rather than an uninspectable intermediate.
How the diff earns its keep
Consider an UPDATE that touches exactly one row in one table:
executor::execute_updateupdates the in-memoryTable.process_commandcallspager::save_database.- Every table is re-serialized to bytes and staged into the pager.
- For the table whose row changed, the bincode bytes differ → those pages are dirty.
- For every other table, the bincode bytes are byte-identical to what's already on disk → those pages stay clean → no writes.
Without the diff, step 3's "re-serialize every table" would trigger a full file rewrite on every statement. With the diff, disk I/O scales with the number of changed tables, not total tables.
This only works because save_database iterates tables in sorted order — if the order were random, a table that didn't change might land at a different page number, appearing dirty. See Design decisions §7.
Free-page list and VACUUM (SQLR-6)
Save now uses a PageAllocator instead of a bare next_free_page counter. The allocator pulls pages from three sources, in preference order:
- Per-table preferred pool — every table/index/master is given the page numbers it occupied last save (collected by walking from its old
rootpage). An unchanged table re-stages byte-identical pages at the same numbers, so the diff pager skips every write for it. - Global freelist — pages from dropped tables/indexes that are recorded in the persisted freelist (rooted at
header.freelist_head). - Extend —
next_extend++, monotonic past the high-water mark.
After staging, pages that were live before this save but didn't get restaged this round (e.g., the leaves of a dropped table) move onto the new freelist. The freelist itself is encoded into a chain of FreelistTrunk pages — each trunk holds up to 1021 free leaf-page numbers plus a next_page pointer to the following trunk. Trunks consume some of the free pages they describe (a trunk page IS a free page borrowed for metadata), so a freelist of N pages takes ceil(N / 1022) trunks and persists N − T leaf entries.
VACUUM; (a SQL statement) calls vacuum_database, which is save_database with empty per-table preferred pools and an empty initial freelist. Allocation falls through to extend on every page → contiguous layout from page 1, no freelist trunks, file truncates to the new high-water mark on the next checkpoint.
Format-version side effect: a save that produces a non-empty freelist promotes the file from v4/v5 to v6 (mirrors Phase 8c's v4→v5 FTS rule). VACUUM clears the freelist but doesn't downgrade — v6 is a strict superset.
Auto-VACUUM trigger (SQLR-10)
After SQLR-6, the file still required a manual VACUUM; to actually shrink — the freelist absorbed orphan pages but the high-water mark stayed put. SQLR-10 adds a heuristic that fires vacuum_database automatically after a page-releasing DDL (DROP TABLE, DROP INDEX, ALTER TABLE DROP COLUMN) when the freelist exceeds a configurable fraction of page_count.
Configuration lives on Database::auto_vacuum_threshold: Option<f32> and is exposed at the connection level via Connection::set_auto_vacuum_threshold / auto_vacuum_threshold, and via SQL through PRAGMA auto_vacuum (SQLR-13 — see src/sql/pragma.rs). Defaults: Some(0.25) (SQLite parity at 25%); pass None (or PRAGMA auto_vacuum = OFF) to opt out per connection. The threshold is per-Connection runtime state and is not persisted in the file header — every reopen starts at the default.
The trigger lives at the end of process_command_with_render, immediately after the auto-save. Order matters: the freelist isn't accurate until the bottom-up rebuild runs during save, so we save first, then check the ratio. The check itself is freelist::should_auto_vacuum(pager, threshold), which:
- skips databases under
MIN_PAGES_FOR_AUTO_VACUUM(16 pages = 64 KiB) so tiny files don't churn, - counts both leaf and trunk pages in the freelist (trunks are reclaimable bytes too),
- returns
trueiff(leaves + trunks) / page_count > threshold.
Auto-VACUUM is also skipped mid-transaction (no save → freelist is stale and the compact would publish in-flight work) and on in-memory databases (no file). The path bypasses executor::execute_vacuum — that wrapper builds a user-facing status string and rejects in-transaction calls, both wrong for a silent maintenance hook — and calls vacuum_database directly.
What it doesn't do (yet)
- No LRU eviction.
on_disk+wal_cachetogether grow with the page count. For a 1 GiB database, that's ~1 GiB of page cache. Bounded cache is future work. - No per-statement granularity. The whole database is re-serialized on every commit; the diff keeps the written set small but the CPU cost of reserialization is unchanged.
- No concurrent reader-and-writer. Phase 4e graduated to shared/exclusive lock modes (multi-reader or single-writer), but POSIX flock can't give us both at once. True concurrent access would need a shared-memory coordination file with read marks — not on the roadmap.
- Savepoints / nested transactions. Phase 4f added top-level
BEGIN/COMMIT/ROLLBACK(snapshot-based rollback, auto-save suppressed inside a transaction), but nestedBEGINis rejected — real savepoints aren't on the roadmap.
Interaction with Database
Database holds an Option<Pager>:
None→ in-memory only. Writes don't touch disk.Some(pager)→ file-backed.save_database(db, path)takes the pager off, drives it, and re-attaches on success.
Being able to take() and put back a pager avoids threading lifetimes through the call stack. It does mean a mid-save failure leaves db.pager == None — which is fine, the next .save or auto-save will re-open the file from scratch.
Testing
Unit tests in src/sql/pager/pager.rs:
create_then_open_round_trips— verifycreatewrites a valid empty DB andopencan read it back.create_spawns_wal_sidecar— Phase 4c:createproduces a header-only-walfile alongside the main file.commit_writes_only_dirty_pages— the diff test: write 3 pages, re-stage 2 unchanged + 1 changed, assert the commit only wrote 1 dirty data frame.commit_leaves_main_file_untouched_and_shrink_hides_dropped_pages— confirms the main file stays frozen across commits, the WAL grows, and a shrink-commit hides dropped pages via the bounds check even though their bytes linger in the main file until checkpoint.wal_replay_on_reopen_restores_committed_state— end-to-end: close a Pager after a commit, reopen, verify every staged page comes back via WAL replay.two_commits_only_stage_the_delta— two identical commits produce zero dirty data frames the second time (only the implicit commit frame).second_pager_on_same_file_is_rejected— Phase 4a: exclusive lock rejects simultaneous openers.- Phase 4d checkpoint suite:
explicit_checkpoint_folds_wal_into_main_file_and_truncates_wal— after checkpoint, main file holds the committed content, WAL is back to header-only.checkpoint_is_idempotent— back-to-back checkpoints: second is a no-op.checkpoint_with_shrink_truncates_main_file— a shrink-commit followed by checkpoint actually makes the main file smaller.auto_checkpoint_fires_past_frame_threshold— enough commits trigger auto-checkpoint; the WAL ends up empty on its own.reopen_after_crash_mid_checkpoint_recovers_via_wal— closing before checkpoint is equivalent to a crash mid-checkpoint; WAL replay restores the post-commit view.
- Phase 4e shared/exclusive suite:
two_read_only_openers_coexist— twoopen_read_onlycalls on the same file succeed simultaneously.read_write_blocks_read_only_and_vice_versa— POSIX flock semantics: RO excludes RW and vice versa.read_only_pager_rejects_mutations—commit/checkpointreturn typed errors; reads still work.read_only_open_without_wal_sidecar_succeeds— a deleted sidecar isn't recreated in RO mode; reads fall back to the main file.
The 9 WAL-format tests live in src/sql/pager/wal.rs and cover header / frame round-trips, torn-write recovery, and the orphan-dirty replay invariant in isolation from the Pager.
Higher-level tests in src/sql/pager/mod.rs cover round-tripping populated databases, rejecting garbage, and handling tables that span multiple pages.