Concurrent writes plan

May 11, 2026 Β· View on GitHub

πŸ“˜ Looking for the user-facing reference? This is the original design proposal, kept as the historical record of the decisions that shaped Phase 11. For the shipped surface β€” SQL, embedding API, SDK error mapping, REPL meta-commands, durability story, limitations β€” read concurrent-writes.md first; come back here when you want the why and the sequencing discussion.

Status: shipped end-to-end through Phase 11.11a (May 2026); a small set of follow-ups remain explicitly parked β€” see the roadmap. Drafted 2026-05-07. Inspiration: Turso β€” a SQLite-compatible engine, written in Rust, that implements multi-version concurrency control to lift SQLite's single-writer ceiling. See turso/core/mvcc/ and the Turso concurrent-writes docs. Tracks: SQLR-22 (Marvin).

This document proposes adding multi-version concurrency control (MVCC) and a BEGIN CONCURRENT transaction mode to SQLRite, enabling multiple writers in the same process to make progress in parallel under snapshot isolation, with row-level write-write conflict detection at commit. It is intentionally a plan β€” there is no code yet.

The output of this work, when shipped, would put SQLRite in the same conceptual bucket as Turso's --experimental-mvcc: the file format stays compatible with the existing v5 layout, the legacy serialized path keeps working, and apps opt in by setting a journal mode and using a new BEGIN CONCURRENT statement.


1. Why bother

SQLite's single-writer rule is the design choice users hit first when they scale. It's not a bug β€” it's the consequence of a single global file lock (PENDING/EXCLUSIVE) that serializes commit ordering β€” but in workloads where two concurrent writers touch disjoint rows it costs throughput unnecessarily.

Turso's measurements report up to 4Γ— the write throughput of SQLite when the workload is dominated by non-conflicting writes (Turso v0.5.0 blog post; Beyond the single-writer limitation). For SQLRite β€” which today inherits SQLite's single-writer model end-to-end (advisory flock(LOCK_EX) plus a single Pager per process; see pager.md and Decision 11 in design-decisions.md) β€” the same uplift is theoretically available, if the engine learns to track row versions and detect write-write conflicts at commit instead of holding an exclusive write lock for the duration.

This is also a natural fit for SQLRite's stated remit (a hands-on study of database internals). MVCC, snapshot isolation, optimistic concurrency control, and version-chain garbage collection are textbook database-systems topics; implementing them turns the engine into a credible reference for those topics rather than a SQLite clone with extra storage tricks.


2. What Turso does

Turso's MVCC implementation is the most directly relevant prior art β€” same target SQL, same general layout, same Rust language, and they've already navigated most of the design questions we'd face. A summary of the moving parts (cross-checked against core/mvcc/mod.rs and core/mvcc/database/mod.rs):

2.1 Activation

PRAGMA journal_mode = mvcc;     -- per-connection switch
BEGIN CONCURRENT;
-- writes
COMMIT;                          -- may return SQLITE_BUSY β†’ caller retries

Compiled in behind --experimental-mvcc. BEGIN CONCURRENT doesn't acquire any locks; BEGIN IMMEDIATE / BEGIN DEFERRED retain SQLite's exclusive-write semantics.

2.2 In-memory version index

The MVCC store keeps an in-memory map keyed by RowID { table_id, row_key } whose value is a chain of RowVersion records. Each version carries:

  • begin: TxTimestampOrID β€” the commit timestamp (or transaction ID, while uncommitted) at which this version becomes visible
  • end: Option<TxTimestampOrID> β€” when it stops being visible (a later UPDATE/DELETE pushes a new head onto the chain)
  • the row payload itself

Visibility for a reader transaction with begin-timestamp T is the textbook snapshot-isolation rule: pick the version whose begin <= T < end.

The Hekaton paper (Larson et al., VLDB 2011) is the explicit reference. The same paper also covers the validation phase used at commit and the garbage-collection rules sketched in Β§7 below.

2.3 Read path

  • If the row is present in the MVCC index β†’ use it (latest-wins; all writes funnel through the index, so the in-memory copy is the source of truth).
  • Otherwise β†’ fall through to the existing pager β†’ WAL β†’ main-file read path. Result is materialized into the index on first touch ("eager loaded" today, which is the source of Turso's memory-pressure caveats).

2.4 Write path

Writes go entirely to the MVCC index, tagged with the transaction ID. They're invisible to other transactions until the writer commits. The transaction also tracks a read set and a write set for validation.

2.5 Commit / conflict detection

Three phases:

  1. Take a commit timestamp from a logical clock.
  2. Walk the write set; for each row, check whether any version newer than the transaction's begin timestamp exists in the index. If yes β†’ write-write conflict. The transaction is aborted and the caller sees SQLITE_BUSY (or, in libSQL terms, Busy / BusySnapshot).
  3. Otherwise, stamp the new versions with the commit timestamp, append a WAL log record describing the row deltas, and fsync. The pager's WAL machinery is reused unchanged.

Conflict detection is row-level, not page-level. This is the critical contrast with the upstream SQLite BEGIN CONCURRENT patch, which only resolves at page granularity and forces aborts when two transactions touch unrelated rows that happen to share a page.

2.6 Checkpointing

The MVCC log is "eventually checkpointed into the SQLite database file via the WAL." Specifically: the writer that checkpoints walks the committed-but-unmaterialized portion of the version chain, materializes the latest visible version of each row into the page cache, and lets the regular pager flush it to the WAL/main file. Recovery on reopen relies on the WAL plus the persistent on-disk state β€” the in-memory MVCC log is rebuilt empty.

2.7 Garbage collection

Versions whose end timestamp is older than the oldest active reader's begin-timestamp are dead and may be reclaimed. Turso's implementation today uses a contiguous Vec per row guarded by an RwLock, with a scan-based GC pass that runs opportunistically. The known issue (turso#3499) is that the lock is the bottleneck under heavy contention; a wait-free chain is on their roadmap.

2.8 Limitations (as of writing)

  • CREATE INDEX is not yet supported in MVCC mode β€” neither maintaining secondary indexes through the version chain nor reading them under snapshot isolation.
  • AUTOINCREMENT is not supported (the global counter is not yet versioned).
  • wal_checkpoint(TRUNCATE) is the only checkpoint variant; it blocks readers and writers.
  • "Eager loading" β€” first access to a table loads it fully into the version index. Memory cost scales with the table, not the working set.
  • Row deltas are full row copies, not column-level diffs (turso#3498).
  • I/O is synchronous; io_uring adoption is on the roadmap (turso#1848).

These are honest limits, not gotchas. They suggest where SQLRite's plan should explicitly punt, and where it might do better by virtue of being smaller.


3. SQLRite's current concurrency model

A clear-eyed inventory of what we have today, since the gap is what the plan has to bridge.

3.1 Process-level locking (src/sql/pager/pager.rs)

Pager::open takes flock(LOCK_EX | LOCK_NB) on both the main file and the WAL sidecar via fs2. Pager::open_read_only takes flock(LOCK_SH | LOCK_NB). Reads and writes are mutually exclusive across processes β€” POSIX flock semantics, no shared-memory coordination file. Within a process, the engine runs one writer because the public surface only exposes one Connection per Database/Pager.

3.2 Transactions (src/sql/db/database.rs:149-195)

BEGIN deep-clones the entire tables: HashMap<String, Table> into a TxnSnapshot stashed on Database::txn. ROLLBACK swaps it back. COMMIT clears the snapshot and the next save_database call appends one WAL commit frame. Auto-save is suppressed while a transaction is open. Nested begins are rejected. There is no concept of read-set / write-set tracking, no logical clock, no row-version metadata.

3.3 Storage (src/sql/pager/ + Decision 6 in design-decisions.md)

  • 4 KiB pages, cell-encoded B-trees per table and per secondary index.
  • save_database rebuilds every B-tree bottom-up on every commit from sorted in-memory rows. This is correct-by-construction but assumes a single committer β€” concurrent committers would each want to rebuild the same tree from a different in-memory state.
  • The pager owns three page maps (on_disk / wal_cache / staged), all HashMap<u32, Box<[u8; PAGE_SIZE]>>. There is no per-page locking or versioning; reads inside the engine are sequential.

3.4 Connection model (src/connection.rs)

Connection owns its Database by value. It is Send, not Sync, and has no clone(). The engine's documented embedding pattern when callers want multi-threaded access is to wrap the connection in Mutex<Connection> β€” i.e. serialize access externally.

3.5 What this means

To get from "one writer, one process" to "many writers, one process, conflict-detected at commit" we need to change every layer:

  1. Connection layer: turn Connection into a thin handle on a shared Database/Pager so multiple connections can coexist within a process. Today there's only ever one Connection per Database.
  2. Transaction layer: replace deep-clone snapshots with per-transaction read/write sets and a logical clock. Snapshots scale O(database size) on every BEGIN, which is unworkable with many concurrent transactions.
  3. Storage layer: introduce an MVCC version index sitting in front of the existing pager. Writes land in the index and only become pager-visible at checkpoint time; reads consult the index first, the pager second.
  4. Commit path: validate write sets against committed-after-begin versions. On conflict, return a typed Busy error and let the caller retry. On success, append a WAL transaction record covering the row deltas.
  5. Checkpoint: drain the version index into the existing bottom-up B-tree rebuild path. The legacy commit path doesn't go away β€” BEGIN and BEGIN IMMEDIATE keep their current single-writer semantics β€” but BEGIN CONCURRENT short-circuits past it until checkpoint time.

The good news: the existing WAL and the existing on-disk format are reusable as-is. Turso's checkpoint design ("MVCC log β†’ page cache β†’ WAL β†’ main file") works because the WAL is the same WAL with or without MVCC. The same is true for SQLRite's WAL.


4. Proposed design

A high-level sketch of the layered architecture this proposal targets.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Connection (handle) ── Connection (handle) ── ...       β”‚   <- many per process
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚              Shared Database (Arc<DatabaseInner>)         β”‚
β”‚                                                          β”‚
β”‚    Schema cache    β”‚    MVCC store    β”‚   Logical clock  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                     β”‚                    β”‚
β”‚   Pager (existing) ─ WAL ─ main fileβ”‚   GC worker        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

4.1 Logical clock

A monotonic u64 counter, per-Database. Hands out begin_ts at BEGIN CONCURRENT and commit_ts at the start of validation. Wrapped in AtomicU64; no contention because each transaction calls it twice. Persisted to the WAL header on each commit so reopens resume past the highest committed timestamp. (Turso's MvccClock is the same shape.)

4.2 Version index

pub struct MvStore {
    // (table_id, rowid) -> head of version chain
    versions: DashMap<RowID, RowVersionChain>,
    clock: MvccClock,
    active: ActiveTxRegistry,   // for GC: min active begin_ts
}

pub struct RowVersion {
    begin: TxTsOrId,            // commit_ts once committed; tx_id while in-flight
    end:   Option<TxTsOrId>,    // None == latest visible version
    payload: Row,               // for v0; column deltas later (turso#3498-style)
    next: Option<Arc<RowVersion>>,
}

A few intentional simplifications versus Turso for v0:

  • One chain per row, behind RwLock (or parking_lot::RwLock). The wait-free chain is a known follow-up; it's not on the v0 critical path.
  • Full-row payloads (defer column-level deltas).
  • Lazy materialization from the pager on first touch. Deviates from Turso's "load whole table eagerly", which they themselves treat as debt; we'd be starting from the better default.

4.3 Transaction record

pub struct ConcurrentTx {
    id: TxId,
    begin_ts: u64,
    state: TxState,                          // Active | Committed | Aborted
    read_set: HashSet<RowID>,                // for serializable-isolation upgrade later
    write_set: HashMap<RowID, Arc<RowVersion>>,
    schema_snapshot: Arc<SchemaSnapshot>,    // pinned at BEGIN
}

Read-set tracking is optional for snapshot isolation but cheap to keep β€” and unlocks serializable later if we want it.

4.4 Read protocol

read(row_id, begin_ts):
    if row_id in MvStore.versions:
        walk chain, return version with begin <= begin_ts < end
    else:
        bytes = pager.read_via_existing_path(row_id)
        materialize into MvStore.versions as a single committed-at-zero version
        return it

4.5 Commit protocol

commit(tx):
    commit_ts = clock.tick()
    for row_id in tx.write_set:
        chain = MvStore.versions[row_id]
        head = chain.read()
        if head.begin > tx.begin_ts:        # someone committed under us
            return Err(Busy)
    for row_id, new_version in tx.write_set:
        new_version.begin = commit_ts
        MvStore.versions[row_id].push(new_version)
    pager.append_wal_log_record(tx.write_set, commit_ts)   # one fsync
    return Ok

The validation pass walks the write set only β€” typically tiny. The fsync is one barrier per transaction, same cost as today's commit_transaction.

4.6 Checkpoint

Existing checkpoint logic (Phase 4d) folds WAL frames into the main file when frame count crosses AUTO_CHECKPOINT_THRESHOLD_FRAMES. The MVCC variant adds a step before that: walk the version index, pick the latest committed version per row visible at min(active begin_ts) - 1, hand it to save_database (which rebuilds the B-tree). Older versions get GC'd at the same time. Schema changes still require an exclusive lock.

4.7 Garbage collection

A separate worker (or piggy-backed on commit) walks MvStore.versions and prunes versions whose end timestamp is below the oldest active reader's begin_ts. v0 can run this synchronously inside commit β€” sweep the rows touched in the write set only β€” and add an explicit background sweep later.

4.8 Compatibility with existing transactions

BEGIN and BEGIN IMMEDIATE keep their current semantics β€” full deep-clone snapshot, exclusive write, no row-version tracking. They remain the right choice for DDL. A BEGIN IMMEDIATE running on the same connection blocks new BEGIN CONCURRENT commits (they get Busy) and waits for in-flight ones to finish. This matches Turso's "exclusive vs concurrent" coexistence rule.


5. Phased plan

Sequenced so each sub-phase is independently shippable, follows SQLRite's existing roadmap discipline (every phase ends with a working build + full test suite + a commit on main), and the engine never regresses while in flight. Tentatively numbered as Phase 10 (Phase 9 = benchmarks; Phase 8 = FTS, shipped). Slot may shift depending on roadmap rebalancing.

Phase 10.1 β€” Multi-connection foundation

Goal: more than one Connection can target the same Database within a process, with no behavior change otherwise.

  • Split Database into DatabaseInner (state) + Connection (handle that holds Arc<DatabaseInner>).
  • Move the Pager behind a Mutex or RwLock inside DatabaseInner.
  • New Database::connect() -> Connection. The existing Connection::open path opens the database and returns a freshly-connected handle.
  • Make the engine Send + Sync end-to-end. No new SQL surface yet.
  • Tests: spawn N threads, each running independent BEGIN; INSERT; COMMIT against the same DB, prove the existing single-writer locking still serializes them correctly with no panics.

Phase 10.2 β€” Logical clock + transaction registry

  • Add MvccClock with tick() -> u64 + now() -> u64.
  • Add ActiveTxRegistry exposing min_active_begin_ts() for the GC.
  • Persist the clock high-water-mark in the WAL header (extend WalHeader with a 16-byte field; bump WAL_FORMAT_VERSION from 1 β†’ 2 β€” backward-compatible because the WAL is regenerated on checkpoint).
  • Tests: round-trip the clock through reopen, reject WAL files where the persisted timestamp is non-monotonic.

Phase 10.3 β€” MvStore skeleton + reads

  • Implement MvStore and RowVersion. v0: one chain per row behind RwLock<Vec<RowVersion>>.
  • Wire reads through the store: lazy-load on first touch, return the visible version for the caller's begin_ts.
  • New SQL gate: PRAGMA journal_mode = mvcc switches the connection into MVCC mode for reads. Writes still go through the legacy path.
  • Tests: snapshot isolation reads β€” one connection inserts and commits, a second connection that began before the commit doesn't see the new row; a third connection that began after does.

Phase 10.4 β€” BEGIN CONCURRENT writes

  • Extend the parser to map BEGIN CONCURRENT (sqlparser already recognizes the modifier) to a new TxKind::Concurrent.
  • Route writes from TxKind::Concurrent transactions into the MvStore's write set rather than the live tables map.
  • Implement the commit-time validation loop. New SQLRiteError::Busy and SQLRiteError::BusySnapshot variants (decision: use both, mirroring Turso, so SDKs can map cleanly).
  • WAL log record format: a new frame kind carrying (table_id, rowid, op, payload) tuples. Distinct from the existing per-page commit frame; the checkpointer flattens log records into page-level updates.
  • Tests:
    • two concurrent inserts on disjoint rowids both commit
    • two concurrent updates on the same rowid: one commits, one aborts with Busy
    • aborted transaction's writes never become visible
    • retry-after-Busy succeeds

Phase 10.5 β€” Checkpoint + crash recovery

Status (roadmap 11.9 β€” May 2026): The crash-recovery half landed in roadmap Phase 11.9. WAL format is bumped to v3; commits append a typed MvccCommitBatch frame before the legacy save's fsync; reopen replays those frames into MvStore and seeds MvccClock past the highest commit_ts. The checkpoint-drain half β€” folding MVCC log records into pager-level updates and re-enabling the Mvcc β†’ Wal journal-mode downgrade β€” is the remaining slice and stays parked for a follow-up.

  • Extend the checkpointer to drain MVCC log records into pager-level updates before folding the WAL into the main file. Deferred β€” see status note above.
  • Crash recovery: on open, replay WAL log records into MvStore, then replay pager-level commit frames as today. (Shipped β€” 11.9.)
  • Tests: kill the process mid-MVCC-commit (between log-record append and version-chain push), reopen, verify the committed transaction is visible and the half-written one is not. (Shipped β€” 11.9 covers the clean-drop case which exercises the same recovery codepath; a real OS-kill test is parked with the checkpoint-drain follow-up.)

Phase 10.6 β€” Garbage collection

  • Per-commit sweep over the write set's chains.
  • Background sweep behind a new PRAGMA mvcc_gc_interval_ms (default 1000).
  • Memory-pressure trigger: if MvStore size > mvcc_max_memory_bytes, force a sweep at the next safe point.
  • Tests: chain length stays bounded under heavy single-row update pressure.

Phase 10.7 β€” Indexes under MVCC (deferred-by-design β€” separate later phase)

Index maintenance under MVCC is hard enough that Turso explicitly punted on it. SQLRite should ship the v0 with CREATE INDEX rejected when journal_mode = mvcc, and tackle indexes as a follow-up phase once the v0 is stable. Sketched solution: each secondary-index entry becomes a RowVersion itself, keyed by (index_id, key, rowid). The cost is one version chain per indexed (column, row) pair.

Phase 10.8 β€” Public API + SDK propagation

  • Surface Connection::begin_concurrent() -> Result<()> as a typed convenience wrapper.
  • Surface SQLRiteError::Busy and SQLRiteError::BusySnapshot through the FFI shim (sqlrite-ffi/) and each SDK (Python, Node, Go, WASM). The right level of abstraction: an exception per SDK plus a typed retry helper.
  • New REPL meta-command .spawn that opens an additional connection to the same DB so users can demo BEGIN CONCURRENT interactively.
  • New benchmark workload in benchmarks/ that pits SQLRite-MVCC against SQLite + SQLRite-default + Turso on a "N writers, mostly disjoint rows" scenario. Slots into the existing SQLR-16 harness as a Group D differentiator workload.

Phase 10.9 β€” Docs

Status (roadmap 11.12 β€” May 2026): Shipped. The canonical user-facing reference at docs/concurrent-writes.md covers the SQL surface, embedding API, SDK error mapping, REPL meta-commands, durability story, and limitations as of Phase 11.11a. This plan-doc is now the historical record. Cross-references in _index.md, supported-sql.md, embedding.md, and design-decisions.md point at the canonical doc; a runnable example lives at examples/rust/concurrent_writers.rs.

  • Promote this plan to docs/concurrent-writes.md (the canonical user-facing reference), keeping concurrent-writes-plan.md as the historical design document. (Shipped β€” 11.12.)
  • Update roadmap.md, docs/_index.md, supported-sql.md, embedding.md, design-decisions.md. (Shipped β€” 11.12.)
  • Add a worked example under examples/rust/concurrent_writers.rs. (Shipped β€” 11.12.)

6. SQL & API surface

-- Opt in (per connection, per session)
PRAGMA journal_mode = mvcc;

-- Snapshot-isolation transaction with row-level conflict detection at commit
BEGIN CONCURRENT;
UPDATE accounts SET balance = balance - 50 WHERE id = 1;
UPDATE accounts SET balance = balance + 50 WHERE id = 2;
COMMIT;   -- may return SQLRITE_BUSY β†’ caller retries

-- Existing transactions still work (single-writer, exclusive)
BEGIN;             -- aka BEGIN DEFERRED, identical to today
BEGIN IMMEDIATE;   -- new alias of BEGIN; documented for Turso parity
// Rust embedding API
let conn = db.connect()?;          // new β€” multi-connection
conn.execute("PRAGMA journal_mode = mvcc")?;

loop {
    conn.execute("BEGIN CONCURRENT")?;
    conn.execute("UPDATE accounts SET balance = balance - 50 WHERE id = 1")?;
    conn.execute("UPDATE accounts SET balance = balance + 50 WHERE id = 2")?;
    match conn.execute("COMMIT") {
        Ok(_) => break,
        Err(SQLRiteError::Busy | SQLRiteError::BusySnapshot) => {
            conn.execute("ROLLBACK").ok();
            continue;
        }
        Err(e) => return Err(e),
    }
}

A typed Connection::with_concurrent_tx<F>(f: F) helper that hides the retry loop is worth considering once the bare API is shaken out.


7. File-format implications

  • The on-disk .sqlrite file format (currently v5) does not change. MVCC state lives in memory and in the WAL.
  • The WAL header gains a clock_high_water: u64 field. Bump WAL_FORMAT_VERSION from 1 β†’ 2. Pre-v2 WALs open as v1 (clock starts from 0); the next checkpoint rewrites the header at v2.
  • A new WAL frame kind for MVCC log records. Existing per-page commit frames keep their format. Both kinds can coexist in the same WAL β€” the checkpointer drains log records first, then per-page commits, then truncates.

This keeps sqlrite-engine v0.x readers compatible with v1.x files and vice versa as long as MVCC isn't enabled at write time.


8. Risks, open questions, and explicit non-goals

Risks

  • Bottom-up B-tree rebuild (Decision 6) is the biggest architectural mismatch. Today every commit rebuilds the whole tree. With MVCC the rebuild only runs at checkpoint, so the cost is amortized β€” but tables that grow large between checkpoints can produce huge checkpoint stalls. Mitigation: in-place B-tree updates are an obvious follow-up, separately tracked.
  • Memory growth β€” the version index holds full row payloads. Mitigation: column-level deltas (Turso's #3498) and an mvcc_max_memory_bytes cap.
  • Lock contention on the per-row chain β€” RwLock<Vec<RowVersion>> is good enough for v0 but won't scale to extreme concurrency. Mitigation: wait-free chain (Turso's #3499) as a v1 follow-up.
  • DDL under MVCC β€” schema changes need an exclusive promotion of the lock. v0 should reject DDL inside a BEGIN CONCURRENT block with a clean error.
  • Multi-process MVCC is out of scope. Turso is intra-process; the in-memory version index has no cross-process visibility, and adding it would require a shared-memory coordination file (the same machinery SQLite's WAL uses for read marks). Multi-process writes continue to be serialized by the existing flock(LOCK_EX). Documented as a non-goal until someone has a workload that justifies the engineering.

Open questions

  • Should journal_mode = mvcc be per-connection or per-database? Turso treats it per-connection; SQLite's journal_mode is per-database. Per-connection is more flexible (a maintenance connection can stay in WAL mode while app connections use MVCC), but per-database is closer to user expectation.
  • Should MvStore materialize the whole table on first touch, or one row at a time? Per-row is cheaper and avoids Turso's eager-load complaint. The cost is one pager.read per cold row.
  • Does BEGIN CONCURRENT upgrade to exclusive on the first write to a versioned table that has no MvStore entry? The simpler answer is no β€” first read materializes the row into the index, no lock upgrade needed.
  • AUTOINCREMENT. Globally-incrementing rowid counters are not naturally versioned. Acceptable v0 answer: reject INSERT into an AUTOINCREMENT table from a BEGIN CONCURRENT block (matches Turso). Better answer: a serializable counter sequence service. Defer.
  • Hermitage-style anomaly tests. Turso's hermitage_tests.rs is a port of Martin Kleppmann's hermitage suite. SQLRite should adopt the same tests as the snapshot-isolation conformance bar.

Non-goals (v0)

  • Concurrent reads of an in-progress checkpoint (Turso's wal_checkpoint(TRUNCATE) blocks; ours can too).
  • Automatic backoff in the SDK retry helpers β€” the caller picks the policy.
  • Cross-process MVCC.
  • Asynchronous I/O / io_uring.

9. Estimated effort

Order-of-magnitude only β€” each sub-phase is a roadmap entry and gets a real estimate when scheduled.

Sub-phaseRough sizeNotes
10.1 multi-connectionMmostly mechanical, tests heavy
10.2 clock + tx registrySsmall, isolated
10.3 MvStore readsMnew module, lazy materialization
10.4 BEGIN CONCURRENT writesLthe meat
10.5 checkpoint + recoveryMreuses existing checkpointer
10.6 GCS–Mone synchronous + one background
10.7 indexes under MVCCLdeferred
10.8 SDK propagationMone slice per SDK
10.9 docsSone pass once 10.4 lands

Total before 10.7: β‰ˆ 4–6 weeks of focused work for a single engineer, given the existing test discipline. Index maintenance under MVCC (10.7) is its own multi-week effort and may slip to a separate phase.


10. References