Enabling MVCC
How MVCC differs from WAL
| Aspect | WAL | MVCC |
|---|---|---|
| Write granularity | Full pages per commit | Affected rows only |
| Readers vs. writers | Don’t block each other | Don’t block each other |
| Persistence file | .db-wal | .db-log (logical log) |
| Isolation level | Page-level snapshot | Row-level snapshot |
| Garbage collection | Automatic (WAL reuse) | Not yet implemented |
Architecture
MVCC uses a three-tier storage hierarchy:mv_tx tracking its current MVCC transaction. The MvStore is shared and uses lock-free crossbeam_skiplist data structures.
Row versioning
Every row version carries three fields:| Field | Meaning |
|---|---|
begin | Timestamp (transaction ID) when this version became visible |
end | Timestamp when this version was deleted or replaced; ∞ if still current |
btree_resident | True if this version existed before MVCC was enabled |
id column:
- T0 inserts
id=1→ version(begin=0, end=∞, id=1) - T1 inserts
id=2→ version(begin=1, end=∞, id=2) - T2 deletes
id=1and insertsid=3:- Deletes row 1: updates its version to
(begin=0, end=2, id=1) - Inserts row 3: version
(begin=2, end=∞, id=3)
- Deletes row 1: updates its version to
Read path
Reads follow a lazy loading strategy with a strict precedence order:- Query the in-memory MVCC index. If the row exists and is visible to the current transaction’s snapshot, return it.
- If not in the MVCC index, fall back to the page cache (which may itself load from the WAL or main
.dbfile).
begin_ts = T if version.begin ≤ T < version.end.
Write path
All writes during a transaction are handled entirely within the in-memory MVCC index. This provides:- High-performance writes with minimal latency.
- Immediate read-your-own-writes visibility within the transaction.
- Isolation from other concurrent transactions until commit.
Commit protocol
Commit uses a two-phase approach to ensure durability:- Write the complete transaction write set from the MVCC index to the page cache.
- Flush page cache contents to the WAL, making the transaction durable.
Unlike Hekaton, Turso does not maintain a record of logical changes after flushing to the WAL. This simplifies compatibility with the SQLite file format.
Checkpointing
Checkpointing flushes the in-memory row versions to the B-tree on disk. It is currently a blocking, stop-the-world operation.- Acquire blocking checkpoint lock (blocks all other transactions, including reads).
- Begin a pager transaction.
- Write committed MVCC row/index versions into the pager.
- Upsert the
persistent_tx_ts_maxmetadata row atomically in the same pager transaction. - Commit the pager transaction (WAL now contains committed frames).
- Checkpoint WAL (backfill frames into the
.dbfile). - Fsync the
.dbfile. - Truncate the logical log to 0 bytes (salt regenerated; header written with next frame).
- Fsync the logical log.
- Truncate the WAL.
- GC checkpointed versions; release the lock.
.db file and logical log cleanup are durable, the WAL remains the authoritative recovery source.
Garbage collection
The MVCC store accumulates row versions in memory as writes occur. Without GC, memory grows monotonically. The GC system reclaims versions that no active reader can see and that are redundant with the B-tree. GC is driven by two parameters computed at GC time:- LWM (low-water mark) —
min(tx.begin_ts)across all active/preparing transactions, oru64::MAXif none. Defines the oldest snapshot any reader holds. ckpt_max(durable_txid_max) — the highest committed timestamp whose data has been written to the B-tree. Defines when B-tree fallthrough is safe.
- Aborted garbage (
begin=None, end=None) — remove unconditionally. - Superseded versions (
end=Timestamp(e), e ≤ lwm) — remove, unless doing so would let the dual cursor surface a stale B-tree row (tombstone guard). - Sole-survivor current version (
end=None, begin ≤ ckpt_max, begin < lwm, chain length = 1) — remove; the B-tree already has the same data. - TxID references (
begin=TxIDorend=TxID) — keep; the owning transaction hasn’t resolved yet.
Finalize stage of each checkpoint.
Recovery
Turso uses four durable artifacts to determine startup state:- Main database file (
.db) - WAL file (
.db-wal) - MVCC logical log (
.db-log) - Metadata table row:
__turso_internal_mvcc_meta(k='persistent_tx_ts_max')
Case 1: WAL has committed frames + valid log header
Case 1: WAL has committed frames + valid log header
Complete interrupted checkpoint: backfill WAL into DB, sync DB, truncate WAL. Then run logical-log recovery with metadata cutoff.
Case 2: WAL has committed frames + log header missing
Case 2: WAL has committed frames + log header missing
Fail closed with
Corrupt. A committed WAL without a logical log is an inconsistent state.Case 3: WAL has committed frames + log header invalid/torn
Case 3: WAL has committed frames + log header invalid/torn
Fail closed with
Corrupt.Case 4: WAL has no committed frames
Case 4: WAL has no committed frames
Truncate/discard WAL tail bytes and continue logical-log recovery.
Case 5: No WAL + invalid/torn log header
Case 5: No WAL + invalid/torn log header
Fail closed with
Corrupt.Case 6: No WAL + valid header, no frames
Case 6: No WAL + valid header, no frames
No replay needed; timestamp state comes from the metadata row.
Case 7: No WAL + empty log (0 bytes)
Case 7: No WAL + empty log (0 bytes)
Normal post-checkpoint state. Timestamp loaded from metadata row if present; no replay.
max(persistent_tx_ts_max, max_replayed_log_commit_ts) + 1.
Correctness invariants
- Startup reaches one consistent state or fails closed — no best-effort ambiguity.
- Committed WAL state is never ignored.
- Invalid logical-log tail frames are never replayed.
persistent_tx_ts_maxis advanced atomically with the pager commit during checkpoint.- Replay applies only frames with
commit_ts > persistent_tx_ts_max. - After interrupted-checkpoint reconciliation, the WAL is truncated.
Current limitations
- No garbage collection of old versions — memory use grows monotonically with write volume.
- No recovery from the logical log on restart — the in-memory state must be reconstructed from the log on every open.
- Checkpoint blocks all transactions, including reads.
- No disk spill — the in-memory SkipMap is unbounded.
- No serializability — MVCC provides snapshot isolation, not serializable isolation.
Key source files
| File | Contents |
|---|---|
core/mvcc/mod.rs | Module overview; documents data anomaly definitions |
core/mvcc/database/mod.rs | Main implementation (~3,000 LOC): reads, writes, commit, GC |
core/mvcc/cursor.rs | Dual cursor merging MVCC SkipMap with B-tree reads |
core/mvcc/persistent_storage/logical_log.rs | Logical log on-disk format |
core/mvcc/database/checkpoint_state_machine.rs | Checkpoint logic and GC triggers |
core/mvcc/clock.rs | Logical clock |
Testing
#[turso_macros::test(mvcc)] attribute to write MVCC-enabled Rust tests: