Compare commits

...

2 Commits

Author SHA1 Message Date
Ben
5a5d19184a docs(assets): add NS-506 session DB durability infographic 2026-06-17 15:57:15 +10:00
Ben
541d155e96 fix(state): harden session DB against torn writes on constrained hosts (NS-506)
A beta tester on a small Fly machine hit a corrupted session database
(state.db). The instance was memory/disk constrained — the agent was even
trying to add a swapfile (blocked by Fly seccomp), and SIGTERM-under-s6
mid-write plus a disk filling with a large npm cache are exactly the
conditions that tear a SQLite file.

The session DB opened with WAL (good) but never set `synchronous` or an
explicit `busy_timeout`, so it ran at SQLite's default durability and a
1s Python-level timeout. This commit pins the SQLite-recommended WAL
durability settings:

- WAL  → `synchronous=NORMAL`: crash-safe against OS crash / power loss /
  process kill (the DB file is never corrupted; only the last
  un-checkpointed transaction can be lost), without FULL's per-write fsync
  cost on the hot session-write path.
- DELETE fallback (NFS/SMB/FUSE, where WAL is unavailable) → `synchronous=
  FULL`, since without WAL only FULL is crash-safe.
- explicit `busy_timeout=2000` so a checkpoint/contention spike surfaces
  as a brief wait, not an immediate "database is locked".

The existing malformed-schema detection + timestamped backup + auto-repair
(`is_malformed_db_error` / `repair_state_db_schema`, surfaced by
`hermes doctor`) already covers *recovery*; this closes the *prevention*
gap that let the corruption happen in the first place.

Tests: 4 new pragma assertions (WAL→NORMAL, busy_timeout, never-OFF,
foreign_keys preserved) that fail without the fix. Plus a real E2E:
SIGKILL a child mid-write, reopen → `PRAGMA integrity_check` returns ok
with all rows intact. Full hermes_state suite (37) green.

Reported via beta (NS-506).
2026-06-17 15:56:08 +10:00
3 changed files with 76 additions and 1 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 375 KiB

View File

@@ -722,7 +722,24 @@ class SessionDB:
isolation_level=None,
)
self._conn.row_factory = sqlite3.Row
apply_wal_with_fallback(self._conn, db_label="state.db")
journal_mode = apply_wal_with_fallback(self._conn, db_label="state.db")
# Durability hardening (NS-506). On memory/disk-constrained
# hosts (e.g. a small Fly machine that OOM-kills or SIGTERMs
# the process mid-write, or fills the disk with a large npm
# cache), an interrupted write can tear the DB. With WAL,
# ``synchronous=NORMAL`` fsyncs the WAL at each checkpoint and
# is crash-safe against OS crash / power loss / process kill
# (only the very last un-checkpointed transaction can be lost,
# never the database file itself) — the SQLite-recommended
# setting for WAL. On the DELETE fallback (NFS/SMB/FUSE) we use
# FULL, since without WAL only FULL is crash-safe. We also set
# an explicit ``busy_timeout`` so a checkpoint/contention spike
# doesn't surface as an immediate "database is locked".
if journal_mode == "wal":
self._conn.execute("PRAGMA synchronous=NORMAL")
else:
self._conn.execute("PRAGMA synchronous=FULL")
self._conn.execute("PRAGMA busy_timeout=2000")
self._conn.execute("PRAGMA foreign_keys=ON")
self._init_schema()

View File

@@ -0,0 +1,58 @@
"""NS-506: session DB durability pragmas.
On memory/disk-constrained hosts an interrupted write (OOM kill, SIGTERM
mid-write, full disk) can tear the SQLite session DB. These tests pin the
durability settings the connection must apply so a regression that drops
them is caught.
"""
import sqlite3
import pytest
from hermes_state import SessionDB
@pytest.fixture
def db(tmp_path):
database = SessionDB(tmp_path / "state.db")
try:
yield database
finally:
database.close()
def _pragma(conn: sqlite3.Connection, name: str):
return conn.execute(f"PRAGMA {name}").fetchone()[0]
def test_wal_mode_uses_synchronous_normal(db):
"""On a normal local filesystem the DB runs WAL + synchronous=NORMAL.
NORMAL is the SQLite-recommended WAL setting: crash-safe against OS
crash / power loss / process kill (the DB file is never corrupted, only
the last un-checkpointed txn can be lost), without the per-write fsync
cost of FULL.
"""
conn = db._conn
assert _pragma(conn, "journal_mode").lower() == "wal"
# 0=OFF, 1=NORMAL, 2=FULL, 3=EXTRA
assert _pragma(conn, "synchronous") == 1
def test_busy_timeout_is_set(db):
"""An explicit busy_timeout keeps a checkpoint/contention spike from
surfacing as an immediate 'database is locked'."""
assert _pragma(db._conn, "busy_timeout") == 2000
def test_synchronous_is_never_off(db):
"""The corruption-prone setting (synchronous=OFF) must never be in
effect for the session store, regardless of journal mode."""
assert _pragma(db._conn, "synchronous") >= 1
def test_foreign_keys_still_enabled(db):
"""Regression guard: the durability pragmas must not displace the
existing foreign_keys=ON."""
assert _pragma(db._conn, "foreign_keys") == 1