Cascade Runbook

June 7, 2026 · View on GitHub

The cascade daemon keeps LanceDB in sync with the markdown files under the memory root. Service / entry points only ever write markdown; the daemon is the sole writer of the LanceDB index. This runbook covers the recurring operational questions.

What runs where

When everos server start boots, the FastAPI lifespan wires four providers in order:

Metrics — Prometheus collector.
SQLite — system DB + schema (SQLModel.metadata.create_all).
LanceDB — async connection + schema verification + FTS indexes.
Cascade — watcher + scanner + worker, all in-process tasks.

The cascade subsystem itself is three independent loops:

Loop	Source signal	Effect
Watcher	`watchdog` filesystem events (sync thread)	`md_change_state.upsert` per registered kind
Scanner	Periodic walk (`scan_interval_seconds`, default 30 s)	Same — catches changes the watcher missed
Worker	`claim_pending_batch` polling (default 1 s when idle)	Handler dispatch → LanceDB upsert / delete

Every loop talks to the same md_change_state sqlite table. The worker's claim mode (pending → processing → done/failed) keeps concurrent workers honest.

Health: `everos cascade status`

queue:
  pending:                   3
  done:                      1247
  failed (retryable=TRUE):   1     (eligible for `cascade fix --apply`)
  failed (retryable=FALSE):  1     (fix md and re-save to recover)
lsn:
  max:           1252
  last_processed: 1250
  lag:            2

lag > 0 means the worker is behind. Steady state should hover near zero; sustained lag points at a slow handler or a stuck retry.
failed (retryable=FALSE) is always user-actionable. Cascade will never auto-clear these — they represent malformed md the user must edit.

Recovering from failures: `everos cascade fix`

cascade fix (no flag) lists every failed row. With --apply:

UPDATE md_change_state SET status='pending', retry_count=0 WHERE status='failed' AND retryable=TRUE (the partial index idx_md_change_retryable makes this O(retryable)).
Drain the worker once so the retry runs synchronously.

Retryable failures cover transient embedding / HTTP errors (5xx, 429, network resets) after the inline MAX_RETRY=3 was exhausted. The fix command resets the counter so a working backend gets a clean start.

retryable=FALSE rows require the user to edit the md (typically a YAML frontmatter issue) and re-save; the watcher picks the change up naturally.

One-shot replay: `everos cascade sync [PATH]`

Use this when the watcher missed an event (WSL mount, network share, external editor with no inotify) or when you want a deterministic flush before, say, a smoke test:

everos cascade sync                           # drain everything pending
everos cascade sync users/u1/episodes/X.md    # re-enqueue + drain

The CLI builds the same CascadeOrchestrator as the daemon but only calls sync_once / drain_once — no watcher / scanner background task. So it's safe to run in parallel with a live everos server.

Recovery paths

LanceDB schema drift on startup

LanceDBLifespanProvider.startup calls verify_business_schemas. If an on-disk table has columns the current Pydantic schema does not declare (or vice versa), the boot fails with:

LanceDB table 'episode' schema drift: missing=[...], extra=[...].
The index is rebuildable from md — recover with
`rm -rf ~/.everos/.index/lancedb` and restart.

This is the documented recovery: delete the index, restart the server, the scanner will pick up every md file on its first sweep and the worker repopulates LanceDB. Markdown is the source of truth, so no data is lost.

inotify watch-limit exhaustion (Linux)

Default kernel limit is 8 192 watches per user. On a sizeable memory root the watcher may silently miss events. Symptoms:

Scanner catches the file changes but the watcher never logs an event for the same path.
cat /proc/sys/fs/inotify/max_user_watches is at the limit.

Fix by bumping the kernel parameter:

echo fs.inotify.max_user_watches=524288 | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

WSL2 / network mounts

Filesystem events do not propagate from the Windows host into WSL2 (or across most SMB / NFS shares). The watcher will start without error and silently see nothing.

Workarounds:

Rely on the scanner — at default 30 s interval, throughput is bounded but eventually-consistent.
Drop the scan interval to ~5 s if the memory root is small.
Run everos cascade sync explicitly after batch edits.

Daemon process crash mid-batch

claim_pending_batch flips rows to processing atomically. If the process dies before mark_done / mark_failed, those rows stay in processing until the next boot. The orchestrator auto-recovers on startup: CascadeOrchestrator.start calls md_change_state_repo.recover_orphan_processing() before launching the watcher / scanner / worker, which resets every processing row back to pending. Single-process cascade means no race — at boot time no other worker could legitimately own a processing row.

No operator action required; the structured log line cascade_recovered_orphan_processing reports the count when it fires.

FD exhaustion (`os error 24` / EMFILE)

Symptoms (any of these on a long-running daemon):

LanceDB query / index build fails with lance error: ... Too many open files (os error 24).
lsof -p <pid> | wc -l grows monotonically over hours / days.
Health log lines like cascade_lancedb_optimize_failed / cascade_lancedb_rebuild_failed carrying OSError: [Errno 24].

Cause (verified against lance crate 4.0): the LanceDB index cache (GlobalIndexCache) holds one reader object per opened FTS / vector / scalar index, and each reader pins the file descriptors of its _indices/<uuid>/... files. With a long-running daemon and steady- state cascade ingest, every optimize() call adds new readers; with LanceDB's own default (index_cache_size_bytes=None, unbounded), they are never evicted and the FDs leak monotonically.

drop_index does not help — it is a manifest-only operation and leaves the on-disk UUID directories untouched. Even an explicit optimize(cleanup_older_than=0) unlink()-ing the files does not release FDs: POSIX keeps the inode alive as long as a process holds an open FD on it (the entries show as (deleted) in lsof). Only an LRU eviction inside the cache (or a connection close) actually closes the FDs.

Fix (already wired in LanceDBSettings.index_cache_size_bytes — default 16 MB, ~290 FD ceiling): see Tuning knobs § LanceDB index cache for the sizing table and the env-var override path.

If you have already hit EMFILE in a running process, the cleanest recovery is a daemon restart — the open connection closes, every FD is released, and the next start comes up with the capped Session in place.

Tuning knobs

Cascade scheduler knobs

All defaults live in everos.memory.cascade.orchestrator.CascadeConfig and everos.memory.cascade.worker.CascadeWorker:

Knob	Default	Effect
`scan_interval_seconds`	30	Scanner sweep cadence
`worker_batch_size`	50	Rows claimed per worker cycle
`worker_max_retry`	3	Inline retries before `mark_failed(retryable=TRUE)`
`worker_poll_interval_seconds`	1	Idle wait between empty drain attempts
`worker_retry_backoff_seconds`	2	Linear backoff seed; doubles per attempt

Tuning surface is intentionally not in Settings yet — once we have wall-clock numbers from real workloads, the values that need operator override will surface there.

LanceDB index cache (`index_cache_size_bytes`)

Lives in LanceDBSettings; overridable via the EVEROS_LANCEDB__INDEX_CACHE_SIZE_BYTES environment variable. This is the only knob that bounds the steady-state file-descriptor count of a long-running EverOS daemon — see Recovery paths § FD exhaustion for why nothing else (prune, rebuild, drop_index) helps.

Measured cap → FD ceiling (30 add+optimize cycles + 100-query stress on the real Episode schema):

Cap	FD ceiling	Query latency (p50)	Safe under `ulimit -n`
`2 MB$	~45	~5 \text{ms}	\text{macOS} \text{default} 256 (5 \times \text{headroom})
$4 MB`	~52	~3 ms	macOS default 256
`8 MB`	~140	~2.4 ms	macOS default 256 (1.8× headroom)
`16 MB` (default)	~290	~2.3 ms	Linux default 1024 (3.5× headroom); macOS needs `ulimit -n 1024`
`32 MB`	~630	~1.4 ms	Linux default 1024 (1.6× headroom)
`unbounded`	grows forever	~1.3 ms	NEVER use in a daemon

EverOS's measured steady-state working set after a rebuild_indexes$ \text{cycle} \text{is} \text{roughly} **50-100 \text{readers} / 3-6 \text{MB} \text{resident}** (5 \text{tables} \times ~7 \text{BM25} \text{columns} \times ~10 $part_N reader entries each), so the 16 MB default provides ~3× headroom for burst traffic and stale-but-not-yet-evicted readers.

When to override:

Tight ulimit -n environments (containers; macOS dev boxes that haven't bumped the default 256) → drop to 4 MB or 8 MB. Query latency increases by ~1-3 ms but correctness is unaffected.
Larger working sets (many more tables or much wider FTS indexes than the default schema set) → bump to 32-64 MB. Verify your platform's ulimit -n covers the corresponding FD ceiling with at least 2× headroom.
Diagnostic-only: set to a tiny value (e.g. 1 MB) to force LRU thrashing and reproduce cache-miss latency in tests.

Do not set metadata_cache_size_bytes — it is intentionally left at LanceDB's default (unbounded) because the metadata cache holds parsed manifests / fragment stats and has zero effect on FD count; capping it just thrashes parsing work without solving anything.

Concurrency

The worker is async, not multi-process. Inside one drain cycle, asyncio.gather(*[_process_one(row) for row in batch]) runs every claimed row concurrently — cascade is IO-bound (embedding HTTP calls dominate wall time) so single-process coroutine concurrency saturates the bottleneck. The worker_batch_size knob (default 50) caps in-flight rows.

Multi-process workers are a scaling axis we'd reach for only if a single process becomes CPU-bound, which the current design does not anticipate. claim_pending_batch is already race-safe (the WHERE status='pending' filter ensures each row lands in exactly one batch even if multiple workers raced), so adding processes later is a deployment-side change with no schema work.

What cascade does NOT do (yet)

Schema migration: LanceDB column changes require rm -rf.
Parent-id back-link: Episode rows currently carry parent_id=None; the writer doesn't preserve the source memcell id in the entry inline. Tracked separately.
Reference-file change detection (agent_skill): edits to references/*.md siblings won't trigger a re-index — only changes to SKILL.md itself fire the watcher. Workaround: run everos cascade sync agents/<a>/skills/skill_<n>/SKILL.md after editing references.