Cascade Runbook
June 7, 2026 · View on GitHub
The cascade daemon keeps LanceDB in sync with the markdown files under the memory root. Service / entry points only ever write markdown; the daemon is the sole writer of the LanceDB index. This runbook covers the recurring operational questions.
What runs where
When everos server start boots, the FastAPI lifespan wires four
providers in order:
- Metrics — Prometheus collector.
- SQLite — system DB + schema (
SQLModel.metadata.create_all). - LanceDB — async connection + schema verification + FTS indexes.
- Cascade — watcher + scanner + worker, all in-process tasks.
The cascade subsystem itself is three independent loops:
| Loop | Source signal | Effect |
|---|---|---|
| Watcher | watchdog filesystem events (sync thread) | md_change_state.upsert per registered kind |
| Scanner | Periodic walk (scan_interval_seconds, default 30 s) | Same — catches changes the watcher missed |
| Worker | claim_pending_batch polling (default 1 s when idle) | Handler dispatch → LanceDB upsert / delete |
Every loop talks to the same md_change_state sqlite table. The
worker's claim mode (pending → processing → done/failed) keeps
concurrent workers honest.
Health: everos cascade status
queue:
pending: 3
done: 1247
failed (retryable=TRUE): 1 (eligible for `cascade fix --apply`)
failed (retryable=FALSE): 1 (fix md and re-save to recover)
lsn:
max: 1252
last_processed: 1250
lag: 2
lag > 0means the worker is behind. Steady state should hover near zero; sustained lag points at a slow handler or a stuck retry.failed (retryable=FALSE)is always user-actionable. Cascade will never auto-clear these — they represent malformed md the user must edit.
Recovering from failures: everos cascade fix
cascade fix (no flag) lists every failed row. With --apply:
UPDATE md_change_state SET status='pending', retry_count=0 WHERE status='failed' AND retryable=TRUE(the partial indexidx_md_change_retryablemakes this O(retryable)).- Drain the worker once so the retry runs synchronously.
Retryable failures cover transient embedding / HTTP errors (5xx, 429,
network resets) after the inline MAX_RETRY=3 was exhausted. The
fix command resets the counter so a working backend gets a clean
start.
retryable=FALSE rows require the user to edit the md (typically a
YAML frontmatter issue) and re-save; the watcher picks the change up
naturally.
One-shot replay: everos cascade sync [PATH]
Use this when the watcher missed an event (WSL mount, network share, external editor with no inotify) or when you want a deterministic flush before, say, a smoke test:
everos cascade sync # drain everything pending
everos cascade sync users/u1/episodes/X.md # re-enqueue + drain
The CLI builds the same CascadeOrchestrator as the daemon but only
calls sync_once / drain_once — no watcher / scanner background
task. So it's safe to run in parallel with a live everos server.
Recovery paths
LanceDB schema drift on startup
LanceDBLifespanProvider.startup calls verify_business_schemas. If
an on-disk table has columns the current Pydantic schema does not
declare (or vice versa), the boot fails with:
LanceDB table 'episode' schema drift: missing=[...], extra=[...].
The index is rebuildable from md — recover with
`rm -rf ~/.everos/.index/lancedb` and restart.
This is the documented recovery: delete the index, restart the server, the scanner will pick up every md file on its first sweep and the worker repopulates LanceDB. Markdown is the source of truth, so no data is lost.
inotify watch-limit exhaustion (Linux)
Default kernel limit is 8 192 watches per user. On a sizeable memory root the watcher may silently miss events. Symptoms:
- Scanner catches the file changes but the watcher never logs an event for the same path.
cat /proc/sys/fs/inotify/max_user_watchesis at the limit.
Fix by bumping the kernel parameter:
echo fs.inotify.max_user_watches=524288 | sudo tee -a /etc/sysctl.conf
sudo sysctl -p
WSL2 / network mounts
Filesystem events do not propagate from the Windows host into WSL2 (or across most SMB / NFS shares). The watcher will start without error and silently see nothing.
Workarounds:
- Rely on the scanner — at default 30 s interval, throughput is bounded but eventually-consistent.
- Drop the scan interval to ~5 s if the memory root is small.
- Run
everos cascade syncexplicitly after batch edits.
Daemon process crash mid-batch
claim_pending_batch flips rows to processing atomically. If the
process dies before mark_done / mark_failed, those rows stay in
processing until the next boot. The orchestrator auto-recovers
on startup: CascadeOrchestrator.start calls
md_change_state_repo.recover_orphan_processing() before launching
the watcher / scanner / worker, which resets every processing row
back to pending. Single-process cascade means no race — at boot
time no other worker could legitimately own a processing row.
No operator action required; the structured log line
cascade_recovered_orphan_processing reports the count when it
fires.
FD exhaustion (os error 24 / EMFILE)
Symptoms (any of these on a long-running daemon):
- LanceDB query / index build fails with
lance error: ... Too many open files (os error 24). lsof -p <pid> | wc -lgrows monotonically over hours / days.- Health log lines like
cascade_lancedb_optimize_failed/cascade_lancedb_rebuild_failedcarryingOSError: [Errno 24].
Cause (verified against lance crate 4.0): the LanceDB index cache
(GlobalIndexCache) holds one reader object per opened FTS / vector
/ scalar index, and each reader pins the file descriptors of its
_indices/<uuid>/... files. With a long-running daemon and steady-
state cascade ingest, every optimize() call adds new readers; with
LanceDB's own default (index_cache_size_bytes=None, unbounded), they
are never evicted and the FDs leak monotonically.
drop_index does not help — it is a manifest-only operation and
leaves the on-disk UUID directories untouched. Even an explicit
optimize(cleanup_older_than=0) unlink()-ing the files does not
release FDs: POSIX keeps the inode alive as long as a process holds
an open FD on it (the entries show as (deleted) in lsof). Only an
LRU eviction inside the cache (or a connection close) actually closes
the FDs.
Fix (already wired in LanceDBSettings.index_cache_size_bytes —
default 16 MB, ~290 FD ceiling): see
Tuning knobs § LanceDB index cache
for the sizing table and the env-var override path.
If you have already hit EMFILE in a running process, the cleanest recovery is a daemon restart — the open connection closes, every FD is released, and the next start comes up with the capped Session in place.
Tuning knobs
Cascade scheduler knobs
All defaults live in everos.memory.cascade.orchestrator.CascadeConfig
and everos.memory.cascade.worker.CascadeWorker:
| Knob | Default | Effect |
|---|---|---|
scan_interval_seconds | 30 | Scanner sweep cadence |
worker_batch_size | 50 | Rows claimed per worker cycle |
worker_max_retry | 3 | Inline retries before mark_failed(retryable=TRUE) |
worker_poll_interval_seconds | 1 | Idle wait between empty drain attempts |
worker_retry_backoff_seconds | 2 | Linear backoff seed; doubles per attempt |
Tuning surface is intentionally not in Settings yet — once we have
wall-clock numbers from real workloads, the values that need
operator override will surface there.
LanceDB index cache (index_cache_size_bytes)
Lives in LanceDBSettings; overridable via the
EVEROS_LANCEDB__INDEX_CACHE_SIZE_BYTES environment variable. This
is the only knob that bounds the steady-state file-descriptor count
of a long-running EverOS daemon — see
Recovery paths § FD exhaustion
for why nothing else (prune, rebuild, drop_index) helps.
Measured cap → FD ceiling (30 add+optimize cycles + 100-query stress
on the real Episode schema):
| Cap | FD ceiling | Query latency (p50) | Safe under ulimit -n |
|---|---|---|---|
| `2 MB$ | ~45 | ~5 \text{ms} | \text{macOS} \text{default} 256 (5 \times \text{headroom}) |
| $4 MB` | ~52 | ~3 ms | macOS default 256 |
8 MB | ~140 | ~2.4 ms | macOS default 256 (1.8× headroom) |
16 MB (default) | ~290 | ~2.3 ms | Linux default 1024 (3.5× headroom); macOS needs ulimit -n 1024 |
32 MB | ~630 | ~1.4 ms | Linux default 1024 (1.6× headroom) |
unbounded | grows forever | ~1.3 ms | NEVER use in a daemon |
EverOS's measured steady-state working set after a rebuild_indexes$ \text{cycle} \text{is} \text{roughly} **50-100 \text{readers} / 3-6 \text{MB} \text{resident}** (5 \text{tables} \times ~7 \text{BM25} \text{columns} \times ~10 $part_N reader entries each), so the 16 MB default
provides ~3× headroom for burst traffic and stale-but-not-yet-evicted
readers.
When to override:
- Tight
ulimit -nenvironments (containers; macOS dev boxes that haven't bumped the default 256) → drop to4 MBor8 MB. Query latency increases by ~1-3 ms but correctness is unaffected. - Larger working sets (many more tables or much wider FTS
indexes than the default schema set) → bump to
32-64 MB. Verify your platform'sulimit -ncovers the corresponding FD ceiling with at least 2× headroom. - Diagnostic-only: set to a tiny value (e.g.
1 MB) to force LRU thrashing and reproduce cache-miss latency in tests.
Do not set metadata_cache_size_bytes — it is intentionally left
at LanceDB's default (unbounded) because the metadata cache holds
parsed manifests / fragment stats and has zero effect on FD count;
capping it just thrashes parsing work without solving anything.
Concurrency
The worker is async, not multi-process. Inside one drain cycle,
asyncio.gather(*[_process_one(row) for row in batch]) runs every
claimed row concurrently — cascade is IO-bound (embedding HTTP calls
dominate wall time) so single-process coroutine concurrency saturates
the bottleneck. The worker_batch_size knob (default 50) caps
in-flight rows.
Multi-process workers are a scaling axis we'd reach for only if a
single process becomes CPU-bound, which the current design does not
anticipate. claim_pending_batch is already race-safe (the
WHERE status='pending' filter ensures each row lands in exactly
one batch even if multiple workers raced), so adding processes later
is a deployment-side change with no schema work.
What cascade does NOT do (yet)
- Schema migration: LanceDB column changes require
rm -rf. - Parent-id back-link: Episode rows currently carry
parent_id=None; the writer doesn't preserve the source memcell id in the entry inline. Tracked separately. - Reference-file change detection (agent_skill): edits to
references/*.mdsiblings won't trigger a re-index — only changes toSKILL.mditself fire the watcher. Workaround: runeveros cascade sync agents/<a>/skills/skill_<n>/SKILL.mdafter editing references.