Hive: Lessons Learned
June 6, 2026 · View on GitHub
Architecture Decisions (Pre-Implementation)
2026-03-01: Claude Code as orchestrator, not a custom framework
- Context: Evaluated CrewAI, LangGraph, AutoGen as orchestration options.
- Decision: Use Claude Code natively as orchestrator. Extend with MCP servers only.
- Rationale: Custom orchestrators replace Claude Code (lose ecosystem improvements). MCP servers extend it (ride the wave). Only ~650 lines of custom Python to maintain.
- Trade-off: Limited to what Claude Code's Agent/MCP system can do. Acceptable — it already does parallel subagents, hooks, and skills.
2026-03-01: Python over Go for MCP servers
- Context: User prefers Go for performance. Evaluated both options.
- Decision: Python (FastMCP SDK).
- Rationale: MCP servers are I/O bound (file reads, API calls). Go's performance advantage is ~1ms per request on workloads that take seconds. FastMCP SDK is mature, reduces boilerplate to ~400 lines. Can rewrite in Go later (MCP is protocol-based).
2026-03-01: Separate project from dotfiles
- Context: Could live inside dotfiles or as standalone project.
- Decision: Standalone
~/Projects/hive, deployed by dotfiles setup scripts. - Rationale: Own lifecycle, own dependencies (pyproject.toml), own tests. Shareable with team members who don't need personal dotfiles. Dotfiles references hive, doesn't contain it.
2026-03-01: pydantic-settings over manual env var resolution
- Context: config.py had 6
_resolve_*()functions + module-level constants (52 lines). No startup validation, harder to test. - Decision: Refactor to
pydantic-settingsBaseSettingswithHIVE_prefix. - Rationale: Standard 12-factor pattern. Automatic type coercion (str→Path, str→float). Startup validation catches bad config immediately.
AliasChoicesfor backward-compatibleOPENROUTER_API_KEY(no prefix). 52 → 25 lines, zero test file changes needed (DI already in place). - Trade-off: New dependency (
pydantic-settings). Acceptable — pydantic is already a transitive dep of FastMCP, so it adds ~0 weight.
2026-03-01: Squash merge for feature branches
- Context: Feature branches accumulate WIP/fix commits that pollute master history.
- Decision: Always squash merge PRs into master.
- Rationale: Each commit on master = 1 complete feature/fix. Clean
git log, easygit revert. Branch history preserved in GitHub PR for forensics if needed.
Operational Lessons
2026-03-04: max_lines=100 was dangerously aggressive
- Context: Default
max_lineswas 100. Benchmarking suite revealed that real vault files (roadmap, tasks, lessons) are 125-173 lines. At max_lines=100, tools captured only 7-22% of actual content. - Impact:
session_briefingwas silently truncating critical information. Users had no signal that they were missing content. - Fix: Raised default to 500. At 500 lines, content capture is 98-100% for all existing vault files.
- Lesson: Never pick defaults by intuition. Benchmark against real data first.
2026-03-04: Benchmark-driven defaults over gut feeling
- Context: Both
max_lines(100) and budget cap ($5/mo) were set by rough estimation during initial development. - Decision: Built a benchmarking suite that measures content capture ratio, token savings, and latency across real vault files.
- Outcome: max_lines raised 5x (100 to 500), budget lowered 5x ($5 to $1). Both changes backed by data, not guesswork.
- Lesson: For any configurable default, write a characterization benchmark before picking the value. The cost of being wrong silently is higher than the cost of measuring.
2026-03-04: The :free suffix bug — test your actual infrastructure
- Context: OpenRouter paid tier smoke test was failing. The model ID being sent included the
:freesuffix (e.g.,qwen/qwen3-coder:free), which routed to the free tier instead of the paid model. - Root cause: Configuration assumed the model string was used as-is, but the
:freesuffix is a routing hint, not part of the model name. - Lesson: Smoke tests must exercise the actual production path. Unit tests with mocked responses would not have caught this — only a real HTTP call to OpenRouter revealed the suffix behavior.
2026-03-04: vault_search has best signal-to-noise ratio
- Context: Benchmarking measured signal-to-noise ratio across tools.
vault_searchscored 98.8% (almost all returned content is relevant).session_briefingscored 78.5% (includes boilerplate headers, health checks, git log noise). - Implication: For targeted queries, prefer
vault_searchoversession_briefing. Reservesession_briefingfor cold-start orientation where breadth matters more than precision. - Action: This data feeds into P1 (Context Curator) design — relevance scoring should weight search results higher than briefing sections.
[2026-03-05] Per-repo agent systems vs cross-project vault — competitive analysis
Context: Analyzing Bruno Bertolini's per-repo .agent/ system (rules, skills, agents, specs) that runs 6 CI agents on every PR for quality gates, security, architecture, and dogfood enforcement.
Problem: Should Hive adopt CI-based quality gate agents (architecture check, security scan, dogfood enforcement) or a PRD→techspec→exec pipeline? Is the per-repo .agent/ pattern superior to a cross-project vault?
Solution: No features to copy. Per-repo agents excel at CI automation but are limited to single-repo scope, scale poorly with .md files (author admits "too big"), and lack cross-project context. Hive's strengths (session_briefing, smart_search, EMA relevance, cross-project vault, capture_lesson) are exactly what per-repo systems cannot do. The only actionable takeaway is P3 drift detector (vault_validate) which is already on roadmap — validates code changes against vault ADRs/patterns. PRD→exec pipeline already covered by Claude Code skills (/prd, /writing-plans). Self-improvement loop is our P2 (auto-capture). RAG not needed until 500+ files (ADR-003).
Tags: #competitive-analysis #architecture #ci-cd #self-improvement
[2026-03-06] anyOf in JSON Schema makes MCP tools invisible to Claude Code
Context: Auditing why vault_patch, capture_lesson, and vault_list_files were missing from Claude Code's deferred tools list (14 of 17 visible).
Problem: Claude Code's deferred tool indexer silently drops tools whose JSON Schema contains anyOf — generated by Python | None union types via Pydantic. Affected tools become completely unusable in a session.
Solution: Replace T | None = None with empty defaults (str = "", list[T] = []). Suppress Ruff B006 (mutable default) with # noqa: B006 — safe because FastMCP creates Pydantic models per call. Design rule: never use | None in MCP tool parameters.
Tags: #mcp #claude-code #fastmcp #schema #interop
[2026-03-06] Subprocess and I/O resilience audit — 4 crash vectors found
Context: User reported vault_patch crashing MCP server on Windows. Expanded into full audit of all subprocess and I/O paths.
Problem: Four categories of unhandled exceptions could crash the server: (1) _git_commit missing catch-all, (2) _git_log/_git_recent only catching TimeoutExpired, (3) httpx.ReadTimeout not caught in HTTP clients, (4) file I/O in write tools with no except OSError.
Solution: (1) Catch-all except Exception in all git helpers + timeout 10→30s. (2) Same for _git_log/_git_recent. (3) Added except httpx.TimeoutException in OllamaClient and OpenRouterClient. (4) Wrapped read_text/write_text in write tools with except OSError. 10 new tests, 290 total.
Tags: #resilience #subprocess #httpx #mcp-stability
ReDoS in vault_search (2026-03-06)
- Context:
vault_search(use_regex=True)compiles user-supplied regex withre.compile()and applies it to every line of every vault file - Root cause: Python's
remodule has no backtracking limit. Pathological patterns like(a+)+$cause exponential time on non-matching lines - Fix: Cap regex pattern length at 200 chars. Practical constraint: vault files are small markdown, per-line matching limits blast radius
- Lesson: Any tool that accepts regex from untrusted input needs a complexity gate — length limit at minimum,
re2library for production systems
list_models() missing status check (2026-03-06)
- Context:
OpenRouterClient.list_models()calledresp.json()without checkingresp.status_codefirst - Root cause: The
generate()method had proper status checks butlist_models()was added later and missed the pattern - Fix: Added
if resp.status_code >= 400guard +float()pricing wrapped in try/except ValueError - Lesson: When adding new methods to an HTTP client class, copy the full error-handling pattern from existing methods, not just the happy path
[2026-03-07] MCP server vs agent — architectural boundary
Context: Evaluating whether hive should evolve from MCP server to agent framework
Problem: Confusion between MCP servers (tool providers, stateless, client-agnostic) and agents (autonomous actors with their own decision loops). Some competitor projects blur this line.
Solution: MCP servers extend the host — they provide tools the host decides when to call. Agents replace the host's decision loop. Hive is and should remain an MCP server. The host (Claude Code, Gemini CLI, etc.) is the orchestrator. Adding agent behavior would couple hive to a specific host.
Why: Protocol-level separation of concerns. MCP is a tool interface, not an execution framework. Staying protocol-pure keeps hive client-agnostic.
Tags: #architecture #mcp #design-principle
[2026-03-07] CI publish failures on duplicate versions — make idempotent
Context: release-please created a release, CI published to PyPI and MCP Registry. Re-running the workflow failed because the version was already published.
Problem: MCP Registry publish step returned a non-zero exit code on duplicate version, failing the entire CI pipeline. PyPI had the same issue but was already handled with --skip-existing.
Solution: Added continue-on-error: true to the MCP Registry publish step. Duplicate publishes are expected (re-runs, manual triggers) and should not block CI.
Why: Idempotency in CI pipelines. Any publish step that can be re-run must tolerate "already exists" as a success condition.
Tags: #ci #release #idempotency
[2026-03-07] Evaluating external tools — separation of concerns over feature envy
Context: Analyzed claude-qmd-sessions (hook-driven session transcript indexing via qmd). Evaluated 4 ideas: auto-briefing hooks, transcript indexing, CWD auto-detection, dual-cap context.
Problem: Temptation to absorb external tool patterns into hive (session transcript search, hook automation, CWD-based project detection).
Solution: None warranted inclusion. Hooks are user-config (not server feature) — documented instead. Transcript indexing adds noise (hive already has capture_lesson for curated extraction). CWD detection is impossible (MCP server doesn't receive client CWD). Dual-cap is YAGNI.
Why: An MCP server should do one thing well (vault access) rather than absorb tangential features. The right response to "interesting external pattern" is often documentation, not code.
Tags: #architecture #yagni #competitive-analysis
2026-03-08: Tool consolidation — 19 to 10 tools for Claude Code compatibility
- Context: Claude Code silently drops MCP tools beyond ~14 from its deferred tool list. Hive had 19 tools, 5 were invisible.
- Problem: Users could never discover or use
vault_patch,vault_list_files,extract_lessons,vault_validate, orvault_usagebecause Claude Code's client-side limit hid them from the tool picker. - Solution: Consolidated 19 tools into 10 by merging related functionality behind mode-switching parameters (e.g.,
vault_searchgainedranked=Trueandsince_days=Ninstead of separatevault_smart_searchandvault_recenttools). No functionality lost — every feature accessible via the consolidated API. - Lesson: MCP client implementations have undocumented limits. Design tool surfaces to stay under ~12 tools per server. Prefer fewer tools with mode parameters over many single-purpose tools.
2026-03-08: Test flakiness from real external services in unit tests
- Context: After merging
vault_summarizeintodelegate_task, the large-file test became flaky — passing in isolation, failing in full suite. - Problem: The
vault_mcptest fixture creates a server with default settings. WhenOPENROUTER_API_KEYis set in the environment (from dotfiles), the server actually connects to OpenRouter and returns a real summary instead of the expected fallback content. - Solution: Made test assertions handle both cases (worker available → summary, worker unavailable → raw content). The proper fix would be injecting a null OpenRouter client in the vault_mcp fixture.
- Lesson: Test fixtures that create servers without explicit client injection will use real external services if env vars are set. Always inject mock clients in unit test fixtures.
[2026-03-10] Glama MCP Directory Listing Requirements
Context: Submitting hive-vault to MCP server directories for discoverability
Problem: punkpeye/awesome-mcp-servers requires a Glama listing link next to the GitHub link. Glama requires a glama.json in the repo root.
Solution: 1) Create glama.json with $schema and maintainers array. 2) Add Glama badge (PNG, 380x200) to README. 3) Update awesome-list PR entry with glama link after GitHub link. Schema: https://glama.ai/mcp/schemas/server.json — only required field is maintainers (GitHub usernames).
Tags: #distribution #mcp #glama
[2026-03-12] SQLite SQLITE_MISUSE from concurrent MCP tool calls
Context: FastMCP dispatches synchronous tool handlers to a thread pool via anyio.to_thread.run_sync. Concurrent tool calls share the same ServerContext, including SQLite-backed trackers.
Problem: RelevanceTracker and UsageTracker used check_same_thread=False without locking, causing SQLITE_MISUSE (error 21). BudgetTracker lacked the flag entirely, causing ProgrammingError on cross-thread access. vault_write/vault_patch had TOCTOU race conditions on file read-modify-write. _git_commit had no serialization, allowing interleaved git add/commit.
Solution: Added threading.Lock to all three SQLite trackers. Used Lock (not RLock) with internal _method() pattern to avoid deadlock on reentrant calls (e.g. month_stats calling _month_spent). Added module-level _WRITE_LOCK in _vault_write.py for atomic file I/O + git commit. Added _GIT_LOCK in _helpers.py to serialize all git operations.
Tags: #concurrency #sqlite #threading #mcp
[2026-03-13] asyncio.timeout cannot interrupt threads — use lock timeouts for sync code
Context: Adding timeouts to MCP tool handlers to fix indefinite hangs (issue #63)
Problem: asyncio.timeout() only cancels at await points — it cannot interrupt a thread blocked on Lock.acquire() or subprocess.run(). Converting sync tools to async via to_thread gives false sense of control.
Solution: Use Lock.acquire(timeout=N) for sync blocking points, asyncio.timeout() for async handlers. Defense in depth: each layer has its own timeout mechanism matching its execution model.
Tags: #python #asyncio #concurrency #mcp
[2026-03-10] uv sync editable install breaks multi-stage Docker builds
Context: Building a multi-stage Docker image for hive-vault. The builder stage used uv sync --frozen --no-dev to install the local package, then only .venv was copied to the final stage.
Problem: Runtime ModuleNotFoundError: No module named 'hive'. uv sync installs the local project as an editable/direct-url reference pointing to /app/src, which doesn't exist in the final image (only .venv is copied).
Solution: Use uv sync --frozen --no-dev --no-install-project for third-party deps only, then .venv/bin/pip install --no-cache-dir --no-deps . to install the local package as a proper wheel embedded in .venv/lib/. The wheel is self-contained — no reference to source paths.
Why: uv sync optimizes for development (editable installs are faster for iteration). In multi-stage Docker builds where source isn't copied to the final stage, you need a non-editable wheel. This is a uv-specific gotcha — pip install . has always produced non-editable installs by default.
Tags: #docker #uv #multi-stage #python-packaging
[2026-03-15] MCP tool parameter names must match LLM mental models
Context: vault_patch tool had parameters named old_text and new_text. LLMs (Claude, Gemini) consistently hallucinated find and replace instead, causing Pydantic validation failures at runtime.
Problem: The parameter names were technically correct but didn't match the natural mental model that LLMs (and humans) have for text substitution operations. Every vault_patch call risked a validation error from the LLM guessing the "obvious" names.
Solution: Renamed old_text→find and new_text→replace across the entire codebase (7 files). Breaking API change, but eliminated an entire class of runtime failures. Evaluated adding aliases for backward compatibility — rejected as over-engineering since no external consumers exist yet.
Why: MCP tools are called by LLMs, not humans typing exact names. Parameter naming is a DX/UX decision that directly affects tool reliability. Shorter, more idiomatic names reduce schema misreads. Design rule: when naming MCP tool parameters, prefer the term an LLM would guess first.
Tags: #mcp #naming #llm-ergonomics #dx
[2026-03-26] BFS hierarchical scope resolution for nested vault directories
Context: Hive vault uses a flat 10_projects/<slug>/ layout, but the new 50_work/ scope is multi-level (e.g. 50_work/45-development/<family>/<component>/). The existing flat resolver only saw direct children of the scope root, so deep projects were unreachable via short slug — users had to spell out the full path.
Problem: _resolve_project_dir could not find a slug like hydra3d-plus under a nested work tree. Adding a work scope without changing resolution would have forced verbose literal paths for every work query, breaking the ergonomic short-slug API users already had for 10_projects.
Solution: Switched _resolve_project_dir to breadth-first traversal: try direct child first, then BFS through subdirectories — shallowest match wins. Slugs containing / bypass BFS and resolve as literal relative paths inside the scope (escape hatch for collisions or explicit targeting). Added scope filter to vault_search (restricts all 3 modes to a single scope) and _find_duplicate_names in vault_health to surface BFS collisions before they cause silent mis-resolution.
Why: BFS preserves the short-slug API across heterogeneous vault layouts (flat for projects, nested for work). Shallowest-wins keeps resolution deterministic; the explicit-path escape hatch covers the duplicate-name edge case without coupling tools to a single layout. Duplicate detection in vault_health makes the previously silent collision surface visible at audit time.
Tags: #vault #scope-resolution #bfs #mcp
[2026-03-11] Three-pass cascading match for vault_patch tolerant text replacement
Context: vault_patch originally required old_text to match the file content byte-for-byte, including YAML frontmatter. LLMs typically copy snippets from vault_query output that has the frontmatter stripped or whitespace normalized, so the patch call would fail every time the body was sourced from a prior read (Issue #52).
Problem: Strict matching is brittle to two realistic LLM-induced drifts: (a) the frontmatter is missing because the LLM only copied the body, (b) trailing/leading whitespace in tables or code blocks got normalized during quoting. Either drift produced a hard error with no hint of how close the match was, breaking the read→patch workflow entirely.
Solution: _match_and_replace() in _helpers.py performs three cascading passes — (1) exact match on full file, (2) exact match on body only (strip frontmatter, then re-attach after replacement), (3) whitespace-normalized match on body (collapse runs of whitespace to a single space for comparison only). First successful pass wins. If all three miss, difflib.SequenceMatcher computes a similarity percentage and the error message includes the closest near-match excerpt — turning a dead-end "not found" into an actionable diagnostic.
Why: Cascading from strict→loose preserves correctness when the LLM gets the text right, while tolerating the common failure modes. Returning similarity diagnostics on total miss converts a UX dead end into a debugging signal: the LLM can see how close it got and adjust. replace_body mode was evaluated as an alternative and discarded — tolerant matching covers the same use cases without adding a second tool surface.
Tags: #vault-patch #tolerant-matching #llm-ergonomics #mcp
[2026-05-15] In-memory MCP tests do not exercise the stdio transport race
Context: Debugging issue #75 (Hive transport dying after first rejected tool call). The in-memory FastMCP call_tool tests passed cleanly when I cancelled the task and made another call — looked like the server was fine.
Problem: That gave false confidence. The actual bug only reproduces when the cancellation goes through the JSON-RPC wire as a notifications/cancelled message AND the server runs as a real subprocess with mcp.server.stdio. The race is in RequestResponder.__exit__'s interaction with anyio.CancelScope, which the in-memory path never touches.
Solution: For any future MCP transport-level bug, write the regression test as a subprocess driving real JSON-RPC. tests/test_transport_recovery.py spawns python -m hive.server and sends initialize → tools/call → notifications/cancelled → tools/call to verify the second call still responds. Pairs in-memory tests for the API surface with subprocess tests for the transport.
Why: FastMCP's call_tool shortcuts the full message dispatch and never instantiates RequestResponder with the cancel-scope contract. The receive loop's task group is where the bug lives, not in the handler. Two distinct surfaces need two distinct test strategies.
Tags: #mcp #testing #stdio #cancellation #issue-75
[2026-05-15] Forward-compatible monkey-patches must be self-gated on the failure mode
Context: Patching mcp.shared.session.RequestResponder.__exit__ to fix the issue #75 cancellation leak. Upstream will eventually fix it; we don't want our patch to mask a different bug if upstream changes the internal flow.
Problem: A naive monkey-patch overrides upstream behaviour permanently. Once we ship it, every future bug in that method becomes invisible (or worse, the patch's logic conflicts with a new upstream fix and creates a new bug).
Solution: Gate the patch on the exact failure-mode signature: if self._completed and isinstance(exc, anyio.get_cancelled_exc_class()). The first clause says "we already sent a response" — the second says "this is anyio's cancellation". Anything else re-raises normally. Wrap the patch application in a defensive try/except that logs a warning if RequestResponder was renamed or restructured upstream.
Why: A self-gated patch becomes inert the moment upstream lands a fix — the trigger condition simply stops being reachable. That makes the monkey-patch removable without coordination: even if we forget to drop it, it doesn't do anything harmful in the post-fix world. Defensive import keeps the production server from crashing if upstream renames the class.
Tags: #monkey-patch #mcp #forward-compat #issue-75
[2026-05-15] release-please extra-files does not retroactively patch drifted files
Context: server.json was added to release-please-config.json as an extra-files entry pointing at $.version and $.packages[0].version. The file was already at v1.4.5 when the config landed. PyPI advanced through 1.5.x → 1.12.2 over eight releases, but server.json stayed at 1.4.5 the entire time, and the MCP Registry kept rejecting mcp-publisher publish with 400 duplicate version. The failure was hidden by a generic || echo "skipped" on the publish step.
Problem: release-please's extra-files updater only fires when it bumps a version inside a release PR. It does not synchronise pre-existing drift — if the file is wrong when you add the config, it stays wrong. Combined with a catch-all error silencer, the registry quietly froze for two months.
Solution: Bump the drifted file manually to the current version once. release-please picks up from there. While there, replace the catch-all silencer (|| echo skipped) with a grep on the specific failure string (cannot publish duplicate version) so genuine failures surface. Add a workflow_dispatch input to the release workflow so you can re-publish to the registry without inventing a new release.
Why: Generic error swallowers are tech-debt batteries: they capture symptoms forever until someone notices the divergence. Pair every "best-effort" step with a precise filter that distinguishes "expected idempotent miss" from "real failure", and provide a manual re-run path so corrective action doesn't require a fake feature commit.
Tags: #release-please #ci #error-handling #mcp-registry
[2026-05-15] Boolean workflow_dispatch inputs are real booleans in if: expressions
Context: Added workflow_dispatch with a boolean input republish_mcp and gated the publish job with if: inputs.republish_mcp == 'true'. The job skipped on every manual trigger.
Problem: GitHub Actions converts type: boolean inputs to actual booleans when reading via inputs.* in expressions. Comparing against the string 'true' always evaluates false, silently skipping the gated job.
Solution: Use the input truthily (inputs.republish_mcp) or compare against the unquoted boolean (inputs.republish_mcp == true). Confirmed by re-triggering and seeing publish-mcp actually run.
Why: Documented but easy to miss — most other Actions contexts are strings. When a workflow_dispatch input is declared as boolean, the expression engine respects the type. String comparison is a silent footgun: the workflow appears to "work" because the trigger succeeds, but the gated job is never reached.
Tags: #github-actions #workflow-dispatch #ci #boolean-coercion
[2026-05-17] release-please leaves uv.lock self-reference stale on every release
Context: hive-vault uses release-please for version bumps. release-please-config.json registers pyproject.toml and server.json (via extra-files), but uv.lock cannot be added there — its [[package]] array-of-tables format makes jsonpath targeting unreliable in release-please's TOML updater. Every release left the lock's editable self-reference anchored at the previous version. By 2026-05-17 master's uv.lock said 1.12.1 while pyproject.toml said 1.12.6 (five releases of drift).
Problem: The drift is invisible in CI because uv operations still resolve, but every developer who runs uv lock / uv sync / uv run on master gets an uncommitted uv.lock diff. The signal of meaningful lock changes is blurred and master is internally inconsistent (lock self-ref vs pyproject.toml version mismatch). Identical pattern hits any project pairing release-please with uv, poetry, or Cargo.
Solution: In release.yml, after googleapis/release-please-action@v4, gate four steps on if: steps.release.outputs.pr: (1) jq-extract the PR's headBranchName into GITHUB_ENV, (2) actions/checkout@v4 of that branch with the release PAT, (3) setup-uv + Python, (4) uv lock and conditionally commit/push any diff back to the PR branch. Route the dynamic branch name through GITHUB_ENV (not direct ${{ steps.… }} interpolation in run:) to satisfy the workflow-injection lint. Cross-project version: the generalized "Special case: lock files with self-references" pattern lives in the maintainer's cross-project knowledge store (this lesson is its L-HIVE-88 origin); not linked here to preserve repo->store independence.
Why: release-please's extra-files only mutates targets when it bumps versions during a release PR — it cannot reliably address array-of-table TOML entries by name. Regenerating the lock on the release-please branch shifts the work to where it actually has access to the new pyproject.toml version. The if: steps.release.outputs.pr gate keeps the steps no-op when release-please has nothing to release; the PAT keeps the commit attributable and re-triggers the standard CI workflow on the new commit.
Tags: #release-please #uv #ci #cross-project #lock-files
[2026-05-18] Multi-process MCP server contention surfaces — checklist + patterns
Context: Hive runs as N independent uvx hive-vault subprocesses (one per Claude Code session) sharing the same vault git repo + three SQLite trackers. PR #90 fixed the symptoms (39-min hang ending in AssertionError('Request already responded to'); recurring git commit timed out; silent capture_lesson loss). PR #92 hardened the design with the patterns that apply across any stateful multi-process MCP server.
Problem: Intra-process primitives (threading.Lock, default sqlite3 connection) silently fail to coordinate across separate MCP server subprocesses, even when each is correct on its own. The symptoms only appear under parallel usage and are hard to attribute (cuelgues, crash from monkey-patch-able assertions, silent data loss).
Solution: Four primitive patterns:
- Inter-process file lock (
filelock) on the git index. The thread-local_GIT_LOCKonly serializes within one process — N processes still race on.git/index.lock. Wrap the whole write critical section (read-modify-write + commit), not just the git call, or you lose data on concurrent appends to the same file (thecapture_lessonloss). - SQLite as inter-process queue, not async cache. Set
connect(timeout=10)+PRAGMA busy_timeout=10000+PRAGMA synchronous=NORMAL+PRAGMA wal_autocheckpoint=200. Replace SELECT+UPDATE withINSERT ... ON CONFLICT DO UPDATE. Buffer writes in memory and flush in batches; reads flush first so they see fresh data. - Rate-limit shared-state mutations (
apply_decaywas the canonical bug). Use an atomicINSERT ... ON CONFLICT DO UPDATE WHERE elapsed >= Tclaim: row updates = decay runs; row unchanged = skip. Multiple briefings within T seconds = exactly one decay. - Cache subprocess-spawn ops by HEAD SHA.
git log/git recentwere spawning per call._current_head_sha(vault)reads.git/HEADdirectly (no subprocess), perfect cache key.
Process-model patterns:
- Per-PID log file (
hive-{pid}.log) —RotatingFileHandlerrotation races corrupt the log under N concurrent writers. - Defer
create_server()out of import-time → main(). Importing the module side-effect-free saves ~300-600ms × N spawns. - For client cancellation races, monkey-patch BOTH
RequestResponder.__exit__(re-raised CancelledError, issue #75) ANDRequestResponder.respond(AssertionError on_completed=Truewhen handler finishes after cancel). Self-gate both on the exact failure mode so they degrade silently if upstream fixes.
UX: format_io_error(exc, path, action) discriminator returning per-class hints beats f"File I/O error: {exc!r}". The LLM relay can act on "permission denied — check writable by MCP process" but not on [Errno 13].
Verdict on scaling: The original ADR-005 estimated wall at 50 sessions. Audit revised down to ~20–25 (every read also writes to SQLite via track(); triple-timeout stack up to 90s; apply_decay correctness break at 5+ concurrent briefings). Post-PR-92 (buffered writes + apply_decay gate + git_log cache), the wall should rise but exact number needs measurement.
Tags: #mcp #concurrency #multi-process #sqlite #filelock #scalability
Cross-project pattern: the multi-process-mcp-server pattern (maintainer's cross-project knowledge store) was distilled from this lesson (origin L-HIVE-90/92); not linked here to preserve repo->store independence.
[2026-05-19] GitHub Actions floating major tags are publisher-dependent — verify before bumping
Context: Bumping all CI actions to Node 24 ahead of 2026-06-02 deprecation. Started with @v6/@v8/@v5 across checkout/setup-uv/release-please-action.
Problem: CI failed with "Unable to resolve action astral-sh/setup-uv@v8, unable to find version v8" even though gh api showed v8.1.0 as latest release. Each action publisher uses a different tagging convention: actions/checkout publishes floating majors (v6 → v6.0.2), googleapis/release-please-action publishes floating majors (v5 → v5.0.0), but astral-sh/setup-uv publishes only floating MINORS (v7.4, v7.5, v7.6) plus exact SemVer (v8.0.0, v8.1.0). No floating major tag exists for setup-uv.
#ci #github-actions #deps
[2026-05-19] mypy 1→2 + fastmcp 3.1→3.3 were no-ops because --strict already covered the surface
Context: Two "risky major" dep bumps in v1.13.0 stabilization cycle: mypy 1.19.1 → 2.1.0 (tightened defaults) and fastmcp 3.1.0 → 3.3.1 (MCP framework). Both initially classified as needing dedicated smoke + risk assessment.
Problem: Risk classification of major bumps tends to over-estimate effort for codebases that already use strict configs. mypy 2's "tightened defaults" are a subset of what --strict already enforces. fastmcp 3.3's API surface for @mcp.tool / call_tool / FastMCP() was stable across 3.1 → 3.3. Spending separate PR cycles on each was lower ROI than expected.
make check once on the bumped lockfile BEFORE designing a multi-step smoke plan. If green, ship as a small lockfile-only PR. The integration tests already exercise the framework wire; the type checker already enforces the strictest contracts. Save the dedicated-smoke effort for bumps where the project's own strictness DOESN'T cover the change vector (e.g. API rename, behavior flag flip).
Solution:
Tags: #deps #tooling #process
[2026-05-20] Empirical wire-level test must precede ADRs about MCP cancellation/race behavior
Context: Drafting ADR-006 (commit policy) and ADR-007 (MCP cancellation response) for HIVE-104. ADR-007 §1 originally decided that _compat._patched_respond would attempt a "best-effort raw stdio write" of the JSON-RPC response when _completed=True to recover user-visible ghost responses. Promoted to Accepted via vault_write after a multi-turn architectural design discussion. The decision rested on an unstated assumption: that no prior response had reached the wire by the time our patched respond() fires.
Problem: A 20-iteration empirical classifier (tests/test_compat_shim.py::test_classify_cancellation_race, spawns a real hive subprocess on Linux, drives tools/call + notifications/cancelled, inspects wire bytes) ran AFTER promotion and showed scenario (a) — "ErrorData wins" — in 20/20 cases. RequestResponder.cancel() at mcp/shared/session.py:148-150 always succeeds in calling _send_response(ErrorData) before our handler completes; the wire response is invariably {"id": N, "error": {"code": 0, "message": "Request cancelled"}}. The "best-effort raw send" decision would have produced a duplicate response (same request_id, success after error) in 100% of cases — a protocol violation worse than the silent-suppress status quo. ADR-007 §1 had to be retracted in Amendment #2 (same day as promotion), and Fase C scope dropped from ~80 LOC raw-stdio-framing to ~30 LOC observability-only.
Solution: For any ADR whose decision depends on wire-level behavior under cancellation or race conditions, write the empirical classifier BEFORE the ADR's decision section is locked in. The classifier pattern is cheap (~50 LOC, mirrors the subprocess fixture in tests/test_transport_recovery.py): spawn the real server, drive the race over N iterations, classify outcomes into well-defined scenarios, count distribution. Do NOT rely on in-memory mocks of anyio streams for this — stdio framing and cancellation timing only behave faithfully end-to-end via subprocess. Also: ADRs MUST allow Status amendments without supersession (ADR-007 carries two amendments stacked under one Status block); the document a future reader sees is the FULL audit trail of how the decision evolved, not just the latest verdict.
Tags: #adr #testing #mcp #cancellation #design-process
[2026-05-20] Pattern-sweep vault before opening any non-trivial branch
Context: Starting the post-HIVE-104 docs cleanup PR. Named the branch docs/post-HIVE-104 and was about to push when a sweep of _meta/patterns/pattern-git-workflow.md surfaced two rule violations.
Problem: Violations were: (1) docs/ prefix not in the approved table (chore/, fix/, feat/, release/), and (2) post-HIVE-104 is a milestone/phase reference, forbidden by git-workflow §7 "phase tracking belongs in the vault backlog, not git history". Without the sweep, both would have landed on origin and been visible to humans/CI.
Solution: Before the first commit on any new branch, query the cross-cutting patterns that gate the work: vault_query(project="_meta", path="patterns/pattern-git-workflow.md") for branch/commit/PR rules, plus any topic-specific pattern (docs-site-starlight, language-standards, spec-driven-development). Renaming the branch later is cheap; renaming after push is not. The branch was renamed to chore/commit-policy-doc-followups before any push.
Tags: #workflow #git #rules-discipline
[2026-05-21] Substring assertions on full report output break when vault_path is included
Context: Implementing the ## server identity block in vault_health (issue #109) — the block embeds vault_path: <abs path>. Four pre-existing tests in test_server.py used substring assertions like assert "stale" not in result.lower(), assert "error" not in result.lower(), assert "ghost_responses" not in result. They started failing because pytest's tmp_path includes the test name (e.g. test_terminal_status_not_stale0), which is now legitimately printed verbatim inside the identity block.
Problem: Bare substring negation on a multi-line markdown report is fragile against any future field that interpolates user-controlled / path-like data. Once vault_path was added, four tests false-positive on substrings that happen to be inside the path. Adding the field exposed brittleness that had been latent.
Solution: Anchor negative assertions on the structural marker the producer code actually emits — ## ghost_responses (the section header), [error] (the issue marker), Stale files (the label). Positive assertions can stay loose. Rule of thumb: if you're asserting that section X is absent, assert the section header is absent, not the topic word.
Tags: #testing #regression #vault-health
[2026-05-21] Substring assertions on full vault_health output break when vault_path is included
Context: Implementing the ## server identity block in vault_health (issue #109) — the block embeds vault_path: <abs path>. Four pre-existing tests in test_server.py used substring assertions like assert "stale" not in result.lower(), assert "error" not in result.lower(), assert "ghost_responses" not in result. They started failing because pytest's tmp_path includes the test name (e.g. test_terminal_status_not_stale0), which is now legitimately printed verbatim inside the identity block.
Problem: Bare substring negation on a multi-line markdown report is fragile against any future field that interpolates user-controlled / path-like data. Once vault_path was added, four tests false-positive on substrings that happen to be inside the path. Adding the field exposed brittleness that had been latent.
Solution: Anchor negative assertions on the structural marker the producer code actually emits — ## ghost_responses (the section header), [error] (the issue marker), Stale files (the label). Positive assertions can stay loose. Rule of thumb: if you're asserting that section X is absent, assert the section header is absent, not the topic word.
Tags: #testing #regression #vault-health
[2026-05-21] Path.write_text in tests must pass encoding=utf-8 on Windows
Context: Re-running the full pytest suite on Windows during #109. tests/test_server.py::TestVaultValidate::test_posix_class_in_heading_not_flagged was already failing on master (unrelated to the identity block) — the test writes a markdown file with an em-dash and then calls vault_health(checks=["links"]), which routes the read through _safe_read (f.read_text(encoding="utf-8")).
Problem: Path.write_text(...) without an explicit encoding uses locale.getencoding() — cp1252 on Windows by default. cp1252 happily encodes the em-dash to byte 0x97. _safe_read then opens the file expecting UTF-8, sees the lone 0x97, raises UnicodeDecodeError, and the file is silently reported as [error] ... File unreadable (I/O or encoding error). The test then sees posix-heading.md in the error message and false-positives — the actual POSIX-class regression check is masked.
Solution: Always pass encoding="utf-8" to Path.write_text (and read_text) in tests that put non-ASCII content into vault files. The production reader is hard-coded to UTF-8; writers must match. This bites on Windows only — Linux/macOS CI defaults to UTF-8 — so it's invisible until someone runs the suite locally on Windows.
Tags: #testing #windows #encoding
[2026-05-21] Three timeouts in a chain aren't a deadline
Context: Designing HIVE-115 latency-tail re-architecture. Hive's tool_span wraps async tool handlers with asyncio.timeout(60). Inside, Lock.acquire(timeout=30) enforces lock-wait. Inside that, subprocess.run(timeout=30) enforces git subprocess wall-time. Three nested timeouts, each correct at its layer. Live evidence in issue #111: capture_lesson elapsed 838s while ctx.tool_timeout was 60s — 14× the documented contract. Three failure modes observed in production: invisible hang, client interprets silence as user-rejection, "Server busy" canned string returned while operation still running. See the lesson "SQLite WAL doesn't auto-checkpoint when N processes hold readers" (below) for the SQLite half of the same systemic issue.
Problem: asyncio.timeout only cancels at await points. Once execution enters asyncio.to_thread(...), asyncio cancels the future but cannot interrupt the thread itself — Python has no portable way to inject an exception into a running thread. Inside that thread, Lock.acquire(timeout=30) and subprocess.run(timeout=30) enforce only their own deadlines. The composition is unsafe: 30s lock wait + 30s subprocess wait + repeated retries can chain into 60+ seconds outside the 60s asyncio envelope. None of the layers act as a true deadline over the composed chain. The deadline is advisory, not enforced. Logs show "ok elapsed_ms=838360" with tool_timeout=60 — the call returned eventually, but 14× over contract.
Solution: A real deadline requires ONE supervisor with termination authority over all sub-operations. Pattern: introduce bounded_call(fn, deadline_s) helper that holds a context-local registry of subprocess.Popen handles + ThreadPoolExecutor futures. On deadline expiry: cancel the future (best-effort), then Popen.terminate() on each registered child (SIGTERM with 2s grace → SIGKILL on POSIX; TerminateProcess on Windows with CREATE_NEW_PROCESS_GROUP so child trees go down), surface a real mcp.protocol.TimeoutError to the client. Migration cost: subprocess.run → Popen in all 5 git callsites (_helpers._git_commit, etc.). Tracked as ADR-008 hard-deadline-enforcement, lands in Phase B of HIVE-115. Generalization: ANY tool or API with a documented timeout must enforce it at one layer with kill authority. Per-step timeouts that compose do not compose into a global deadline. This refines the four-layer model in the maintainer's cross-project async-threading pattern §1 — defense-in-depth is correct, but ONE layer must own preemption.
Tags: #python #asyncio #deadline #timeout #subprocess #concurrency #HIVE-115 #ADR-008
[2026-05-21] SQLite WAL doesn't auto-checkpoint when N processes hold readers (baseline N=3-5)
Context: Investigating multi-process WAL bloat for HIVE-115. Local snapshot 2026-05-21: 3 concurrent hive-vault processes alive (PIDs 475646, 529429, 540650 — 25min, 7min, 4min old). lsof confirms each holds open file handles to all 3 SQLite DBs simultaneously (worker.db, relevance.db, lesson_reinforcement.db). Observed sizes: relevance.db-wal = 4.1 MB vs .db = 53 KB (77× ratio), lesson_reinforcement.db-wal = 157 KB vs 12 KB (13×), worker.db-wal = 91 KB vs 8.2 KB (11×) — and worker.db-wal was last modified 2 months ago (March 13). Critical context: 3-5 concurrent Claude Code sessions per user is the BASELINE daily usage pattern of hive, not edge case. Prior ADR-005 scale table dimensioned "1-3 = fine, 5 = occasional waits"; the system is operating at the codo of its own scaling boundary in normal use.
Problem: PRAGMA journal_mode=WAL + PRAGMA wal_autocheckpoint=1000 does NOT mean "WAL stays small". Any concurrent reader that has an open snapshot blocks the checkpoint operation from advancing past the frame it's reading. With process-per-MCP-client orchestration (ADR-001/ADR-005), every additional Claude Code session adds a process that opens read handles on ALL trackers even if it only uses one. Result: at N=3-5 baseline, there is virtually always a snapshot holder; the WAL never drains. Worse: the original Phase A design called for PRAGMA wal_checkpoint(TRUNCATE) on shutdown — but at N=3-5, there is rarely a "last process to shut down". The shutdown drain is virtually inert; the WAL grows unboundedly until a true idle moment that may never come. Per-open WAL replay cost grows linearly with WAL size, so each new hive process pays an increasing startup tax.
Solution: Under multi-process patterns at N>1 baseline, the WAL drain must be PERIODIC, not shutdown-driven. Concretely (Phase A of HIVE-115, ADR-009 v1): (1) start a background threading.Thread(daemon=True) in every hive process that runs PRAGMA wal_checkpoint(PASSIVE) every 30s on each tracker's connection. PASSIVE does not block readers and advances checkpoint as far as current frames allow. (2) Keep wal_checkpoint(TRUNCATE) on graceful shutdown as a fallback — useful when N IS 0 (rare but possible). (3) Surface wal_size_bytes in vault_health(include_runtime=True) so growth is observable before it becomes contention. Cross-references: the lesson "Three timeouts in a chain aren't a deadline" (above) (same root: composition of locally-correct decisions failing at the actual usage envelope). For Phase B see ADR-009 v2 (Outbox+Reconciler) which makes the reconciler thread the single periodic checkpoint owner. Long-form analysis: HIVE-115 backlog (tracked in the forge — GitHub issues / milestones).
Tags: #sqlite #wal #multiprocess #checkpoint #concurrency #HIVE-115 #ADR-009
[2026-05-21] Cooperative external committer needs explicit coordination, not best-effort
Context: Investigating .git/index.lock contention for HIVE-115. Hive's vault git policy (ADR-006) treats commits as "best-effort, never fail the write". The Obsidian vault has the obsidian-git plugin configured: autoSaveInterval=10 minutes, autoPullInterval=10, autoPullOnBoot=true, pullBeforePush=true. Neither tool knows the other exists. When obsidian-git fires its 10-minute backup tick, it holds .git/index.lock for the duration of pull + commit + push (~5-15s, the pullBeforePush=true triples the window). During that window, any hive _GIT_LOCK + _git_filelock acquire blocks. Issue #110 evidence: silent 30-second freezes per call coinciding with obsidian-git ticks, leading to _LOCK_TIMEOUT=30 abandons. Prior ADR-006 §6 already added "detect_obsidian_git()" as informational signal in vault_health, but treated it as advisory only.
Problem: When hive treats git as "best-effort, never fail the write" and obsidian-git treats git as "auto-commit every interval", the two cooperative processes COMPETE for .git/index.lock instead of coordinating. The result is silent: hive abandons with a WARNING log after 30s, the user sees a freeze, but no errors propagate. Worse, the windows are not synchronized — obsidian-git's pullBeforePush=true extends the lock window 3× over a plain commit, so the probability of coincidence is high during write-heavy hive sessions. The pre-existing "informational detection" in vault_health is insufficient: it surfaces presence, not active coordination. There is no fallback path when the external committer is paused or broken.
Solution: Promote detect_obsidian_git() from informational to first-class design concept. Phase A of HIVE-115 (ADR-010 external-committer-coexistence): (1) HIVE_LOCK_TIMEOUT_S env-tunable so users with large vaults can absorb longer external windows; (2) structured mcp.lock_contention log per acquire attempt with waited_ms field; (3) obsidian_git_present boolean surfaced in vault_health(include_runtime=True). Phase B (ADR-009 v2 outbox path): when external committer detected AND recent (probe git log -1 --since="$((autoSaveInterval*2)) minutes ago" returns a commit), DEFER hive's writes via commit=False automatically (not just opt-in); when detector reports stale/missing, FALLBACK to hive's own backoff-retry reconciler. Critical: never blindly defer indefinitely — a paused obsidian-git plugin would silently halt all vault commits. Pattern: cooperate-or-fallback, never compete-blindly. Cross-ref the lesson "SQLite WAL doesn't auto-checkpoint when N processes hold readers" (above) for the SQLite half of the same multi-writer coordination problem.
Tags: #git #obsidian-git #coordination #filelock #multiprocess #HIVE-115 #ADR-010
[2026-05-21] Telemetry IS the design, not an afterthought
Context: Investigating root causes of HIVE-115. The 838s capture_lesson event from issue #111 was invisible until manual log archaeology surfaced one INFO line buried in a per-PID log file: mcp ok ... tool=capture_lesson id=11 elapsed_ms=838360. The configured tool_timeout=60 value lived in code, but the actual elapsed time was only logged at INFO with no structured fields. WAL bloat (4.1 MB relevance.db-wal) was invisible until manual ls showed the size — no metric exposed it. Lock contention with obsidian-git was suspected only by correlating obsidian-git's autoSaveInterval=10 config with hive's 30s freezes. None of these were observable from inside the system; all required external archaeology.
Problem: A tool with a documented contract (e.g. HIVE_TOOL_TIMEOUT=60) but no structured telemetry for the actual elapsed time, the wait breakdown, or the contract violations is operating blind. When something goes wrong, debugging requires log spelunking instead of metric query. By the time someone notices a 14-minute hang in a per-PID log file, the user has already abandoned the session and lost confidence. Worse, decisions about whether to re-architect become subjective ("seems slow lately") instead of measured. The prior plan for HIVE-115 included a Phase B (Outbox+Reconciler) that depends on sizing the bounded_call grace period correctly — without distribution data for last_git_lock_wait_ms, the grace period would be guessed, not measured.
Solution: Any tool/api with a configured deadline needs structured logging of {deadline, elapsed, wait_breakdown} from day one. For HIVE-115 Phase A: emit one structured mcp.lock_contention log per _GIT_LOCK.acquire attempt with {tool, lock, waited_ms, abandoned}; surface wal_size_bytes, competing_pid_count, last_git_lock_wait_ms (rolling N=100), obsidian_git_present via vault_health(include_runtime=True). Treat instrumentation as a SHIPPING REQUIREMENT alongside the fix, not as follow-up work. Without these metrics, the gate condition for Phase B advancement ("≥10 events of waited_ms>5000, or p99 wal_size > 5 MB, or ≥1 tool_timeout_exceeded") cannot be evaluated objectively. Generalization: when a design choice introduces a contract (deadline, capacity, freshness), the same PR must introduce its observability — they are inseparable. Cross-ref: the maintainer's cross-project phased-redesign-with-telemetry-gates pattern documents the gating discipline; the lesson "Three timeouts in a chain aren't a deadline" (above) is the canonical "broken contract due to lack of enforcement and observability" pair.
Tags: #observability #telemetry #design #instrumentation #HIVE-115 #phase-a
[2026-05-22] ContextVar propagates across asyncio.to_thread — use over explicit parameter passing for per-call state
Context: HIVE-115 PR-3: designing bounded_call's process_registry parameter (a list[Popen] mutated by sync code in a worker thread, iterated by async code on deadline expiry). ADR-008 §1 originally specified explicit parameter passing because "asyncio.to_thread boundary makes contextvar propagation fragile across the async/sync layer".
Problem: Threading the registry through every git helper signature (_git_commit(vault, paths, message, registry=...), _git_commit_all(vault, message, registry=...), plus every wrapper that calls them) would be invasive across 8 callsites and break every existing test that passes positional args. ADR claim went unverified against current CPython docs.
Solution: CPython 3.9+ asyncio.to_thread uses contextvars.copy_context() internally — the same ContextVar-bound list reference is visible from both async land and the worker thread, and mutations land in the same object (not a copy). Adopted _GIT_REGISTRY_CV: ContextVar[list[Popen] | None] with default None. tool_span (async wrapper) sets the CV at entry, _run_git (sync, runs inside asyncio.to_thread) reads _GIT_REGISTRY_CV.get() and appends/removes. Zero signature changes at callsites. Verified by tests/test_bounded_call.py::test_subprocess_terminated killing a registered Popen from async land. Lesson: re-evaluate ADR claims about Python concurrency against current docs before designing workarounds; contextvar propagation across to_thread is documented and robust.
Tags: #python #concurrency #async #contextvars #design
[2026-05-22] GitHub Closes #N keyword only parsed when on its own line at PR body footer
Context: HIVE-115 PR-3 (#119) body included Closes #111 inside a Summary section bullet list ("Closes #111 — the 838s..."). PR-4 (#121) body had the same Closes #110 styling.
Problem: After PR-3 merged, issue #111 stayed OPEN — had to close manually with a gh issue close + reference comment. GraphQL closingIssuesReferences returned empty. The auto-close keyword detection failed silently.
Solution: For PR-4 (#121), placed Closes #110 on its own line at the body footer. GraphQL verified closingIssuesReferences: [{number: 110}]. On merge, #110 auto-closed correctly. Rule: GitHub's close-keyword scanner is fragile within rich markdown (bold, inline list items, surrounding text); always put Closes #N / Fixes #N / Resolves #N on a bare line at the bottom of the PR body. Verify with gh api graphql -f query='{... pullRequest(number:N) { closingIssuesReferences { nodes { number } } } }' before relying on auto-close.
Tags: #github #workflow #pull-request #automation
[2026-05-28] You cannot cancel a Python thread you started
Tag: concurrency, deadlines, cooperation-pattern, hive-116
Context. HIVE-115 PR-3 introduced bounded_call/tool_span to enforce wall-clock deadlines on tool calls. The supervisor terminates registered Popen subprocesses on expiry — that part works. But two weeks of empirical use (issue #141) showed that the worker thread doing the sync _git_commit is NOT cancelled when the deadline fires. The client sees TimeoutError, but the thread keeps running. In one Windows case it ran 246 seconds after the supposedly-60s deadline. While the thread runs, it holds _GIT_LOCK (threading) + the singleton _git_filelock (filelock) and blocks every sibling.
Problem. asyncio.timeout is purely cooperative — it cancels the awaiting coroutine at the next await, but asyncio.to_thread's worker thread has no awaitable inside the body. CPython has no Thread.cancel(). PyThreadState_SetAsyncExc exists but is documented as not reliable for cancelling code inside C-implemented blocking calls (which is exactly where stuck threads live). So a thread inside subprocess.communicate() stays there until the subprocess flushes its stdio — and on Windows, communicate() on a SIGKILLed child can block far longer than it does on POSIX. Result: deadline supervisor + runaway thread + cached singleton lock = "fast client response, blocked siblings, orphan lock file." Foundation lesson is "Three timeouts in a chain aren't a deadline" (above); this lesson is the corollary that motivates the cooperation primitive.
Solution. Stop trying to cancel the thread. Instead, evict the cached lock object from the singleton cache so the next acquire constructs a fresh one. The runaway thread eventually releases (FD closes when the thread exits or when GC reaps the FileLock); meanwhile the new acquires bypass it. On POSIX the kernel tracks fcntl.flock per-fd, so the new FileLock can acquire the moment the runaway holder's fd closes. On Windows the orphan lock file persists in .git/ until the parent process exits (filelock library invariant), but no longer blocks new acquires.
The supervisor inserts the eviction between SIGKILL and the TimeoutError raise:
# In bounded_call / tool_span TimeoutError branch:
killed = await _terminate_registry(registry, grace_s)
if vault is not None and killed:
_cleanup_index_lock(vault, killed)
await asyncio.sleep(HIVE_POST_KILL_DRAIN_S) # 5s default
evict_filelock(vault) # pop from cache
_record_lock_eviction(vault, killed) # telemetry
raise TimeoutError(...)
The 5s drain is intentional: gives the worker thread a chance to escape with lock: naturally on the happy path (Linux subprocess.communicate returns within ~100ms of SIGKILL). Eviction is the safety net for the worst case where the thread is stuck. Drain calibration {1, 5, 10} converged on 5s as the smallest value that never raced eviction in 20-run validation.
Codified in adr/adr-012-cooperative-filelock-eviction-on-deadline.md (decision) + the maintainer's cross-project multi-process-mcp-server pattern §primitive-8 (reusable form). adr/adr-008-hard-deadline-enforcement.md §5 amendment cross-references this.
Anti-pattern caught. Earlier draft of HIVE-116 considered "Option A: Popen.wait(timeout=N) inside _run_git" — an inner timeout to make the thread escape voluntarily. This violates the SSOT principle from HIVE-115 audit B1 ("bounded_call is the single source of truth for deadlines, inner timeouts race the supervisor's external termination"). Eviction is the right shape because it doesn't add a second clock; it cleans up state the runaway thread is no longer authoritative over.
Cross-platform note. On Windows, the .git/hive.lock file may persist as a 0-byte file even after eviction — Device or resource busy on rm until the parent process exits. This is the filelock library's invariant (the file is the handle's medium; closing the handle doesn't unlink the file). Documented in docs/troubleshooting.md as a cosmetic artifact, not a functional issue.
When to escalate. If vault_health.runtime.lock_eviction.count_30d rises above ~10/month in normal use, the cooperation pattern is reaching its limit and ADR-011 (daemon model) becomes mandatory. ADR-012 buys observation time for the 2026-06-05 Phase C decision checkpoint (issue #124); it is not a permanent fix for sustained N≥10 multi-session usage.
[2026-05-27] Zero-byte WAL file means fully checkpointed, not broken
Context: Debt triage session 2026-05-27: investigating ~/.local/share/hive/worker.db WAL sidecar observed as stale during HIVE-116 investigation.
Problem: worker.db had a 0-byte WAL file with mtime from Mar 10 (~2.5 months stale). Initial impression was that the WAL checkpoint wasn't running. This caused diagnostic confusion during HIVE-116 investigation.
Solution: A 0-byte .db-wal file means the WAL was FULLY checkpointed (success signal, not failure). SQLite leaves the empty WAL file behind after checkpoint. The old mtime just means no writes happened since then — normal for budget.db when OpenRouter isn't heavily used. Added _clean_stale_wal_files() at server startup in _helpers.py to auto-remove 0-byte WALs ≥30d stale, preventing future diagnostic confusion.
Tags: #sqlite #wal #debt
[2026-05-27] Makefile DX improvements: cross-platform clean, test-one, logs target
Context: DX bundle improvements during 2026-05-27 debt triage session.
Problem: make clean used rm -rf (POSIX-only), no make test-one target existed for quick single-test runs, log path was only documented in troubleshooting docs.
Solution: Replaced rm -rf in make clean with uv run python -c (cross-platform), added make test-one ARGS=... target, added make logs target that shows path + tail -f tip. Updated .claude/CLAUDE.md with upstream _compat.py tracker + issue #127 reference.
Tags: #dx #makefile #cross-platform
[2026-06-04] A CLI catch-all else that launches a server swallows --version/--help
Context: First Windows validation of the auto-update rollout (hive#176). Reflexively ran hive --version to read the installed version; instead it printed the FastMCP banner and blocked on the stdio MCP server — the command never returned. The version had to be read via uv tool list (which the rollout scripts already do, deliberately avoiding a hive probe).
Problem: main() dispatched with if argv[0] == "serve" ... else: create_server().run(). The else was a catch-all: ANY argv that wasn't serve/service/client — including --version, --help, and typos — fell through to launching the blocking stdio server. The load-bearing invariant is "bare hive (zero args) → server" (the v1 MCP per-session contract), but the code gated on "unrecognized argv → server", which is a strictly larger set. The two were conflated.
Solution: Gate the server launch on EMPTY argv, not unrecognized argv. Route explicit tokens: -V/--version → print version (exit 0), -h/--help → usage (exit 0), unknown token → usage error on stderr (exit 2). Bare invocation and serve/client/service routing unchanged, so no real consumer breaks (grep-verified across hive + dotfiles: every caller uses bare hive-vault / hive client / hive serve / hive service). A footgun on input nobody relied on → shipped as fix: (PR #203). Generalization: a default-action else in a CLI dispatcher is a footgun whenever the default has side effects (here: booting a daemon). Distinguish "no command given" (run the default) from "unknown command given" (exit-2 usage error).
Tags: #cli #footgun #argv #version #PR-203
[2026-06-04] The daemon's restart-on-upgrade is POSIX-only — Windows breaks it three ways
Context: First real Windows validation of the Phase C daemon rollout (hive#176). ADR-011 §4 specified supervised restart-on-upgrade cross-OS, but only the Linux/systemd path was ever exercised — the Windows path was unspiked (the ADR-011 [MUST RESOLVE] covered transport/token, not supervision/upgrade).
Problem: Three independent breakages, all rooted in Windows OS semantics that differ from systemd: (1) Task Scheduler <RestartOnFailure> does NOT restart on the daemon's exit 75 — drift was detected and the process exited 75 (LastTaskResult=75 confirmed), but 6 min later the task was Ready, no process. RestartOnFailure reacts to the task engine failing to launch, not an application's non-zero exit code; it is not a 1:1 map of systemd Restart=on-failure. (2) uv tool upgrade cannot replace the in-use hive.exe (os error 32) because the daemon — and every hive client session — always holds it; POSIX swaps an in-use binary by inode, Windows refuses. (3) A console-app Task action under an interactive-token LogonTrigger shows a console window every logon — no parity with the silent systemd --user unit.
Solution (ADR-015): Keep ADR-011's shared daemon contract; make the Windows mechanism diverge. (B) An in-task PowerShell wrapper-loop relaunches hive serve while it exits non-zero, stops on exit 0 — the systemd semantics Task Scheduler lacks. (C) An S4U <Principal> runs the task in session 0 (no window, non-elevated, no stored password). (A) An orchestrated stop-before-upgrade (PowerShell, holds no lock): only-if-newer -> defer-if-locked -> stop daemon -> uv tool upgrade -> start. Shipped in hive#207 (B+C) + dotfiles#229 (A); all three validated on real hardware.
Lesson: A cross-OS service abstraction's contract generalizes; its mechanism (restart trigger, binary swap, windowless execution) does NOT. Audit a second OS empirically before generalizing supervision — the Regla-del-3 failure mode. ADR-011 generalized from one OS and the gaps surfaced only at the first real Windows rollout.
Tags: #windows #daemon #cross-os #task-scheduler #s4u #ADR-015 #hive-176
[2026-06-04] uv tool upgrade footguns on Windows: in-use lock, exact pins, orphaned children
Context: Building + validating the Windows-safe upgrade orchestration (ADR-015 / dotfiles#229).
Problem: Three traps hit while making uv tool upgrade hive-vault work around a live daemon: (1) uv tool install --reinstall is destructive on Windows — it removes the venv Scripts dir, which fails (os error 5) when a hive process holds it, leaving the tool env corrupted (uv tool list then could not find the package). Plain uv tool upgrade only fails the entrypoint copy (cosmetic — the launcher is a version-agnostic trampoline) while still updating pure-python site-packages. (2) uv tool upgrade respects the install-time version constraint — uv tool install hive-vault==1.32.4 pins it, after which uv tool upgrade reports "Nothing to upgrade". The production install must be unpinned (uv tool install --upgrade hive-vault, which the rollout's mcp prerequisite already uses). (3) Stop-Process -Name hive orphans the python child — .local\bin\hive.exe is a trampoline that spawns the real server as python.exe under the uv-tools dir; killing the trampoline leaves the child alive, still holding the install, so the next upgrade defers (a leftover daemon child is indistinguishable from a live client session).
Solution: The orchestration uses plain uv tool upgrade (never --reinstall), acts only when a newer version is published, and defers if any non-daemon process holds the install (conservative — never a partial upgrade). Process cleanups must kill the python.exe children under the uv-tools path, not just hive.
Lesson: On Windows, "upgrade a running tool" is neither atomic nor pin-agnostic. Stop the holder, keep the install unpinned, and treat any in-use process as a hard blocker (defer, don't force). The residual OS limitation is tracked upstream in uv (#8528, #11930, #11134), not reimplemented in hive.
Tags: #windows #uv #upgrade #footgun #ADR-015
[2026-06-05] httpx base_url join silently drops the base path on absolute request paths
Context: HIVE-211 PR1 — generalizing OpenRouterClient into a provider-parameterized OpenAICompatibleClient (Ollama / NaN / OpenRouter; chat + embeddings). The original client set httpx base_url="https://openrouter.ai" (no path) and POSTed to the absolute path /api/v1/chat/completions.
Problem: httpx merges base_url + request URL with RFC-3986 join semantics (httpx.URL(base).join(req)), NOT string concatenation. An absolute request path (leading /) replaces any path of base_url: base="https://host/api/v1" + .post("/chat/completions") → https://host/chat/completions — the /api/v1 is silently dropped. The old code only worked by accident because its base_url had no path. Naively reusing that shape with base_url="https://api.nan.builders/v1" would have hit /chat/completions (404) instead of /v1/chat/completions — and mocked unit tests would NOT catch it (they patch _http.post and never exercise URL joining).
Solution: Don't rely on httpx base_url join at all. Store self._base_url = base_url.rstrip("/") (the full prefix incl. version: NaN …/v1, OpenRouter …/api/v1, Ollama …/v1) and build full URLs explicitly: self._http.post(f"{self._base_url}/chat/completions", …) — the same convention the OpenAI SDK uses for base_url. Added test_generate_posts_to_full_chat_completions_url asserting the exact composed URL so the join behavior is pinned, not assumed.
Tags: #httpx #footgun #base-url #openai-compatible #HIVE-211
[2026-05-31] Cross-OS spikes catch bugs Linux-only testing misses
Context: HIVE-118 Phase C daemon de-risking: built 5 runnable spikes (transport, load, idempotency, resilience, robustness) for the loopback-HTTP + bearer-token daemon and ran them on Linux AND Windows before writing the real hive serve.
Problem: Four real bugs surfaced only on Windows or were latent on Linux: (1) WinError 32 — deleting the daemon log while the child still held the handle (POSIX allows unlinking an open file; Windows does not); (2) an icacls owner-only ACL check false-matched the Users group inside the path C:\Users\Manu\...; (3) an unfaithful load model — a sync time.sleep tool blocked the event loop and a per-call SQLite connect tanked Windows throughput; (4) asyncio.CancelledError subclasses BaseException, not Exception, so suppress(Exception) let a cancelled call escape teardown. Separately, a blanket uv lock --upgrade pulled starlette 0.52->1.x (a MAJOR, sitting directly under the daemon's HTTP transport).
Solution: De-risk the platform-sensitive layer (transport, file handles, process kill) cross-OS BEFORE building on it — the spike converts "will it work on Windows?" from a project risk into a CI check. Concrete fixes: print the result BEFORE best-effort temp cleanup and kill+wait the child before unlinking (Windows handle lock); abstract owner-only behind explicit POSIX-mode vs Windows-icacls branches and strip the path before scanning ACL principals; model hive's real pattern (async tools + asyncio.to_thread offload + single owning connection) instead of a loop-blocking sync sleep; catch CancelledError explicitly. For deps: scope a relock to the dependency you have a concrete reason + audit for (mcp, capped <2.0 because _compat.py patches private internals) and leave transitive majors (starlette, cryptography) to Dependabot's individually-reviewable PRs — never bury a risky major in a 90-package blanket relock.
Tags: #testing #cross-platform #windows #spikes #dependencies #mcp #asyncio
[2026-05-31] uvicorn SIGTERM bypasses finally/atexit (rc -15)
Context: HIVE-118 slice 2 hardening: tried to make the hive serve daemon clean up its published token/port state files on graceful shutdown by wrapping server.run() (FastMCP → await uvicorn.Server.serve()) in a try/finally and also registering an atexit handler.
Problem: Neither the finally block nor the atexit handler ran on SIGTERM, even though uvicorn logged a clean "Application shutdown complete". A direct probe showed the daemon's Popen.returncode was -15: the process exits via the SIGTERM signal itself after uvicorn's asyncio signal handler runs its graceful ASGI shutdown — control never unwinds back through our frame, and atexit is skipped on signal death. SIGTERM is exactly how systemd stop, Popen.terminate(), and kill stop the process, so the cleanup was effectively dead code.
Solution: Do not rely on finally/atexit for cleanup in a uvicorn-hosted daemon stopped by SIGTERM. Either (a) treat leftover state as benign and make the readers robust (here: the client TCP-probes the port and falls back; a daemon restart overwrites the state files), or (b) if cleanup on stop is truly required, install your own SIGTERM handler — but that means cooperating with / replacing uvicorn's handler, which risks breaking its graceful shutdown. We chose (a): document it as by-design and drop the finally/atexit attempt.
Tags: #python #uvicorn #signals #daemon #fastmcp
[2026-05-31] Load test confirms planned defenses (idempotency, init_timeout) hit real cases
Context: HIVE-118 Phase C: ran a real-daemon load/bug-hunt harness against hive serve — N concurrent sessions doing mixed vault_query reads + vault_write (auto-commit) writes + /status polls, up to 64 sessions x 40 calls, plus a proxy-FD probe and a mid-write SIGKILL durability probe.
Problem: Wanted to find bugs in the new single-owner daemon design under load. The core stress found NONE: zero lost writes, git fsck clean with exact commit counts, metrics accurate under 64-way concurrency, no fd leak (direct AND proxy paths — the proxy tears down its per-request backend session cleanly), latency backpressure-bounded. But two probes surfaced behaviors that map to ALREADY-PLANNED defenses, confirming they target real problems: (1) a raw fastmcp Client with no init_timeout hangs when its server dies mid-session; (2) a mid-write SIGKILL leaves git uncorrupted but can produce a committed-but-unacked write (commits=22 vs client-acked=21), so a naive retry would duplicate-append.
Solution: Treat these as validation, not new bugs. (1) confirms why the client shim sets init_timeout (H1 hardening) — without it the shim would hang on a wedged/dead daemon; it also argues for revisiting a per-request timeout for the wedge-after-initialize case. (2) confirms why ADR-011 §6.2 reserves a per-write idempotency key: under crash, vault_write is at-least-once (commit can outlive its ack), so the key is needed to make a retry a no-op (safe for append mode). A good load suite verifies that planned defenses attack real failure modes — both were demonstrated with real traffic. Crash-safety of git itself (fsck clean, no stale index.lock) holds under SIGKILL.
Tags: #load-testing #hive #daemon #idempotency #crash-safety #fastmcp
[2026-05-31] Sequence the at-most-once write primitive before transparent retry/reconnect
Context: HIVE-118 Phase C daemon: deciding the order of two remaining slices — mid-session auto-reconnect in hive client, and the idempotency key (ADR-011 §6.2). The proposed order had reconnect first.
Problem: Auto-reconnect's value is transparently retrying a forwarded call that failed because the daemon died mid-flight. For a side-effecting write (vault_write/patch), a transparent retry duplicates an already-applied write whenever the original was committed-but-unacked — a case the real-daemon load harness empirically confirmed under SIGKILL. Building reconnect first forces a bad choice: ship it retrying writes (a latent-corruption window masked only by "the daemon isn't activated yet" — a fragile cross-slice coupling), or ship it retrying reads only and then revisit/expand it once idempotency lands (rework).
Solution: Build the at-most-once / idempotency primitive BEFORE the retry mechanism that relies on it. It is self-contained (key in the tool envelope + applied-key store + TTL), has zero dependency on reconnect, and was already spike-proven 3/3 cross-OS — so it can be built and tested in isolation. Doing it first lets auto-reconnect be built once, correctly, retrying writes safely, and lets the reconnect test assert the real invariant: a write whose daemon dies mid-flight and reconnects produces exactly one applied write. General rule: when adding transparent retry/failover to a system with side-effecting operations, the correctness invariant (at-most-once) must exist before the mechanism that can violate it — never ship the violator first guarded only by "currently unreachable".
Tags: #architecture #resilience #idempotency #crash-only #sequencing #daemon
[2026-06-01] FastMCP streamable-http cancels in-flight handlers on shutdown — no true drain
Context: HIVE-118 slice 1.3 restart-on-upgrade spike (spike/upgrade_spike.py): wanted the daemon to "drain/swap" — finish in-flight tool calls before restarting into an upgraded version. Spiked owning the uvicorn Server via the PUBLIC mcp.http_app() and setting server.should_exit (not signals).
Problem: should_exit gives a CLEAN process exit (serve() returns, exit 0 — strictly better than SIGTERM's rc -15), but the MCP session manager CANCELS the active tool handler on lifespan shutdown: measured client_got_response=False AND handler_completed_serverside=False. So neither the client ack nor the server-side write completes — a true in-flight drain is NOT deliverable over the streamable-http transport. (Separately the spike proved importlib.metadata.version() DOES reflect an in-place dist-info swap mid-process with no invalidate_caches, so version-drift detection is viable stdlib-only.)
Solution: Restart-on-upgrade must be CLEAN-STOP-ONLY, not drain/swap: drift-poll via importlib.metadata → own the uvicorn Server through the public http_app() → should_exit → exit 0 → supervisor restarts into the new code. In-flight safety does NOT come from draining (the transport can't); it comes from idempotency (at-most-once key) + auto-reconnect (the client safely retries the cut call). This is exactly why idempotency (slice 2, ADR-013) was sequenced before restart-on-upgrade. General rule: don't fight a transport for a graceful drain it structurally cannot deliver — make the cut safe instead.
Tags: #hive #daemon #fastmcp #uvicorn #restart-on-upgrade #idempotency #spikes #crash-only
[2026-06-01] Enforce a single-owner invariant at the routing layer, not by making the fallback contend for the owner's lock
Context: HIVE-118 Phase C slice 3: adding client auto-reconnect to the hive client stdio shim. The shim proxies to a single-owner hive serve daemon (which owns git + SQLite under an exclusive singleton daemon.lock flock) and falls back to an in-process server when no daemon is reachable. Auto-reconnect makes the backend decision per-call instead of one-shot at startup, which introduces a dual-owner window: the shim can fall back to a write-capable in-process owner and a daemon can (re)appear concurrently.
Problem: How to stop the in-process fallback and the daemon from both owning git/SQLite at once. Three options: (1) prefer-daemon per-call routing; (2) prefer-daemon + async teardown of the cached in-process standby when the daemon returns; (3) flock-gate the fallback so it must take the same singleton lock before owning state. Option 3 looks like the "purest" single-owner design but has a fatal inversion: the singleton flock is exclusive AND the daemon declines (exit 0) if it cannot acquire it (the supervised-restart design), so a degraded fallback holding the flock would BLOCK the canonical daemon from ever starting — the opposite of the desired priority. Option 2 adds a genuine mid-call-close race to reclaim resources (idle SQLite connections + reconciler/checkpoint threads) that are already cross-process safe.
Solution: Enforce single-ownership at the WRITE-ROUTING layer (option 1): the per-call factory prefers the canonical owner — while the daemon is reachable every call is forwarded to it, so the fallback owner performs zero writes even if it exists. Build the fallback lazily (only on first unreachability) and cache it, so the happy path never creates a second owner at all. Do NOT make the fallback contend for the owner's exclusive lock: when the lock is also the owner's liveness gate (acquire-or-decline), a fallback that grabs it starves the canonical owner. Leave residual sub-second races to the defenses already designed for multi-owner contention (here: idempotency key + .git/index.lock self-heal). General rule: when adding transparent failover to a system with a single-owner write invariant, make the canonical owner win at the routing decision; reserve exclusive locks for the owner's own liveness, and never let a degraded fallback hold a lock that gates the thing it is supposed to defer to.
Tags: #architecture #resilience #single-owner #daemon #failover #concurrency #hive
[2026-06-02] Restart-on-upgrade: the daemon's exit code must match the supervisor's restart policy
Context: Implementing HIVE-118 slice 1.3: a long-lived daemon that should adopt a newly installed package version (after uv tool upgrade) by restarting into the new code under a process supervisor.
Problem: The intuitive design is "detect drift, stop cleanly, exit 0, let the supervisor restart". But under a systemd Restart=on-failure unit, exit 0 is a SUCCESS and the supervisor does NOT restart — so a clean exit(0) silently fails to pick up the upgrade. Forcing a restart by switching the unit to Restart=always + RestartPreventExitStatus= to suppress the no-op-decline case couples the code's exit-code contract to the unit file (fragile, two places to keep in sync). Separately, owning the server matters: FastMCP's mcp.run(transport="http") can only be stopped by a signal, which exits via the signal (rc -15) and cuts the in-flight handler.
Solution: Make the exit code carry intent and let one restart policy serve all cases. Under Restart=on-failure: a drift-triggered clean stop exits NON-ZERO (75 / EX_TEMPFAIL) so the supervisor relaunches into the new code; a graceful signal stop (systemctl stop) and a singleton-decline no-op exit 0 (no restart, no loop). No RestartPreventExitStatus= coupling. For the cooperative stop itself, OWN the server: build the app from the public mcp.http_app(), create your own uvicorn.Server, and set should_exit=True from a background drift poll — uvicorn drains (bounded by timeout_graceful_shutdown) and serve() returns, vs a signal that cuts the call. A true in-flight drain is unreachable over streamable-http (the handler is cancelled on lifespan shutdown), so at-most-once idempotency + client auto-reconnect must already cover the cut call. Also: re-raise SystemExit before any catch-all CRITICAL crash log so a clean restart code is not mislabelled a crash.
Tags: #daemon #systemd #uvicorn #process-supervision #fastmcp #exit-codes