Changelog

May 29, 2026 · View on GitHub

All notable changes to this project will be documented in this file.

Unreleased

Added

AC-728 Python parity slice 1: four base contract probes. Mirrors ts/src/control-plane/contract-probes/index.ts at the PR #957 shape (probeDirectoryContract / probeTerminalContract / probeServiceContract / probeArtifactContract) as pure Python functions in a new autocontext.control_plane.contract_probes package. Inputs / outputs / failures are Pydantic v2 frozen models (extra="forbid", arbitrary_types_allowed=True so compiled re.Pattern[str] regex objects can ride through unchanged). Same failure-kind enums as the TS surface: directory (unexpected-file, missing-file), terminal (unexpected-exit-code, missing-stdout-pattern, forbidden-stdout-pattern, missing-stderr-pattern, forbidden-stderr-pattern), service (missing-endpoint, unexpected-endpoint, wrong-interface with port + protocol normalization so tcp is the default), artifact (missing-substring, forbidden-substring, wrong-line-ending, invalid-json, missing-json-field with dotted-path JSON field lookup and the same early-return shape on JSON parse failure). The slice-1 audit invariant (every observation field is non-optional so the silent-pass shape cannot arise) is pinned by a parametrised test_expectation_against_minimum_observation_always_fails_loudly that mirrors the TS close-out audit (PR #1000). 25 new Python tests covering the per-probe pass / fail surfaces and the missing-observation pinning property. ruff + mypy clean; module is 407 lines (under the 800-line guard).
AC-697 slice 8: TypeScript autoctx queue add canonical subcommand. Slice 1 (PR #981) pinned the canonical CLI contract and parked TypeScript queue at intentional_gap because the runtime only exposed the legacy autoctx queue -s <spec> ... form for queue-add; slice 2 (PR #997) added queue status. This slice closes the last remaining contract gap (other than the explicitly out-of-scope mission Python entry) by promoting cmdQueue to dispatch an explicit add subcommand: the first sub-arg is inspected and stripped before the existing parseArgs / planQueueCommand workhorse runs, so autoctx queue add -s <spec> ... and the legacy autoctx queue -s <spec> ... route through the same planner. The legacy form stays registered for backward compatibility with existing automation. QUEUE_HELP_TEXT documents the canonical queue add form first and keeps the legacy alias plus queue status discoverable from autoctx queue --help. ts/README.md "Task queue" section now lists queue add, the legacy alias, and queue status so the README matches the contract and help text. Contract: queue typescript flips from intentional_gap to yes. 5 new TS tests pin the contract flip, the help-text update (canonical queue add form plus preserved -s legacy alias), and confirm the planQueueCommand / renderQueuedTaskResult shape is unchanged across the move. All prior TS contract + capabilities tests still pass; tsc --noEmit clean. AC-697 contract gaps are closed with this slice (apart from the out-of-scope mission Python entry).
AC-697 slice 7: Python autoctx show + autoctx watch commands. Slice 1 (PR #981) pinned show and watch as canonical paved-road commands and TS shipped them; Python had stub gaps that this slice closes. New show <run-id> [--best] [--generation N] [--json] composes the existing store.get_run + store.run_status read surfaces: --best filters to the single generation with the highest best_score; --generation N filters to a specific generation index; bare show renders all generations. Renders a per-generation table (or JSON payload with run_id, scenario, status, generations[]). New watch <run-id> [--interval N] [--json] polls store.run_status on a configurable interval (default 2 seconds), emits one human-readable line (or one JSONL row under --json) per transition, and breaks when the latest generation enters a terminal status (completed, failed, succeeded, errored). Both commands emit actionable errors with a non-zero exit code when the run id is not found. Contract: Python show and watch both flip from intentional_gap to yes. 7 new Python tests: subcommands registered at canonical paths; contract entries flipped on both runtimes; show missing-run actionable error; show --best reduces to top-scoring generation; show --generation N filters to specific row; watch breaks immediately on a terminal status (no sleep); watch missing-run actionable error. All 27 prior Python contract/serve/capabilities tests still pass. ruff + mypy clean.
AC-697 slice 6: autoctx serve mcp canonical path on both runtimes. Slice 1 (PR #981) pinned serve mcp as the canonical MCP-server path with mcp-serve as a registered alias on both runtimes; this slice ships the matching CLI changes. Python: serve promoted to a sub-Typer group with invoke_without_command=True so the legacy autoctx serve [--host ...] [--port ...] form continues to start the HTTP API. Three subcommands registered: serve http (explicit canonical HTTP), serve mcp (canonical MCP), and the bare serve callback (legacy HTTP form). HTTP serve body extracted to _run_http_serve(host, port) so both serve (callback) and serve http (explicit) call the same code path. MCP serve body identical to the existing mcp-serve handler (calls autocontext.mcp.server.run_server). mcp-serve top-level alias kept registered for backward compatibility with existing Claude Code MCP configurations. TypeScript: cmdServeHttp detects mcp as the first sub-arg and rewrites argv to delegate to cmdMcpServe (same delegation pattern as slice-4 cmdScenario -> cmdNewScenario). mcp-serve top-level dispatch entry kept for backward compat. Contract: both serve.mcp entries flip from intentional_gap to yes; mcp-serve stays as the contract alias. 6 new tests (3 Python: subcommands registered at canonical paths, serve typer group is invokable without subcommand, contract serve.mcp is yes on both runtimes with mcp-serve alias preserved; 3 TS: contract serve.mcp is yes, mcp-serve alias preserved, serve + mcp-serve both registered in command-registry). All 23 prior Python contract/capabilities/queue tests + 30 prior TS contract tests still pass. ruff + mypy + tsc clean.
AC-697 slice 5: autoctx capabilities is now contract-driven on both runtimes. The slice-1 contract pinned capabilities as the operator-facing surface for the canonical command set, but both runtimes shipped legacy/no-op implementations: TS emitted only visibleSupportedCommandNames() (command names with no aliases or per-runtime support), Python did not ship the command at all. This slice loads docs/cli-contract.json from each runtime and emits a structured payload with the canonical commands, aliases, and per-runtime support. TypeScript: buildCapabilitiesPayload gains a contract: { schema_version, commands: [...] } field; each command carries id, path, summary, audience, maturity, aliases, runtime_support.{python,typescript}.{status,reason?}. The legacy commands: string[] field is preserved for backward compatibility. Optional contractPath parameter on buildCapabilitiesPayload lets tests override the default repo-relative path. Python: new autocontext.cli_capabilities module with build_capabilities_payload(contract_path=None) (loads contract, returns the same JSON shape) and register_capabilities_command(app, console=...) that mounts autoctx capabilities as a typer command. --json prints the structured payload via plain stdout (no rich ANSI coloring) so JSON consumers parse the output directly. Default human-readable output renders a per-command table with python/ts support status. Contract: both capabilities entries flip from intentional_gap to yes. 5 new Python tests (payload shape, paved-road command presence, intentional_gap reasons propagated, --json end-to-end, human-readable summary). 2 new TS tests (contract field with canonical commands and aliases, runtime_support enum validity). All 17 Python parity tests + 23 TS contract tests still pass. ruff + mypy + tsc clean.
AC-697 slice 4: TS autoctx scenario create canonical path. Mirrors slice 3 (Python typer-group refactor, PR #998) on the TypeScript side. command-registry.ts adds scenario to the DbCommandName union and registers it as a primary command. cli/index.ts adds a cmdScenario handler that detects the first sub-arg: create rewrites process.argv so the existing cmdNewScenario handler runs unchanged (DRY: scaffolding logic stays single-sourced), --help or no args print a usage banner naming the subcommand, anything else exits with an unknown-subcommand error. The legacy top-level new-scenario command stays registered as the alias the slice-1 contract pins. Contract: TS scenario.create flips from intentional_gap to yes. The TS contract parity test now does a partial multi-token check: when a contract entry claims TS support for a path.length >= 2 command, the parent token must be a registered command (catches the case where TS claims yes but didn't even mount the parent). Full multi-token subcommand verification remains future work, gated on introducing a TS subcommand registry. 3 new TS tests (scenario is registered in visibleSupportedCommandNames; TS scenario.create is yes in the contract; new-scenario is preserved as an alias); the 18 existing TS contract + status-retargeting tests still pass + the 17 Python parity tests still pass. tsc --noEmit clean.
AC-697 slice 3: Python queue and scenario typer-group refactor. Promotes the two action-positional Python commands to sub-Typer groups with registered subcommands so the canonical contract paths (queue add, queue status, scenario create) appear in iter_python_command_paths(). cli_queue.register_queue_command now mounts a queue sub-Typer with invoke_without_command=True so the legacy autoctx queue -s <spec> form still routes to the add subcommand via a group callback; explicit autoctx queue add and autoctx queue status subcommands are also registered. cli_new_scenario.register_new_scenario_command extracts the scaffolding body to a module-level _scaffold_scenario_body() helper, then registers both the legacy top-level new-scenario command and a new scenario sub-Typer group with create subcommand that delegates to the same body. Contract walker _walk_typer in cli_contract.py now yields each registered group's prefix in addition to recursing into its subcommands, so contract entries that pin a group's top-level path (e.g. queue as the umbrella) match the observed registry. Contract: Python queue.status flips from intentional_gap (the slice-2 action-positional reason) to yes; Python scenario.create flips from intentional_gap to yes; TypeScript scenario.create reason updated to point at a follow-up AC-697 slice 4 that will mirror the typer-group refactor on the TS side. 3 new Python tests (queue add and queue status registered at the canonical paths; legacy queue -s <spec> still routes to add without producing a usage error; the slice-2 "Supported actions" test repurposed to assert typer's standard subcommand-not-found banner). 17 existing slice-1 Python parity tests + 18 TS contract tests still pass after updating one slice-2 TS assertion that pinned the now-closed action-positional gap. ruff + mypy clean. The iter_python_command_paths walker change is backward-compatible: existing callers that consumed the path enumeration get a superset (all the same paths plus group-prefix entries), so no other tests broke.
AC-697 slice 2: TS status command retargeted from queue-pending to run-status, with queue status as the new canonical home for queue-pending counts. Slice 1 (PR #981) pinned the contract; this slice ships the matching CLI changes. TypeScript: cmdStatus now errors out when invoked without a <run-id> (no fallthrough to queue-pending), pointing operators at autoctx queue status for the queue-pending count; cmdQueue gains a subcommand dispatch that detects autoctx queue status and routes to executeStatusCommandWorkflow + renderStatusResult (the same workflow that used to drive top-level status, so the JSON output shape {"pendingCount": <int>} is preserved across the move). The existing autoctx queue -s <spec> queue-add path is unchanged for backward compatibility. Python: run_queue_command in cli_queue.py accepts action="status" (previously only "add") and emits a {"pending_count": <int>} payload via store.pending_task_count(). The Python top-level status already required a <run-id> positional, so no Python-side change was needed there. docs/cli-contract.json: TS status flips from intentional_gap to yes; TS queue.status flips from intentional_gap to yes; Python queue.status keeps intentional_gap with an updated reason explaining the action-positional dispatch ("Behavior shipped via autoctx queue status action-positional; contract walker reads Typer's registered subcommands and will not see it until a follow-up slice promotes queue to a sub-Typer group, which would break autoctx queue -s <spec> callers"). 7 new tests: 2 Python (autoctx queue status --json returns pending_count, unknown action emits a clear actionable error) + 5 TypeScript (contract entries flipped to yes for TS status + queue.status, Python queue.status retains the action-positional intentional_gap reason, summary still pins run-status as the canonical meaning, workflow shape preserved across the move). All 17 existing slice-1 Python parity tests + all 13 slice-1 TS parity tests still pass. ruff + mypy + tsc clean.
AC-708 slice 2c: PyTorch/CUDA-backed logistic-regression curator advisor. New autocontext.hermes.cuda_trained_advisor ships train_cuda_logistic(examples, *, epochs=200, learning_rate=0.5, l2=0.001, seed=0) and save_cuda_advisor(advisor, path). Same multinomial logistic regression architecture as slices 2a (PR #980) and 2b (PR #995) on the same fixed feature encoder; the gradient descent runs on PyTorch tensors with torch.cuda when torch.cuda.is_available(), falling back transparently to CPU torch otherwise. The checkpoint records the actual device under a device audit field ("cuda" or "cpu"); the kind stays cuda_logistic_regression either way because the backend (PyTorch) is what differs from slice 2b's MLX. HAS_CUDA_ADVISOR flag derived from importlib.util.find_spec("torch"); calling train_cuda_logistic without torch raises a clear RuntimeError naming the autocontext[cuda] extra. load_advisor already accepted cuda_logistic_regression (slice 2b reserved the kind). New autoctx hermes train-advisor --cuda --checkpoint <path> CLI subcommand wires the backend end-to-end; the slice-2b three-way --baseline / --logistic / --mlx mutex extends to four-way; passing --cuda without torch installed surfaces a loud actionable error. New cuda optional extra in pyproject.toml (torch>=2.0.0). 10 new tests: 5 platform-independent (_require_torch message, 4-way mutex with zero flags, 4-way mutex with two flags, --cuda without torch clear error, load_advisor accepts cuda_logistic_regression); 5 CUDA-gated (training shape, beats-baseline-on-separable-data, save/load round-trip preserves predictions, empty-dataset ValueError, end-to-end CLI train + checkpoint + load). Verified both paths locally: 9/10 pass with torch installed (1 skip is the not-installed error-path test); 5/10 pass + 5 cleanly skip without torch (CI default). The CLI runner consolidates the three-backend dispatch into a uniform payload-construction block (training function + saver + advisor kind selected via if/elif on the flag, rest of the payload built identically) and introduces a local _fail(message) helper to consolidate the repeating "json -> stderr, else -> console.print(red), raise typer.Exit" pattern; the file lands at 789/800 lines, comfortably under the module-size guard. All 20 existing slice-2a tests and all 5 platform-independent + 5 MLX-gated slice-2b tests still pass; ruff + mypy clean. docs/agent-integration.md train-advisor section documents --cuda alongside the existing flags with an inline recipe.
AC-708 slice 2b: MLX-backed logistic-regression curator advisor. New autocontext.hermes.mlx_trained_advisor ships train_mlx_logistic(examples, *, epochs=200, learning_rate=0.5, l2=0.001, seed=0) and save_mlx_advisor(advisor, path). Same multinomial logistic regression architecture as slice 2a (PR #980) on the fixed feature encoder, but the gradient descent runs on MLX so the matrix multiplies can be GPU-accelerated on Apple silicon. Returns a LogisticRegressionAdvisor (the slice-2a dataclass) so the loaded checkpoint type stays uniform across backends and recommend --advisor does not need to dispatch on backend. The checkpoint JSON is the slice-2a schema with kind: "mlx_logistic_regression" + backend: "mlx" so audits can tell which backend produced a file; the extended load_advisor in trained_advisor.py now accepts logistic_regression, mlx_logistic_regression, and the reserved cuda_logistic_regression (slice 2c). HAS_MLX_ADVISOR flag derived from a guarded import mlx.core / import mlx.nn; calling train_mlx_logistic without MLX raises a clear RuntimeError naming the autocontext[mlx] extra. New autoctx hermes train-advisor --mlx --checkpoint <path> CLI subcommand wires the backend end-to-end; the existing two-way --baseline / --logistic mutex extends to three-way (--baseline / --logistic / --mlx); calling --mlx without the MLX extra installed surfaces a loud actionable error rather than crashing inside an opaque ImportError. 10 new tests: 4 platform-independent (_require_mlx message; three-way mutex with zero flags; three-way mutex with two flags; --mlx without MLX clear error); 1 schema test in the slice-2a file (load_advisor accepts mlx_logistic_regression); 5 MLX-gated (gated on HAS_MLX_ADVISOR) covering training shape, beats-baseline-on-separable-data, save/load round-trip preserves predictions, empty-dataset ValueError, end-to-end CLI train + checkpoint + load. Verified both paths on Apple-silicon: 8/9 pass with MLX installed (the 1 skip is the not-installed error-path test which can't fire when MLX IS installed); 4/9 pass + 5 cleanly skipped without MLX (CI default). All 20 existing slice-2a tests still pass. ruff + mypy clean. The cli_hermes_runners.py module stays under the 800-line guard (798/800). docs/agent-integration.md train-advisor section gains the --mlx flag documentation + an inline recipe alongside the existing --baseline / --logistic examples.
AC-728 close-out audit: missing-observation invariant pinning tests for the four slice-1 probes (directory, terminal, service, artifact). Each probe's observation fields are non-optional at the TypeScript type layer and at the slice-5 ContractProbeSuiteSchema Zod layer, so the silent-pass shape that necessitated explicit missing-observation failure kinds in cleanup (PR #988), media (PR #985), and distributed (PR #993, slice 8) cannot arise here by construction. 5 new tests (directory requiredFiles against empty workdir; terminal requiredStdoutPatterns against empty stdout; service required against empty observed list; artifact requiredJsonFields against empty content; artifact requiredSubstrings against empty content) pin the loud-failure path so any future refactor that loosens an observation field surfaces immediately. Source-level design-note comment at the top of ts/src/control-plane/contract-probes/index.ts documents the audit conclusion so future contributors don't have to re-derive it. 132 -> 137 file total; tsc --noEmit clean. With this slice the AC-728 surface is fully shipped: directory/terminal/service/artifact probes (PR #957), cleanup probe + retrofit (PRs #983, #988), media probe (#985), distributed probe (#987), suite runner (#990), autoctx probes check CLI (#991), and autoctx probes extract CLI for all seven kinds with orphan-expectation rejection (#992, #993).
AC-728 slice 8: autoctx probes extract now covers cleanup, media, and distributed probe kinds. HarnessTraceSchema gains observations.cleanup (entries with optional symlink/mtime metadata), observations.media (per-path WxH, byte size, column metadata, line count, magic bytes), and observations.distributed (worldSize + per-rank reports with optional steps and observations: Record<string, string>). Matching expectation shapes: expectations.cleanup (lockfile-age policy, sidecar / backup pattern overrides, forbidSymlinks, allowedSymlinkTargets, ignoredPatterns), expectations.media (per-path expected magic bytes, dimensions, byte-size bounds, column expectations, line count), and expectations.distributed (expectedWorldSize, expectedSteps, mustMatchAcrossRanks). Each new section is rejected by superRefine when declared as an expectation without its matching observation, closing the same orphan-expectation class fixed in slice 7's PR #992 review. Per-media expectations join observations by path (mirrors the artifact convention). 11 new tests cover cleanup join + observation-only behaviour, media per-path matching + no-expectation no-op probe, distributed cross-rank divergence + observation-only pass, all four orphan-rejection paths (cleanup / media envelope / per-media path / distributed), and a seven-probe end-to-end round-trip through extractContractProbeSuite + ContractProbeSuiteSchema + runContractProbeSuite. 116 existing AC-728 cases still pass for a 127/127 file total. tsc --noEmit clean. The README's "Synthesizing a suite from a harness trace" section grew to a seven-probe example covering all observation + expectation shapes, and the EXTRACT_HELP_TEXT notes the expanded coverage.
AC-728 slice 7: autoctx probes extract -- synthesize a runnable probe suite from a harness trace. New autoctx probes extract --trace <path> [--output <path>] reads a HarnessTrace JSON file that bundles both observations (what actually happened in a recorded run: terminal exit code / stdout / stderr; the workdir's present files; observed service endpoints; emitted artifacts) and optional expectations (what the operator declared should have happened: expected exit code, required / allowed / ignored files, required endpoints, per-artifact JSON-field / substring / line-ending expectations). The extractor joins observations with expectations into a ContractProbeSuite (slice 5 wire shape) ready to feed to autoctx probes check. Per-artifact expectations match observations by path; an observation without a matching expectation produces a probe with no declared substring / line-ending / JSON-field checks (the artifact's existence and content are recorded but no assertions fire). New HarnessTraceSchema (Zod) validates the trace envelope and per-kind nested shapes, all .strict() so unknown keys (typos) fail validation; reuses the slice-5 transform pattern for regex / date helpers so safeParse surfaces { success: false } for invalid regexes rather than throwing raw SyntaxError. Output goes to stdout by default (pipe-friendly for extract | check); --output <path> writes to a file (parent directories created). RegExp values in the emitted suite are serialised as { source, flags } objects so the slice-5 runner schema can re-parse them. Slice 7 supports the four AC-728 slice-1 probe kinds (terminal, directory, service, artifact); cleanup, media, and distributed extractors land in follow-up slices once their trace formats settle. New files: ts/src/control-plane/contract-probes/cli/extract.ts (the runExtract(args) in-process handler, the extractContractProbeSuite(trace) pure function, and the HarnessTraceSchema Zod schema), plus the dispatcher in ts/src/control-plane/contract-probes/cli/index.ts now routes the extract subcommand. 21 new vitest cases (schema parses observation-only and observations+expectations forms; rejects unknown keys at the envelope and nested in observations; safeParse surfaces invalid-regex issues; observation-only terminal passes; observation-only workdir fails by default with no allowlist; terminal observation + expectation joins; workdir + directory expectation joins with ignoredPatterns; missing allowlist surfaces unexpected-file failures; per-artifact path matching; no-expectation artifact path emits a no-op probe; end-to-end round-trip through ContractProbeSuiteSchema + runContractProbeSuite; CLI --help, missing --trace, missing file, malformed JSON, schema-invalid trace, stdout emission, --output emission with parent-directory creation, end-to-end emitted-suite passes slice-5 runner schema). The 87 existing AC-728 probe + runner + check-CLI cases still pass; 108/108 file total across the contract-probes test files. Package root barrel and tests/package-export-catalogs.test.ts are unchanged (the extractor is reached through autoctx probes extract). README "Contract Probes" section gains a "Synthesizing a suite from a harness trace" subsection with a minimal trace example and the extract | check pipe pattern. tsc --noEmit clean.
AC-728 slice 6: autoctx probes CLI surface. New autoctx probes check --suite <path> runs a JSON-defined contract-probe suite and reports per-probe pass/fail. Exit code 0 on full pass; 1 on any failure or any load / parse error. Default output is human-readable (probes check: PASS/FAIL plus per-probe lines with failure detail); --json emits a structured ContractProbeSuiteResult payload (the discriminated-union shape from slice 5 with kind, optional label, passed, and per-probe-typed failures carrying probe-specific fields like path, rank, key, endpoint). Schema-invalid suites surface every Zod issue with its dotted path on stderr so operators can fix typos like requiredStdoutPattern (singular) at parse time rather than discover the missing expectation in a green-but-wrong run. New files: ts/src/control-plane/contract-probes/cli/check.ts (the runCheck(args) in-process handler) and ts/src/control-plane/contract-probes/cli/index.ts (the runProbesCommand(args) subcommand dispatcher). Wired into ts/src/cli/index.ts as the probes no-db command via cmdProbes, and into ts/src/cli/command-registry.ts so autoctx --help lists it. Mirrors the production-traces / instrument CLI pattern: handlers return { stdout, stderr, exitCode } with no process.exit or console inside, so tests consume the runner directly without spawning a subprocess. 12 new vitest cases (top-level help, unknown subcommand, dispatch to check, --help, missing --suite, missing file, malformed JSON, schema-invalid suite surfaces Zod issues, text PASS report, text FAIL report with per-probe detail, --json shape with discriminated-union result, dispatcher round-trip from the cli/index.ts entry); the 75 existing AC-728 probe + runner cases still pass for an 87/87 file total. tsc --noEmit clean. Closes the AC-728 acceptance criterion #2 ("The probe layer can be surfaced to agents as context or run as harness checks") in operator-visible form. A follow-up slice adds autoctx probes extract <trace> -- read a recorded trace and synthesize a probe suite by extracting observations.
AC-728 slice 5: contract-probe suite runner. New ts/src/control-plane/contract-probes/runner.ts adds runContractProbeSuite(suite) (pure function that dispatches a JSON-defined probe spec across all seven AC-728 probes and aggregates results), loadContractProbeSuite(path) (file loader mirroring the cli-contract.ts pattern), and ContractProbeSuiteSchema (Zod schema validating the JSON wire format with a discriminated union over the seven probe kinds). Schema includes ContractProbeKindEnum, ContractProbeInvocation, ContractProbeFailure, ContractProbeRunResult, ContractProbeSuiteResult types. RegExp values can be serialised as either a bare string ("^trace\\.") or { source, flags? }; ISO-8601 strings transform to Date objects for cleanup probe now / per-entry mtime; malformed dates raise invalid ISO-8601 date: .... Aggregate result passed is the AND of per-probe passes; per-probe results carry kind, optional label (caller-supplied attribution string), passed, and a cross-kind ContractProbeFailure[] (each failure has at minimum kind and message; specific probes attach kind-specific extras like path, rank, key, endpoint). 13 new vitest cases (empty suite passes, unknown probe kind rejected, schema_version != 1 rejected, RegExp string transform, ISO-8601 to Date transform, malformed date rejected, exhaustive 7-kind dispatch with all passing, suite passed is AND across probes, failure entries carry kind + label, JSON file load + parse, missing file throws, malformed JSON throws). 55 existing AC-728 probe cases still pass; 68 total tests across the two files. Package root barrel re-exports runContractProbeSuite, loadContractProbeSuite, ContractProbeSuiteSchema, ContractProbeKindEnum and the new types; tests/package-export-catalogs.test.ts pins the public surface. tsc --noEmit clean. Wires the AC-728 acceptance criterion #2 ("The probe layer can be surfaced to agents as context or run as harness checks") at the library level; a follow-up slice adds the autoctx probes CLI command + trace-replay extractor on top of this foundation.
AC-728 slice 3: media / tabular contract probe. New probeMediaContract in ts/src/control-plane/contract-probes/index.ts closes the "media/data artifact dimensions, encoding, headers, and units" item from the original AC-728 ticket. Seven failure kinds (wrong-magic-bytes, wrong-dimensions, wrong-byte-size, wrong-column-count, missing-column, wrong-line-count, missing-observation) cover format header bytes, image / video dimensions, byte-size bounds, tabular column-count and column-name presence, and JSONL / CSV line counts. When the caller declares an expectation but the matching observation is undefined, the probe emits missing-observation rather than silently passing — a corrupt artifact or a broken metadata extractor would otherwise satisfy the contract by omitting observations. When the caller does not declare an expectation about a field, that field is not checked. Pure function, no IO. 12 new vitest cases (clean-pass, all-expectations-match, magic-byte mismatch, width / height mismatch, byte-size below min / above max, column-count mismatch, missing required column, line-count mismatch, declared-but-unobserved blanket fail across all seven fields, byte-size missing-observation with only one bound set, no-expectation-declared still passes); the prior 39 AC-728 cases still pass for a 51/51 file total. Package root barrel (ts/src/index.ts) re-exports probeMediaContract and its types; tests/package-export-catalogs.test.ts pins the public surface. tsc --noEmit clean.
AC-728 cleanup-probe missing-observation retrofit. Applies the PR #985 review lesson (declared expectations without observations must fail rather than silently pass) to probeCleanupContract. New missing-observation failure kind added to CleanupContractFailureKind; two surfaces now emit it: (1) when maxLockfileAgeMs is set but a matched lockfile entry has no mtime, the probe fails with missing-observation rather than skipping the age check (a stat-failing extractor would otherwise satisfy the age contract by omitting mtime); (2) when allowedSymlinkTargets is set but a symlink entry has no symlinkTarget, the probe fails with missing-observation rather than treating the target as <unknown> and letting a broken extractor pass the allowlist contract. The pre-existing "no expectation declared → no failure" invariant is preserved: callers who declare neither maxLockfileAgeMs nor allowedSymlinkTargets still get the same behavior as before. 4 new tests (lockfile-without-mtime + age contract fails; lockfile-without-mtime without age contract still passes via the unconditional-flag path; symlink-without-target + allowlist fails; symlink-without-target without any symlink contract passes); the prior AC-728 cases still pass. tsc --noEmit clean.
AC-728 slice 4: distributed / multi-process contract probe. New probeDistributedContract in ts/src/control-plane/contract-probes/index.ts closes the "distributed/multi-process parity checks beyond world-size 1" item from the original AC-728 ticket. Distributed tensor code can pass shallow checks (process started, gradient computed locally) and still fail multi-rank parity; this probe catches the cross-rank invariants. Six failure kinds (wrong-world-size, missing-rank, duplicate-rank, rank-divergence, wrong-step-count, missing-observation) cover observed-vs-expected world size, every rank in [0, worldSize) reporting (no missing rank, no duplicate report), per-key cross-rank observation equality for keys listed in mustMatchAcrossRanks (the divergence message enumerates the distinct values so the caller can see which value disagrees), and per-rank step-count parity against expectedSteps. Pure function: the caller does the runtime IO (torchrun / NCCL / MPI / whatever collects per-rank reports) and passes a DistributedRankReport per rank; the probe verifies. Same posture as the AC-728 slice 1, 2, 3 probes. Mirrors the PR #985 review lesson: a declared expectation without its observation fails as missing-observation rather than silently passing — e.g. mustMatchAcrossRanks: ["final_loss"] against a rank that did not report final_loss fails, as does expectedWorldSize against an undefined worldSize. 10 new vitest cases (clean 4-rank pass, wrong-world-size, missing-rank, rank-divergence with distinct-value enumeration, wrong-step-count per rank, missing-observation for both must-match keys and world size, no-expectation-declared still passes, world-size-1 degenerate pass, duplicate-rank guard); the 29 existing AC-728 slice 1/2 cases still pass for a 39/39 file total. Package root barrel (ts/src/index.ts) re-exports probeDistributedContract and its types alongside the existing AC-728 probes; tests/package-export-catalogs.test.ts pins the public surface so the export cannot silently disappear. tsc --noEmit clean.
AC-728 slice 2: cleanup contract probe. New probeCleanupContract in ts/src/control-plane/contract-probes/index.ts (alongside the AC-728 slice 1 directory/terminal/service/artifact probes from PR #957) catches the leftover-artifact class of contract bugs the directory probe alone can miss: broken symlinks, symlinks forbidden by contract or pointing outside an allowedSymlinkTargets allowlist, stale lockfiles (configurable maxLockfileAgeMs with now injection for deterministic tests; defaults to flagging every lockfile unconditionally when no threshold is set), editor / OS sidecars (vim swap .swp / .swo, emacs-style ~, macOS .DS_Store, LibreOffice .~lock.*#), and backup copies (.bak, .orig). Five failure kinds (stray-symlink, broken-symlink, stale-lockfile, stray-sidecar, stray-backup) with per-failure human-readable message. Reuses the existing isIgnored helper so ignoredPatterns semantics match probeDirectoryContract. Pure function: the probe does no filesystem IO, so it composes with the same trace-replay surfaces the slice 1 probes already use. Caller passes a directory listing as CleanupFileEntry records (path, optional isSymlink / symlinkTarget / symlinkBroken / mtime); default sidecar and backup patterns are intentionally narrow so the probe does not false-positive against legitimate dotfiles. 11 new tests (clean-directory passes, broken-symlink always fails, forbidSymlinks blanket fail, allowedSymlinkTargets allowlist, default sidecar / backup detection, lockfile unconditional and age-thresholded, ignoredPatterns parity with the directory probe, caller pattern overrides); the 18 existing AC-728 slice 1 tests still pass for a 29/29 file total. tsc --noEmit clean.
AC-697 slice 1: shared CLI contract + per-runtime parity tests. New docs/cli-contract.json is the single source of truth for the canonical autoctx surface (17 commands so far: the six paved-road plus the highest-friction items from the ticket). New autocontext.cli_contract (Python loader with frozen Contract / CommandSpec / Flag / RuntimeSupportPair value types, RuntimeStatus StrEnum, iter_python_command_paths Typer introspection helper, PAVED_ROAD constant) + ts/src/cli/cli-contract.ts (Zod-validated TypeScript loader with matching PAVED_ROAD constant and resolveAlias helper). Both runtimes' parity tests load the same JSON: 17 Python tests in tests/test_cli_contract.py and 13 TypeScript tests in ts/tests/cli-contract-ac697.test.ts cover schema sanity (no duplicate ids, alias uniqueness, intentional-gap reasons required, audience tier + domain concept validity, paved-road constant matches audience filter), runtime parity (every runtime_support.<runtime> == "yes" claim must resolve to a registered command at the canonical path), and AC-697 friction-point invariants pinned in the contract (status canonical meaning is run status; solve is not a domain noun; --iterations is the canonical iteration flag; queue.status does not occupy top-level status). Each intentional_gap entry carries a non-empty reason so reviewers can tell apart "decided not to ship" from "forgot to implement" and trace which AC-697 follow-up slice owns the fix. The contract is intentionally small in slice 1 (paved road + friction points); follow-up slices fill in the remaining 30+ commands and ship the actual semantics fixes (status retargeting, queue add parity, alias plumbing, paved-road help view, capabilities-from-contract).
AC-708 slice 2a: pure-Python logistic-regression curator advisor. New autocontext.hermes.trained_advisor exposes LogisticRegressionAdvisor (frozen value type carrying learned weights, intercepts, label order, fixed feature encoder), train_logistic(examples, *, epochs=200, learning_rate=0.5, l2=0.001, seed=0) (multinomial logistic regression via gradient descent on softmax cross-entropy), predict + predict_proba (calibrated per-label probabilities summing to 1), and save_advisor / load_advisor for JSON checkpoint round-trip with a stable schema (kind: "logistic_regression", version: 1). Implements the existing Advisor Protocol so the AC-708 slice 1 evaluate and the AC-709 recommend work unchanged. New autoctx hermes train-advisor --logistic --output metrics.json --checkpoint advisor.json trains and persists; autoctx hermes recommend --advisor advisor.json --home ~/.hermes --output recs.jsonl loads and emits recommendations — closes the AC-705 → AC-708 → AC-709 loop end-to-end with a real trained advisor. --baseline / --logistic are mutually exclusive (caller picks one explicitly); --baseline-from / --advisor on recommend are likewise mutually exclusive. Same-file guards reject --checkpoint equal to either --data (would clobber the source dataset) or --output (would clobber the metrics payload). load_advisor rejects dimension-invalid checkpoints (mismatched label / weights / intercepts row counts, or any weights row whose length disagrees with feature_names) so the failure surfaces at the file rather than later inside predict_proba. Pure Python (no numpy/sklearn/GPU dep) so the trained backend runs in CI smoke mode against fixture-sized data; MLX (slice 2b) and CUDA (slice 2c) backends ship behind the same Advisor Protocol + checkpoint schema (kind discriminator rejects foreign checkpoints). 19 tests cover learned-weight shape, Advisor-Protocol dispatch, predict_proba normalization, beats-baseline-on-separable-data, deterministic-for-same-seed, empty-dataset rejection, single-label graceful fallback, save/load round-trip, stable checkpoint schema, unknown-kind/missing-file/corrupt-JSON/dimension-mismatch load errors, the two --checkpoint same-file guards, and CLI integration (train --logistic + recommend --advisor end-to-end). Total hermes/spec-verifier/module-size tests pass; ruff + mypy clean.
AC-770 + AC-771: two new rules in the AC-769 remediation router. rule_threshold_budget (AC-770) emits a new BudgetIncrease(parameter, current, suggested_factor, reason) hint when an assertion error matches the k/total at N trials or at N trials: k/total shape with a near-zero pass rate (k <= 25% of total), or contains insufficient samples / convergence not reached. Factor heuristic: 16x for k == 0 (the c32 marker from the Cryptopals validation campaign), 4x otherwise. rule_indexing_base (AC-771) emits a new IndexingCheck(reason) hint when a near-zero k/N hits|bytes recovered failure shape pairs with source code containing a Z_N / index_N / idx_N identifier alongside a position = N / index = N constant — the c56 shape where literature naming (1-indexed Z_16) and code (0-indexed position = 16) disagree. When source code is unavailable, a low-confidence generic hint still fires for 0/N failures so the agent considers indexing as a candidate. The router signature gains a source_code: str | None = None kwarg, forwarded to all rules; existing AC-769 callers are unaffected (kwarg has a None default). Both new hints render through render_hints() with human-readable descriptions. 19 new tests (8 threshold-budget, 6 indexing-base, 3 router integration, 2 rendering) plus the existing 22 AC-769 tests all pass; lint + mypy clean.
AC-711: validate the Hermes autocontext skill against realistic agent prompts via a static content rubric. New autocontext.hermes.skill_validation exposes TaskPrompt, ExpectedBehavior, ValidationCase, ValidationResult, ValidationReport value types plus the validate_skill() entry point and a DEFAULT_RUBRIC covering all six AC-711 fixture prompts (evaluate_and_improve, export_best_as_skill, look_at_curator_reports, use_local_mlx_to_train, mcp_vs_cli, improve_curator_without_replacing). Six typed predicates enforce the AC-711 evaluation criteria: prefers_cli_when_mcp_unconfigured, uses_mcp_only_when_configured, never_mutates_hermes_skills_for_inspect_or_train, explains_privacy_before_session_ingest, documents_export_skill_path, separates_curator_and_autocontext_responsibilities. New autoctx hermes validate-skill --output report.md --json runs the rubric and exits non-zero on any failure so CI gates skill drift. Skill patch: building the rubric surfaced a real gap — the shipped SKILL.md had no explicit privacy posture for session/trajectory imports. Per AC-711 deliverable, hermes/skill.py now ships a Privacy Before Session and Trajectory Ingest section distinguishing Curator decision reports (safe metadata) from sessions/trajectories (raw content), documenting --redact standard|strict|off plus --dry-run, naming autoctx hermes ingest-sessions and autoctx hermes ingest-trajectories as the affected commands. The AC-712 committed skills/autocontext/SKILL.md snapshot is regenerated against the patched renderer so the AC-712 sync invariant test stays green after this PR lands. Three negative regression tests (CLI-first guidance stripped, every privacy keyword stripped, export-skill stripped) prove the rubric has teeth. Validation results recorded at docs/hermes-skill-validation.md. 17 rubric tests pass; total hermes-cluster tests pass.
AC-712: distribution path for the Hermes autocontext skill. Ships a committed snapshot of render_autocontext_skill() at skills/autocontext/SKILL.md plus the four AC-702 references under skills/autocontext/references/. CI sync invariant (autocontext/tests/test_hermes_skill_distribution.py, 5 tests) pins the committed bytes byte-for-byte to the renderer and rejects orphan reference files, so the snapshot can never drift silently. New docs/hermes-skill-distribution.md documents three install paths (Option A: autoctx hermes export-skill --output ~/.hermes/skills/autocontext/SKILL.md --with-references; Option B: curl raw URLs from main or a pinned SHA; Option C: shallow + sparse git clone), the /reload-skills reload story, frontmatter-based versioning, and the local-edits-as-fork pitfall. Upstream Hermes submission and agentskills.io / hub registration are scoped as AC-712 follow-ups so the supported install matrix is unblocked today without waiting on external approval. DRY: the renderer is still the single source of truth; the committed snapshot is generated via the shipped CLI and re-generated the same way.
AC-707 (spike): Hermes plugin emitter prototype + decision doc. New autocontext.hermes.plugin_emitter module ships a fail-open HermesTraceEmitter orchestrator with LLMCallEvent / ToolCallEvent value types, a TraceSink Protocol, and a LocalJsonlSink concrete write surface. The emitter reuses the existing RedactionPolicy (DRY with AC-706) and production_traces.emit.build_trace (DRY with AC-704 / AC-706) so a future production plugin can adopt the shape without redesigning anything in autocontext. Decision documented at docs/hermes-plugin-emitter-spike.md: DEFER until either a concrete operator workflow demands the extra fidelity (sub-second timing, structured tool calls, provider usage) or Hermes publishes a stable plugin API contract. The file importers (AC-704 / AC-706) plus the advisor pipeline (AC-708 / AC-709) cover the current operator scenarios; paying the cross-package contract cost now would not unlock any active payoff thread. 12 tests pin the safety properties (sink fail-open, hook fail-open, late finalize ignored, concurrent sessions isolated, no network IO in default mode, shared-policy redaction, ProductionTrace shape) so a future revisit is glue work, not a green-field rewrite. AC-707 closed.
AC-709: autoctx hermes recommend --home ~/.hermes --baseline-from training/hermes-curator-decisions.jsonl --output recommendations.jsonl [--include-protected] [--json] is the read-only recommendation surface. Trains a baseline advisor on AC-705 export data, walks the live Hermes inventory, and emits one JSONL row per recommendation. New autocontext.hermes.recommendations module exposes Recommendation (skill_name, predicted_action, confidence, status, features, reason) and recommend(inventory, advisor, *, include_protected, reason). Read-only invariant: never writes to ~/.hermes; Curator stays the mutation owner. Protected skills (pinned / bundled / hub provenance) are filtered out by default so a recommendation cannot mistakenly target upstream-owned content; --include-protected surfaces them tagged status="protected" for audit. Same-file guard on --baseline-from / --output mirrors the AC-706 / AC-708 ingest posture. Slice-1 refactor of autocontext.hermes.advisor: introduces SkillFeatures as the inference-time input shape so advisors take features (not labeled examples), with CuratorDecisionExample.features bridging training to inference cleanly. BaselineAdvisor.predict(features) is unchanged behaviorally; the slice-1 tests update one direct call site. 13 recommendation tests + 1 refactor regression cover features bridge, advisor protocol, protected-skill filtering, include-protected audit path, JSON round-trip, default rationale per advisor type, and 4 CLI integration tests (success, same-file guard, empty training rejection, all-protected empty-output, include-protected surfacing). 186 total hermes tests pass.
AC-769: failure-type → remediation routing on top of FailureReport. New autocontext/src/autocontext/loop/remediation_router.py pattern-matches a FailureReport (plus optional AC-767 fixtures map) into typed RemediationHint instances. Three built-in rules ship: rule_off_by_one (matches "expected X, got Y" where diff ∈ {1, BLOCK, BLOCK²} for common block sizes, plus "off-by-N" keywords) → SmallCaseVerify; rule_positional_typerror (matches TypeError: foo() takes N positional arguments and extracts modules from File "..." traceback lines) → SurfaceSignatures; rule_stale_fixture (matches missing-substring failures referencing a fixture key whose cached payload is older than stale_after_days) → RefreshFixture. Rules are pluggable via a Rule Protocol and DEFAULT_RULES list. route_remediations(report, *, fixtures, stale_after_days, rules) runs every rule and concatenates hints in order; render_hints(hints) emits a ## Suggested next moves prompt block. Wired into the tree-search refinement loop (loop/stage_tree_search.py): HypothesisNode gains last_errors: list[list[str]], HypothesisTree.update accepts an optional errors_per_match kwarg, and the refinement-prompt build site calls remediation_hints_for_node(selected, fixtures=ctx.fixtures) then threads the result into build_refinement_prompt(remediation_hints=...). build_refinement_prompt gains a remediation_hints: str = "" opt-in kwarg (existing callers unchanged). 23 tests cover rules, router, render, the stage_tree_search wiring helper, and an end-to-end test through build_refinement_prompt.
AC-767 (docs follow-up): operator-facing documentation for the authoritative ground-truth fixture loader landed in #968. New autocontext/docs/fixture-loader.md covers quick-start (drop a manifest at autocontext/knowledge/<scenario>/fixtures.json, set AUTOCONTEXT_FIXTURE_LOADER_ENABLED=true), manifest format (key, source, optional expected_sha256), cache semantics (rehash on read, source-URL change invalidates, missing manifest is a no-op), programmatic API (FixtureManifest, FixtureCache, UrlFetcher, load_scenario_fixtures, render_fixtures), and the settings reference. No code changes; the implementation already shipped via #968.
AC-708 (slice 1): autoctx hermes train-advisor --data <jsonl> --baseline --output metrics.json lays down the data + evaluation contract for the local Hermes curator advisor. New autocontext.hermes.advisor module exposes a DDD domain layer: CuratorDecisionExample value type loaded from AC-705 export JSONL, BaselineAdvisor (always-majority-class with deterministic tie-break in CANONICAL_LABELS order), LabelMetrics / AdvisorMetrics (per-label precision/recall + overall accuracy + insufficient_data flag), train_baseline(), and evaluate(). load_curator_examples is per-line tolerant (matches AC-704 / AC-706 ingest posture): malformed JSON, missing required fields, and unknown labels skip the row rather than aborting. INSUFFICIENT_DATA_THRESHOLD = 20 floors when per-label metrics are meaningful — datasets below the floor still get metrics back but with the flag set, addressing the AC-708 acceptance criterion "a clear 'not enough data' failure mode for small Hermes homes". The baseline establishes the floor every later trained advisor (slice 2: logistic regression / MLX / CUDA, AC-709 recommendation surface) must beat without redesigning the data contract. 15 tests cover loader robustness, baseline determinism, per-label precision/recall on a known fixture, insufficient-data thresholds, JSON-serializable metrics, and CLI integration (--baseline --json --output, insufficient-data warning, empty-dataset rejection).
AC-706 (slice 2): autoctx hermes ingest-sessions --home ~/.hermes --output traces/hermes-sessions.jsonl --redact standard|strict|off [--since <ISO>] [--limit n] [--dry-run] reads the Hermes session SQLite DB (<home>/state.db) in read-only URI mode and writes one autocontext ProductionTrace JSONL row per session. New autocontext.hermes.sessions module exposes a DDD domain layer: HermesSession, HermesMessage, HermesSessionRepository (read-only SQLite + schema-drift tolerance + WAL/SHM sidecar independence), and SessionDBMissing for the "no DB to ingest" boundary. New autocontext.hermes.session_ingest is the application service that maps domain objects into ProductionTraces via the same production_traces.emit.build_trace helper that AC-704 uses (DRY). Per-message content goes through the shared RedactionPolicy from slice 1 (DRY across both ingest paths), so a strict-mode user-pattern set behaves identically for trajectories and sessions. The RAW_CONTENT_WARNING opt-in marker from slice 1 is reused so --redact off --json surfaces the same audit signal for sessions. Per-trace metadata carries session_id, agent_id, session_started_at, session_ended_at, session_metadata, and source: "hermes.session". Missing DB returns an empty summary (graceful, exit 0). 10 repository tests cover read-only refusal, missing-DB error path, since-filter, sequence order, schema drift (extra and missing columns), WAL/SHM-less open, and corrupt metadata JSON. 13 ingester tests cover end-to-end emission, shared-policy redaction, since/limit/dry-run, importer-never-mutates-DB invariant (mtime + size check), --redact off warning surfacing, per-trace metadata, invalid---since rejection, and CLI integration. AC-706 closed.
AC-706 (slice 1): autoctx hermes ingest-trajectories --input <jsonl> --output <jsonl> --redact standard|strict|off reads a Hermes trajectory JSONL file (ShareGPT-like, line-per-trajectory) and writes a redacted copy. Default --redact standard runs the existing sharing/redactor pipeline (Anthropic / OpenAI / AWS / GitHub / Slack keys, bearer tokens, emails, IPs, env values, absolute paths, high-risk file refs). --redact strict requires --user-patterns (a JSON array of {name, pattern} regex objects) and tags hits as [REDACTED_USER_PATTERN:<name>]. --redact off writes raw content and surfaces a CLI warning on the privacy posture (AC-706 requires explicit operator opt-in). --dry-run reports redaction counts without writing the output (AC-706 privacy preview). Per-line tolerance: corrupt JSON, non-object trajectories, and blank lines are skipped (not aborted) with per-line warnings. The redaction stats are returned per-category so operators can audit what was removed. New autocontext.hermes.redaction module exposes RedactionPolicy, compile_user_patterns, and redact_text as the shared policy surface that the AC-706 slice 2 (sessions) will reuse. 11 redaction-policy tests + 13 trajectory-ingester tests (including the CLI subcommand entry point and the input-never-mutated invariant). AC-706 slice 2 (ingest-sessions from ~/.hermes/state.db with WAL/SHM tolerance and schema drift) is a follow-up; this slice ships the redaction primitives and the simpler JSONL surface first.
AC-702: Hermes skill references for progressive disclosure. Adds autocontext/src/autocontext/hermes/references.py exposing 4 markdown references (hermes-curator, cli-workflows, mcp-workflows, local-training) accessible via list_references() / render_reference(name). The rendered SKILL.md from render_autocontext_skill() now ends with a ## References section that cross-links each one. autoctx hermes export-skill --with-references --output <dir>/SKILL.md writes the references next to the skill in a references/ subdirectory; --force propagates to both SKILL.md and references. The skill remains useful on its own when --with-references is not passed. Atomic preflight: every destination is checked before any write so a reference-name collision can't leave SKILL.md half-installed. 12 tests cover canonical order, content invariants (read-only rule in curator alignment doc; concrete commands in CLI workflows; CLI-vs-MCP guidance in MCP workflows; small-dataset warning in local-training), SKILL.md cross-linking, the CLI overwrite-without-force guardrail, and the atomicity regression test.
AC-705: autoctx hermes export-dataset --kind curator-decisions --home ~/.hermes --output training/hermes-curator-decisions.jsonl exports Hermes curator decision artifacts as supervised training JSONL for narrow advisor classifiers (per the AC-708 scope). Each row carries example_id, source.curator_run_path, source.started_at, input.skill_{name,state,provenance,pinned,use_count,view_count,patch_count,activity_count,last_activity_at}, label (consolidated | pruned | archived | added, strongest-wins precedence), confidence: "strong", redactions: [], and context.run_{provider,model,counts}. Label quality rules pinned by tests: pinned skills NEVER become mutation targets; bundled and hub skills NEVER become mutation targets (they appear only as context). Skills missing from the inventory still emit an example with unknown features so historical curator decisions can be trained on. Both Hermes v0.12 action shapes are accepted (list of strings OR list of {"name": ...} dicts). --since <ISO-8601> raises ValueError on invalid input rather than silently disabling the filter; runs without parseable started_at fall back to file mtime for the comparison. Pinned-via-.usage.json, bundled-via-.bundled_manifest, and hub-via-.hub/lock.json names are protected even when no active SKILL.md folder exists. Other documented dataset kinds (consolidation-pairs, skill-selection, skill-quality-signals) raise NotImplementedError with a clear message so callers know they're planned but not yet implemented. 18 fixture-based tests cover schema, label quality rules, since/limit filters, unknown-kind dispatch, dict-shape actions, protected-name fallbacks, and --since hardening. Module docstring documents the full schema; the schema is intentionally flat and feature-engineered so it can feed autoctx train --backend mlx|cuda via a one-step adapter (the adapter is a follow-up). NOTE: small personal Hermes homes may not have enough data for useful model training yet -- the dataset shape ships first; usefulness depends on Curator-decision volume.
AC-704: autoctx hermes ingest-curator --home ~/.hermes --output traces/hermes-curator.jsonl reads Hermes v0.12 curator run reports (<home>/logs/curator/**/run.json) and emits autocontext ProductionTrace JSONL. The ingester is tolerant: malformed JSON is skipped with a warning rather than aborting; missing started_at falls back to file mtime; missing duration_seconds falls back to 0. Curator action lists (consolidated/pruned/archived/added) and counts land in trace.metadata.curator_* so downstream dataset exporters (AC-705) can consume them without re-parsing raw files. Privacy defaults: --include-llm-final (off by default) gates whether the curator's LLM final summary is attached as an assistant message; --include-tool-args (off by default) gates whether raw tool-call args are preserved. --since <ISO-8601> and --limit <n> filter the run set. CLI returns a JSON summary (runs_read, traces_written, skipped, warnings) under --json. 11 fixture-based tests cover normal run / consolidation-only / auto-transition-only / malformed JSON / missing curator dir / since-filter / limit / synthesized-messages-satisfy-schema / include-llm-final opt-in / metadata round-trip / timing derivation.
AC-710: docs/hermes-positioning.md records the Hermes Curator + autocontext positioning. Headline: Hermes Curator is the live skill-library maintainer; autocontext is the evaluation, trace, replay, export, and local-training layer. Includes an at-a-glance complementarity table, the default operator flow (autoctx hermes inspect -> autoctx hermes export-skill -> autoctx judge / improve), the read-only import boundary on ~/.hermes, the privacy posture for session/trajectory imports, the narrow scope of autoctx train for advisor models, and an explicit "autocontext does not replace Curator" section. Cross-linked from docs/README.md "Integrating External Agents". Status footer enumerates shipped / in-flight / out-of-scope work so the doc stays accurate as the rest of the Hermes cluster lands.
AC-682 (slice 1): TypeScript OpenTelemetry bridge for PublicTrace. New ts/src/traces/otel-bridge.ts exposes publicTraceToOtelResourceSpans (forward) and otelResourceSpansToPublicTrace (reverse) over a minimal validated subset of OTel JSON ResourceSpans (OtelResourceSpansSchema Zod). Bidirectional round-trip preserves traceId, sourceHarness (via service.name), collectedAt, sessionId, message order/content, tool calls (name/args/duration/error -> span status.code = "ERROR"), outcome (score/reasoning/dimensions), and redactions metadata. Reverse path validates the reconstructed trace against PublicTraceSchema before returning so a broken bridge cannot emit invalid traces. 11 tests cover schema validation, forward emission, round-trip, missing-service-name error path, missing-root-span error path, optional-outcome handling, zero-tool-call messages, and redaction preservation. Design note + mapping table at docs/opentelemetry-bridge.md enumerates the known-gap fields (file references, metadata, tool results) that survive as opaque JSON blobs rather than as structured OTel attributes. Python parity, OTLP protobuf wire format, and the ProductionTrace bridge are out of scope for slice 1.
AC-725: docs/flue-influences.md design note records what the runtime workspace/session contract, scoped command/tool grants, child-agent task execution, and cwd discovery model borrowed from an external review, and what was explicitly NOT borrowed (no upstream dependency, no API names, no provider stack, no vocabulary replacement). Cross-linked from docs/README.md "Architecture And Parity"; the canonical docs/concept-model.md is intentionally NOT cross-linked to keep its vocabulary autocontext-native (a tests/package-topology.test.ts invariant pins this). Pins the guardrail that sandbox / workspace / session are runtime isolation/boundary concepts, not peer top-level product nouns alongside Scenario / Mission.
AC-728: verifier-facing contract probes for terminal, service, and artifact tasks. Extends ts/src/control-plane/contract-probes/index.ts (previously only probeDirectoryContract) with three new pure probes: probeTerminalContract (exit code + required/forbidden stdout/stderr patterns), probeServiceContract (required endpoints with host/port/protocol matching + wrong-interface detection for 127.0.0.1 vs 0.0.0.0 confusion + optional allowed-endpoint allowlist), and probeArtifactContract (required/forbidden substrings + LF/CRLF line-ending check + required JSON fields via dot-paths with invalid-json failure when JSON parse fails). All probes follow the existing { passed: boolean, failures: readonly Failure[] } shape; failures carry a typed kind for client filtering. 17 new tests + the existing directory probe test. Distributed/multi-rank parity probes deferred to a follow-up slice.
AC-679 (slice 3b): autoctx trace-findings --trace-id <id> extends the slice-2 CLI to load a stored ProductionTrace by id from .autocontext/production-traces/ingested/<date>/*.jsonl (the local data plane that flows through autoctx production-traces ingest). --trace <path> and --trace-id <id> are mutually exclusive input modes; exactly one is required. The workflow adapts ProductionTrace to PublicTrace inline (flatten source.emitter -> sourceHarness, derive collectedAt from timing.startedAt, map outcome only when both score and reasoning are present, copy embedded toolCalls per message) so the slice-1 extractor runs unchanged. 5 new tests cover load + Markdown, JSON shape, missing-id error, mutual exclusivity, and the "neither flag" failure case. AC-679 is now substantively feature-complete (criteria 1-8 met); the only deferred work is additional taxonomy categories (slice 3e).
AC-679 (slice 3d): WeaknessReport variant in ts/src/analytics/trace-findings.ts. Adds WeaknessReportSchema (Zod), generateWeaknessReport(trace), and renderWeaknessReportMarkdown(report). Mirrors Python's WeaknessReport shape (recommendation-focused with recovery analysis text) alongside the existing TraceFindingReport. Recommendations are one-per-distinct-category, deduplicated, sourced from a fixed RECOMMENDATION_BY_CATEGORY table. Recovery analysis is a narrative string composed from the outcome score and weakness count. 8 tests cover schema completeness, generation across the four taxonomy categories, deduplicated recommendations, and Markdown output sections / empty states.
AC-679 (slice 3c): renderTraceFindingReportHtml(report) ships in ts/src/analytics/trace-findings.ts. Emits an offline-first self-contained HTML document with an inline <style> block, anchored finding rows (id="finding-<id>" so external references can link directly), and data-category + data-severity attributes on each <li> for client-side filtering hooks. Mirrors the shape of Python's render_trace_writeup_html so operator muscle memory transfers between the two runtimes. User-originated content (titles, descriptions, summary, traceId) is escaped through a single htmlEscape helper that handles & < > " '. 7 tests cover scaffolding, escaping, anchors, data attributes, empty states, offline-style block, and evidence references.
AC-679 (slice 3a): cross-runtime TraceFindingReport JSON contract. A shared fixture at fixtures/cross-runtime/trace-finding-report.json (at repo root) is the wire-format contract that both Python and TypeScript validate against. Python adds CrossRuntimeTraceFinding / CrossRuntimeFailureMotif / CrossRuntimeTraceFindingReport Pydantic models at analytics/cross_runtime_trace_findings.py with camelCase JSON aliases mirroring the TS Zod schema; snake_case kwargs work for ergonomic Python use, model_dump(by_alias=True) is the canonical wire form. 9 Python tests + 6 TS tests on the same fixture catch shape/taxonomy/enum drift before a TS-produced report can fail to parse on Python (and vice versa). Closes AC-679 criterion 8 (cross-runtime contract tests catch Python/TS drift).
AC-679 (slice 2): autoctx trace-findings --trace <path> [--json] CLI subcommand wires the slice-1 extractor library into an operator-facing TypeScript command. Reads a PublicTrace JSON file, runs generateTraceFindingReport, and emits the report as Markdown (default) or JSON. Handler is pure (runTraceFindingsCommand(args) -> {stdout, stderr, exitCode}) so the 11 unit tests drive it directly without subprocess spawn or stdout capture; the top-level cli/index.ts shim writes the result. Coupling to the ProductionTrace store (--trace-id <id>) and the extra slice-1-deferred taxonomy categories remain follow-up work.
AC-679 (slice 1): TypeScript trace-finding extractor library at analytics/trace-findings.ts. Re-targets AC-679 to operate over PublicTrace (the TS data plane primitive) rather than mirroring Python's harness-internal RunTrace shape, so cross-runtime parity lives in the output contract (TraceFindingReportSchema Zod schema) rather than the input trace. Slice 1 ships the Zod schemas (TraceFindingSchema, FailureMotifSchema, TraceFindingReportSchema), a four-category taxonomy targeting agent-behavior failures detectable from a PublicTrace (tool_call_failure, agent_refusal, low_outcome_score, dimension_inconsistency), pure extractor functions (extractFindings, extractFailureMotifs, generateTraceFindingReport), and renderTraceFindingReportMarkdown. Captures the agent-behavior axis that the AC-678 Python slice deferred. CLI subcommand, HTML rendering, additional categories (context loss / error-recovery loops), and cross-runtime fixture parity tests land in follow-up slices.
AC-678 (slice): autoctx analytics trace-findings --trace-id <id> [--kind writeup|weakness] [--json] emits a trace-grounded findings report for a stored RunTrace. Exposes the existing TraceReporter.generate_writeup / generate_weakness_report pipeline as an operator CLI without changing the canonical report model; Markdown body matches the run-end-time writeup artifact. Reuses the _validated_trace_id traversal guard from render-timeline. Closes the headline AC-678 gap (Python report model existed without a CLI surface); semantic failure-taxonomy mapping beyond the current event_type grouping remains open.
AC-749 (slice): autoctx analytics render-timeline --trace-id <id> [--output path.html] renders an existing persisted RunTrace as an interactive HTML timeline. On-demand counterpart to the run-end-time renderer that already lives in loop/trace_artifacts.persist_run_inspection; reuses the same timeline_inspection_view extractor + render_timeline_inspection_html view. The rendered HTML now also surfaces a "Generations" section with per-generation failure/recovery counts (data attributes data-generation-index, data-generation-failure-count, data-generation-recovery-count for client-side hooks). The view layer exposes the same inspect_generation data the JSON payload already carries -- no new analytics model.
Harness proposal decisions now require explicit evidence references before heldout/fresh validation can accept or reject a proposal. Missing --evidence-ref keeps the durable decision inconclusive, and corrupted accepted/rejected proposal JSON with empty evidenceRefs, dev-only evidence, or missing baseline evidence is rejected by schema validation.
Python and TypeScript prompt budgeting now share a domain policy for canonical duplicate-context removal, per-component token caps, protected components, and trim order; semantic compaction also caches repeated component compactions by policy version and content hash.
AC-727 (slice): autoctx improve --checkpoint-cmd runs a user-supplied command after each round to preserve partial progress (e.g. git -C /repo commit -am 'round checkpoint' or cp {file} /tmp/round.lean). Same {file} placeholder semantics as --verify-cmd, plus --checkpoint-suffix and --checkpoint-timeout companions. Unlike the verifier, a checkpoint command's non-zero exit is logged but does NOT veto the round; it surfaces as a new checkpoint_done(round=N, checkpoint_ok=..., checkpoint_exit_code=...) event in the --ndjson stream. Lets long-running improve loops salvage near-miss artifacts before later rounds overshoot or time out.
AC-723: the TypeScript CLI now exposes autoctx agent run <agent> and autoctx agent dev for experimental .autoctx/agents handlers. The one-shot runner accepts --id, JSON --payload, explicit --env files with shell env precedence, provider/model overrides for runtime-backed handlers, and --json output; the dev server exposes GET /manifest and POST /agents/<name>/invoke.
Context-selection analytics reports now include actionable diagnostics for duplicate selected content, low useful-artifact recall, and selected-token bloat.
Python analytics now includes autoctx analytics context-selection --run-id <run-id> [--json] to summarize persisted context-selection artifacts by selected tokens, selection rate, duplicate-content rate, useful-artifact recall, and freshness.
AC-757: TypeScript control-plane EvalRuns now support verified and experimental tracks. autoctx eval attach accepts --track verified|experimental, eval list --output json reports the effective track, and promotion decisions reject explicitly experimental EvalRuns as non-promotion evidence.
AC-758: Candidate artifacts now record deterministic strategy identity metadata: a canonical strategy fingerprint, component fingerprints, parent strategy lineage, and exact/near duplicate assessment. autoctx candidate register/show include the metadata, and candidate list surfaces the strategy fingerprint and duplicate kind.
AC-759: Candidate artifacts now quarantine repeated invalid strategies by fingerprint. Re-registering an exact or near duplicate of a disabled/quarantined strategy records strategyQuarantine, candidate list surfaces quarantineReason, promotion decisions reject quarantined strategies, and operational memory skips findings tied to quarantined strategy fingerprints.
AC-760: EvalRuns can now carry opt-in ablation verification evidence for accepted strategy and harness changes. autoctx eval attach accepts --ablation-verification ./ablation.json, promotion decide --require-ablation records an ablationVerification assessment, and --ablation-targets strategy,harness narrows the required target coverage.
AC-680: TypeScript control-plane harness/context changes now have a durable HarnessChangeProposal workflow. autoctx harness proposal create/list/show/decide records finding lineage, proposed patches, expected impact, rollback criteria, and an evidence-gated decision that accepts only heldout/fresh validation against matching-suite baseline evidence.
Strategy duplicate and quarantine checks now span all environments for the same scenario/actuator and use payloadHash as an exact-match fallback for legacy artifacts without strategyIdentity.
AC-752: autoctx improve --ndjson streams per-round events as newline-delimited JSON to stdout for visibility into long-running loops. Event kinds: round_start, judge_done, verifier_done (only when --verify-cmd is set), round_summary, and a final summary line. Under --ndjson the Rich human-readable summary is suppressed so stdout is pure JSON. --json and --ndjson are mutually exclusive output modes and are rejected up front when both are passed.
AC-753: the ndjson stream now also emits a revision_done(round=N, output=<content>) event right after round_start for every round, carrying the exact output the loop is about to evaluate. For round 1 the payload is the seed; for round N>1 it is the result of task.revise_output() from round N-1. Lets consumers salvage near-miss verifier-vetoed rounds. Pass --no-ndjson-include-output (default --ndjson-include-output) to suppress these events when the bulk output is unwanted; that flag drops the revision_done event entirely and never writes the output payload anywhere on stdout.
AC-751: autoctx improve --claude-max-total-seconds FLOAT exposes settings.claude_max_total_seconds (the wall-clock ceiling on total claude-cli runtime in a single run; env: AUTOCONTEXT_CLAUDE_MAX_TOTAL_SECONDS). Only applied when the effectively-resolved judge provider is claude-cli; judge_provider='auto' paths that inherit agent_provider='claude-cli' are honored. --timeout help on improve now explicitly names the per-provider setting it writes (claude_timeout/codex_timeout/pi_timeout).
Python and TypeScript now expose autoctx worker to run the existing task queue TaskRunner as a daemon or one-shot batch worker, with persistent-host deployment docs for serve + worker.
Added narrow Python/TypeScript task queue store contracts so future hosted storage adapters can provide Postgres-backed claim/complete/fail/enqueue semantics without changing TaskRunner.
Gondolin is documented as a reserved optional microVM sandbox backend, fails closed until a real adapter is configured, and now has public request/policy/backend contracts for out-of-tree adapters.
TypeScript autoctx runtime-sessions now lists, shows, and renders operator-facing timelines for persisted runtime-session event logs from CLI-backed provider runs, including show --run-id <run-id> and timeline --run-id <run-id> for run-scoped logs; status, show, and watch --json surface a runtime_session summary when one exists, MCP exposes the same read surface via list_runtime_sessions, get_runtime_session, and get_runtime_session_timeline, cockpit HTTP clients can read logs and timelines from /api/cockpit/runtime-sessions, /api/cockpit/runtime-sessions/:session_id/timeline, /api/cockpit/runs/:run_id/runtime-session, and /api/cockpit/runs/:run_id/runtime-session/timeline, cockpit run list/status/resume payloads include runtime_session plus runtime_session_url for discovery, the interactive TUI exposes /timeline <run-id> for the same grouped view and summarizes live runtime-session activity as it arrives with persisted /activity filters, quiet/normal/verbose detail controls, /activity reset, read-only bare /activity and /activity status, and startup readback of loaded activity settings, and /ws/events streams live runtime_session_event envelopes as runtime-session events are appended.
Python now has parity readers for runtime-session event logs: a TypeScript-compatible event/store/read-model/timeline layer, cockpit endpoints for listing logs and resolving run-scoped timelines, run list/status/resume discovery fields, and MCP tools autocontext_list_runtime_sessions, autocontext_get_runtime_session, and autocontext_get_runtime_session_timeline with unprefixed aliases.
Python runtime-backed run and solve role calls now automatically append provider prompts and responses to the run-scoped runtime-session log, preserving runtime failure semantics while making the new Python readers useful without manual recorder wiring.
Python now exposes a core RuntimeWorkspaceEnv contract with local filesystem and in-memory adapters, virtual path resolution, scoped command grants, and explicit cleanup semantics to match the TypeScript runtime workspace boundary.
TypeScript runtime workspace command grants now expose structured start/end/error observability events, a no-shell local process wrapper with explicit env inheritance, redacted/truncated command output previews, child-task inheritance policy, and scoped command/tool grant types for runtime-session calls without serializing trusted env values into prompts or session logs.
The canonical concept model now documents durable runtime-session event storage as an Artifact model for provider turns, shell/tool activity, child-task lineage, compaction summaries, replay, and the boundary with RunTrace/production traces.
Python and TypeScript runtime-session logs now record semantic compaction ledger writes as COMPACTION events with entry ids, component names, ledger paths, and generation metadata for replay timelines; TypeScript records the hook-finalized ledger entries and paths after artifact write hooks run.
Python and TypeScript now expose explicit runtime-session-to-RunTrace adapters for analytics reuse, mapping child-task lineage, command/tool status, and compaction artifact references without copying raw prompts, model responses, stdout/stderr, or arbitrary runtime metadata.

Fixed

AC-764 / AC-765: Python and TypeScript Pi CLI runtimes no longer rely on raw subprocess.run(..., timeout=...) / execFileSync(..., { timeout }) cleanup. Both runtimes now isolate pi --print in a subprocess/session where supported, kill the full process group on timeout, close inherited stdout/stderr pipes, bound post-kill cleanup to 5s, and preserve timeout metadata (error: "timeout", timeout seconds) for callers. Regression coverage includes process-group kill, interrupted/abnormal cleanup, and leaked-pipe timeout return paths.
AC-761 / AC-735: claude-cli subprocesses are now hard-killed at their process group on timeout AND on any other abnormal exit (KeyboardInterrupt, SystemExit, ...). The previous code path used subprocess.run(..., timeout=...), which only proc.kill()s the immediate child; claude-cli helper processes that inherit pipe fds kept the post-kill communicate() drain open, so a --timeout 1200 invocation observed at 2h24m alive (AC-761) and AUTOCONTEXT_CLAUDE_MAX_TOTAL_SECONDS=28800 runs observed at 8h45m (AC-735). The runtime now spawns claude in its own session (start_new_session=True) and os.killpg(pgid, SIGKILL)s the whole group, with a bounded 5s grace on the post-kill drain. Because start_new_session=True also detaches the child from the terminal's signal-delivery group, Ctrl-C / SIGINT no longer reaches the claude process group automatically; the helper's except BaseException branch (PR #940 review) ensures interrupted runs still clean up the detached children before re-raising. Wall-clock returns within claude_timeout + 5s even when grandchildren hold pipes open. POSIX only; Windows uses proc.kill() fallback.
AC-756: ImprovementResult.met_threshold now consistently mirrors the same predicate used by the early-return paths -- the best round both cleared quality_threshold and satisfied dimension_threshold if one was configured. Previously the fallthrough exit (plateau-stall, unchanged-output, max-rounds, consecutive-failures) hard-coded met_threshold=False, so a run that produced above-threshold output via, e.g., a plateau-stall path was flagged as "didn't meet threshold" and could be discarded by automation. The fix tracks best_dims_ok alongside best_score so the per-dimension gate is honored at fallthrough exits too.
AC-754: ImprovementLoop now peels off an outer markdown code fence (e.g. ```lean ... ```) when cleaning agent output, so verifiers that compile the output directly (lake env lean, mypy, cargo check, ...) no longer reject otherwise-valid content on the literal fence lines. Applied to both the seed (round 1's input) and the result of every task.revise_output() call. The strip is conservative: only the outer wrapper is removed, inner nested fences and unbalanced fences are preserved.
AC-750: ImprovementLoop no longer fires a misleading max_score_delta warning when the previous round was zeroed by the external --verify-cmd verifier. The loop now tracks last_unvetoed_score separately from prev_valid_score; the delta check compares against the last legitimate judge score, while plateau detection still treats consecutive verifier vetoes as a stall.
Runtime-session event stores now preserve existing events when saving stale or partial logs, and the TypeScript timeline pairs repeated child-task completions by child session id before falling back to task aliases.
Worker commands now clamp concurrency to one for stateful persistent runtimes, and Python runtime-bridge providers close underlying runtimes on shutdown.
TypeScript task runners now await queue-store methods so hosted Postgres adapters can implement the queue contract asynchronously.
AC-733..AC-738 batch from the putnam_2013_a5 stress test: improve now exposes --verify-cmd/--verify-suffix/--verify-timeout for compile/test gates that can force score=0 and feed stderr back into revision; solve accepts --task-prompt to bypass the LLM scenario designer (which truncated long Lean/Putnam-style prompts), --task-file for file-backed descriptions, --generations as an alias for --gens, and -d short form for --description; --family typos surface a did_you_mean suggestion via the new FamilyName value object instead of silently falling through; AUTOCONTEXT_CLAUDE_TOOLS="" now renders as a single --tools= argv token rather than a stray double-space; and AUTOCONTEXT_CLAUDE_MAX_TOTAL_SECONDS (default 0/off) attaches a RuntimeBudget to every settings-driven ClaudeCLIRuntime (default agent provider, per-role overrides, and the judge/provider registry path), with retry backoff sleeps bounded by both the per-invocation cap and the attached budget.

Changed

Python autocontext and TypeScript autoctx package metadata are bumped to 0.5.1 for the Pi CLI timeout-hardening release. Follow-up Pi pi-autocontext package metadata is bumped to 0.2.5, its extension imports and peer dependencies are migrated to the Pi 0.74 @earendil-works/* / typebox package names, and its autoctx dependency now requires the hardened ^0.5.1 line.
Default of AUTOCONTEXT_CLAUDE_MAX_TOTAL_SECONDS is now 0 (disabled, opt-in). Set explicitly when you want a wall-clock cap on total Claude CLI runtime; the per-invocation retry cap inside ClaudeCLIConfig keeps its 25-minute default for in-process retry sequences.

0.5.0 - 2026-05-01

Added

Python and TypeScript autoctx solve now accept the plain-language goal as a positional argument while keeping --description as a named option.
Python and TypeScript solve/run commands now accept --iterations as the plain-language alias for --gens.
Python and TypeScript autoctx run <scenario> now accept a positional scenario while keeping --scenario for scripts.
Python and TypeScript autoctx export <run-id> now export knowledge from a specific run while keeping scenario-level export support.
TypeScript CLI/TUI help now uses the same plain-language run vocabulary, including status <run-id>, show <run-id> --best, and watch <run-id>.
Python autoctx hermes inspect now reads Hermes v0.12 skill usage telemetry and Curator reports without mutating ~/.hermes, and autoctx hermes export-skill emits a first-class Hermes autocontext skill that teaches CLI-first workflows with MCP as optional.

Fixed

Python installed autoctx no longer crashes on no-args startup when packaged banner assets are missing.

Changed

Python autocontext and TypeScript autoctx package metadata are bumped to 0.5.0.
Pi pi-autocontext package metadata is bumped to 0.2.4, and its autoctx dependency range accepts both the current 0.4.9 package and the upcoming 0.5.0 npm line.

0.4.9 - 2026-04-30

Fixed

TypeScript simulate now uses the schema-evolution scenario designer for schema-evolution prompts and rejects zero-mutation generated specs before persistence (AC-694).
Python Pi/Pi-RPC budget errors now report the effective bounded role timeout instead of the original unbounded Pi timeout (AC-695).
RLM sessions can soft-finalize from explicit final-answer tags, cautious natural-language closure cues, and repeated silent no-progress turns, while preserving real inspection progress (AC-696).
Rubric drift monitoring now flags within-generation mean-versus-best compression and catches slower dimension decline patterns (AC-686).

Changed

Python autocontext and TypeScript autoctx package metadata are bumped to 0.4.9.
Pi pi-autocontext package metadata is bumped to 0.2.3 while intentionally keeping its autoctx dependency one package behind at ^0.4.8.

0.4.8 - 2026-04-30

Fixed

TypeScript generated schema_evolution scenarios no longer score empty mutation plans as perfect, and generated actions now record mutation lineage before schema-coverage scoring (AC-666).
Python Claude CLI runtime calls now use bounded timeout retries with exponential backoff, total wall-clock caps, retry metadata, and warning/error logs for long-running live-agent calls (AC-684).
Python solve now enforces generation budgets across Pi/Pi-RPC role calls, including per-role overrides, and closes one-shot budgeted persistent Pi RPC clients after use (AC-691).
TypeScript schema-evolution creation now recovers from Pi-style invalid JSON responses with markdown fences, prose wrappers, comments, trailing commas, and camelCase fields (AC-692).
Python solve JSON/status output now includes resolved scenario-family metadata for stress harnesses and user workflows (AC-693).
Iterative investigation no longer requires resolving the architect runtime before the first analyst step.
Task-like solve lifecycle hooks now report persisted generation counts separately from improvement rounds.

Changed

Python autocontext and TypeScript autoctx package metadata are bumped to 0.4.8.
Pi pi-autocontext package metadata is bumped to 0.2.2 while intentionally keeping its autoctx dependency one package behind at ^0.4.7.

0.4.7 - 2026-04-29

Added

Python autoctx export now accepts --format pi-package to write a Pi-local package directory with package.json, SKILL.md, prompt markdown, and the original autocontext strategy payload.
Python and TypeScript autocontext now expose Pi-shaped extension hook buses via AUTOCONTEXT_EXTENSIONS, covering run/generation lifecycle, context transforms, semantic compaction, provider requests/responses, judge calls, and artifact writes.
Pi pi-autocontext now exposes autocontext_runtime_snapshot for run artifacts, package provenance, session branch lineage, and recent event-stream context.
TypeScript Pi RPC now supports an opt-in persistent runtime via AUTOCONTEXT_PI_RPC_PERSISTENT=true, reusing one pi --mode rpc subprocess for prompt and live-control calls.
TypeScript CLI now exposes autoctx solve as a DB-backed solve-on-demand entrypoint with --description, --gens, --timeout, and --json support (AC-619).
TypeScript solve now preserves Python-shaped controls for structured family overrides, per-generation runtime-budget enforcement, output file writing, and classifier fallback status metadata (AC-620).

Fixed

TypeScript capabilities now report the provider factory support surface and no longer mark the visible train command as Python-only (AC-626).
TypeScript run now supports saved custom agent_task scenarios through the agent-task improvement runner instead of rejecting scenarios already discoverable in the control plane (AC-625).

Changed

Restructured the top-level README.md: leads with the Pi runtime quick start, adds an MCP-driven natural-language entry path ("Or Just Talk To Your Agent"), shows a structured artifact tree with concrete playbook.md and trace.jsonl excerpts, surfaces production-trace capture as its own section, merges the surfaces table with command examples, and adds a short FAQ. Removes redundant "How People Use It" / "Choose An Entry Point" / "Repository Layout" sections (the last is already covered in AGENTS.md).
Bumped subpackage README references from 0.4.4 to 0.4.7 (autocontext/README.md, ts/README.md) to track the next release line.
Python autocontext, TypeScript autoctx, and Pi pi-autocontext package metadata are bumped for the release.

0.4.6 - 2026-04-23

Added

Browser integration surface (AC-598–603): Chrome CDP backend for Python (autocontext.integrations.browser) and TypeScript (autoctx/integrations/browser), wired into investigations and the task queue. Includes a browser exploration contract, cross-runtime validation fixtures, parity enforcement, and selector generation for CDP element refs.
A2-III Anthropic integration: instrument_client / InstrumentedAsyncAnthropic (Python) and instrumentClient (TypeScript) intercept Anthropic SDK calls and route production traces through the autocontext pipeline, with AnthropicStreamProxy/AnthropicStreamProxyAsync for streaming and AnthropicTaxonomyMapper for outcome classification. Available at autocontext.integrations.anthropic and autoctx/integrations/anthropic. Includes cross-runtime parity (9 fixtures + 50-run property tests), anthropic-python/ts detector plugins, bundle-size enforcement, and zero-telemetry guarantee.
Production traces build-dataset filters (AC-606): --provider, --app, --env, and --outcome filters on the build-dataset CLI and MCP tool, plus an E2E integration test covering OpenAI + Anthropic traces through ingest→build-dataset.
Hierarchical investigation evidence with evidence cards cache and artifact drill-down hardening.
Tail context preservation in secondary prompt reducer surfaces.
Solve runtime floor raised for generated scenarios.

Fixed

Provider proxy runtime plumbing centralized into a shared _shared/proxy-runtime module so Anthropic and OpenAI integration proxies share consistent lifecycle and error handling (AC-611).
TypeScript scenario family designers now share response parsing across agent-task, artifact-editing, and tool-fragility families so generated specs preserve family-specific semantics (AC-612).
Install salt identity invariant preserved across process restarts (AC-609).
Cross-runtime migration ledger reconciliation so Python and TypeScript DBs stay aligned after schema divergence (AC-608).
CLI dispatch moved into a command registry so mission routes resolve correctly (AC-610).
Babel reverse solve designer retries restored and scenario creation stabilized (AC-607).

Changed

Python and TypeScript package metadata are bumped to 0.4.6.

0.4.5 - 2026-04-21

Fixed

quality_threshold auto-heal no longer silently drops below the configured floor during multi-round improvement loops (AC-585).
Judge-provider inheritance now propagates correctly to nested evaluation calls so role-routing overrides are honored end-to-end (AC-586).
Claude CLI timeout default bumped from 300 to 600 seconds, reducing spurious failures in longer live-agent solve runs (AC-588).
Release-sweep accounting hardened to prevent double-counting across concurrent sweep legs.

Added

Added a shared browser exploration contract and package-safe configuration surface across Python and TypeScript, including canonical schemas, validation helpers, secure AUTOCONTEXT_BROWSER_* defaults, and policy helpers.
Added the TypeScript Chrome DevTools Protocol backend for browser exploration, including attach-only target discovery, websocket transport, policy-gated actions, and evidence artifacts.
Added Python browser exploration integration for investigations and queued tasks, including policy-gated snapshot capture, prompt/evidence enrichment, and fail-closed task-runner wiring.
Added a thin Python Chrome CDP browser backend with debugger-target discovery, evidence persistence, WebSocket transport, runtime factory, and policy-checked session actions.
Added cross-runtime browser contract fixtures so Python and TypeScript validators stay in lockstep.
Added TypeScript browser-context integration for investigations, queued tasks, and MCP queueing, including fail-closed navigation policy handling and artifact-backed browser evidence.

0.4.4 - 2026-04-20

Added

Added the production-traces contract and traffic-to-eval pipeline across Python and TypeScript, including cross-runtime schemas, emit/validate helpers, redaction, retention, dataset building, CLI/MCP surfaces, and golden integration flows.
Added the TypeScript control-plane model-routing actuator plus the published chooseModel runtime helper for deterministic route, rollout, guardrail, fallback, and trace-integrated model selection.
Added Python solve ergonomics for family overrides and improved classifier observability/fallback vocabulary for finance, schema-evolution, geopolitical simulation, and alignment-stress prompts.

Fixed

Hardened Python scenario design and solve paths around malformed designer responses, intent-drift retry feedback, mandatory calibration examples, structured quality thresholds, readable sample prompts, and schema/geopolitical simulate routing.
Preserved the latest control-plane hardening while restacking the production-traces/model-routing foundation, including candidate artifact boundary validation and model-routing payload registration.

Changed

Python and TypeScript package metadata are bumped to 0.4.4.

0.4.3 - 2026-04-17

Fixed

Hardened Pi-backed solve/runtime execution so Pi RPC waits for assistant completion, honors model/context-file options consistently, and solve runs enforce timeout budgets.
Preserved generated-scenario family behavior across solve, export, TypeScript new-scenario, and improve flows, including empty-action family specs and improve calls without an initial output.
Made custom scenario loading resilient and diagnosable: malformed specs no longer block registry discovery, spec-only directories surface actionable diagnostics, import-time missing files keep their real reason, and non-agent family specs can auto-materialize Python scenario.py sources.
Normalized structured agent-task prompt payloads before validation and code generation, so JSON-like sample inputs, reference context, preparation instructions, and revision prompts no longer crash generated runtimes.

Changed

Python and TypeScript package metadata are bumped to 0.4.3.

0.4.2 - 2026-04-16

Fixed

Preserved TypeScript workflow and custom-scenario semantics across broader scenario generation, including workflow compensation/side-effect metadata and camelCase final score weights.
Hardened Python judge, improve, simulate, and list CLI flows around timeout overrides, fresh workspaces, provider overrides, rubric guardrails, and simulation-family routing.
Added the Python autoctx investigate surface with generation fallbacks and kept its CLI implementation below the repository module-size gate.
Restored Python autoctx queue add --task-prompt ... --rubric ... compatibility for prompt-backed queued tasks, including direct ad hoc queueing without a saved spec name.

Changed

Python and TypeScript package metadata are bumped to 0.4.2.

0.4.1 - 2026-04-14

Fixed

Restored operator-loop escalation accounting when explicit escalation actions also mention clarification, so generated Python scenarios preserve both escalation and clarification signals.
Preserved operator-loop family routing through Python solve creation and replay-safe feedback validation without violating the Pydantic serialization convention.
Routed TypeScript new-scenario operator-loop requests through the dedicated family designer and allowed generated operator-loop scenarios to execute through the solve codegen path.
Python and TypeScript package metadata are bumped to 0.4.1.

0.4.0 - 2026-04-14

Changed

Refactored the TypeScript platform foundation, analytics/trace/training, and control-plane integration surfaces into thinner workflow modules while preserving CLI, MCP, and package parity.
Hardened the extracted package-surface workflows around typed MCP tool boundaries, simulation dashboard report parsing, and deterministic simulation score normalization.
Python and TypeScript package metadata are bumped to 0.4.0.

0.3.7 - 2026-04-08

Added

TypeScript autoctx campaign CLI with create, status, list, add-mission, progress, pause, resume, and cancel subcommands, completing the CLI surface for CampaignManager (AC-533).
Campaign API endpoints and MCP tools for multi-mission coordination with budget tracking and dependency graphs.

Changed

Standardized Anthropic credential loading around ANTHROPIC_API_KEY while keeping AUTOCONTEXT_ANTHROPIC_API_KEY as a compatibility alias across Python and TypeScript settings.
Added optional role-scoped credential and endpoint overrides (AUTOCONTEXT_{ROLE}_API_KEY, AUTOCONTEXT_{ROLE}_BASE_URL) for competitor, analyst, coach, and architect, falling back to the global provider configuration when unset.

Fixed

Python autoctx simulate now resolves live generation through the effective architect-role runtime surface, so AUTOCONTEXT_ARCHITECT_PROVIDER and other role-routing overrides are honored instead of being bypassed by the raw client builder.
Python simulation spec normalization now tolerates LLM-friendly action/spec shapes such as postconditions, nested criteria objects, and extra action-planning metadata without failing code generation.
Structured simulation preconditions now preserve referenced action ids when LLM output includes both an action field and human-readable prose, so generated dependencies remain executable.
Regenerating a custom scenario with the same name in one process now force-reloads the generated module so solve and creator validation do not reuse stale scenario classes from sys.modules.
Pi-backed live flows now default to a 300 second timeout, reducing spurious failures in longer solve runs.
Public docs now describe operator-in-the-loop as a runnable family and no longer contradict the executable tests.

0.3.6 - 2026-04-07

Changed

Hardened bootstrap, evidence, and privacy handling so environment snapshots redact shell paths correctly, rematerialized workspaces do not retain stale artifacts, and live prompt/evidence flows now wire the collected snapshot and evidence manifest into the real loop.
Tightened scenario-generation safety in the TypeScript surface so operator_loop validation requires its real escalation/clarification hooks and spec auto-heal preserves punctuation-heavy precondition dependencies instead of dropping valid ordering.
Improved evidence and security backstops by failing closed on TruffleHog execution errors and making the evidence workspace/MCP integration rely on a materialized runtime workspace instead of dead helper-only paths.
Hardened blob-store backends so local keys cannot escape the configured root and Hugging Face bucket metadata/list/delete behavior remains accurate across fresh process boundaries.
Python and TypeScript package metadata are bumped to 0.3.6.

0.3.5 - 2026-04-06

Changed

Stabilized the post-0.3.4 simulation path so operator-loop scenarios preserve behavioral-contract signals across multi-run, sweep, and replay flows instead of silently dropping them.
Hardened plain-language simulation execution around explicit family detection, operator-loop contract enforcement, and shared CLI engine-result handling so incomplete runs surface consistently across Python and TypeScript surfaces.
Tightened the simulation-engine implementation without regressing the repo module-size guardrail, including the compatibility shim needed by existing abstract-class filtering tests.
Python and TypeScript package metadata are bumped to 0.3.5.

0.3.4 - 2026-04-04

Changed

Added action-label and living-docs surfaces to the operator workflow, including reviewer-driven cleanup on the action-label taxonomy and living-docs maintenance path.
Landed the TypeScript/Python parity tranche for session store and the full research package, keeping the rebased cross-surface runtime behavior aligned on current main.
Folded in the pi-autocontext polish follow-up so the published Pi package line reflects the renamed extension and its best-practices cleanup.
Python and TypeScript package metadata are bumped to 0.3.4.

0.3.3 - 2026-04-03

Changed

Expanded the research surface with validated domain contracts, runtime gating, persistence hardening, and better evaluation wiring for briefs, prompts, and adapters.
Hardened Python and TypeScript operator-control surfaces around terminal lifecycle transitions, remote approvals, progress digests, and agentOS session/runtime error handling.
Improved SQLite bootstrap and migration compatibility so packaged installs and fresh databases stay aligned with the live generation schema.
Expanded the TypeScript provider compatibility surface with env-driven config for gemini, mistral, groq, openrouter, and azure-openai, and synced the public provider docs/tests to match.
Python and TypeScript package metadata are bumped to 0.3.3.

0.3.2 - 2026-04-02

Changed

Completed the TypeScript session-runtime parity pass across lifecycle management, coordinator state transitions, supervision, context pressure, remote approvals, progress digests, memory consolidation, and skill registry behavior.
Hardened the TypeScript operator control plane so terminal session and worker states stay terminal, remote approvals require connected controllers, and redirected work remains visible in progress summaries.
Python and TypeScript package metadata are bumped to 0.3.2.

0.3.1 - 2026-04-01

Changed

Python package publishing now uses the canonical PyPI name autocontext instead of autoctx.
Public install docs now reflect the package split accurately: PyPI is autocontext, while npm remains autoctx.
Python and TypeScript package metadata are bumped to 0.3.1.

0.3.0 - 2026-03-29

Added

Commands

autoctx simulate — plain-language multi-variable simulation with sweeps, replay, compare, and export.
autoctx investigate — evidence-driven diagnosis with hypotheses, confidence scoring, and unknowns.
autoctx analyze — interpret and compare runs, simulations, investigations, and missions.
autoctx train — train distilled models from curated datasets with backend selection.
Python autoctx simulate — full parity with the TypeScript surface: run, replay, compare, and export.

Scenarios

All 11 scenario families now fully executable in TypeScript (was 2/11) via secure-exec V8 isolate codegen.
operator_loop is now a fully runnable family in both packages.
Unified family classifier: all families reachable through the CLI.
Spec auto-heal: codegen failures trigger automatic recovery.
Scenario revision flow: refine created scenarios with feedback.
Deep execution validation: generated code is executed and verified before registration.
Three scenario templates: content-generation, prompt-optimization, and rag-accuracy.
new-scenario CLI materializes runnable artifacts to disk.
Scenario parity matrix documents Python/TypeScript surface coverage.

Missions & Campaigns

Adaptive mission execution: LLM-driven goal decomposition and step planning replaces generic bookkeeping.
Campaign abstraction: coordinate multiple missions under long-term goals with budget tracking and dependencies.
Mission-simulation integration: missions invoke simulations as planning tools.

Trace Pipeline

Open public trace schema v1.0.0: versioned interchange format for coding agent traces.
Sensitive-data detection and redaction with policy-backed actions.
Privacy-aware trace export workflow: redact, validate, manifest, and attestation.
Publishing connectors for local JSONL, GitHub Gist, and Hugging Face.
Trace-to-model data plane with DatasetCurator and DataPlane.
Repo-local dataset discovery: scan repo trees and convert JSONL, JSON, CSV, and markdown into ShareGPT-style records.
Curated distillation dataset pipeline with gate filtering, top-quartile selection, family filtering, and failure-example policy.

Training & Distillation

Base model selection maps scenario families to training modes (from-scratch, LoRA, and full fine-tune).
Training backend abstraction with MLX and CUDA plus an injectable TrainingExecutor hook.
Prompt alignment ensures distilled models match runtime invocation.
Candidate-shadow-active promotion lifecycle with configurable quantitative gates and rollback.

Changed

Consolidated operator UI: the Python serve and tui surfaces are API/WebSocket-first, while interactive terminal UI remains available through the TypeScript client surfaces.
Richer sweep DSL: categorical sweeps, logarithmic scales, sweep file loading, and named presets.

Fixed

Trace pipeline audit: expanded redaction patterns, ISO 8601 timestamp validation, explicit role mapping, export warnings, and Hugging Face format fixes.
Distillation audit: training executor hook, base model validation, CSV parser edge cases, silent catches now surfaced as warnings, and end-to-end integration coverage.

0.2.4 - 2026-03-26

Added

Session notebook context now flows into runtime prompts and cockpit views for active runs.
World-state abstractions now support stateful scenario families and workflow-style scenarios.

Changed

Agent-task scaffolding and execution now use separate phased budgets.
Operator-loop scenarios remain available as typed family metadata, but executable operator-loop scaffolding has been removed so the harness no longer bakes in escalation-specific runtime behavior.
Public repo docs now include a docs landing page, package-selection guidance, an analytics/adoption guide, a release checklist, and copy-paste integration examples for CLI, MCP, Python SDK, and TypeScript usage.

Fixed

Python package fallback version metadata now matches the published 0.2.0 package version.

0.2.0 - 2026-03-15

Added

Initial public release with Python and TypeScript packages.
Generation loop with Elo-based progression gating.
Agent roles: competitor, analyst, coach, architect, and curator.
Pluggable scenarios including grid_ctf, othello, and the custom creation pipeline.
LLM judge with multi-sample evaluation.
Task runner daemon with improvement loops.
MCP server with tool implementations.
FastAPI dashboard with WebSocket events.
CLI via Typer (Python) and parseArgs (TypeScript).