Changelog
May 29, 2026 · View on GitHub
All notable changes to this project will be documented in this file.
Unreleased
Added
-
AC-728 Python parity slice 1: four base contract probes. Mirrors
ts/src/control-plane/contract-probes/index.tsat the PR #957 shape (probeDirectoryContract/probeTerminalContract/probeServiceContract/probeArtifactContract) as pure Python functions in a newautocontext.control_plane.contract_probespackage. Inputs / outputs / failures are Pydantic v2 frozen models (extra="forbid",arbitrary_types_allowed=Trueso compiledre.Pattern[str]regex objects can ride through unchanged). Same failure-kind enums as the TS surface: directory (unexpected-file,missing-file), terminal (unexpected-exit-code,missing-stdout-pattern,forbidden-stdout-pattern,missing-stderr-pattern,forbidden-stderr-pattern), service (missing-endpoint,unexpected-endpoint,wrong-interfacewith port + protocol normalization sotcpis the default), artifact (missing-substring,forbidden-substring,wrong-line-ending,invalid-json,missing-json-fieldwith dotted-path JSON field lookup and the same early-return shape on JSON parse failure). The slice-1 audit invariant (every observation field is non-optional so the silent-pass shape cannot arise) is pinned by a parametrisedtest_expectation_against_minimum_observation_always_fails_loudlythat mirrors the TS close-out audit (PR #1000). 25 new Python tests covering the per-probe pass / fail surfaces and the missing-observation pinning property.ruff+mypyclean; module is 407 lines (under the 800-line guard). -
AC-697 slice 8: TypeScript
autoctx queue addcanonical subcommand. Slice 1 (PR #981) pinned the canonical CLI contract and parked TypeScriptqueueatintentional_gapbecause the runtime only exposed the legacyautoctx queue -s <spec> ...form for queue-add; slice 2 (PR #997) addedqueue status. This slice closes the last remaining contract gap (other than the explicitly out-of-scopemissionPython entry) by promotingcmdQueueto dispatch an explicitaddsubcommand: the first sub-arg is inspected and stripped before the existingparseArgs/planQueueCommandworkhorse runs, soautoctx queue add -s <spec> ...and the legacyautoctx queue -s <spec> ...route through the same planner. The legacy form stays registered for backward compatibility with existing automation.QUEUE_HELP_TEXTdocuments the canonicalqueue addform first and keeps the legacy alias plusqueue statusdiscoverable fromautoctx queue --help.ts/README.md"Task queue" section now listsqueue add, the legacy alias, andqueue statusso the README matches the contract and help text. Contract:queuetypescript flips fromintentional_gaptoyes. 5 new TS tests pin the contract flip, the help-text update (canonicalqueue addform plus preserved-slegacy alias), and confirm theplanQueueCommand/renderQueuedTaskResultshape is unchanged across the move. All prior TS contract + capabilities tests still pass;tsc --noEmitclean. AC-697 contract gaps are closed with this slice (apart from the out-of-scopemissionPython entry). -
AC-697 slice 7: Python
autoctx show+autoctx watchcommands. Slice 1 (PR #981) pinnedshowandwatchas canonical paved-road commands and TS shipped them; Python had stub gaps that this slice closes. Newshow <run-id> [--best] [--generation N] [--json]composes the existingstore.get_run+store.run_statusread surfaces:--bestfilters to the single generation with the highestbest_score;--generation Nfilters to a specific generation index; bareshowrenders all generations. Renders a per-generation table (or JSON payload withrun_id,scenario,status,generations[]). Newwatch <run-id> [--interval N] [--json]pollsstore.run_statuson a configurable interval (default 2 seconds), emits one human-readable line (or one JSONL row under--json) per transition, and breaks when the latest generation enters a terminal status (completed,failed,succeeded,errored). Both commands emit actionable errors with a non-zero exit code when the run id is not found. Contract: Pythonshowandwatchboth flip fromintentional_gaptoyes. 7 new Python tests: subcommands registered at canonical paths; contract entries flipped on both runtimes; show missing-run actionable error; show--bestreduces to top-scoring generation; show--generation Nfilters to specific row; watch breaks immediately on a terminal status (no sleep); watch missing-run actionable error. All 27 prior Python contract/serve/capabilities tests still pass.ruff+mypyclean. -
AC-697 slice 6:
autoctx serve mcpcanonical path on both runtimes. Slice 1 (PR #981) pinnedserve mcpas the canonical MCP-server path withmcp-serveas a registered alias on both runtimes; this slice ships the matching CLI changes. Python:servepromoted to a sub-Typer group withinvoke_without_command=Trueso the legacyautoctx serve [--host ...] [--port ...]form continues to start the HTTP API. Three subcommands registered:serve http(explicit canonical HTTP),serve mcp(canonical MCP), and the bareservecallback (legacy HTTP form). HTTP serve body extracted to_run_http_serve(host, port)so bothserve(callback) andserve http(explicit) call the same code path. MCP serve body identical to the existingmcp-servehandler (callsautocontext.mcp.server.run_server).mcp-servetop-level alias kept registered for backward compatibility with existing Claude Code MCP configurations. TypeScript:cmdServeHttpdetectsmcpas the first sub-arg and rewrites argv to delegate tocmdMcpServe(same delegation pattern as slice-4cmdScenario->cmdNewScenario).mcp-servetop-level dispatch entry kept for backward compat. Contract: bothserve.mcpentries flip fromintentional_gaptoyes;mcp-servestays as the contract alias. 6 new tests (3 Python: subcommands registered at canonical paths,servetyper group is invokable without subcommand, contractserve.mcpis yes on both runtimes withmcp-servealias preserved; 3 TS: contractserve.mcpis yes,mcp-servealias preserved,serve+mcp-serveboth registered in command-registry). All 23 prior Python contract/capabilities/queue tests + 30 prior TS contract tests still pass.ruff+mypy+tscclean. -
AC-697 slice 5:
autoctx capabilitiesis now contract-driven on both runtimes. The slice-1 contract pinnedcapabilitiesas the operator-facing surface for the canonical command set, but both runtimes shipped legacy/no-op implementations: TS emitted onlyvisibleSupportedCommandNames()(command names with no aliases or per-runtime support), Python did not ship the command at all. This slice loadsdocs/cli-contract.jsonfrom each runtime and emits a structured payload with the canonical commands, aliases, and per-runtime support. TypeScript:buildCapabilitiesPayloadgains acontract: { schema_version, commands: [...] }field; each command carriesid,path,summary,audience,maturity,aliases,runtime_support.{python,typescript}.{status,reason?}. The legacycommands: string[]field is preserved for backward compatibility. OptionalcontractPathparameter onbuildCapabilitiesPayloadlets tests override the default repo-relative path. Python: newautocontext.cli_capabilitiesmodule withbuild_capabilities_payload(contract_path=None)(loads contract, returns the same JSON shape) andregister_capabilities_command(app, console=...)that mountsautoctx capabilitiesas a typer command.--jsonprints the structured payload via plain stdout (no rich ANSI coloring) so JSON consumers parse the output directly. Default human-readable output renders a per-command table with python/ts support status. Contract: bothcapabilitiesentries flip fromintentional_gaptoyes. 5 new Python tests (payload shape, paved-road command presence, intentional_gap reasons propagated,--jsonend-to-end, human-readable summary). 2 new TS tests (contract field with canonical commands and aliases, runtime_support enum validity). All 17 Python parity tests + 23 TS contract tests still pass.ruff+mypy+tscclean. -
AC-697 slice 4: TS
autoctx scenario createcanonical path. Mirrors slice 3 (Python typer-group refactor, PR #998) on the TypeScript side.command-registry.tsaddsscenarioto theDbCommandNameunion and registers it as a primary command.cli/index.tsadds acmdScenariohandler that detects the first sub-arg:createrewritesprocess.argvso the existingcmdNewScenariohandler runs unchanged (DRY: scaffolding logic stays single-sourced),--helpor no args print a usage banner naming the subcommand, anything else exits with an unknown-subcommand error. The legacy top-levelnew-scenariocommand stays registered as the alias the slice-1 contract pins. Contract: TSscenario.createflips fromintentional_gaptoyes. The TS contract parity test now does a partial multi-token check: when a contract entry claims TS support for apath.length >= 2command, the parent token must be a registered command (catches the case where TS claims yes but didn't even mount the parent). Full multi-token subcommand verification remains future work, gated on introducing a TS subcommand registry. 3 new TS tests (scenariois registered invisibleSupportedCommandNames; TSscenario.createisyesin the contract;new-scenariois preserved as an alias); the 18 existing TS contract + status-retargeting tests still pass + the 17 Python parity tests still pass.tsc --noEmitclean. -
AC-697 slice 3: Python
queueandscenariotyper-group refactor. Promotes the two action-positional Python commands to sub-Typer groups with registered subcommands so the canonical contract paths (queue add,queue status,scenario create) appear initer_python_command_paths().cli_queue.register_queue_commandnow mounts aqueuesub-Typer withinvoke_without_command=Trueso the legacyautoctx queue -s <spec>form still routes to the add subcommand via a group callback; explicitautoctx queue addandautoctx queue statussubcommands are also registered.cli_new_scenario.register_new_scenario_commandextracts the scaffolding body to a module-level_scaffold_scenario_body()helper, then registers both the legacy top-levelnew-scenariocommand and a newscenariosub-Typer group withcreatesubcommand that delegates to the same body. Contract walker_walk_typerincli_contract.pynow yields each registered group's prefix in addition to recursing into its subcommands, so contract entries that pin a group's top-level path (e.g.queueas the umbrella) match the observed registry. Contract: Pythonqueue.statusflips fromintentional_gap(the slice-2 action-positional reason) toyes; Pythonscenario.createflips fromintentional_gaptoyes; TypeScriptscenario.createreason updated to point at a follow-up AC-697 slice 4 that will mirror the typer-group refactor on the TS side. 3 new Python tests (queue addandqueue statusregistered at the canonical paths; legacyqueue -s <spec>still routes to add without producing a usage error; the slice-2 "Supported actions" test repurposed to assert typer's standard subcommand-not-found banner). 17 existing slice-1 Python parity tests + 18 TS contract tests still pass after updating one slice-2 TS assertion that pinned the now-closed action-positional gap.ruff+mypyclean. Theiter_python_command_pathswalker change is backward-compatible: existing callers that consumed the path enumeration get a superset (all the same paths plus group-prefix entries), so no other tests broke. -
AC-697 slice 2: TS
statuscommand retargeted from queue-pending to run-status, withqueue statusas the new canonical home for queue-pending counts. Slice 1 (PR #981) pinned the contract; this slice ships the matching CLI changes. TypeScript:cmdStatusnow errors out when invoked without a<run-id>(no fallthrough to queue-pending), pointing operators atautoctx queue statusfor the queue-pending count;cmdQueuegains a subcommand dispatch that detectsautoctx queue statusand routes toexecuteStatusCommandWorkflow+renderStatusResult(the same workflow that used to drive top-levelstatus, so the JSON output shape{"pendingCount": <int>}is preserved across the move). The existingautoctx queue -s <spec>queue-add path is unchanged for backward compatibility. Python:run_queue_commandincli_queue.pyacceptsaction="status"(previously only"add") and emits a{"pending_count": <int>}payload viastore.pending_task_count(). The Python top-levelstatusalready required a<run-id>positional, so no Python-side change was needed there.docs/cli-contract.json: TSstatusflips fromintentional_gaptoyes; TSqueue.statusflips fromintentional_gaptoyes; Pythonqueue.statuskeepsintentional_gapwith an updated reason explaining the action-positional dispatch ("Behavior shipped viaautoctx queue statusaction-positional; contract walker reads Typer's registered subcommands and will not see it until a follow-up slice promotesqueueto a sub-Typer group, which would breakautoctx queue -s <spec>callers"). 7 new tests: 2 Python (autoctx queue status --jsonreturnspending_count, unknown action emits a clear actionable error) + 5 TypeScript (contract entries flipped toyesfor TS status + queue.status, Pythonqueue.statusretains the action-positional intentional_gap reason, summary still pins run-status as the canonical meaning, workflow shape preserved across the move). All 17 existing slice-1 Python parity tests + all 13 slice-1 TS parity tests still pass.ruff+mypy+tscclean. -
AC-708 slice 2c: PyTorch/CUDA-backed logistic-regression curator advisor. New
autocontext.hermes.cuda_trained_advisorshipstrain_cuda_logistic(examples, *, epochs=200, learning_rate=0.5, l2=0.001, seed=0)andsave_cuda_advisor(advisor, path). Same multinomial logistic regression architecture as slices 2a (PR #980) and 2b (PR #995) on the same fixed feature encoder; the gradient descent runs on PyTorch tensors withtorch.cudawhentorch.cuda.is_available(), falling back transparently to CPU torch otherwise. The checkpoint records the actual device under adeviceaudit field ("cuda"or"cpu"); thekindstayscuda_logistic_regressioneither way because the backend (PyTorch) is what differs from slice 2b's MLX.HAS_CUDA_ADVISORflag derived fromimportlib.util.find_spec("torch"); callingtrain_cuda_logisticwithout torch raises a clearRuntimeErrornaming theautocontext[cuda]extra.load_advisoralready acceptedcuda_logistic_regression(slice 2b reserved the kind). Newautoctx hermes train-advisor --cuda --checkpoint <path>CLI subcommand wires the backend end-to-end; the slice-2b three-way--baseline/--logistic/--mlxmutex extends to four-way; passing--cudawithout torch installed surfaces a loud actionable error. Newcudaoptional extra inpyproject.toml(torch>=2.0.0). 10 new tests: 5 platform-independent (_require_torchmessage, 4-way mutex with zero flags, 4-way mutex with two flags,--cudawithout torch clear error,load_advisoracceptscuda_logistic_regression); 5 CUDA-gated (training shape, beats-baseline-on-separable-data, save/load round-trip preserves predictions, empty-datasetValueError, end-to-end CLI train + checkpoint + load). Verified both paths locally: 9/10 pass with torch installed (1 skip is the not-installed error-path test); 5/10 pass + 5 cleanly skip without torch (CI default). The CLI runner consolidates the three-backend dispatch into a uniform payload-construction block (training function + saver + advisor kind selected via if/elif on the flag, rest of the payload built identically) and introduces a local_fail(message)helper to consolidate the repeating "json -> stderr, else -> console.print(red), raise typer.Exit" pattern; the file lands at 789/800 lines, comfortably under the module-size guard. All 20 existing slice-2a tests and all 5 platform-independent + 5 MLX-gated slice-2b tests still pass;ruff+mypyclean.docs/agent-integration.mdtrain-advisor section documents--cudaalongside the existing flags with an inline recipe. -
AC-708 slice 2b: MLX-backed logistic-regression curator advisor. New
autocontext.hermes.mlx_trained_advisorshipstrain_mlx_logistic(examples, *, epochs=200, learning_rate=0.5, l2=0.001, seed=0)andsave_mlx_advisor(advisor, path). Same multinomial logistic regression architecture as slice 2a (PR #980) on the fixed feature encoder, but the gradient descent runs on MLX so the matrix multiplies can be GPU-accelerated on Apple silicon. Returns aLogisticRegressionAdvisor(the slice-2a dataclass) so the loaded checkpoint type stays uniform across backends andrecommend --advisordoes not need to dispatch on backend. The checkpoint JSON is the slice-2a schema withkind: "mlx_logistic_regression"+backend: "mlx"so audits can tell which backend produced a file; the extendedload_advisorintrained_advisor.pynow acceptslogistic_regression,mlx_logistic_regression, and the reservedcuda_logistic_regression(slice 2c).HAS_MLX_ADVISORflag derived from a guardedimport mlx.core/import mlx.nn; callingtrain_mlx_logisticwithout MLX raises a clearRuntimeErrornaming theautocontext[mlx]extra. Newautoctx hermes train-advisor --mlx --checkpoint <path>CLI subcommand wires the backend end-to-end; the existing two-way--baseline/--logisticmutex extends to three-way (--baseline/--logistic/--mlx); calling--mlxwithout the MLX extra installed surfaces a loud actionable error rather than crashing inside an opaque ImportError. 10 new tests: 4 platform-independent (_require_mlxmessage; three-way mutex with zero flags; three-way mutex with two flags;--mlxwithout MLX clear error); 1 schema test in the slice-2a file (load_advisoracceptsmlx_logistic_regression); 5 MLX-gated (gated onHAS_MLX_ADVISOR) covering training shape, beats-baseline-on-separable-data, save/load round-trip preserves predictions, empty-dataset ValueError, end-to-end CLI train + checkpoint + load. Verified both paths on Apple-silicon: 8/9 pass with MLX installed (the 1 skip is the not-installed error-path test which can't fire when MLX IS installed); 4/9 pass + 5 cleanly skipped without MLX (CI default). All 20 existing slice-2a tests still pass.ruff+mypyclean. Thecli_hermes_runners.pymodule stays under the 800-line guard (798/800).docs/agent-integration.mdtrain-advisor section gains the--mlxflag documentation + an inline recipe alongside the existing--baseline/--logisticexamples. -
AC-728 close-out audit: missing-observation invariant pinning tests for the four slice-1 probes (directory, terminal, service, artifact). Each probe's observation fields are non-optional at the TypeScript type layer and at the slice-5
ContractProbeSuiteSchemaZod layer, so the silent-pass shape that necessitated explicitmissing-observationfailure kinds in cleanup (PR #988), media (PR #985), and distributed (PR #993, slice 8) cannot arise here by construction. 5 new tests (directoryrequiredFilesagainst empty workdir; terminalrequiredStdoutPatternsagainst empty stdout; servicerequiredagainst empty observed list; artifactrequiredJsonFieldsagainst empty content; artifactrequiredSubstringsagainst empty content) pin the loud-failure path so any future refactor that loosens an observation field surfaces immediately. Source-level design-note comment at the top ofts/src/control-plane/contract-probes/index.tsdocuments the audit conclusion so future contributors don't have to re-derive it. 132 -> 137 file total;tsc --noEmitclean. With this slice the AC-728 surface is fully shipped: directory/terminal/service/artifact probes (PR #957), cleanup probe + retrofit (PRs #983, #988), media probe (#985), distributed probe (#987), suite runner (#990),autoctx probes checkCLI (#991), andautoctx probes extractCLI for all seven kinds with orphan-expectation rejection (#992, #993). -
AC-728 slice 8:
autoctx probes extractnow covers cleanup, media, and distributed probe kinds.HarnessTraceSchemagainsobservations.cleanup(entries with optional symlink/mtime metadata),observations.media(per-path WxH, byte size, column metadata, line count, magic bytes), andobservations.distributed(worldSize + per-rank reports with optional steps andobservations: Record<string, string>). Matching expectation shapes:expectations.cleanup(lockfile-age policy, sidecar / backup pattern overrides,forbidSymlinks,allowedSymlinkTargets,ignoredPatterns),expectations.media(per-path expected magic bytes, dimensions, byte-size bounds, column expectations, line count), andexpectations.distributed(expectedWorldSize, expectedSteps, mustMatchAcrossRanks). Each new section is rejected bysuperRefinewhen declared as an expectation without its matching observation, closing the same orphan-expectation class fixed in slice 7's PR #992 review. Per-media expectations join observations bypath(mirrors the artifact convention). 11 new tests cover cleanup join + observation-only behaviour, media per-path matching + no-expectation no-op probe, distributed cross-rank divergence + observation-only pass, all four orphan-rejection paths (cleanup / media envelope / per-media path / distributed), and a seven-probe end-to-end round-trip throughextractContractProbeSuite+ContractProbeSuiteSchema+runContractProbeSuite. 116 existing AC-728 cases still pass for a 127/127 file total.tsc --noEmitclean. The README's "Synthesizing a suite from a harness trace" section grew to a seven-probe example covering all observation + expectation shapes, and theEXTRACT_HELP_TEXTnotes the expanded coverage. -
AC-728 slice 7:
autoctx probes extract-- synthesize a runnable probe suite from a harness trace. Newautoctx probes extract --trace <path> [--output <path>]reads aHarnessTraceJSON file that bundles bothobservations(what actually happened in a recorded run: terminal exit code / stdout / stderr; the workdir's present files; observed service endpoints; emitted artifacts) and optionalexpectations(what the operator declared should have happened: expected exit code, required / allowed / ignored files, required endpoints, per-artifact JSON-field / substring / line-ending expectations). The extractor joins observations with expectations into aContractProbeSuite(slice 5 wire shape) ready to feed toautoctx probes check. Per-artifact expectations match observations bypath; an observation without a matching expectation produces a probe with no declared substring / line-ending / JSON-field checks (the artifact's existence and content are recorded but no assertions fire). NewHarnessTraceSchema(Zod) validates the trace envelope and per-kind nested shapes, all.strict()so unknown keys (typos) fail validation; reuses the slice-5 transform pattern for regex / date helpers sosafeParsesurfaces{ success: false }for invalid regexes rather than throwing rawSyntaxError. Output goes to stdout by default (pipe-friendly forextract | check);--output <path>writes to a file (parent directories created). RegExp values in the emitted suite are serialised as{ source, flags }objects so the slice-5 runner schema can re-parse them. Slice 7 supports the four AC-728 slice-1 probe kinds (terminal, directory, service, artifact); cleanup, media, and distributed extractors land in follow-up slices once their trace formats settle. New files:ts/src/control-plane/contract-probes/cli/extract.ts(therunExtract(args)in-process handler, theextractContractProbeSuite(trace)pure function, and theHarnessTraceSchemaZod schema), plus the dispatcher ints/src/control-plane/contract-probes/cli/index.tsnow routes theextractsubcommand. 21 new vitest cases (schema parses observation-only and observations+expectations forms; rejects unknown keys at the envelope and nested in observations; safeParse surfaces invalid-regex issues; observation-only terminal passes; observation-only workdir fails by default with no allowlist; terminal observation + expectation joins; workdir + directory expectation joins with ignoredPatterns; missing allowlist surfaces unexpected-file failures; per-artifact path matching; no-expectation artifact path emits a no-op probe; end-to-end round-trip throughContractProbeSuiteSchema+runContractProbeSuite; CLI --help, missing --trace, missing file, malformed JSON, schema-invalid trace, stdout emission, --output emission with parent-directory creation, end-to-end emitted-suite passes slice-5 runner schema). The 87 existing AC-728 probe + runner + check-CLI cases still pass; 108/108 file total across the contract-probes test files. Package root barrel andtests/package-export-catalogs.test.tsare unchanged (the extractor is reached throughautoctx probes extract). README "Contract Probes" section gains a "Synthesizing a suite from a harness trace" subsection with a minimal trace example and theextract | checkpipe pattern.tsc --noEmitclean. -
AC-728 slice 6:
autoctx probesCLI surface. Newautoctx probes check --suite <path>runs a JSON-defined contract-probe suite and reports per-probe pass/fail. Exit code 0 on full pass; 1 on any failure or any load / parse error. Default output is human-readable (probes check: PASS/FAILplus per-probe lines with failure detail);--jsonemits a structuredContractProbeSuiteResultpayload (the discriminated-union shape from slice 5 withkind, optionallabel,passed, and per-probe-typedfailurescarrying probe-specific fields likepath,rank,key,endpoint). Schema-invalid suites surface every Zod issue with its dotted path on stderr so operators can fix typos likerequiredStdoutPattern(singular) at parse time rather than discover the missing expectation in a green-but-wrong run. New files:ts/src/control-plane/contract-probes/cli/check.ts(therunCheck(args)in-process handler) andts/src/control-plane/contract-probes/cli/index.ts(therunProbesCommand(args)subcommand dispatcher). Wired intots/src/cli/index.tsas theprobesno-db command viacmdProbes, and intots/src/cli/command-registry.tssoautoctx --helplists it. Mirrors the production-traces / instrument CLI pattern: handlers return{ stdout, stderr, exitCode }with noprocess.exitorconsoleinside, so tests consume the runner directly without spawning a subprocess. 12 new vitest cases (top-level help, unknown subcommand, dispatch to check, --help, missing --suite, missing file, malformed JSON, schema-invalid suite surfaces Zod issues, text PASS report, text FAIL report with per-probe detail, --json shape with discriminated-union result, dispatcher round-trip from the cli/index.ts entry); the 75 existing AC-728 probe + runner cases still pass for an 87/87 file total.tsc --noEmitclean. Closes the AC-728 acceptance criterion #2 ("The probe layer can be surfaced to agents as context or run as harness checks") in operator-visible form. A follow-up slice addsautoctx probes extract <trace>-- read a recorded trace and synthesize a probe suite by extracting observations. -
AC-728 slice 5: contract-probe suite runner. New
ts/src/control-plane/contract-probes/runner.tsaddsrunContractProbeSuite(suite)(pure function that dispatches a JSON-defined probe spec across all seven AC-728 probes and aggregates results),loadContractProbeSuite(path)(file loader mirroring thecli-contract.tspattern), andContractProbeSuiteSchema(Zod schema validating the JSON wire format with a discriminated union over the seven probe kinds). Schema includesContractProbeKindEnum,ContractProbeInvocation,ContractProbeFailure,ContractProbeRunResult,ContractProbeSuiteResulttypes. RegExp values can be serialised as either a bare string ("^trace\\.") or{ source, flags? }; ISO-8601 strings transform to Date objects for cleanup probenow/ per-entrymtime; malformed dates raiseinvalid ISO-8601 date: .... Aggregate resultpassedis the AND of per-probe passes; per-probe results carrykind, optionallabel(caller-supplied attribution string),passed, and a cross-kindContractProbeFailure[](each failure has at minimumkindandmessage; specific probes attach kind-specific extras likepath,rank,key,endpoint). 13 new vitest cases (empty suite passes, unknown probe kind rejected, schema_version != 1 rejected, RegExp string transform, ISO-8601 to Date transform, malformed date rejected, exhaustive 7-kind dispatch with all passing, suite passed is AND across probes, failure entries carry kind + label, JSON file load + parse, missing file throws, malformed JSON throws). 55 existing AC-728 probe cases still pass; 68 total tests across the two files. Package root barrel re-exportsrunContractProbeSuite,loadContractProbeSuite,ContractProbeSuiteSchema,ContractProbeKindEnumand the new types;tests/package-export-catalogs.test.tspins the public surface.tsc --noEmitclean. Wires the AC-728 acceptance criterion #2 ("The probe layer can be surfaced to agents as context or run as harness checks") at the library level; a follow-up slice adds theautoctx probesCLI command + trace-replay extractor on top of this foundation. -
AC-728 slice 3: media / tabular contract probe. New
probeMediaContractints/src/control-plane/contract-probes/index.tscloses the "media/data artifact dimensions, encoding, headers, and units" item from the original AC-728 ticket. Seven failure kinds (wrong-magic-bytes,wrong-dimensions,wrong-byte-size,wrong-column-count,missing-column,wrong-line-count,missing-observation) cover format header bytes, image / video dimensions, byte-size bounds, tabular column-count and column-name presence, and JSONL / CSV line counts. When the caller declares an expectation but the matching observation is undefined, the probe emitsmissing-observationrather than silently passing — a corrupt artifact or a broken metadata extractor would otherwise satisfy the contract by omitting observations. When the caller does not declare an expectation about a field, that field is not checked. Pure function, no IO. 12 new vitest cases (clean-pass, all-expectations-match, magic-byte mismatch, width / height mismatch, byte-size below min / above max, column-count mismatch, missing required column, line-count mismatch, declared-but-unobserved blanket fail across all seven fields, byte-size missing-observation with only one bound set, no-expectation-declared still passes); the prior 39 AC-728 cases still pass for a 51/51 file total. Package root barrel (ts/src/index.ts) re-exportsprobeMediaContractand its types;tests/package-export-catalogs.test.tspins the public surface.tsc --noEmitclean. -
AC-728 cleanup-probe missing-observation retrofit. Applies the PR #985 review lesson (declared expectations without observations must fail rather than silently pass) to
probeCleanupContract. Newmissing-observationfailure kind added toCleanupContractFailureKind; two surfaces now emit it: (1) whenmaxLockfileAgeMsis set but a matched lockfile entry has nomtime, the probe fails with missing-observation rather than skipping the age check (a stat-failing extractor would otherwise satisfy the age contract by omitting mtime); (2) whenallowedSymlinkTargetsis set but a symlink entry has nosymlinkTarget, the probe fails with missing-observation rather than treating the target as<unknown>and letting a broken extractor pass the allowlist contract. The pre-existing "no expectation declared → no failure" invariant is preserved: callers who declare neithermaxLockfileAgeMsnorallowedSymlinkTargetsstill get the same behavior as before. 4 new tests (lockfile-without-mtime + age contract fails; lockfile-without-mtime without age contract still passes via the unconditional-flag path; symlink-without-target + allowlist fails; symlink-without-target without any symlink contract passes); the prior AC-728 cases still pass.tsc --noEmitclean. -
AC-728 slice 4: distributed / multi-process contract probe. New
probeDistributedContractints/src/control-plane/contract-probes/index.tscloses the "distributed/multi-process parity checks beyond world-size 1" item from the original AC-728 ticket. Distributed tensor code can pass shallow checks (process started, gradient computed locally) and still fail multi-rank parity; this probe catches the cross-rank invariants. Six failure kinds (wrong-world-size,missing-rank,duplicate-rank,rank-divergence,wrong-step-count,missing-observation) cover observed-vs-expected world size, every rank in[0, worldSize)reporting (no missing rank, no duplicate report), per-key cross-rank observation equality for keys listed inmustMatchAcrossRanks(the divergence message enumerates the distinct values so the caller can see which value disagrees), and per-rank step-count parity againstexpectedSteps. Pure function: the caller does the runtime IO (torchrun/ NCCL / MPI / whatever collects per-rank reports) and passes aDistributedRankReportper rank; the probe verifies. Same posture as the AC-728 slice 1, 2, 3 probes. Mirrors the PR #985 review lesson: a declared expectation without its observation fails asmissing-observationrather than silently passing — e.g.mustMatchAcrossRanks: ["final_loss"]against a rank that did not reportfinal_lossfails, as doesexpectedWorldSizeagainst an undefinedworldSize. 10 new vitest cases (clean 4-rank pass, wrong-world-size, missing-rank, rank-divergence with distinct-value enumeration, wrong-step-count per rank, missing-observation for both must-match keys and world size, no-expectation-declared still passes, world-size-1 degenerate pass, duplicate-rank guard); the 29 existing AC-728 slice 1/2 cases still pass for a 39/39 file total. Package root barrel (ts/src/index.ts) re-exportsprobeDistributedContractand its types alongside the existing AC-728 probes;tests/package-export-catalogs.test.tspins the public surface so the export cannot silently disappear.tsc --noEmitclean. -
AC-728 slice 2: cleanup contract probe. New
probeCleanupContractints/src/control-plane/contract-probes/index.ts(alongside the AC-728 slice 1 directory/terminal/service/artifact probes from PR #957) catches the leftover-artifact class of contract bugs the directory probe alone can miss: broken symlinks, symlinks forbidden by contract or pointing outside anallowedSymlinkTargetsallowlist, stale lockfiles (configurablemaxLockfileAgeMswithnowinjection for deterministic tests; defaults to flagging every lockfile unconditionally when no threshold is set), editor / OS sidecars (vim swap.swp/.swo, emacs-style~, macOS.DS_Store, LibreOffice.~lock.*#), and backup copies (.bak,.orig). Five failure kinds (stray-symlink,broken-symlink,stale-lockfile,stray-sidecar,stray-backup) with per-failure human-readablemessage. Reuses the existingisIgnoredhelper soignoredPatternssemantics matchprobeDirectoryContract. Pure function: the probe does no filesystem IO, so it composes with the same trace-replay surfaces the slice 1 probes already use. Caller passes a directory listing asCleanupFileEntryrecords (path, optionalisSymlink/symlinkTarget/symlinkBroken/mtime); default sidecar and backup patterns are intentionally narrow so the probe does not false-positive against legitimate dotfiles. 11 new tests (clean-directory passes, broken-symlink always fails, forbidSymlinks blanket fail, allowedSymlinkTargets allowlist, default sidecar / backup detection, lockfile unconditional and age-thresholded, ignoredPatterns parity with the directory probe, caller pattern overrides); the 18 existing AC-728 slice 1 tests still pass for a 29/29 file total.tsc --noEmitclean. -
AC-697 slice 1: shared CLI contract + per-runtime parity tests. New
docs/cli-contract.jsonis the single source of truth for the canonicalautoctxsurface (17 commands so far: the six paved-road plus the highest-friction items from the ticket). Newautocontext.cli_contract(Python loader with frozenContract/CommandSpec/Flag/RuntimeSupportPairvalue types,RuntimeStatusStrEnum,iter_python_command_pathsTyper introspection helper,PAVED_ROADconstant) +ts/src/cli/cli-contract.ts(Zod-validated TypeScript loader with matchingPAVED_ROADconstant andresolveAliashelper). Both runtimes' parity tests load the same JSON: 17 Python tests intests/test_cli_contract.pyand 13 TypeScript tests ints/tests/cli-contract-ac697.test.tscover schema sanity (no duplicate ids, alias uniqueness, intentional-gap reasons required, audience tier + domain concept validity, paved-road constant matches audience filter), runtime parity (everyruntime_support.<runtime> == "yes"claim must resolve to a registered command at the canonical path), and AC-697 friction-point invariants pinned in the contract (status canonical meaning is run status; solve is not a domain noun; --iterations is the canonical iteration flag; queue.status does not occupy top-level status). Eachintentional_gapentry carries a non-emptyreasonso reviewers can tell apart "decided not to ship" from "forgot to implement" and trace which AC-697 follow-up slice owns the fix. The contract is intentionally small in slice 1 (paved road + friction points); follow-up slices fill in the remaining 30+ commands and ship the actual semantics fixes (status retargeting, queue add parity, alias plumbing, paved-road help view, capabilities-from-contract). -
AC-708 slice 2a: pure-Python logistic-regression curator advisor. New
autocontext.hermes.trained_advisorexposesLogisticRegressionAdvisor(frozen value type carrying learned weights, intercepts, label order, fixed feature encoder),train_logistic(examples, *, epochs=200, learning_rate=0.5, l2=0.001, seed=0)(multinomial logistic regression via gradient descent on softmax cross-entropy),predict+predict_proba(calibrated per-label probabilities summing to 1), andsave_advisor/load_advisorfor JSON checkpoint round-trip with a stable schema (kind: "logistic_regression",version: 1). Implements the existingAdvisorProtocol so the AC-708 slice 1evaluateand the AC-709recommendwork unchanged. Newautoctx hermes train-advisor --logistic --output metrics.json --checkpoint advisor.jsontrains and persists;autoctx hermes recommend --advisor advisor.json --home ~/.hermes --output recs.jsonlloads and emits recommendations — closes the AC-705 → AC-708 → AC-709 loop end-to-end with a real trained advisor.--baseline/--logisticare mutually exclusive (caller picks one explicitly);--baseline-from/--advisoronrecommendare likewise mutually exclusive. Same-file guards reject--checkpointequal to either--data(would clobber the source dataset) or--output(would clobber the metrics payload).load_advisorrejects dimension-invalid checkpoints (mismatched label / weights / intercepts row counts, or any weights row whose length disagrees withfeature_names) so the failure surfaces at the file rather than later insidepredict_proba. Pure Python (no numpy/sklearn/GPU dep) so the trained backend runs in CI smoke mode against fixture-sized data; MLX (slice 2b) and CUDA (slice 2c) backends ship behind the sameAdvisorProtocol + checkpoint schema (kinddiscriminator rejects foreign checkpoints). 19 tests cover learned-weight shape, Advisor-Protocol dispatch, predict_proba normalization, beats-baseline-on-separable-data, deterministic-for-same-seed, empty-dataset rejection, single-label graceful fallback, save/load round-trip, stable checkpoint schema, unknown-kind/missing-file/corrupt-JSON/dimension-mismatch load errors, the two--checkpointsame-file guards, and CLI integration (train --logistic + recommend --advisor end-to-end). Total hermes/spec-verifier/module-size tests pass; ruff + mypy clean. -
AC-770 + AC-771: two new rules in the AC-769 remediation router.
rule_threshold_budget(AC-770) emits a newBudgetIncrease(parameter, current, suggested_factor, reason)hint when an assertion error matches thek/total at N trialsorat N trials: k/totalshape with a near-zero pass rate (k <= 25% of total), or containsinsufficient samples/convergence not reached. Factor heuristic: 16x fork == 0(the c32 marker from the Cryptopals validation campaign), 4x otherwise.rule_indexing_base(AC-771) emits a newIndexingCheck(reason)hint when a near-zerok/N hits|bytes recoveredfailure shape pairs with source code containing aZ_N/index_N/idx_Nidentifier alongside aposition = N/index = Nconstant — the c56 shape where literature naming (1-indexedZ_16) and code (0-indexedposition = 16) disagree. When source code is unavailable, a low-confidence generic hint still fires for 0/N failures so the agent considers indexing as a candidate. The router signature gains asource_code: str | None = Nonekwarg, forwarded to all rules; existing AC-769 callers are unaffected (kwarg has a None default). Both new hints render throughrender_hints()with human-readable descriptions. 19 new tests (8 threshold-budget, 6 indexing-base, 3 router integration, 2 rendering) plus the existing 22 AC-769 tests all pass; lint + mypy clean. -
AC-711: validate the Hermes
autocontextskill against realistic agent prompts via a static content rubric. Newautocontext.hermes.skill_validationexposesTaskPrompt,ExpectedBehavior,ValidationCase,ValidationResult,ValidationReportvalue types plus thevalidate_skill()entry point and aDEFAULT_RUBRICcovering all six AC-711 fixture prompts (evaluate_and_improve,export_best_as_skill,look_at_curator_reports,use_local_mlx_to_train,mcp_vs_cli,improve_curator_without_replacing). Six typed predicates enforce the AC-711 evaluation criteria:prefers_cli_when_mcp_unconfigured,uses_mcp_only_when_configured,never_mutates_hermes_skills_for_inspect_or_train,explains_privacy_before_session_ingest,documents_export_skill_path,separates_curator_and_autocontext_responsibilities. Newautoctx hermes validate-skill --output report.md --jsonruns the rubric and exits non-zero on any failure so CI gates skill drift. Skill patch: building the rubric surfaced a real gap — the shipped SKILL.md had no explicit privacy posture for session/trajectory imports. Per AC-711 deliverable,hermes/skill.pynow ships a Privacy Before Session and Trajectory Ingest section distinguishing Curator decision reports (safe metadata) from sessions/trajectories (raw content), documenting--redact standard|strict|offplus--dry-run, namingautoctx hermes ingest-sessionsandautoctx hermes ingest-trajectoriesas the affected commands. The AC-712 committedskills/autocontext/SKILL.mdsnapshot is regenerated against the patched renderer so the AC-712 sync invariant test stays green after this PR lands. Three negative regression tests (CLI-first guidance stripped, every privacy keyword stripped,export-skillstripped) prove the rubric has teeth. Validation results recorded atdocs/hermes-skill-validation.md. 17 rubric tests pass; total hermes-cluster tests pass. -
AC-712: distribution path for the Hermes
autocontextskill. Ships a committed snapshot ofrender_autocontext_skill()atskills/autocontext/SKILL.mdplus the four AC-702 references underskills/autocontext/references/. CI sync invariant (autocontext/tests/test_hermes_skill_distribution.py, 5 tests) pins the committed bytes byte-for-byte to the renderer and rejects orphan reference files, so the snapshot can never drift silently. Newdocs/hermes-skill-distribution.mddocuments three install paths (Option A:autoctx hermes export-skill --output ~/.hermes/skills/autocontext/SKILL.md --with-references; Option B:curlraw URLs frommainor a pinned SHA; Option C: shallow + sparsegit clone), the/reload-skillsreload story, frontmatter-based versioning, and the local-edits-as-fork pitfall. Upstream Hermes submission andagentskills.io/ hub registration are scoped as AC-712 follow-ups so the supported install matrix is unblocked today without waiting on external approval. DRY: the renderer is still the single source of truth; the committed snapshot is generated via the shipped CLI and re-generated the same way. -
AC-707 (spike): Hermes plugin emitter prototype + decision doc. New
autocontext.hermes.plugin_emittermodule ships a fail-openHermesTraceEmitterorchestrator withLLMCallEvent/ToolCallEventvalue types, aTraceSinkProtocol, and aLocalJsonlSinkconcrete write surface. The emitter reuses the existingRedactionPolicy(DRY with AC-706) andproduction_traces.emit.build_trace(DRY with AC-704 / AC-706) so a future production plugin can adopt the shape without redesigning anything in autocontext. Decision documented atdocs/hermes-plugin-emitter-spike.md: DEFER until either a concrete operator workflow demands the extra fidelity (sub-second timing, structured tool calls, provider usage) or Hermes publishes a stable plugin API contract. The file importers (AC-704 / AC-706) plus the advisor pipeline (AC-708 / AC-709) cover the current operator scenarios; paying the cross-package contract cost now would not unlock any active payoff thread. 12 tests pin the safety properties (sink fail-open, hook fail-open, late finalize ignored, concurrent sessions isolated, no network IO in default mode, shared-policy redaction, ProductionTrace shape) so a future revisit is glue work, not a green-field rewrite. AC-707 closed. -
AC-709:
autoctx hermes recommend --home ~/.hermes --baseline-from training/hermes-curator-decisions.jsonl --output recommendations.jsonl [--include-protected] [--json]is the read-only recommendation surface. Trains a baseline advisor on AC-705 export data, walks the live Hermes inventory, and emits one JSONL row per recommendation. Newautocontext.hermes.recommendationsmodule exposesRecommendation(skill_name, predicted_action, confidence, status, features, reason) andrecommend(inventory, advisor, *, include_protected, reason). Read-only invariant: never writes to~/.hermes; Curator stays the mutation owner. Protected skills (pinned / bundled / hub provenance) are filtered out by default so a recommendation cannot mistakenly target upstream-owned content;--include-protectedsurfaces them taggedstatus="protected"for audit. Same-file guard on--baseline-from/--outputmirrors the AC-706 / AC-708 ingest posture. Slice-1 refactor ofautocontext.hermes.advisor: introducesSkillFeaturesas the inference-time input shape so advisors take features (not labeled examples), withCuratorDecisionExample.featuresbridging training to inference cleanly.BaselineAdvisor.predict(features)is unchanged behaviorally; the slice-1 tests update one direct call site. 13 recommendation tests + 1 refactor regression cover features bridge, advisor protocol, protected-skill filtering, include-protected audit path, JSON round-trip, default rationale per advisor type, and 4 CLI integration tests (success, same-file guard, empty training rejection, all-protected empty-output, include-protected surfacing). 186 total hermes tests pass. -
AC-769: failure-type → remediation routing on top of
FailureReport. Newautocontext/src/autocontext/loop/remediation_router.pypattern-matches aFailureReport(plus optional AC-767fixturesmap) into typedRemediationHintinstances. Three built-in rules ship:rule_off_by_one(matches "expected X, got Y" where diff ∈ {1, BLOCK, BLOCK²} for common block sizes, plus "off-by-N" keywords) →SmallCaseVerify;rule_positional_typerror(matchesTypeError: foo() takes N positional argumentsand extracts modules fromFile "..."traceback lines) →SurfaceSignatures;rule_stale_fixture(matchesmissing-substringfailures referencing a fixture key whose cached payload is older thanstale_after_days) →RefreshFixture. Rules are pluggable via aRuleProtocolandDEFAULT_RULESlist.route_remediations(report, *, fixtures, stale_after_days, rules)runs every rule and concatenates hints in order;render_hints(hints)emits a## Suggested next movesprompt block. Wired into the tree-search refinement loop (loop/stage_tree_search.py):HypothesisNodegainslast_errors: list[list[str]],HypothesisTree.updateaccepts an optionalerrors_per_matchkwarg, and the refinement-prompt build site callsremediation_hints_for_node(selected, fixtures=ctx.fixtures)then threads the result intobuild_refinement_prompt(remediation_hints=...).build_refinement_promptgains aremediation_hints: str = ""opt-in kwarg (existing callers unchanged). 23 tests cover rules, router, render, the stage_tree_search wiring helper, and an end-to-end test throughbuild_refinement_prompt. -
AC-767 (docs follow-up): operator-facing documentation for the authoritative ground-truth fixture loader landed in #968. New
autocontext/docs/fixture-loader.mdcovers quick-start (drop a manifest atautocontext/knowledge/<scenario>/fixtures.json, setAUTOCONTEXT_FIXTURE_LOADER_ENABLED=true), manifest format (key,source, optionalexpected_sha256), cache semantics (rehash on read, source-URL change invalidates, missing manifest is a no-op), programmatic API (FixtureManifest,FixtureCache,UrlFetcher,load_scenario_fixtures,render_fixtures), and the settings reference. No code changes; the implementation already shipped via #968. -
AC-708 (slice 1):
autoctx hermes train-advisor --data <jsonl> --baseline --output metrics.jsonlays down the data + evaluation contract for the local Hermes curator advisor. Newautocontext.hermes.advisormodule exposes a DDD domain layer:CuratorDecisionExamplevalue type loaded from AC-705 export JSONL,BaselineAdvisor(always-majority-class with deterministic tie-break inCANONICAL_LABELSorder),LabelMetrics/AdvisorMetrics(per-label precision/recall + overall accuracy +insufficient_dataflag),train_baseline(), andevaluate().load_curator_examplesis per-line tolerant (matches AC-704 / AC-706 ingest posture): malformed JSON, missing required fields, and unknown labels skip the row rather than aborting.INSUFFICIENT_DATA_THRESHOLD = 20floors when per-label metrics are meaningful — datasets below the floor still get metrics back but with the flag set, addressing the AC-708 acceptance criterion "a clear 'not enough data' failure mode for small Hermes homes". The baseline establishes the floor every later trained advisor (slice 2: logistic regression / MLX / CUDA, AC-709 recommendation surface) must beat without redesigning the data contract. 15 tests cover loader robustness, baseline determinism, per-label precision/recall on a known fixture, insufficient-data thresholds, JSON-serializable metrics, and CLI integration (--baseline --json --output, insufficient-data warning, empty-dataset rejection). -
AC-706 (slice 2):
autoctx hermes ingest-sessions --home ~/.hermes --output traces/hermes-sessions.jsonl --redact standard|strict|off [--since <ISO>] [--limit n] [--dry-run]reads the Hermes session SQLite DB (<home>/state.db) in read-only URI mode and writes one autocontextProductionTraceJSONL row per session. Newautocontext.hermes.sessionsmodule exposes a DDD domain layer:HermesSession,HermesMessage,HermesSessionRepository(read-only SQLite + schema-drift tolerance + WAL/SHM sidecar independence), andSessionDBMissingfor the "no DB to ingest" boundary. Newautocontext.hermes.session_ingestis the application service that maps domain objects into ProductionTraces via the sameproduction_traces.emit.build_tracehelper that AC-704 uses (DRY). Per-message content goes through the sharedRedactionPolicyfrom slice 1 (DRY across both ingest paths), so a strict-mode user-pattern set behaves identically for trajectories and sessions. TheRAW_CONTENT_WARNINGopt-in marker from slice 1 is reused so--redact off --jsonsurfaces the same audit signal for sessions. Per-trace metadata carriessession_id,agent_id,session_started_at,session_ended_at,session_metadata, andsource: "hermes.session". Missing DB returns an empty summary (graceful, exit 0). 10 repository tests cover read-only refusal, missing-DB error path, since-filter, sequence order, schema drift (extra and missing columns), WAL/SHM-less open, and corrupt metadata JSON. 13 ingester tests cover end-to-end emission, shared-policy redaction, since/limit/dry-run, importer-never-mutates-DB invariant (mtime + size check),--redact offwarning surfacing, per-trace metadata, invalid---sincerejection, and CLI integration. AC-706 closed. -
AC-706 (slice 1):
autoctx hermes ingest-trajectories --input <jsonl> --output <jsonl> --redact standard|strict|offreads a Hermes trajectory JSONL file (ShareGPT-like, line-per-trajectory) and writes a redacted copy. Default--redact standardruns the existingsharing/redactorpipeline (Anthropic / OpenAI / AWS / GitHub / Slack keys, bearer tokens, emails, IPs, env values, absolute paths, high-risk file refs).--redact strictrequires--user-patterns(a JSON array of{name, pattern}regex objects) and tags hits as[REDACTED_USER_PATTERN:<name>].--redact offwrites raw content and surfaces a CLI warning on the privacy posture (AC-706 requires explicit operator opt-in).--dry-runreports redaction counts without writing the output (AC-706 privacy preview). Per-line tolerance: corrupt JSON, non-object trajectories, and blank lines are skipped (not aborted) with per-line warnings. The redaction stats are returned per-category so operators can audit what was removed. Newautocontext.hermes.redactionmodule exposesRedactionPolicy,compile_user_patterns, andredact_textas the shared policy surface that the AC-706 slice 2 (sessions) will reuse. 11 redaction-policy tests + 13 trajectory-ingester tests (including the CLI subcommand entry point and the input-never-mutated invariant). AC-706 slice 2 (ingest-sessionsfrom~/.hermes/state.dbwith WAL/SHM tolerance and schema drift) is a follow-up; this slice ships the redaction primitives and the simpler JSONL surface first. -
AC-702: Hermes skill references for progressive disclosure. Adds
autocontext/src/autocontext/hermes/references.pyexposing 4 markdown references (hermes-curator,cli-workflows,mcp-workflows,local-training) accessible vialist_references()/render_reference(name). The rendered SKILL.md fromrender_autocontext_skill()now ends with a## Referencessection that cross-links each one.autoctx hermes export-skill --with-references --output <dir>/SKILL.mdwrites the references next to the skill in areferences/subdirectory;--forcepropagates to both SKILL.md and references. The skill remains useful on its own when--with-referencesis not passed. Atomic preflight: every destination is checked before any write so a reference-name collision can't leave SKILL.md half-installed. 12 tests cover canonical order, content invariants (read-only rule in curator alignment doc; concrete commands in CLI workflows; CLI-vs-MCP guidance in MCP workflows; small-dataset warning in local-training), SKILL.md cross-linking, the CLI overwrite-without-force guardrail, and the atomicity regression test. -
AC-705:
autoctx hermes export-dataset --kind curator-decisions --home ~/.hermes --output training/hermes-curator-decisions.jsonlexports Hermes curator decision artifacts as supervised training JSONL for narrow advisor classifiers (per the AC-708 scope). Each row carriesexample_id,source.curator_run_path,source.started_at,input.skill_{name,state,provenance,pinned,use_count,view_count,patch_count,activity_count,last_activity_at},label(consolidated|pruned|archived|added, strongest-wins precedence),confidence: "strong",redactions: [], andcontext.run_{provider,model,counts}. Label quality rules pinned by tests:pinnedskills NEVER become mutation targets;bundledandhubskills NEVER become mutation targets (they appear only as context). Skills missing from the inventory still emit an example withunknownfeatures so historical curator decisions can be trained on. Both Hermes v0.12 action shapes are accepted (list of strings OR list of{"name": ...}dicts).--since <ISO-8601>raises ValueError on invalid input rather than silently disabling the filter; runs without parseablestarted_atfall back to file mtime for the comparison. Pinned-via-.usage.json, bundled-via-.bundled_manifest, and hub-via-.hub/lock.jsonnames are protected even when no active SKILL.md folder exists. Other documented dataset kinds (consolidation-pairs,skill-selection,skill-quality-signals) raiseNotImplementedErrorwith a clear message so callers know they're planned but not yet implemented. 18 fixture-based tests cover schema, label quality rules, since/limit filters, unknown-kind dispatch, dict-shape actions, protected-name fallbacks, and --since hardening. Module docstring documents the full schema; the schema is intentionally flat and feature-engineered so it can feedautoctx train --backend mlx|cudavia a one-step adapter (the adapter is a follow-up). NOTE: small personal Hermes homes may not have enough data for useful model training yet -- the dataset shape ships first; usefulness depends on Curator-decision volume. -
AC-704:
autoctx hermes ingest-curator --home ~/.hermes --output traces/hermes-curator.jsonlreads Hermes v0.12 curator run reports (<home>/logs/curator/**/run.json) and emits autocontextProductionTraceJSONL. The ingester is tolerant: malformed JSON is skipped with a warning rather than aborting; missingstarted_atfalls back to file mtime; missingduration_secondsfalls back to 0. Curator action lists (consolidated/pruned/archived/added) and counts land intrace.metadata.curator_*so downstream dataset exporters (AC-705) can consume them without re-parsing raw files. Privacy defaults:--include-llm-final(off by default) gates whether the curator's LLM final summary is attached as an assistant message;--include-tool-args(off by default) gates whether raw tool-call args are preserved.--since <ISO-8601>and--limit <n>filter the run set. CLI returns a JSON summary (runs_read,traces_written,skipped,warnings) under--json. 11 fixture-based tests cover normal run / consolidation-only / auto-transition-only / malformed JSON / missing curator dir / since-filter / limit / synthesized-messages-satisfy-schema / include-llm-final opt-in / metadata round-trip / timing derivation. -
AC-710:
docs/hermes-positioning.mdrecords the Hermes Curator + autocontext positioning. Headline: Hermes Curator is the live skill-library maintainer; autocontext is the evaluation, trace, replay, export, and local-training layer. Includes an at-a-glance complementarity table, the default operator flow (autoctx hermes inspect->autoctx hermes export-skill->autoctx judge/improve), the read-only import boundary on~/.hermes, the privacy posture for session/trajectory imports, the narrow scope ofautoctx trainfor advisor models, and an explicit "autocontext does not replace Curator" section. Cross-linked fromdocs/README.md"Integrating External Agents". Status footer enumerates shipped / in-flight / out-of-scope work so the doc stays accurate as the rest of the Hermes cluster lands. -
AC-682 (slice 1): TypeScript OpenTelemetry bridge for
PublicTrace. Newts/src/traces/otel-bridge.tsexposespublicTraceToOtelResourceSpans(forward) andotelResourceSpansToPublicTrace(reverse) over a minimal validated subset of OTel JSONResourceSpans(OtelResourceSpansSchemaZod). Bidirectional round-trip preserves traceId, sourceHarness (viaservice.name),collectedAt, sessionId, message order/content, tool calls (name/args/duration/error -> spanstatus.code = "ERROR"), outcome (score/reasoning/dimensions), and redactions metadata. Reverse path validates the reconstructed trace againstPublicTraceSchemabefore returning so a broken bridge cannot emit invalid traces. 11 tests cover schema validation, forward emission, round-trip, missing-service-name error path, missing-root-span error path, optional-outcome handling, zero-tool-call messages, and redaction preservation. Design note + mapping table atdocs/opentelemetry-bridge.mdenumerates the known-gap fields (file references, metadata, tool results) that survive as opaque JSON blobs rather than as structured OTel attributes. Python parity, OTLP protobuf wire format, and the ProductionTrace bridge are out of scope for slice 1. -
AC-725:
docs/flue-influences.mddesign note records what the runtime workspace/session contract, scoped command/tool grants, child-agent task execution, andcwddiscovery model borrowed from an external review, and what was explicitly NOT borrowed (no upstream dependency, no API names, no provider stack, no vocabulary replacement). Cross-linked fromdocs/README.md"Architecture And Parity"; the canonicaldocs/concept-model.mdis intentionally NOT cross-linked to keep its vocabulary autocontext-native (atests/package-topology.test.tsinvariant pins this). Pins the guardrail thatsandbox/workspace/sessionare runtime isolation/boundary concepts, not peer top-level product nouns alongsideScenario/Mission. -
AC-728: verifier-facing contract probes for terminal, service, and artifact tasks. Extends
ts/src/control-plane/contract-probes/index.ts(previously onlyprobeDirectoryContract) with three new pure probes:probeTerminalContract(exit code + required/forbidden stdout/stderr patterns),probeServiceContract(required endpoints with host/port/protocol matching +wrong-interfacedetection for127.0.0.1vs0.0.0.0confusion + optional allowed-endpoint allowlist), andprobeArtifactContract(required/forbidden substrings + LF/CRLF line-ending check + required JSON fields via dot-paths withinvalid-jsonfailure when JSON parse fails). All probes follow the existing{ passed: boolean, failures: readonly Failure[] }shape; failures carry a typedkindfor client filtering. 17 new tests + the existing directory probe test. Distributed/multi-rank parity probes deferred to a follow-up slice. -
AC-679 (slice 3b):
autoctx trace-findings --trace-id <id>extends the slice-2 CLI to load a storedProductionTraceby id from.autocontext/production-traces/ingested/<date>/*.jsonl(the local data plane that flows throughautoctx production-traces ingest).--trace <path>and--trace-id <id>are mutually exclusive input modes; exactly one is required. The workflow adapts ProductionTrace to PublicTrace inline (flattensource.emitter->sourceHarness, derivecollectedAtfromtiming.startedAt, map outcome only when bothscoreandreasoningare present, copy embeddedtoolCallsper message) so the slice-1 extractor runs unchanged. 5 new tests cover load + Markdown, JSON shape, missing-id error, mutual exclusivity, and the "neither flag" failure case. AC-679 is now substantively feature-complete (criteria 1-8 met); the only deferred work is additional taxonomy categories (slice 3e). -
AC-679 (slice 3d):
WeaknessReportvariant ints/src/analytics/trace-findings.ts. AddsWeaknessReportSchema(Zod),generateWeaknessReport(trace), andrenderWeaknessReportMarkdown(report). Mirrors Python'sWeaknessReportshape (recommendation-focused with recovery analysis text) alongside the existingTraceFindingReport. Recommendations are one-per-distinct-category, deduplicated, sourced from a fixedRECOMMENDATION_BY_CATEGORYtable. Recovery analysis is a narrative string composed from the outcome score and weakness count. 8 tests cover schema completeness, generation across the four taxonomy categories, deduplicated recommendations, and Markdown output sections / empty states. -
AC-679 (slice 3c):
renderTraceFindingReportHtml(report)ships ints/src/analytics/trace-findings.ts. Emits an offline-first self-contained HTML document with an inline<style>block, anchored finding rows (id="finding-<id>"so external references can link directly), anddata-category+data-severityattributes on each<li>for client-side filtering hooks. Mirrors the shape of Python'srender_trace_writeup_htmlso operator muscle memory transfers between the two runtimes. User-originated content (titles, descriptions, summary, traceId) is escaped through a singlehtmlEscapehelper that handles& < > " '. 7 tests cover scaffolding, escaping, anchors, data attributes, empty states, offline-style block, and evidence references. -
AC-679 (slice 3a): cross-runtime TraceFindingReport JSON contract. A shared fixture at
fixtures/cross-runtime/trace-finding-report.json(at repo root) is the wire-format contract that both Python and TypeScript validate against. Python addsCrossRuntimeTraceFinding/CrossRuntimeFailureMotif/CrossRuntimeTraceFindingReportPydantic models atanalytics/cross_runtime_trace_findings.pywith camelCase JSON aliases mirroring the TS Zod schema; snake_case kwargs work for ergonomic Python use,model_dump(by_alias=True)is the canonical wire form. 9 Python tests + 6 TS tests on the same fixture catch shape/taxonomy/enum drift before a TS-produced report can fail to parse on Python (and vice versa). Closes AC-679 criterion 8 (cross-runtime contract tests catch Python/TS drift). -
AC-679 (slice 2):
autoctx trace-findings --trace <path> [--json]CLI subcommand wires the slice-1 extractor library into an operator-facing TypeScript command. Reads a PublicTrace JSON file, runsgenerateTraceFindingReport, and emits the report as Markdown (default) or JSON. Handler is pure (runTraceFindingsCommand(args) -> {stdout, stderr, exitCode}) so the 11 unit tests drive it directly without subprocess spawn or stdout capture; the top-levelcli/index.tsshim writes the result. Coupling to the ProductionTrace store (--trace-id <id>) and the extra slice-1-deferred taxonomy categories remain follow-up work. -
AC-679 (slice 1): TypeScript trace-finding extractor library at
analytics/trace-findings.ts. Re-targets AC-679 to operate overPublicTrace(the TS data plane primitive) rather than mirroring Python's harness-internal RunTrace shape, so cross-runtime parity lives in the output contract (TraceFindingReportSchemaZod schema) rather than the input trace. Slice 1 ships the Zod schemas (TraceFindingSchema,FailureMotifSchema,TraceFindingReportSchema), a four-category taxonomy targeting agent-behavior failures detectable from a PublicTrace (tool_call_failure,agent_refusal,low_outcome_score,dimension_inconsistency), pure extractor functions (extractFindings,extractFailureMotifs,generateTraceFindingReport), andrenderTraceFindingReportMarkdown. Captures the agent-behavior axis that the AC-678 Python slice deferred. CLI subcommand, HTML rendering, additional categories (context loss / error-recovery loops), and cross-runtime fixture parity tests land in follow-up slices. -
AC-678 (slice):
autoctx analytics trace-findings --trace-id <id> [--kind writeup|weakness] [--json]emits a trace-grounded findings report for a storedRunTrace. Exposes the existingTraceReporter.generate_writeup/generate_weakness_reportpipeline as an operator CLI without changing the canonical report model; Markdown body matches the run-end-time writeup artifact. Reuses the_validated_trace_idtraversal guard fromrender-timeline. Closes the headline AC-678 gap (Python report model existed without a CLI surface); semantic failure-taxonomy mapping beyond the currentevent_typegrouping remains open. -
AC-749 (slice):
autoctx analytics render-timeline --trace-id <id> [--output path.html]renders an existing persistedRunTraceas an interactive HTML timeline. On-demand counterpart to the run-end-time renderer that already lives inloop/trace_artifacts.persist_run_inspection; reuses the sametimeline_inspection_viewextractor +render_timeline_inspection_htmlview. The rendered HTML now also surfaces a "Generations" section with per-generation failure/recovery counts (data attributesdata-generation-index,data-generation-failure-count,data-generation-recovery-countfor client-side hooks). The view layer exposes the sameinspect_generationdata the JSON payload already carries -- no new analytics model. -
Harness proposal decisions now require explicit evidence references before heldout/fresh validation can accept or reject a proposal. Missing
--evidence-refkeeps the durable decisioninconclusive, and corrupted accepted/rejected proposal JSON with emptyevidenceRefs, dev-only evidence, or missing baseline evidence is rejected by schema validation. -
Python and TypeScript prompt budgeting now share a domain policy for canonical duplicate-context removal, per-component token caps, protected components, and trim order; semantic compaction also caches repeated component compactions by policy version and content hash.
-
AC-727 (slice):
autoctx improve --checkpoint-cmdruns a user-supplied command after each round to preserve partial progress (e.g.git -C /repo commit -am 'round checkpoint'orcp {file} /tmp/round.lean). Same{file}placeholder semantics as--verify-cmd, plus--checkpoint-suffixand--checkpoint-timeoutcompanions. Unlike the verifier, a checkpoint command's non-zero exit is logged but does NOT veto the round; it surfaces as a newcheckpoint_done(round=N, checkpoint_ok=..., checkpoint_exit_code=...)event in the--ndjsonstream. Lets long-running improve loops salvage near-miss artifacts before later rounds overshoot or time out. -
AC-723: the TypeScript CLI now exposes
autoctx agent run <agent>andautoctx agent devfor experimental.autoctx/agentshandlers. The one-shot runner accepts--id, JSON--payload, explicit--envfiles with shell env precedence, provider/model overrides for runtime-backed handlers, and--jsonoutput; the dev server exposesGET /manifestandPOST /agents/<name>/invoke. -
Context-selection analytics reports now include actionable diagnostics for duplicate selected content, low useful-artifact recall, and selected-token bloat.
-
Python analytics now includes
autoctx analytics context-selection --run-id <run-id> [--json]to summarize persisted context-selection artifacts by selected tokens, selection rate, duplicate-content rate, useful-artifact recall, and freshness. -
AC-757: TypeScript control-plane EvalRuns now support
verifiedandexperimentaltracks.autoctx eval attachaccepts--track verified|experimental,eval list --output jsonreports the effective track, and promotion decisions reject explicitly experimental EvalRuns as non-promotion evidence. -
AC-758: Candidate artifacts now record deterministic strategy identity metadata: a canonical strategy fingerprint, component fingerprints, parent strategy lineage, and exact/near duplicate assessment.
autoctx candidate register/showinclude the metadata, andcandidate listsurfaces the strategy fingerprint and duplicate kind. -
AC-759: Candidate artifacts now quarantine repeated invalid strategies by fingerprint. Re-registering an exact or near duplicate of a disabled/quarantined strategy records
strategyQuarantine,candidate listsurfacesquarantineReason, promotion decisions reject quarantined strategies, and operational memory skips findings tied to quarantined strategy fingerprints. -
AC-760: EvalRuns can now carry opt-in ablation verification evidence for accepted strategy and harness changes.
autoctx eval attachaccepts--ablation-verification ./ablation.json,promotion decide --require-ablationrecords anablationVerificationassessment, and--ablation-targets strategy,harnessnarrows the required target coverage. -
AC-680: TypeScript control-plane harness/context changes now have a durable
HarnessChangeProposalworkflow.autoctx harness proposal create/list/show/deciderecords finding lineage, proposed patches, expected impact, rollback criteria, and an evidence-gated decision that accepts only heldout/fresh validation against matching-suite baseline evidence. -
Strategy duplicate and quarantine checks now span all environments for the same scenario/actuator and use
payloadHashas an exact-match fallback for legacy artifacts withoutstrategyIdentity. -
AC-752:
autoctx improve --ndjsonstreams per-round events as newline-delimited JSON to stdout for visibility into long-running loops. Event kinds:round_start,judge_done,verifier_done(only when--verify-cmdis set),round_summary, and a final summary line. Under--ndjsonthe Rich human-readable summary is suppressed so stdout is pure JSON.--jsonand--ndjsonare mutually exclusive output modes and are rejected up front when both are passed. -
AC-753: the ndjson stream now also emits a
revision_done(round=N, output=<content>)event right afterround_startfor every round, carrying the exact output the loop is about to evaluate. For round 1 the payload is the seed; for round N>1 it is the result oftask.revise_output()from round N-1. Lets consumers salvage near-miss verifier-vetoed rounds. Pass--no-ndjson-include-output(default--ndjson-include-output) to suppress these events when the bulk output is unwanted; that flag drops therevision_doneevent entirely and never writes the output payload anywhere on stdout. -
AC-751:
autoctx improve --claude-max-total-seconds FLOATexposessettings.claude_max_total_seconds(the wall-clock ceiling on total claude-cli runtime in a single run; env:AUTOCONTEXT_CLAUDE_MAX_TOTAL_SECONDS). Only applied when the effectively-resolved judge provider is claude-cli;judge_provider='auto'paths that inheritagent_provider='claude-cli'are honored.--timeouthelp onimprovenow explicitly names the per-provider setting it writes (claude_timeout/codex_timeout/pi_timeout). -
Python and TypeScript now expose
autoctx workerto run the existing task queueTaskRunneras a daemon or one-shot batch worker, with persistent-host deployment docs forserve + worker. -
Added narrow Python/TypeScript task queue store contracts so future hosted storage adapters can provide Postgres-backed claim/complete/fail/enqueue semantics without changing
TaskRunner. -
Gondolin is documented as a reserved optional microVM sandbox backend, fails closed until a real adapter is configured, and now has public request/policy/backend contracts for out-of-tree adapters.
-
TypeScript
autoctx runtime-sessionsnow lists, shows, and renders operator-facing timelines for persisted runtime-session event logs from CLI-backed provider runs, includingshow --run-id <run-id>andtimeline --run-id <run-id>for run-scoped logs;status,show, andwatch --jsonsurface aruntime_sessionsummary when one exists, MCP exposes the same read surface vialist_runtime_sessions,get_runtime_session, andget_runtime_session_timeline, cockpit HTTP clients can read logs and timelines from/api/cockpit/runtime-sessions,/api/cockpit/runtime-sessions/:session_id/timeline,/api/cockpit/runs/:run_id/runtime-session, and/api/cockpit/runs/:run_id/runtime-session/timeline, cockpit run list/status/resume payloads includeruntime_sessionplusruntime_session_urlfor discovery, the interactive TUI exposes/timeline <run-id>for the same grouped view and summarizes live runtime-session activity as it arrives with persisted/activityfilters, quiet/normal/verbose detail controls,/activity reset, read-only bare/activityand/activity status, and startup readback of loaded activity settings, and/ws/eventsstreams liveruntime_session_eventenvelopes as runtime-session events are appended. -
Python now has parity readers for runtime-session event logs: a TypeScript-compatible event/store/read-model/timeline layer, cockpit endpoints for listing logs and resolving run-scoped timelines, run list/status/resume discovery fields, and MCP tools
autocontext_list_runtime_sessions,autocontext_get_runtime_session, andautocontext_get_runtime_session_timelinewith unprefixed aliases. -
Python runtime-backed run and solve role calls now automatically append provider prompts and responses to the run-scoped runtime-session log, preserving runtime failure semantics while making the new Python readers useful without manual recorder wiring.
-
Python now exposes a core
RuntimeWorkspaceEnvcontract with local filesystem and in-memory adapters, virtual path resolution, scoped command grants, and explicit cleanup semantics to match the TypeScript runtime workspace boundary. -
TypeScript runtime workspace command grants now expose structured start/end/error observability events, a no-shell local process wrapper with explicit env inheritance, redacted/truncated command output previews, child-task inheritance policy, and scoped command/tool grant types for runtime-session calls without serializing trusted env values into prompts or session logs.
-
The canonical concept model now documents durable runtime-session event storage as an
Artifactmodel for provider turns, shell/tool activity, child-task lineage, compaction summaries, replay, and the boundary withRunTrace/production traces. -
Python and TypeScript runtime-session logs now record semantic compaction ledger writes as
COMPACTIONevents with entry ids, component names, ledger paths, and generation metadata for replay timelines; TypeScript records the hook-finalized ledger entries and paths after artifact write hooks run. -
Python and TypeScript now expose explicit runtime-session-to-
RunTraceadapters for analytics reuse, mapping child-task lineage, command/tool status, and compaction artifact references without copying raw prompts, model responses, stdout/stderr, or arbitrary runtime metadata.
Fixed
- AC-764 / AC-765: Python and TypeScript Pi CLI runtimes no longer rely on raw
subprocess.run(..., timeout=...)/execFileSync(..., { timeout })cleanup. Both runtimes now isolatepi --printin a subprocess/session where supported, kill the full process group on timeout, close inherited stdout/stderr pipes, bound post-kill cleanup to 5s, and preserve timeout metadata (error: "timeout", timeout seconds) for callers. Regression coverage includes process-group kill, interrupted/abnormal cleanup, and leaked-pipe timeout return paths. - AC-761 / AC-735: claude-cli subprocesses are now hard-killed at their process group on timeout AND on any other abnormal exit (
KeyboardInterrupt,SystemExit, ...). The previous code path usedsubprocess.run(..., timeout=...), which onlyproc.kill()s the immediate child; claude-cli helper processes that inherit pipe fds kept the post-killcommunicate()drain open, so a--timeout 1200invocation observed at 2h24m alive (AC-761) andAUTOCONTEXT_CLAUDE_MAX_TOTAL_SECONDS=28800runs observed at 8h45m (AC-735). The runtime now spawns claude in its own session (start_new_session=True) andos.killpg(pgid, SIGKILL)s the whole group, with a bounded 5s grace on the post-kill drain. Becausestart_new_session=Truealso detaches the child from the terminal's signal-delivery group, Ctrl-C / SIGINT no longer reaches the claude process group automatically; the helper'sexcept BaseExceptionbranch (PR #940 review) ensures interrupted runs still clean up the detached children before re-raising. Wall-clock returns withinclaude_timeout + 5seven when grandchildren hold pipes open. POSIX only; Windows usesproc.kill()fallback. - AC-756:
ImprovementResult.met_thresholdnow consistently mirrors the same predicate used by the early-return paths -- the best round both clearedquality_thresholdand satisfieddimension_thresholdif one was configured. Previously the fallthrough exit (plateau-stall, unchanged-output, max-rounds, consecutive-failures) hard-codedmet_threshold=False, so a run that produced above-threshold output via, e.g., a plateau-stall path was flagged as "didn't meet threshold" and could be discarded by automation. The fix tracksbest_dims_okalongsidebest_scoreso the per-dimension gate is honored at fallthrough exits too. - AC-754:
ImprovementLoopnow peels off an outer markdown code fence (e.g.```lean ... ```) when cleaning agent output, so verifiers that compile the output directly (lake env lean,mypy,cargo check, ...) no longer reject otherwise-valid content on the literal fence lines. Applied to both the seed (round 1's input) and the result of everytask.revise_output()call. The strip is conservative: only the outer wrapper is removed, inner nested fences and unbalanced fences are preserved. - AC-750:
ImprovementLoopno longer fires a misleadingmax_score_deltawarning when the previous round was zeroed by the external--verify-cmdverifier. The loop now trackslast_unvetoed_scoreseparately fromprev_valid_score; the delta check compares against the last legitimate judge score, while plateau detection still treats consecutive verifier vetoes as a stall. - Runtime-session event stores now preserve existing events when saving stale or partial logs, and the TypeScript timeline pairs repeated child-task completions by child session id before falling back to task aliases.
- Worker commands now clamp concurrency to one for stateful persistent runtimes, and Python runtime-bridge providers close underlying runtimes on shutdown.
- TypeScript task runners now await queue-store methods so hosted Postgres adapters can implement the queue contract asynchronously.
- AC-733..AC-738 batch from the putnam_2013_a5 stress test:
improvenow exposes--verify-cmd/--verify-suffix/--verify-timeoutfor compile/test gates that can force score=0 and feed stderr back into revision;solveaccepts--task-promptto bypass the LLM scenario designer (which truncated long Lean/Putnam-style prompts),--task-filefor file-backed descriptions,--generationsas an alias for--gens, and-dshort form for--description;--familytypos surface adid_you_meansuggestion via the newFamilyNamevalue object instead of silently falling through;AUTOCONTEXT_CLAUDE_TOOLS=""now renders as a single--tools=argv token rather than a stray double-space; andAUTOCONTEXT_CLAUDE_MAX_TOTAL_SECONDS(default0/off) attaches aRuntimeBudgetto every settings-drivenClaudeCLIRuntime(default agent provider, per-role overrides, and the judge/provider registry path), with retry backoff sleeps bounded by both the per-invocation cap and the attached budget.
Changed
- Python
autocontextand TypeScriptautoctxpackage metadata are bumped to0.5.1for the Pi CLI timeout-hardening release. Follow-up Pipi-autocontextpackage metadata is bumped to0.2.5, its extension imports and peer dependencies are migrated to the Pi 0.74@earendil-works/*/typeboxpackage names, and itsautoctxdependency now requires the hardened^0.5.1line. - Default of
AUTOCONTEXT_CLAUDE_MAX_TOTAL_SECONDSis now0(disabled, opt-in). Set explicitly when you want a wall-clock cap on total Claude CLI runtime; the per-invocation retry cap insideClaudeCLIConfigkeeps its 25-minute default for in-process retry sequences.
0.5.0 - 2026-05-01
Added
- Python and TypeScript
autoctx solvenow accept the plain-language goal as a positional argument while keeping--descriptionas a named option. - Python and TypeScript
solve/runcommands now accept--iterationsas the plain-language alias for--gens. - Python and TypeScript
autoctx run <scenario>now accept a positional scenario while keeping--scenariofor scripts. - Python and TypeScript
autoctx export <run-id>now export knowledge from a specific run while keeping scenario-level export support. - TypeScript CLI/TUI help now uses the same plain-language run vocabulary, including
status <run-id>,show <run-id> --best, andwatch <run-id>. - Python
autoctx hermes inspectnow reads Hermes v0.12 skill usage telemetry and Curator reports without mutating~/.hermes, andautoctx hermes export-skillemits a first-class Hermesautocontextskill that teaches CLI-first workflows with MCP as optional.
Fixed
- Python installed
autoctxno longer crashes on no-args startup when packaged banner assets are missing.
Changed
- Python
autocontextand TypeScriptautoctxpackage metadata are bumped to0.5.0. - Pi
pi-autocontextpackage metadata is bumped to0.2.4, and itsautoctxdependency range accepts both the current0.4.9package and the upcoming0.5.0npm line.
0.4.9 - 2026-04-30
Fixed
- TypeScript
simulatenow uses the schema-evolution scenario designer for schema-evolution prompts and rejects zero-mutation generated specs before persistence (AC-694). - Python Pi/Pi-RPC budget errors now report the effective bounded role timeout instead of the original unbounded Pi timeout (AC-695).
- RLM sessions can soft-finalize from explicit final-answer tags, cautious natural-language closure cues, and repeated silent no-progress turns, while preserving real inspection progress (AC-696).
- Rubric drift monitoring now flags within-generation mean-versus-best compression and catches slower dimension decline patterns (AC-686).
Changed
- Python
autocontextand TypeScriptautoctxpackage metadata are bumped to0.4.9. - Pi
pi-autocontextpackage metadata is bumped to0.2.3while intentionally keeping itsautoctxdependency one package behind at^0.4.8.
0.4.8 - 2026-04-30
Fixed
- TypeScript generated
schema_evolutionscenarios no longer score empty mutation plans as perfect, and generated actions now record mutation lineage before schema-coverage scoring (AC-666). - Python Claude CLI runtime calls now use bounded timeout retries with exponential backoff, total wall-clock caps, retry metadata, and warning/error logs for long-running live-agent calls (AC-684).
- Python solve now enforces generation budgets across Pi/Pi-RPC role calls, including per-role overrides, and closes one-shot budgeted persistent Pi RPC clients after use (AC-691).
- TypeScript schema-evolution creation now recovers from Pi-style invalid JSON responses with markdown fences, prose wrappers, comments, trailing commas, and camelCase fields (AC-692).
- Python solve JSON/status output now includes resolved scenario-family metadata for stress harnesses and user workflows (AC-693).
- Iterative investigation no longer requires resolving the architect runtime before the first analyst step.
- Task-like solve lifecycle hooks now report persisted generation counts separately from improvement rounds.
Changed
- Python
autocontextand TypeScriptautoctxpackage metadata are bumped to0.4.8. - Pi
pi-autocontextpackage metadata is bumped to0.2.2while intentionally keeping itsautoctxdependency one package behind at^0.4.7.
0.4.7 - 2026-04-29
Added
- Python
autoctx exportnow accepts--format pi-packageto write a Pi-local package directory withpackage.json,SKILL.md, prompt markdown, and the original autocontext strategy payload. - Python and TypeScript autocontext now expose Pi-shaped extension hook buses via
AUTOCONTEXT_EXTENSIONS, covering run/generation lifecycle, context transforms, semantic compaction, provider requests/responses, judge calls, and artifact writes. - Pi
pi-autocontextnow exposesautocontext_runtime_snapshotfor run artifacts, package provenance, session branch lineage, and recent event-stream context. - TypeScript Pi RPC now supports an opt-in persistent runtime via
AUTOCONTEXT_PI_RPC_PERSISTENT=true, reusing onepi --mode rpcsubprocess for prompt and live-control calls. - TypeScript CLI now exposes
autoctx solveas a DB-backed solve-on-demand entrypoint with--description,--gens,--timeout, and--jsonsupport (AC-619). - TypeScript solve now preserves Python-shaped controls for structured family overrides, per-generation runtime-budget enforcement, output file writing, and classifier fallback status metadata (AC-620).
Fixed
- TypeScript capabilities now report the provider factory support surface and no longer mark the visible
traincommand as Python-only (AC-626). - TypeScript
runnow supports saved customagent_taskscenarios through the agent-task improvement runner instead of rejecting scenarios already discoverable in the control plane (AC-625).
Changed
- Restructured the top-level
README.md: leads with the Pi runtime quick start, adds an MCP-driven natural-language entry path ("Or Just Talk To Your Agent"), shows a structured artifact tree with concreteplaybook.mdandtrace.jsonlexcerpts, surfaces production-trace capture as its own section, merges the surfaces table with command examples, and adds a short FAQ. Removes redundant "How People Use It" / "Choose An Entry Point" / "Repository Layout" sections (the last is already covered inAGENTS.md). - Bumped subpackage README references from
0.4.4to0.4.7(autocontext/README.md,ts/README.md) to track the next release line. - Python
autocontext, TypeScriptautoctx, and Pipi-autocontextpackage metadata are bumped for the release.
0.4.6 - 2026-04-23
Added
- Browser integration surface (AC-598–603): Chrome CDP backend for Python (
autocontext.integrations.browser) and TypeScript (autoctx/integrations/browser), wired into investigations and the task queue. Includes a browser exploration contract, cross-runtime validation fixtures, parity enforcement, and selector generation for CDP element refs. - A2-III Anthropic integration:
instrument_client/InstrumentedAsyncAnthropic(Python) andinstrumentClient(TypeScript) intercept Anthropic SDK calls and route production traces through the autocontext pipeline, withAnthropicStreamProxy/AnthropicStreamProxyAsyncfor streaming andAnthropicTaxonomyMapperfor outcome classification. Available atautocontext.integrations.anthropicandautoctx/integrations/anthropic. Includes cross-runtime parity (9 fixtures + 50-run property tests), anthropic-python/ts detector plugins, bundle-size enforcement, and zero-telemetry guarantee. - Production traces
build-datasetfilters (AC-606):--provider,--app,--env, and--outcomefilters on thebuild-datasetCLI and MCP tool, plus an E2E integration test covering OpenAI + Anthropic traces through ingest→build-dataset. - Hierarchical investigation evidence with evidence cards cache and artifact drill-down hardening.
- Tail context preservation in secondary prompt reducer surfaces.
- Solve runtime floor raised for generated scenarios.
Fixed
- Provider proxy runtime plumbing centralized into a shared
_shared/proxy-runtimemodule so Anthropic and OpenAI integration proxies share consistent lifecycle and error handling (AC-611). - TypeScript scenario family designers now share response parsing across agent-task, artifact-editing, and tool-fragility families so generated specs preserve family-specific semantics (AC-612).
- Install salt identity invariant preserved across process restarts (AC-609).
- Cross-runtime migration ledger reconciliation so Python and TypeScript DBs stay aligned after schema divergence (AC-608).
- CLI dispatch moved into a command registry so mission routes resolve correctly (AC-610).
- Babel reverse solve designer retries restored and scenario creation stabilized (AC-607).
Changed
- Python and TypeScript package metadata are bumped to
0.4.6.
0.4.5 - 2026-04-21
Fixed
quality_thresholdauto-heal no longer silently drops below the configured floor during multi-round improvement loops (AC-585).- Judge-provider inheritance now propagates correctly to nested evaluation calls so role-routing overrides are honored end-to-end (AC-586).
- Claude CLI timeout default bumped from 300 to 600 seconds, reducing spurious failures in longer live-agent solve runs (AC-588).
- Release-sweep accounting hardened to prevent double-counting across concurrent sweep legs.
Added
- Added a shared browser exploration contract and package-safe configuration surface across Python and TypeScript, including canonical schemas, validation helpers, secure
AUTOCONTEXT_BROWSER_*defaults, and policy helpers. - Added the TypeScript Chrome DevTools Protocol backend for browser exploration, including attach-only target discovery, websocket transport, policy-gated actions, and evidence artifacts.
- Added Python browser exploration integration for investigations and queued tasks, including policy-gated snapshot capture, prompt/evidence enrichment, and fail-closed task-runner wiring.
- Added a thin Python Chrome CDP browser backend with debugger-target discovery, evidence persistence, WebSocket transport, runtime factory, and policy-checked session actions.
- Added cross-runtime browser contract fixtures so Python and TypeScript validators stay in lockstep.
- Added TypeScript browser-context integration for investigations, queued tasks, and MCP queueing, including fail-closed navigation policy handling and artifact-backed browser evidence.
0.4.4 - 2026-04-20
Added
- Added the production-traces contract and traffic-to-eval pipeline across Python and TypeScript, including cross-runtime schemas, emit/validate helpers, redaction, retention, dataset building, CLI/MCP surfaces, and golden integration flows.
- Added the TypeScript control-plane
model-routingactuator plus the publishedchooseModelruntime helper for deterministic route, rollout, guardrail, fallback, and trace-integrated model selection. - Added Python solve ergonomics for family overrides and improved classifier observability/fallback vocabulary for finance, schema-evolution, geopolitical simulation, and alignment-stress prompts.
Fixed
- Hardened Python scenario design and solve paths around malformed designer responses, intent-drift retry feedback, mandatory calibration examples, structured quality thresholds, readable sample prompts, and schema/geopolitical simulate routing.
- Preserved the latest control-plane hardening while restacking the production-traces/model-routing foundation, including candidate artifact boundary validation and model-routing payload registration.
Changed
- Python and TypeScript package metadata are bumped to
0.4.4.
0.4.3 - 2026-04-17
Fixed
- Hardened Pi-backed solve/runtime execution so Pi RPC waits for assistant completion, honors model/context-file options consistently, and solve runs enforce timeout budgets.
- Preserved generated-scenario family behavior across solve, export, TypeScript
new-scenario, andimproveflows, including empty-action family specs and improve calls without an initial output. - Made custom scenario loading resilient and diagnosable: malformed specs no longer block registry discovery, spec-only directories surface actionable diagnostics, import-time missing files keep their real reason, and non-agent family specs can auto-materialize Python
scenario.pysources. - Normalized structured agent-task prompt payloads before validation and code generation, so JSON-like sample inputs, reference context, preparation instructions, and revision prompts no longer crash generated runtimes.
Changed
- Python and TypeScript package metadata are bumped to
0.4.3.
0.4.2 - 2026-04-16
Fixed
- Preserved TypeScript workflow and custom-scenario semantics across broader scenario generation, including workflow compensation/side-effect metadata and camelCase final score weights.
- Hardened Python judge, improve, simulate, and list CLI flows around timeout overrides, fresh workspaces, provider overrides, rubric guardrails, and simulation-family routing.
- Added the Python
autoctx investigatesurface with generation fallbacks and kept its CLI implementation below the repository module-size gate. - Restored Python
autoctx queue add --task-prompt ... --rubric ...compatibility for prompt-backed queued tasks, including direct ad hoc queueing without a saved spec name.
Changed
- Python and TypeScript package metadata are bumped to
0.4.2.
0.4.1 - 2026-04-14
Fixed
- Restored operator-loop escalation accounting when explicit escalation actions also mention clarification, so generated Python scenarios preserve both escalation and clarification signals.
- Preserved operator-loop family routing through Python solve creation and replay-safe feedback validation without violating the Pydantic serialization convention.
- Routed TypeScript
new-scenariooperator-loop requests through the dedicated family designer and allowed generated operator-loop scenarios to execute through the solve codegen path. - Python and TypeScript package metadata are bumped to
0.4.1.
0.4.0 - 2026-04-14
Changed
- Refactored the TypeScript platform foundation, analytics/trace/training, and control-plane integration surfaces into thinner workflow modules while preserving CLI, MCP, and package parity.
- Hardened the extracted package-surface workflows around typed MCP tool boundaries, simulation dashboard report parsing, and deterministic simulation score normalization.
- Python and TypeScript package metadata are bumped to
0.4.0.
0.3.7 - 2026-04-08
Added
- TypeScript
autoctx campaignCLI with create, status, list, add-mission, progress, pause, resume, and cancel subcommands, completing the CLI surface for CampaignManager (AC-533). - Campaign API endpoints and MCP tools for multi-mission coordination with budget tracking and dependency graphs.
Changed
- Standardized Anthropic credential loading around
ANTHROPIC_API_KEYwhile keepingAUTOCONTEXT_ANTHROPIC_API_KEYas a compatibility alias across Python and TypeScript settings. - Added optional role-scoped credential and endpoint overrides (
AUTOCONTEXT_{ROLE}_API_KEY,AUTOCONTEXT_{ROLE}_BASE_URL) forcompetitor,analyst,coach, andarchitect, falling back to the global provider configuration when unset.
Fixed
- Python
autoctx simulatenow resolves live generation through the effective architect-role runtime surface, soAUTOCONTEXT_ARCHITECT_PROVIDERand other role-routing overrides are honored instead of being bypassed by the raw client builder. - Python simulation spec normalization now tolerates LLM-friendly action/spec shapes such as
postconditions, nested criteria objects, and extra action-planning metadata without failing code generation. - Structured simulation preconditions now preserve referenced action ids when LLM output includes both an
actionfield and human-readable prose, so generated dependencies remain executable. - Regenerating a custom scenario with the same name in one process now force-reloads the generated module so
solveand creator validation do not reuse stale scenario classes fromsys.modules. - Pi-backed live flows now default to a 300 second timeout, reducing spurious failures in longer
solveruns. - Public docs now describe
operator-in-the-loopas a runnable family and no longer contradict the executable tests.
0.3.6 - 2026-04-07
Changed
- Hardened bootstrap, evidence, and privacy handling so environment snapshots redact shell paths correctly, rematerialized workspaces do not retain stale artifacts, and live prompt/evidence flows now wire the collected snapshot and evidence manifest into the real loop.
- Tightened scenario-generation safety in the TypeScript surface so
operator_loopvalidation requires its real escalation/clarification hooks and spec auto-heal preserves punctuation-heavy precondition dependencies instead of dropping valid ordering. - Improved evidence and security backstops by failing closed on TruffleHog execution errors and making the evidence workspace/MCP integration rely on a materialized runtime workspace instead of dead helper-only paths.
- Hardened blob-store backends so local keys cannot escape the configured root and Hugging Face bucket metadata/list/delete behavior remains accurate across fresh process boundaries.
- Python and TypeScript package metadata are bumped to
0.3.6.
0.3.5 - 2026-04-06
Changed
- Stabilized the post-
0.3.4simulation path so operator-loop scenarios preserve behavioral-contract signals across multi-run, sweep, and replay flows instead of silently dropping them. - Hardened plain-language simulation execution around explicit family detection, operator-loop contract enforcement, and shared CLI engine-result handling so incomplete runs surface consistently across Python and TypeScript surfaces.
- Tightened the simulation-engine implementation without regressing the repo module-size guardrail, including the compatibility shim needed by existing abstract-class filtering tests.
- Python and TypeScript package metadata are bumped to
0.3.5.
0.3.4 - 2026-04-04
Changed
- Added action-label and living-docs surfaces to the operator workflow, including reviewer-driven cleanup on the action-label taxonomy and living-docs maintenance path.
- Landed the TypeScript/Python parity tranche for session store and the full research package, keeping the rebased cross-surface runtime behavior aligned on current
main. - Folded in the
pi-autocontextpolish follow-up so the published Pi package line reflects the renamed extension and its best-practices cleanup. - Python and TypeScript package metadata are bumped to
0.3.4.
0.3.3 - 2026-04-03
Changed
- Expanded the research surface with validated domain contracts, runtime gating, persistence hardening, and better evaluation wiring for briefs, prompts, and adapters.
- Hardened Python and TypeScript operator-control surfaces around terminal lifecycle transitions, remote approvals, progress digests, and agentOS session/runtime error handling.
- Improved SQLite bootstrap and migration compatibility so packaged installs and fresh databases stay aligned with the live generation schema.
- Expanded the TypeScript provider compatibility surface with env-driven config for
gemini,mistral,groq,openrouter, andazure-openai, and synced the public provider docs/tests to match. - Python and TypeScript package metadata are bumped to
0.3.3.
0.3.2 - 2026-04-02
Changed
- Completed the TypeScript session-runtime parity pass across lifecycle management, coordinator state transitions, supervision, context pressure, remote approvals, progress digests, memory consolidation, and skill registry behavior.
- Hardened the TypeScript operator control plane so terminal session and worker states stay terminal, remote approvals require connected controllers, and redirected work remains visible in progress summaries.
- Python and TypeScript package metadata are bumped to
0.3.2.
0.3.1 - 2026-04-01
Changed
- Python package publishing now uses the canonical PyPI name
autocontextinstead ofautoctx. - Public install docs now reflect the package split accurately: PyPI is
autocontext, while npm remainsautoctx. - Python and TypeScript package metadata are bumped to
0.3.1.
0.3.0 - 2026-03-29
Added
Commands
autoctx simulate— plain-language multi-variable simulation with sweeps, replay, compare, and export.autoctx investigate— evidence-driven diagnosis with hypotheses, confidence scoring, and unknowns.autoctx analyze— interpret and compare runs, simulations, investigations, and missions.autoctx train— train distilled models from curated datasets with backend selection.- Python
autoctx simulate— full parity with the TypeScript surface: run, replay, compare, and export.
Scenarios
- All 11 scenario families now fully executable in TypeScript (was 2/11) via secure-exec V8 isolate codegen.
operator_loopis now a fully runnable family in both packages.- Unified family classifier: all families reachable through the CLI.
- Spec auto-heal: codegen failures trigger automatic recovery.
- Scenario revision flow: refine created scenarios with feedback.
- Deep execution validation: generated code is executed and verified before registration.
- Three scenario templates: content-generation, prompt-optimization, and rag-accuracy.
new-scenarioCLI materializes runnable artifacts to disk.- Scenario parity matrix documents Python/TypeScript surface coverage.
Missions & Campaigns
- Adaptive mission execution: LLM-driven goal decomposition and step planning replaces generic bookkeeping.
- Campaign abstraction: coordinate multiple missions under long-term goals with budget tracking and dependencies.
- Mission-simulation integration: missions invoke simulations as planning tools.
Trace Pipeline
- Open public trace schema v1.0.0: versioned interchange format for coding agent traces.
- Sensitive-data detection and redaction with policy-backed actions.
- Privacy-aware trace export workflow: redact, validate, manifest, and attestation.
- Publishing connectors for local JSONL, GitHub Gist, and Hugging Face.
- Trace-to-model data plane with
DatasetCuratorandDataPlane. - Repo-local dataset discovery: scan repo trees and convert JSONL, JSON, CSV, and markdown into ShareGPT-style records.
- Curated distillation dataset pipeline with gate filtering, top-quartile selection, family filtering, and failure-example policy.
Training & Distillation
- Base model selection maps scenario families to training modes (from-scratch, LoRA, and full fine-tune).
- Training backend abstraction with MLX and CUDA plus an injectable
TrainingExecutorhook. - Prompt alignment ensures distilled models match runtime invocation.
- Candidate-shadow-active promotion lifecycle with configurable quantitative gates and rollback.
Changed
- Consolidated operator UI: the Python
serveandtuisurfaces are API/WebSocket-first, while interactive terminal UI remains available through the TypeScript client surfaces. - Richer sweep DSL: categorical sweeps, logarithmic scales, sweep file loading, and named presets.
Fixed
- Trace pipeline audit: expanded redaction patterns, ISO 8601 timestamp validation, explicit role mapping, export warnings, and Hugging Face format fixes.
- Distillation audit: training executor hook, base model validation, CSV parser edge cases, silent catches now surfaced as warnings, and end-to-end integration coverage.
0.2.4 - 2026-03-26
Added
- Session notebook context now flows into runtime prompts and cockpit views for active runs.
- World-state abstractions now support stateful scenario families and workflow-style scenarios.
Changed
- Agent-task scaffolding and execution now use separate phased budgets.
- Operator-loop scenarios remain available as typed family metadata, but executable operator-loop scaffolding has been removed so the harness no longer bakes in escalation-specific runtime behavior.
- Public repo docs now include a docs landing page, package-selection guidance, an analytics/adoption guide, a release checklist, and copy-paste integration examples for CLI, MCP, Python SDK, and TypeScript usage.
Fixed
- Python package fallback version metadata now matches the published
0.2.0package version.
0.2.0 - 2026-03-15
Added
- Initial public release with Python and TypeScript packages.
- Generation loop with Elo-based progression gating.
- Agent roles: competitor, analyst, coach, architect, and curator.
- Pluggable scenarios including
grid_ctf,othello, and the custom creation pipeline. - LLM judge with multi-sample evaluation.
- Task runner daemon with improvement loops.
- MCP server with tool implementations.
- FastAPI dashboard with WebSocket events.
- CLI via Typer (Python) and
parseArgs(TypeScript).