Measured-Run Proof Bundle
June 3, 2026 · View on GitHub
Status: explainable demo, not a runnable quickstart. Uses existing committed golden artifacts from the Phase 1 and Phase 2 acceptance fixtures. No install instructions, no eBPF setup, no claim that this is a standalone product.
This page walks through what one Assay-Runner measured run produces, using
the canonical golden artifacts already checked into the repo. The goal is
not to teach you how to run a measured run yourself — that requires a
delegated Linux/eBPF host class (assay-bpf-runner) — but to make it
concrete what comes out of one. If you are evaluating whether a
deterministic proof-bundle layer would be useful next to your existing
observability or testing setup, this is the document that answers "what
am I actually looking at?".
If you arrived here from GitHub Discussion #1329 or the AgentSight Issue #44, this is the conceptual companion to the Phase 1 + 2 retrospective.
What This Is Not
- Not an install guide. There is no
cargo install assay-runner. - Not a live monitor. Nothing in this document streams.
- Not an instrumentation library. There is no SDK for the user to import.
- Not a comparison against any specific observability product. The point is to explain the shape of the artifact, not to position it against alternatives.
What One Measured Run Produces
A measured run produces one deterministic .tar.gz archive. The archive
contains, per the
assay.runner.archive_manifest.v0 layout (see also
the schema constants in crates/assay-runner-schema/src/archive_manifest.rs):
- Three load-bearing v0 JSON artifacts that reviewers and CI gates
read:
observation-health.json,capability-surface.json,correlation-report.json. Each carries anassay.runner.*.v0schema string. - One archive manifest:
manifest.json(schemaassay.runner.archive_manifest.v0) listing every file in the archive with its SHA-256 and byte count. - One whole-archive event stream:
events.ndjson. - Three per-layer event streams:
layers/kernel.ndjson,layers/policy.ndjson,layers/sdk.ndjson.
run-archive.tar.gz
├── manifest.json # archive manifest (schema, run_id, files[])
├── observation-health.json # honest health-of-observation report
├── capability-surface.json # what the run touched (paths/tools/decisions)
├── correlation-report.json # SDK/policy/kernel correlation by tool_call_id
├── events.ndjson # whole-archive event stream
└── layers/
├── kernel.ndjson # cgroup-scoped normalized kernel events
├── policy.ndjson # MCP allow/deny decisions
└── sdk.ndjson # normalized SDK tool-call events
The three load-bearing JSON files are what reviewers, CI gates, and
cross-runtime diff projections actually read. The ndjson streams are the
layer-level evidence the JSON files are computed from. The archive
verifies through the existing Assay evidence path: every file is hashed
and recorded in manifest.json, and the runner does not ship its own
verifier.
Observation Health — Honesty About Gaps
This is the most important artifact. It says what was observed cleanly and what was not. A measured run that lost kernel events, missed a policy decision, or had a degraded SDK capture is required to say so here.
Canonical golden, from
golden/observation-health-openai-agents-kernel-policy-v0.json:
{
"schema": "assay.runner.observation_health.v0",
"run_id": "run_openai_agents_kernel_policy_determinism",
"platform": "linux",
"kernel_layer": "complete",
"ringbuf_drops": 0,
"policy_layer": "present",
"sdk_layer": "self_reported",
"cgroup_correlation": "clean",
"network_protocol_coverage": "connect_only",
"network_endpoint_claim_scope": "diagnostic_only",
"notes": [
"s2_kernel_capture: monitor_events=4 ringbuf_drops=0 network_protocol_coverage=connect_only network_endpoint_claim_scope=diagnostic_only",
"s4_policy_capture: policy_events=1",
"s5_sdk_capture: sdk_events=3 sdk_tool_calls=1"
]
}
How to read this:
kernel_layer: completeplusringbuf_drops: 0means the kernel did not lose any events the eBPF ring buffer handed us. If it had, this would saydegradedand the count would be non-zero, and the rest of the bundle would have to be interpreted in that light.network_protocol_coverage: connect_onlyis the honesty boundary for the current Runner network surface: clean capture does not imply protocol-complete QUIC peer attribution.- Runs that emit
sendtoorsendmsgpeer events can reportdatagram_peer_observedorconnect_and_datagram_peer_observedinstead. That strengthens the transport observation but does not, by itself, create a request-level or exact peer-set claim. network_endpoint_claim_scope: diagnostic_onlymeans anynetwork_endpointsvalues are coarse/diagnostic evidence, not an exact datagram peer set.policy_layer: presentmeans MCP policy decisions were captured.sdk_layer: self_reportedis the honest framing: SDK events come from the SDK itself, so we record them but never call them kernel-corroborated.cgroup_correlation: cleanmeans the child process landed in the measured cgroup before it spawned, so the kernel observation window matches the process's actual lifetime.
If any of these degrade, the v0 contract requires the bundle to say so. A measured run does not pretend to be cleaner than it is.
Capability Surface — What The Run Touched
A normalized, set-shaped view of what the run did at the policy and kernel layers. Deterministic across runs that do the same thing.
Canonical golden, from
golden/capability-surface-openai-agents-kernel-policy-v0.json:
{
"schema": "assay.runner.capability_surface.v0",
"run_id": "run_openai_agents_kernel_policy_determinism",
"filesystem_paths": [
"/tmp/assay-runner-openai-agents-kernel-policy/work/openai-agents-input.txt",
"/tmp/assay-runner-openai-agents-kernel-policy/work/policy-input.txt"
],
"network_endpoints": [],
"process_execs": [],
"mcp_tools": [
"read_file"
],
"policy_decisions": [
"allow:read_file"
]
}
How to read this:
- The fixture used one MCP tool (
read_file), policy allowed it once, and two filesystem paths were touched. No network, no extra process execs. - Sets are sorted and deduplicated. Two runs of the same fixture produce byte-identical surfaces. Two runs that diverge in observed behaviour produce a non-empty diff on this artifact, which is exactly the regression signal CI gates can read.
This is the artifact a release gate would diff against a baseline. Not "did the eval pass" — "did the surface change in a way we did not expect".
Correlation Report — Cross-Layer Binding
This is the artifact that makes the SDK / policy / kernel layers
comparable rather than three separate streams. It binds tool-calls
across layers by tool_call_id.
Canonical golden, from
golden/correlation-report-openai-agents-kernel-policy-v0.json:
{
"schema": "assay.runner.correlation_report.v0",
"run_id": "run_openai_agents_kernel_policy_determinism",
"status": "clean",
"bindings": [
{
"tool_call_id": "tc_runner_policy_001",
"policy_decision": "allow",
"kernel_event_count": 2,
"window": {
"start": "run_started",
"end": "run_finished"
}
}
],
"ambiguities": []
}
How to read this:
- The SDK declared a tool call with id
tc_runner_policy_001. The policy layer recorded anallowdecision under the same id. The kernel layer recorded two normalized events inside the binding window. status: cleanmeans every SDK tool-call binding had a stabletool_call_idand a matching policy decision. If a runtime omittedtool_call_id, the v0 contract requires this to degrade topartialorfailedwith the ambiguity recorded — we do not invent ordering to paper over the gap.window.start/window.endare runner-defined phase markers from one canonical runner clock. SDK timestamps are informational only.
SDK And Policy Layer Streams
The ndjson layers are not golden-checked-in (they are produced by the fixture at acceptance time), but their shape is contract-frozen. Illustrative slices from a clean run look like this.
SDK layer, one event per line, assay.runner.sdk_event.v0:
{"schema":"assay.runner.sdk_event.v0","run_id":"run_openai_agents_kernel_policy_determinism","seq":0,"event_type":"run_started","source":"openai-agents-fixture","sdk_name":"@openai/agents","sdk_version":"0.11.4"}
{"schema":"assay.runner.sdk_event.v0","run_id":"run_openai_agents_kernel_policy_determinism","seq":1,"event_type":"tool_call_started","source":"openai-agents-fixture","sdk_name":"@openai/agents","sdk_version":"0.11.4","tool_call_id":"tc_runner_policy_001","tool":"read_file"}
{"schema":"assay.runner.sdk_event.v0","run_id":"run_openai_agents_kernel_policy_determinism","seq":2,"event_type":"tool_call_completed","source":"openai-agents-fixture","sdk_name":"@openai/agents","sdk_version":"0.11.4","tool_call_id":"tc_runner_policy_001","tool":"read_file"}
Policy layer (also ndjson, MCP decision records). One illustrative entry:
{
"tool_call_id": "tc_runner_policy_001",
"tool": "read_file",
"decision": "allow",
"reason": "policy:tools.allow",
"ts_runner": "policy_layer_captured"
}
tool_call_id is the join key across all three layers. That is the
whole v0 correlation contract: one stable id, one window, three layers,
no inferred ordering.
Cross-Runtime Diff — Comparing Two Runtimes
A cross-runtime diff projects the capability surface across two
different runtime fixtures (here @openai/agents vs google-genai),
applies the canonicalization rules from
cross-runtime-diff-decisions.md,
and produces a diff with explicit non-claims.
Excerpt from
golden/cross-runtime-diff-s5-gemini-v0.json:
{
"schema": "assay.runner.cross_runtime_diff.v0",
"base_runtime": "s5_openai_agents",
"head_runtime": "gemini_google_genai",
"status": "clean",
"preconditions": {
"base_health_clean": true,
"head_health_clean": true,
"stable_tool_call_ids_required": true,
"stable_tool_call_ids_present": true,
"runtimes_distinct": true
},
"surface": {
"filesystem_paths": {
"added": ["<work>/gemini-input.txt"],
"removed": ["<work>/openai-agents-input.txt"],
"unchanged": ["<work>/policy-input.txt"]
},
"mcp_tools": { "added": [], "removed": [], "unchanged": ["read_file"] },
"policy_decisions": { "added": [], "removed": [], "unchanged": ["allow:read_file"] }
},
"sdk_metadata": {
"comparison": "side_band_provenance",
"base": { "sdk_name": "@openai/agents", "sdk_version": "0.11.4" },
"head": { "sdk_name": "google-genai", "sdk_version": "2.6.0" }
},
"non_claims": [
"cross_runtime_no_acceptability_judgment",
"cross_runtime_no_declared_capability_input",
"cross_runtime_no_derived_binding_identity",
"cross_runtime_no_filename_semantic_equivalence",
"cross_runtime_no_sdk_capability_equivalence"
]
}
How to read this:
- The two runtimes touched different per-fixture filename prefixes
(
openai-agents-input.txtvsgemini-input.txt) but reached the same policy decision on the same MCP tool (read_file,allow). That difference is in the surface diff, surfaced explicitly, not silently normalized away. - The
<work>/prefix is the A1 work-dir canonicalization rule: per-fixture work directories are normalized to a stable prefix so the diff is meaningful across runtimes without losing per-fixture filenames. non_claimsis the heart of v0 honesty: the diff does not claim that two runtimes touchingread_filemeans they have semantically equivalent capabilities, and it does not pretend to derive binding identity across runtimes. Cross-runtime equivalence is a separate, not-yet-opened, contract question.
If a schema change made these two runtimes diverge unexpectedly, the diff catches it before it ships. That is the regression surface, and it's the thing CI can read.
What This Bundle Is Useful For
- Release gating. Diff today's surface against last release's baseline; non-empty diff blocks the release unless explicitly approved.
- Regression testing. Run the same agent fixture before and after a change; the surface should be byte-identical or the diff has to explain itself.
- Cross-runtime comparison. When the same prompt runs under two different agent runtimes, the cross-runtime diff says where they agree and where they don't, without making semantic-equivalence claims.
- Honest evidence under load. The observation health is the contract that the bundle is not lying about gaps. Degradation is recorded, not hidden.
What it is not useful for:
- "What is my agent doing right now" — Assay-Runner is not a live monitor. Use a system-level observability tool for that.
- Production traffic analysis — the contract is per-run, not per-fleet.
- Live LLM call observability — the supported path uses deterministic local providers; live LLM observability is a different problem.
Where Each Artifact Comes From
If you want to read the code:
- Schemas and manifest types:
crates/assay-runner-schema/ - Archive assembly and layer normalizers:
crates/assay-runner-core/ - Cgroup placement primitives:
crates/assay-runner-linux/ - Cross-runtime diff projection:
scripts/ci/(currently script-hosted) - Fixtures producing the goldens above:
runner-fixtures/openai-agents/andrunner-fixtures/gemini-google-genai/
All four runner crates are publish = false. The fixtures require a
delegated Linux/eBPF host class (assay-bpf-runner) to produce real
acceptance runs. The golden JSON artifacts referenced from this page
are checked-in snapshots so reviewers can read the contract without
running anything.
Further Reading
- Phase 1 + 2 retrospective — the long-form story
- Assay-Runner reference index — all internal contracts
- Runner artifact v0 contracts — full schema specs
- Runner cross-runtime diff v0 contract
- Runner cross-runtime diff decisions (A1+B3+C1)
- Phase 2D consolidation audit — extraction-readiness gating