Glossary

April 22, 2026 · View on GitHub

Short definitions for the concepts and commands you will encounter in falsify's docs. Terms are ordered alphabetically; cross-references use the link target #term-slug.

Audit trail

The append-only JSONL stream produced by falsify export. Every lock, run, and verdict writes one line, so the trail reconstructs the full history of a claim from pre-registration through current state. It is what makes post-hoc threshold edits visible rather than plausibly-deniable. See also: Lock file, Verdict, ADVERSARIAL.md.

Canonical YAML

The deterministic YAML serialization used as the hash substrate: yaml.safe_dump(spec, sort_keys=True, default_flow_style=False, allow_unicode=True, width=4096). Two specs with the same semantic content produce byte-identical canonical YAML, and therefore the same spec hash. See also: Hash (spec hash), Spec, ARCHITECTURE.md.

Claim

A falsifiable statement about a measurable quantity — a threshold compared via a direction against a number returned by a metric function over a dataset. The unit of pre-registration that falsify locks, runs, and verdicts. See also: Spec, Failure criteria, TUTORIAL.md.

Claim contract

The triple (claim, failure_criteria, experiment) as it exists at lock time. It is the "contract" in the sense that changing any of its parts changes the spec hash and therefore invalidates the lock — which is the point. See also: Hash mismatch, Honest relock, ARCHITECTURE.md.

Direction

One of above, below, or equals. Governs the comparison of a metric value against the threshold and is strict: above means strictly greater-than, below means strictly less-than. The strict semantic is documented in the schema comments so that boundary cases (value exactly equal to threshold) are always unambiguous. See also: Failure criteria, Threshold, hypothesis.schema.yaml.

Dogfood

The practice of running falsify against claims about its own behaviour. The three shipped self-claims (cli_startup, test_coverage_count, claude_surface) are re-verified by make dogfood on every CI run. The shorthand makes the loop visible: if falsify breaks its own contract, you see it first. See also: Self-claim, Self-dogfooding, CLAUDE.md.

Exit code

A deterministic integer signal emitted by every falsify subcommand. Fixed semantics: 0 PASS, 2 bad spec or INCONCLUSIVE, 3 hash mismatch, 10 FAIL, 11 guard violation. Treated as API — exit codes never change without a major version bump. See also: Verdict, Replay, README.md.

Failure criteria

The falsification block of a spec: metric, direction, and threshold, plus a minimum sample size. A run PASSes when the observed metric on the required sample survives the criterion, FAILs when it does not, and goes INCONCLUSIVE when the sample is too small. See also: Claim, Threshold, Direction.

Falsification

The Popper sense: a claim is meaningful only if it is testable and could fail. falsify the tool encodes this as a workflow — the spec names the condition under which the claim would die, and the verdict records whether that condition held. See also: Claim, Pre-registration, FAQ.md.

Hash (spec hash)

A SHA-256 digest of the canonical YAML bytes of a spec. Identifies a locked claim uniquely and detects any post-lock edit. Stored in spec.lock.json under spec_hash; 64 lowercase hex characters. See also: Canonical YAML, Lock file, ARCHITECTURE.md.

Hash mismatch

The exit-3 condition when the stored spec_hash in spec.lock.json disagrees with a freshly-computed hash of the current spec. falsify run refuses to execute until the operator resolves the divergence with an honest relock or by reverting the edit. See also: Hash (spec hash), STALE, ADVERSARIAL.md.

Honest relock

Explicitly running falsify lock <name> --force after an intentional edit to a spec. Produces a new hash and an entry in the audit trail, so the change is never silent. See also: Hash mismatch, Pre-registration, CONTRIBUTING.md.

Hypothesis

The author-facing name for a spec draft — used by the hypothesis-author skill and in hypothesis.schema.yaml. Once a hypothesis is locked it becomes a claim in the CLI vocabulary; the terms are otherwise interchangeable. See also: Claim, Spec, hypothesis.schema.yaml.

INCONCLUSIVE

A verdict state where a run executed but the sample size fell below minimum_sample_size. Not a PASS, not a FAIL — the run simply did not have enough data to decide. Exit code 2. See also: Minimum sample size, Verdict, ARCHITECTURE.md.

Lock file

spec.lock.json, the small JSON artefact written by falsify lock. Stores the spec hash, the locked_at timestamp, and the canonical YAML bytes that were hashed. Tracked in git so the lock is part of the claim's history. See also: Audit trail, Spec, TUTORIAL.md.

MCP (Model Context Protocol)

The stdio protocol that exposes falsify verdicts to Claude Desktop and Claude Code sessions via mcp_server/. Ships four tools (list_verdicts, get_verdict, get_stats, check_claim) and three resource URIs. Optional install: pip install -e '.[mcp]'. See also: Verdict, ARCHITECTURE.md.

Metric function

A Python callable referenced by experiment.metric_fn in the form module:function. Must return a (value, sample_size) pair. Runs in the same process as falsify run; the metric module is imported from the current working directory. See also: Spec, Failure criteria, EXAMPLES.md.

Minimum sample size

falsification.minimum_sample_size in the spec. A floor below which a run is recorded as INCONCLUSIVE rather than PASS or FAIL — the machinery refuses to decide on underpowered data. See also: INCONCLUSIVE, Failure criteria.

Pre-registration

Locking a claim before running the experiment. The single most important discipline in falsify: the hash published at lock time is what proves the claim did not move to fit the data. See also: Prime Directive, Hash (spec hash), ADVERSARIAL.md.

Prime Directive

The three-line rule stated in CLAUDE.md: every claim is locked before it is run; hash mismatches are never silenced; threshold edits require an explicit relock and a visible audit entry. See also: Pre-registration, Honest relock, CLAUDE.md.

Replay

Re-executing a stored run's metric function against its recorded dataset, via falsify replay <run_id>. Exits 0 if the replay reproduces the stored value within tolerance, 10 otherwise — the standard reproducibility test. See also: Verdict, Exit code, ARCHITECTURE.md.

Self-claim

A claim about falsify itself, sourced under claims/self/ and locked under .falsify/. The shipped set is cli_startup, test_coverage_count, and claude_surface. See also: Dogfood, Self-dogfooding, CLAUDE.md.

Self-dogfooding

See Dogfood. The longer name is the one used in CI logs and release-check Gate 12; it means the same thing. See also: Dogfood, Self-claim.

Skill

A Claude Code capability described in .claude/skills/<name>/SKILL.md. falsify ships five (hypothesis-author, falsify, claim-audit, claim-review, falsify-ci-doctor). Skills run in the main session; heavier work routes to subagents. See also: Slash command, Subagent, ARCHITECTURE.md.

Slash command

A Claude Code prompt defined in .claude/commands/<name>.md. falsify ships three (/new-claim, /audit-claims, /ship-verdict). Commands compose skills and CLI calls into a single-keystroke workflow. See also: Skill, Subagent, README.md.

Spec

The YAML file declaring a claim, its failure criteria, and its experiment reference. The canonical bytes of this file are what gets hashed at lock time. See also: Claim, Canonical YAML, hypothesis.schema.yaml.

STALE

A verdict state where a run exists but its spec_hash no longer matches the current canonical hash — i.e., the spec has changed since the last run. Neither PASS nor FAIL; the run is simply out-of-date. See also: Hash mismatch, Verdict, ADVERSARIAL.md.

Subagent

A Claude Code agent described in .claude/agents/<name>.md, forked-context by default. falsify ships two: claim-auditor (semantic PR cross-reference) and verdict-refresher (autonomous re-runs of STALE specs). See also: Skill, Slash command, MANAGED_AGENTS.md.

Threshold

The numeric cutoff inside a failure criteria block. Compared against the metric value via the chosen direction. Changing it silently is the canonical dishonesty pattern falsify is built to surface. See also: Direction, Failure criteria, Honest relock.

UNLOCKED

A state where a spec.yaml exists but has no spec.lock.json beside it. falsify run refuses to execute an UNLOCKED spec — that would violate pre-registration. See also: Pre-registration, Lock file, Verdict.

UNRUN

A state where a spec is locked but has never been executed. Seen immediately after falsify lock on a fresh claim. The corresponding exit code depends on the subcommand; falsify why <claim> reports state: UNRUN. See also: Verdict, Pre-registration, TUTORIAL.md.

Verdict

The recorded outcome of a run: one of PASS, FAIL, INCONCLUSIVE, STALE, UNRUN, UNLOCKED, or UNKNOWN. Stored in verdict.json beside the run it describes; the value reported by falsify verdict and by the MCP get_verdict tool. See also: Exit code, Audit trail, ARCHITECTURE.md.