Examples
May 28, 2026 · View on GitHub
Each example has: the claim in plain English, the falsification
criterion, a spec.yaml sketch, and what PASS / FAIL would mean.
These are illustrative — drop them into .falsify/<name>/spec.yaml
and adapt the experiment.command to your codebase.
Example 1 — Model accuracy regression
Scaffold this with
falsify init --template accuracy.
Claim (plain): "Our classifier maintains at least 92% accuracy on the holdout set."
Spec sketch:
claim: Classifier accuracy ≥ 92% on holdout.
falsification:
failure_criteria:
metric: accuracy
direction: below
threshold: 0.92
minimum_sample_size: 500
stopping_rule: fixed sample
experiment:
command: python3 eval.py --holdout data/holdout.csv --report out.json
dataset: data/holdout.csv
metric_fn: eval:accuracy
Run falsify why <name> for a plain-English explanation of the
current state. Behavior:
- PASS (exit 0): accuracy ≥ 0.92 on ≥ 500 rows.
- FAIL (exit 10): accuracy < 0.92.
- INCONCLUSIVE (exit 2): fewer than 500 rows evaluated.
Example 2 — Latency regression gate
Scaffold this with
falsify init --template latency.
Claim: "P95 request latency stays under 200ms after this refactor."
Spec sketch:
claim: P95 latency ≤ 200ms on staging benchmark.
falsification:
failure_criteria:
metric: p95_latency_ms
direction: above
threshold: 200
minimum_sample_size: 10000
stopping_rule: fixed sample
experiment:
command: python3 bench/run.py --n 10000 --out bench.json
metric_fn: bench:p95_ms
Run falsify why <name> for a plain-English explanation of the
current state. Behavior: PASS = latency stays within the budget; FAIL = a
regression merged in. Useful as a pre-release gate — CI blocks
the release tag when falsify verdict exits 10.
Example 3 — Prediction market calibration
Scaffold this with
falsify init --template brier.
Claim: "Our market prices are well-calibrated: Brier score below 0.25 across last-30-day closed markets."
Spec sketch:
claim: Brier score < 0.25 on closed markets, last 30 days.
falsification:
failure_criteria:
metric: brier
direction: above
threshold: 0.25
minimum_sample_size: 20
stopping_rule: fixed sample (30-day rolling window)
experiment:
command: python3 calibration/brier.py --window 30d --out out.json
metric_fn: calibration.brier:compute
Run falsify why <name> for a plain-English explanation of the
current state. Behavior: this is exactly the calibration sample in
examples/calibration_sample/ — the generalization. PASS = calibrated,
FAIL = re-train or re-price.
Example 4 — Code review LLM agreement
Scaffold this with
falsify init --template llm-judge.
Claim: "When our LLM code reviewer flags a line as critical, a human reviewer agrees at least 80% of the time."
Spec sketch:
claim: Human-LLM agreement on critical-line flags ≥ 80%.
falsification:
failure_criteria:
metric: agreement_rate
direction: below
threshold: 0.80
minimum_sample_size: 100
stopping_rule: fixed sample
experiment:
command: python3 eval_agreement.py --labeled labeled.csv --predictions preds.csv --out out.json
metric_fn: eval_agreement:agreement_rate
Run falsify why <name> for a plain-English explanation of the
current state. Behavior: FAIL = the LLM reviewer is disagreeing with humans too
often → don't ship it as the default path. PASS = safe to enable
without human-in-the-loop for every flag.
Example 5 — AB test preregistration (bonus)
Scaffold this with
falsify init --template ab.
Claim: "Variant B has higher click-through rate than A, at p<0.05 with a minimum detectable effect of 2 percentage points."
Spec sketch:
claim: Variant B CTR > Variant A CTR (MDE 2pp, alpha 0.05).
falsification:
failure_criteria:
metric: ctr_b_minus_a
direction: below
threshold: 0.02
minimum_sample_size: 20000
stopping_rule: fixed sample (no peeking)
experiment:
command: python3 ab/run.py --snapshot out.json
metric_fn: ab:lift
Run falsify why <name> for a plain-English explanation of the
current state. Behavior: the stopping_rule: fixed sample (no peeking) note
guards against p-hacking via repeated looks. FAIL = no detectable
lift; PASS = lift ≥ 2pp on the full pre-registered sample.
Why the template is the same
All five examples share the same schema: metric + direction + threshold + minimum_sample_size + stopping_rule. This uniformity is the point — every empirical claim in your stack becomes a hash-anchored, CI-gated artifact, regardless of domain.
Adapting an example to your repo
-
Copy the spec sketch into
.falsify/<name>/spec.yaml. -
Fill in
experiment.commandto run your evaluation. -
Point
metric_fnat themodule:functionthat reads the experiment output and returns a(float, int)tuple where the float is the metric and the int is the sample size. -
Run the pipeline:
python3 falsify.py lock <name> python3 falsify.py run <name> python3 falsify.py verdict <name> -
If this check belongs in your release flow, install the hook so CI enforces it automatically:
python3 falsify.py hook install
Sharing a claim with a peer
When you want a reviewer, regulator, or collaborator to verify your
verdicts from scratch, export the audit trail and ship them the
JSONL plus the original spec.yaml:
python3 falsify.py export --include-runs > audit.jsonl
The file has one JSON object per event — every lock, every run,
every verdict — each with a schema_version, a timestamp, and
(for verdict records) a locked_hash that chains back to the
original lock. Two invocations against the same .falsify/
produce byte-identical output, so the peer can re-run export
after re-running your experiment and diff the two files to confirm
the audit chain matches.
On the receiving end:
# peer just received audit.jsonl
python3 falsify.py verify audit.jsonl # exit 0 if trustworthy
verify walks the JSONL, confirms each verdict's locked_hash
resolves to a preceding lock's canonical_hash, checks timestamps
are monotonic per spec, and refuses any file that was reordered
or whose hash chain broke after export (exit 10).