Pattern 1
May 8, 2026 · View on GitHub
When to use: One model. One benchmark. One number reported. The 90% case.
What you're committing to
Before you run the eval, you commit nine things:
prml_version: "0.1"
metric: "accuracy"
threshold: 0.92
threshold_direction: ">="
dataset: "imagenet-1k-val"
dataset_hash: "sha256:9e2c8d1a..." # SHA-256 of the resized 50,000-image tarball
model_version: "resnet50-2026-05-08-fp16"
sample_size: 50000
seed: 42
pre_registered: "2026-05-08T20:00:00Z"
The hash of these fields is your receipt that the threshold (>= 0.92) was set before the run.
Run it
pip install falsify
falsify lock manifest.yaml
# sha256:e3b0c44298fc1c14a...
# ...run your eval...
falsify verify manifest.yaml --hash sha256:e3b0c44298fc1c14a...
# OK
What goes wrong
1. The dataset hash drifts. The most common screw-up. You compute dataset_hash against imagenet-1k-val.tar, the next day someone re-downloads the file and unpacks it in a different order, and sha256sum produces a different hash. Pin the content, not the path. Pre-compute once, paste into the manifest.
2. Float precision. threshold: 0.92 and threshold: 0.920 are different bytes. Pick one form for your project and stick with it. The reference impls handle this consistently — but your YAML editor may not.
3. Pre-registration drift. You commit at T=0, you start the run at T=2h. If something forces you to re-launch the eval at T=4h, the manifest still says T=0. That's correct: the commitment didn't change; only the run did. If you want to mark a re-run, emit a new manifest with a fresh pre_registered timestamp; don't edit the old one.
What doesn't work
-
Hashing the model. PRML does not (and shouldn't) commit to the model weights.
model_versionis a label — a stable identifier you control. Wrap PRML inside a Sigstore signature if you want artifact-level integrity. -
Multiple thresholds in one manifest. Pick one. If you have two metrics, emit two manifests with two hashes. This is by design — composite thresholds make pre-registration ambiguous.
-
Threshold direction left implicit. "We hit ~92% accuracy" is not a PRML claim. PRML requires you to commit to
>= 0.92or<= 0.05or whatever shape your direction takes.
Minimal manifest skeleton (copy/paste)
prml_version: "0.1"
metric: "" # e.g. "accuracy", "f1", "refusal_rate"
threshold: 0.0
threshold_direction: ">=" # >= | <= | > | < | ==
dataset: ""
dataset_hash: "sha256:"
model_version: ""
sample_size: 0
seed: 0
pre_registered: "" # RFC 3339 UTC, e.g. "2026-05-08T20:00:00Z"
Next pattern
If you report mean ± std across multiple seeds: see Pattern 2 — Multi-seed eval claim.