PRML Cookbook

June 1, 2026 · View on GitHub

Short, opinionated patterns for using PRML in real ML evaluation pipelines.

Spec DOI License

This is the field-manual for the PRML specification. The spec tells you what a manifest is. The cookbook tells you how to use it without shooting yourself in the foot.

Every pattern is:

  • One page — read in under three minutes
  • Self-contained — the example runs end-to-end with the snippets shown
  • Failure-mode-first — what goes wrong is named before what goes right

Patterns

#PatternWhen to use
1Single-shot eval claimOne model, one benchmark, one number — the 90% case.
2Multi-seed eval claimWhen you report mean ± std over N seeds.
3Streaming Elo / arena evalLive leaderboards. (Uses v0.2 streaming variant.)
4Dataset version pinningBenchmarks evolve; how to commit to a specific revision.
5CI gate via prml-verify-actionBlock PRs that ship a model with a tampered eval claim.
6Public registry anchoringWhen and when not to publish your hash publicly.
7RevocationWithdrawing a manifest after publication. (v0.2 feature.)
8Pre-registration without infrastructureThe minimum-viable workflow: a YAML file and sha256sum.
9RLHF win-rate evaluationsJudge-model comparisons (AlpacaEval, MT-Bench, Arena-Hard).
10Federated evaluationMulti-org replication: shared hash, distinct producers, regulator-grade audit trail.
11PRML + Sigstore for execution integrityCloses the §8.1 gap: who ran the eval, when, against which exact artefacts.
12PRML in Hugging Face model cardsMake the accuracy number on a published HF model card verifiable, not trust-me prose.
13PRML + commit-reveal validation for independence attestationCloses the other §8.1 gap: structural proof that independent evaluators couldn't coordinate verdicts. Co-authored with ValiChord.

Anti-patterns

#Anti-patternWhy it bites
A1Computing the hash after the runThe whole point is committing before.
A2Editing the manifest "to fix a typo"Any edit breaks the hash. Use revocation.
A3Storing private data in the manifestThe hash is published; the manifest content might be too.
A4Treating the hash as proof of truthThe hash proves commitment, not correctness.

Reference

  • Identity levels (0–4) — a non-normative ladder for the binding strength between producer and the real-world authoring entity. Used by Pattern 11 and the v0.3 RFC.

Audit & compliance crosswalks

Subcategory-by-subcategory maps from major AI governance frameworks to PRML fields (FULL / PARTIAL / NONE tagged):

Examples

Working code in examples/:

  • pytorch-imagenet/ — Full example: PRML manifest before a PyTorch ImageNet eval, hash committed, post-run verification
  • stable-baselines3-rl/ — RL agent on LunarLander-v2, mean episode reward claim, threshold direction >=
  • inspect-ai-refusal/ — Refusal-rate eval via Inspect AI, PRML pre-registration via falsify-inspect
  • huggingface-eval/lm-eval-harness integration, multi-task pre-registration

License

  • Documentation, patterns, examples: CC0 1.0 — public domain dedication. Mirror, fork, modify without attribution.
  • Any tooling: MIT.

Contributing

Pattern proposals welcome via PR. Each new pattern must:

  1. Solve a real problem someone hit while implementing PRML
  2. Be reproducible — name the tools and their versions
  3. Include a "what doesn't work" section (we are not selling)
  4. Be under 800 words

Open an issue first if you're unsure whether your pattern fits.

Authors

Cüneyt Öztürk Contact: hello@falsify.dev · falsify.dev


Status

  • v0.1 stable. v0.2 RFC open through 2026-05-22 — spec.falsify.dev/v0.2-rfc.
  • The PRML JSON Schema is in the SchemaStore catalog (merged 2026-05-11), so *.prml.yaml files autocomplete in VS Code, JetBrains, Helix, Zed, and Cursor out of the box.

Contributing

See CONTRIBUTING.md and the good first issue label for scoped work.

Cite the spec: Öztürk, C. (2026). PRML v0.1. Zenodo. https://doi.org/10.5281/zenodo.20177839