PRML
June 2, 2026 · View on GitHub
Version: 0.1 (Draft) Date: 2026-05-01 Status: Working Draft — Public Review Editor: Cüneyt Öztürk <hello@falsify.dev> Reference Implementation: falsify (MIT) Canonical URL: https://spec.falsify.dev/v0.1 License: CC BY 4.0
Abstract
PRML defines a content-addressed serialization format for pre-registered machine learning evaluation claims. A PRML manifest binds a metric, a numeric threshold, a dataset content hash, and a random seed to a SHA-256 digest produced before the experiment runs. After the experiment, an independent verifier recomputes the hash, executes the evaluation against the pre-registered parameters, and emits a deterministic verdict.
The format is designed to be implementable in any language, transmittable as a plain text artifact, and verifiable without network access. PRML is not an experiment-tracking platform; it is a primitive intended to underlie such platforms and to satisfy regulatory audit-trail obligations under regimes including the EU AI Act (Regulation 2024/1689) Articles 12 and 18.
Status of This Memo
This document is a working draft published for public review. It is not a
finished standard. Comments are invited at
github.com/studio-11-co/falsify/discussions or by email to hello@falsify.dev.
The next planned revision (v0.2) will incorporate review feedback and freeze the canonicalization rules of §3.
1. Introduction
1.1 Motivation
Machine learning evaluations suffer from a credibility gap that conventional experiment-tracking tools do not close. The metric, threshold, dataset, and seed that a team claims to have committed to before a training run are typically recorded only after results are observed, if at all. Post-hoc revision of these parameters — moving a threshold from 0.85 to 0.83, swapping a held-out split, re-rolling a seed — is mechanically indistinguishable from honest reporting in the absence of a cryptographic pre-commitment.
Three contemporary forces make this gap urgent:
- Regulatory. The EU AI Act's logging (Article 12) and recordkeeping (Article 18) obligations enter force August 2, 2026. High-risk AI providers must demonstrate that performance claims attached to a deployed model are the same claims registered prior to deployment.
- Scientific. Benchmark contamination, data leakage, and selective reporting consistently degrade the informativeness of public evaluations.
- Commercial. Capability claims attached to frontier model releases are frequently disputed precisely because no public, tamper-evident record of the evaluation contract exists.
PRML proposes the smallest sufficient primitive to close this gap: a hash-bound manifest, written before the run, verified after.
1.2 Conventions and Terminology
The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in [RFC 2119].
The following terms have specific meaning in this specification:
- Manifest — A document conforming to §2 that pre-registers an evaluation claim.
- Canonical bytes — The byte sequence produced by serializing a manifest according to §3.
- Manifest hash —
SHA-256(canonical bytes), encoded as 64 lowercase hexadecimal characters. - Producer — The party that creates and publishes a manifest before the evaluation runs.
- Verifier — Any party (including the producer) that independently recomputes the manifest hash and the evaluation outcome.
- Audit log — The append-only sequence of manifests covering successive amendments to a registered claim (§6).
2. Manifest Structure
2.1 Required Fields
A PRML manifest is a YAML 1.2 document. Implementations MUST populate the following top-level keys:
| Key | Type | Description |
|---|---|---|
version | string | Spec version. MUST equal "prml/0.1" for this revision. |
claim_id | string | UUIDv7 identifier for the claim. MUST be unique per producer. |
created_at | string | RFC 3339 timestamp in UTC, second precision. |
metric | string | Identifier of the metric being claimed. See §2.3.1. |
comparator | string | One of >=, >, ==, <=, <. See §5.1. |
threshold | number | Real number the metric is compared against. |
dataset | mapping | Identifier and content hash of the dataset. See §2.3.2. |
seed | integer | Non-negative 64-bit integer. |
producer | mapping | Identity of the manifest producer. See §2.3.3. |
2.2 Optional Fields
| Key | Type | Description |
|---|---|---|
metric_args | mapping | Free-form arguments parameterizing the metric. |
model | mapping | Identifier of the model under test, if known pre-run. |
code | mapping | Identifier of the code (e.g., git commit) used to evaluate. |
prior_hash | string | Manifest hash of the previous claim in an amendment chain (§6). |
notes | string | Human-readable annotation. MUST NOT affect verification. |
2.3 Field Semantics
2.3.1 metric
The metric value MUST be either:
- A registered identifier from the PRML Metric Registry (forthcoming, §11), or
- A URI dereferencing to a published definition.
Examples: accuracy, f1_macro, https://example.org/metrics/calibration_ece.
2.3.2 dataset
dataset:
id: <human-readable identifier>
hash: <hex SHA-256 of canonical dataset bytes>
uri: <optional retrieval URI>
The hash field MUST be the SHA-256 digest of the dataset's canonical byte
representation. The canonical representation is dataset-format-specific and
SHOULD be documented in the dataset's accompanying datasheet.
2.3.3 producer
producer:
id: <DNS-name, ORCID, or GitHub handle>
signature: <optional detached PGP or minisign signature>
The signature field, when present, MUST be a detached signature over the
canonical bytes of the manifest (the same byte sequence whose SHA-256 is
the manifest hash). This matches standard cryptographic practice (Sigstore
detached signatures, minisign, PGP --detach-sign) and protects against
a second-preimage scenario where an adversary substitutes a different
canonical byte sequence yielding an identical hash: under hash-only signing,
such a substitution would still verify; under bytes-signing, it does not.
Verifiers MUST run canonicalization regardless (it is a prerequisite for the
hash check in §5.2 step 1), so signing over the canonical bytes adds no
redundant work. Implementations SHOULD store the signature in a sidecar
file <claim_id>.prml.sig alongside the existing <claim_id>.prml.sha256
sidecar. v0.2 normatively adopts this sidecar convention.
v0.1 erratum (2026-05-02): earlier drafts of this spec instructed implementations to sign over the manifest hash. That recommendation is withdrawn. Existing v0.1 implementations that signed over the hash should re-sign over the canonical bytes before any regulatory submission.
2.3.4 created_at — declared time vs. anchor time
The created_at field is the time the producer declares to have authored
the manifest. It is part of the canonical bytes and therefore part of the
hash: any change to created_at changes the hash. This is sufficient to
prevent a producer from retroactively editing the timestamp on a published
manifest without breaking the signature chain.
It is not sufficient to prove the manifest existed at the declared time.
A producer can write any RFC 3339 string into created_at at any moment;
the spec has no way to constrain that string against a wall clock the
producer does not control.
Audit-strength timestamps come from anchor mechanisms external to the manifest:
- Git commit author/committer timestamps in a public repository,
- Registry receipt timestamps (e.g.
registry.falsify.devrecords the server-side wall clock at which it first observed a given manifest hash), - RFC 3161 timestamping authorities,
- Sigstore Rekor transparency log entries,
- arXiv submission timestamps and DOI registration dates,
- CI run timestamps recorded in public workflow logs.
A verifier evaluating "when was this committed?" MUST treat the
created_at field as a producer-side claim and look to one or more anchor
mechanisms for the authoritative answer. A v0.1-conforming producer
SHOULD anchor every published manifest in at least one such mechanism
and document the choice. v0.2 makes this a normative SHOULD; v0.1 leaves
the choice informative.
This distinction matters for §8.1 threat-model analysis: the
threat that created_at defends against is the producer retroactively
editing a published manifest. The threat that anchoring defends against is
the producer back-dating a manifest that was authored after the fact.
The two are different and require different mechanisms.
3. Canonical Serialization
3.1 YAML Subset
A PRML manifest MUST be expressible in the following YAML subset:
- Block-style mappings only (no flow-style).
- Plain scalars, double-quoted scalars, and integers.
- No anchors, aliases, or tags beyond
!!str,!!int,!!float. - ASCII-only, except where UTF-8 is explicitly permitted (e.g.,
notes).
3.2 Key Ordering
For canonicalization, all mappings MUST be reserialized with keys in lexicographic byte order. Nested mappings are ordered recursively.
3.3 Whitespace and Encoding
- Canonical output MUST be UTF-8 encoded.
- Indentation MUST be exactly two spaces per level.
- Each key-value line MUST terminate with a single LF (
0x0A). - The canonical byte sequence MUST end with a single LF.
- Trailing whitespace is PROHIBITED.
- Comments are PROHIBITED in canonical form.
A reference canonicalizer is provided by the falsify implementation and produces output byte-equivalent to the rules above for any conforming input.
4. Hash Algorithm
The manifest hash MUST be computed as:
hash = lowercase_hex(SHA-256(canonical_bytes))
Implementations MUST NOT strip a trailing newline, normalize line endings to
CRLF, or otherwise alter canonical_bytes before hashing.
The hash SHOULD be published alongside the manifest in a sidecar file
named <claim_id>.prml.sha256.
5. Verification Semantics
5.1 Comparison Operators
comparator | Pass condition |
|---|---|
>= | observed ≥ threshold |
> | observed > threshold |
== | abs(observed - threshold) < tolerance |
<= | observed ≤ threshold |
< | observed < threshold |
The == comparator's tolerance defaults to 1e-9. Producers MAY override
this by setting metric_args.tolerance.
5.2 Pass/Fail Determination
A verifier MUST:
- Recompute the manifest hash from
canonical_bytesand verify it matches the published hash. - Recompute the dataset hash from the dataset content and verify it matches
dataset.hash. - Execute the evaluation using the manifest's
metric,metric_args,seed, and dataset. - Apply the comparator from §5.1.
- Emit a verdict per §7.
5.3 Tampering Detection
If the recomputed manifest hash does not match the published hash, the verifier MUST abort verification and MUST NOT emit a Pass or Fail verdict. Implementations MUST signal tampering distinctly from evaluation failure (see §7).
6. Amendment Protocol
PRML treats every claim as immutable once hashed. Honest revision is supported through an explicit, append-only amendment chain.
6.1 Forward-Only Audit Log
A producer who needs to change any field of a previously-registered claim
MUST create a new manifest whose prior_hash field equals the manifest
hash of the previous claim. The new manifest MUST retain the claim_id of
the previous claim.
The full sequence of manifests sharing a claim_id, ordered by created_at
and verified by the prior_hash chain, constitutes the audit log for that
claim.
6.2 Amendment Semantics
- The previous manifest is NOT deleted, overwritten, or revoked.
- Verifiers MUST treat the latest manifest in the chain as the operative one, but MUST also expose the full chain when requested.
- Hash-equality of two claims with identical content but different
created_atvalues MUST NOT occur; canonicalization includes the timestamp.
6.3 Amendment Chain Hash
An aggregate identifier for the full chain, suitable for public posting, is:
chain_hash = SHA-256(concat(canonical_bytes_1, canonical_bytes_2, ..., canonical_bytes_n))
where the manifests are concatenated in created_at order.
7. Exit Code Specification
Reference implementations MUST signal verification outcomes via the following exit codes:
| Code | Meaning |
|---|---|
0 | Pass — manifest verified, evaluation satisfies comparator. |
10 | Fail — manifest verified, evaluation does not satisfy comparator. |
3 | Tampered — manifest hash mismatch, verification aborted. |
11 | Guard violation — manifest is well-formed but a producer-declared invariant (e.g., dataset-hash mismatch, seed out of range) failed. |
2 | Usage error — invalid command-line arguments or unreadable manifest. |
1 | Unspecified runtime error. |
Codes other than 0, 1, 2, 3, 10, 11 are RESERVED.
8. Security Considerations
8.1 Threat Model
PRML protects against silent post-hoc revision of a registered claim. It does NOT protect against:
- A producer who never publishes the manifest at all.
- A producer who publishes a manifest privately, runs the evaluation, then publishes only on Pass (selective publication).
- A producer colluding with the dataset host to alter dataset content while preserving its declared hash (broken hash function).
- A producer signing a manifest with a key the verifier cannot validate.
Mitigations require external mechanisms: timestamping services (RFC 3161), public manifest registries, or signed dataset hosts.
v0.1 erratum on selective publication (2026-05-02). For regulatory use — particularly EU AI Act Annex III high-risk system audits — selective publication is the most likely real-world adversary, not the theoretical ones above. The cryptographic protocol is satisfied by a producer who publishes only Pass results; the regulatory purpose is not. v0.1 implementations used in compliance contexts MUST adopt one of the following deployment-level mitigations, none of which v0.1 enforces but all of which are compatible with the v0.1 manifest format:
Publish-before-run discipline. The manifest URL is committed to a public registry (a Git tag, an RFC 3161 timestamping authority, or an immutable S3 object with public-read) before the evaluation runs, not after. The registrar's timestamp becomes the publication-time proof; the manifest itself remains regulator-verifiable offline.
Sequential
claim_idallocation. A producer'sclaim_idsequence is published as a monotonic chain (eachclaim_idis the successor of the previous, regardless of outcome). A regulator can detect missing entries in the sequence and demand explanation.External pre-registration anchor. The manifest hash is committed to a third-party pre-registration registry (OSF, ClinicalTrials.gov pattern adapted, or an in-house immutable log) before any evaluation. The anchor is what the regulator verifies; the manifest is the provenance.
v0.2 will normatively adopt option (3) for the
producer.tier: high-riskprofile. v0.1 deployments choosing not to adopt one of these three mitigations are NOT suitable for EU AI Act Article 12 evidence submission and the producer SHOULD declare so in their accompanying conformity-assessment documentation.
8.2 Hash Algorithm Agility
This revision fixes SHA-256 as the hash algorithm. Future revisions MAY
introduce algorithm agility via a hash_algorithm field defaulting to
sha-256. Verifiers conforming to v0.1 MUST reject manifests declaring any
other algorithm.
8.3 Canonical Form Attacks
A producer who serializes a manifest non-canonically and publishes the non-canonical hash is detectable: any verifier recanonicalizing the manifest will compute a different hash and emit Tampered (exit 3). This places canonicalization correctness on the verifier, not the trust path.
9. Compliance Mapping (Informative)
This section is non-normative.
9.1 EU AI Act (Regulation 2024/1689)
| Article | Obligation | PRML coverage |
|---|---|---|
| 12 | Automatic recording of events over the system's lifetime | A PRML chain is the record of evaluation events with a tamper-evident hash chain. |
| 18 | Documentation retention for 10 years post-market | PRML manifests are plain text artifacts <1 KB; retention is trivial. |
| 17 | Quality management system covering performance | Pre-registered thresholds satisfy the "objective performance metric" requirement. |
| 50 | Transparency obligations for deployers | Public manifest hashes provide the receipt deployers need. |
9.2 NIST AI Risk Management Framework
PRML directly supports the MEASURE and MANAGE functions: pre-registered manifests establish the metric framework before deployment and provide the evidence trail for ongoing monitoring.
9.3 ISO/IEC 42001 (AI Management System)
PRML manifests are admissible as objective evidence under §8.4 (Operational Planning and Control) of ISO/IEC 42001:2023.
10. Reference Implementation
The falsify project provides a reference implementation in Python. Conformance to this specification is defined as:
- Producing canonical bytes byte-equivalent to the falsify reference for the PRML test vector suite (Appendix B).
- Computing manifest hashes byte-equivalent to the reference.
- Emitting exit codes per §7 for the test vector suite.
A conformance test harness will be published with v0.2.
11. IANA Considerations
This document requests the registration of:
- File extension:
.prml - MIME type:
application/vnd.prml+yaml - Sidecar extension:
.prml.sha256
A PRML Metric Registry will be established with v0.2.
12. References
Normative
- [RFC 2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", March 1997.
- [RFC 3339] Klyne, G., "Date and Time on the Internet: Timestamps", July 2002.
- [FIPS 180-4] NIST, "Secure Hash Standard", August 2015.
- [YAML 1.2] YAML Specification, October 2009.
Informative
- [EU 2024/1689] Regulation (EU) 2024/1689 (AI Act), June 2024.
- [NIST AI RMF] NIST AI Risk Management Framework 1.0, January 2023.
- [ISO 42001] ISO/IEC 42001:2023, AI Management System.
- [Gelman 2018] Gelman, A. & Loken, E., "The garden of forking paths".
- [Ioannidis 2005] Ioannidis, J., "Why most published research findings are false".
Appendix A — Minimal Example Manifest
version: "prml/0.1"
claim_id: "01900000-0000-7000-8000-000000000000"
created_at: "2026-05-01T12:00:00Z"
metric: "accuracy"
comparator: ">="
threshold: 0.85
dataset:
id: "imagenet-val-2012"
hash: "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
seed: 42
producer:
id: "falsify.dev"
Canonical bytes hash (illustrative; verify with reference implementation):
b2c3a1f0d8e7c6b5a4938271605f4e3d2c1b0a9988776655443322110ffeeddc
Appendix B — Test Vectors
A conformance suite of 12 test vectors is published alongside this specification at:
https://github.com/studio-11-co/falsify/tree/main/spec/test-vectors/v0.1/
Each vector defines:
- An input manifest (logical YAML mapping; key order irrelevant).
- The canonical UTF-8 byte sequence the canonicalizer MUST produce.
- The lowercase hex SHA-256 of those bytes.
An implementation conforms to PRML v0.1 if and only if it reproduces all 13 vectors exactly. The vectors cover:
| ID | Property exercised |
|---|---|
| TV-001 | Minimal valid manifest (matches Appendix A) |
| TV-002 | Key-ordering invariance — random insertion order produces same hash |
| TV-003 | Single-bit-of-content sensitivity — 0.85 vs 0.86 produces different hash |
| TV-004 | Optional fields populated (model.id, model.hash, dataset.uri) |
| TV-005 | Unicode in producer.id (UTF-8 byte handling) |
| TV-006 | Maximum seed value (2⁶⁴ - 1) |
| TV-007 | Minimum seed value (0) |
| TV-008 | Equality comparator (== with integer-valued threshold) |
| TV-009 | Amendment with prior_hash linkage to TV-001 |
| TV-010 | pass@k metric with model fields |
| TV-011 | AUROC with strict-greater comparator (>) |
| TV-012 | Regression metric (mae) with <= minimization |
Vectors are regeneratable via
python3 spec/test-vectors/v0.1/generate.py. Once v0.1 is frozen
(2026-05-22), the vectors and their hashes are immutable. Any change
requires a v0.2 spec bump.
Conformance is enforceable via the falsify reference test suite
(tests/test_prml_vectors.py); CI fails if any vector diverges.
Change Log
- v0.1 (2026-05-01) — Initial public draft.
- v0.1 (2026-05-01) — Test vector suite (13 vectors) published in
spec/test-vectors/v0.1/; Appendix B finalized.
Editor's note: This document is intended to be readable, implementable, and boring. Excitement is reserved for what gets built on top of it.