PRML Integrity Index
June 1, 2026 Β· View on GitHub
A public, reproducible scorecard of how well 25+ well-known ML evaluation claims meet the 9 falsifiability criteria PRML considers minimum hygiene.
π Live page: falsify.dev/integrity π Spec: spec.falsify.dev/v0.1 π License: CC0 1.0 β public domain dedication for data and tooling
What this is
A snapshot dated 2026-Q2 that scores well-known evaluation claims (proprietary and open) against the eight-field PRML manifest format plus one extra criterion (pre-registration timestamp).
This is heuristic, not authoritative. The score reflects the canonical public source of each claim β what the publisher chose to record. Two claims with the same accuracy can score very differently here. That's the point: the score reflects format hygiene, not model quality.
What this is not
- Not a moral ranking
- Not affiliated with, endorsed by, or written in cooperation with any of the listed publishers
- Not a tool to embarrass anyone β a low score reflects a format gap, not a conduct gap
- PRML Β§8.1 explicitly: a high score means a claim is checkable, not that it is true
The nine criteria
- Metric named β accuracy, refusal-rate, F1, etc.
- Numeric value β a scalar, not "state of the art"
- Dataset named β HumanEval, GPQA, MMLU, etc.
- Dataset hash / version pin β a specific revision or content hash
- Model version pinned β a build, date, or revision (not "GPT-4")
- Threshold direction stated β
>= 0.95, not "around 95%" - Sample size given β N for the eval run
- Seed published β RNG state when applicable
- Pre-registration date β timestamp showing the threshold was set before the run
Disagree with a score?
Open an issue. Use the Re-score request template:
- Link to the canonical public source (paper, model card, blog post)
- Name the field you think we missed
- Quote the relevant text or screenshot
We re-score on receipt of evidence. Re-scoring is cheap; arguing about scores is not the point.
Add a new claim
Use the New entry template. Submissions must:
- Cite a public primary source (academic paper, official model card, vendor release notes β not third-party summaries)
- Be reproducible: anyone reading the source must be able to verify each of the 9 bits
- Use neutral language in the
notefield β describe the structural gap, never the publisher
Repository structure
data/
entries.json # canonical scored entries β single source of truth
schema.json # JSON schema for entries.json
.github/
ISSUE_TEMPLATE/
rescore.md # template for re-score requests
add-entry.md # template for new entries
README.md # this file
LICENSE # CC0 1.0 β see file
CONTRIBUTING.md # how to propose changes
The page at falsify.dev/integrity is rendered from data/entries.json (currently inlined in the page; the repo is the auditable source).
Audit & compliance crosswalks
Subcategory-by-subcategory maps from major AI governance frameworks to PRML fields (FULL / PARTIAL / NONE tagged):
- EU AI Act Article 12 β code-level pattern for the 2 December 2027 high-risk deadline
- NIST AI RMF 1.0 β GOVERN / MAP / MEASURE / MANAGE subcategory map
- ISO/IEC 42001:2023 β AI Management System clause-by-clause evidence map
Versioning
The Index is versioned by quarter. A re-scored entry produces a delta entry in data/changelog.md so historical versions are reproducible.
2026-Q2β initial release (this version)2026-Q3β planned: 50 entries, including more open-weight claims and AISI/METR-style audited evaluations
License
- Everything in this repository (data, README, any tooling): CC0 1.0 β public domain dedication. Mirror, fork, dispute, re-derive without attribution required.
Authors
CΓΌneyt ΓztΓΌrk Contact: hello@falsify.dev Β· falsify.dev
The Index was produced as part of the PRML v0.1 launch (2026-05) and is maintained as a community resource. Submitting a re-score is not a hostile act; it is the intended workflow.
Status
- v0.1 stable. v0.2 RFC open through 2026-05-22 β spec.falsify.dev/v0.2-rfc.
- The PRML JSON Schema is in the SchemaStore catalog (merged 2026-05-11), so
*.prml.yamlfiles autocomplete in VS Code, JetBrains, Helix, Zed, and Cursor out of the box.
Contributing
See CONTRIBUTING.md and the good first issue label for scoped work.
Cite the spec: ΓztΓΌrk, C. (2026). PRML v0.1. Zenodo. https://doi.org/10.5281/zenodo.20177839