Benchmark Report Pattern (v3.3.5+)

April 15, 2026 · View on GitHub

Status: v3.3.5 — hub doc for benchmark disclosure schema
Schema: shared/benchmark_report.schema.json
Validator: scripts/check_benchmark_report.py
Template: examples/benchmark_report_template.json


Why this exists

The Anthropic automated-w2s-researcher (2026) paper headlined a performance comparison: "2 researchers × 7 days (PGR=0.23) vs 9 agents × 5 days (PGR=0.97)."

Dramatic. Also: n=2, author-conducted, no independence, self-scored. Not a scientific comparison — a presentation artifact. The authors knew what the agent would produce before sitting down to do the human baseline; their "7 days" is implicitly benchmarked against their own foreknowledge. Unaware external researchers doing the same task cold would likely score higher. The gap may partly be a sandbagged human baseline.

ARS inherits this risk the moment a user publishes an "ARS beats manual" claim. This pattern defines a mandatory schema for anyone publishing an ARS benchmark comparison. Reports that don't satisfy the schema aren't "ARS benchmarks" — they're anecdotes.


The schema at a glance

Six required top-level fields in benchmark_report.schema.json:

  • ars_version — exact semver the benchmark ran against; ties results to a specific skill version so readers can check whether that version is still current.
  • task_definition — description, task_type (outcome-gradable or open-ended), and outcome_gradable flag; distinguishes rubric-scored tasks from genuinely open-ended ones.
  • human_baseline — the five provenance fields that determine whether the human side of the comparison is credible (see field-by-field rationale below).
  • ars_run — cost, time, skills used, and declared data access level; lets readers check the "agents are cheap" or "agents are fast" claims against concrete numbers.
  • metrics — primary metric name, numeric value, and scoring independence; forces explicit disclosure of who scored the outputs.
  • caveats — non-empty array of known limitations; at least one honest statement is required by schema enforcement.

Field-by-field rationale

human_baseline

This block exists because the human side of a comparison is where benchmark credibility lives or dies. A weak human baseline inflates an agent's apparent advantage.

  • sample_size (integer, minimum 1): Schema rejects zero, forces an explicit number. n=1 or n=2 get a validator warning rather than a hard failure — some tasks are expert-bounded — but the number must be stated. Silence is worse than admitting n=1.

  • author_independence (enum: author-conducted | author-blinded | third-party-conducted): author-conducted is the "2 authors did it themselves" trap: allowed, but the schema flags the downward-bias risk. Authors know what a good ARS output looks like; their human effort is calibrated against that prior. author-blinded (authors worked without seeing ARS outputs first) is better. third-party-conducted eliminates experimenter bias entirely.

  • hours_spent (number, minimum 0): Lets third parties sanity-check the time claim. "7 days" reads differently against 20 papers versus 400.

  • recruitment (non-empty string): One sentence forces self-description of selection bias. "Authors," "hired on Upwork," "convenience sample" are all honest. Blank is not permitted.

  • tools_allowed (array of strings): "No tools" vs "ChatGPT allowed" is not the same comparison as ARS. If the human baseline used no AI assistance while ARS ran full pipeline, the comparison is testing infrastructure access, not cognitive quality.

ars_run

  • cost_usd (number, minimum 0): A run costing $500 in API tokens is not "cheap" just because it took 3 hours. Hours and cost together are needed for a fair comparison.

  • data_access_level_declared (enum: raw | redacted | verified_only): If ARS ran on raw data while the human baseline used redacted data, the benchmark is testing data privilege, not skill. Cross-references ground-truth isolation from v3.3.2.

  • skills_used (array, minItems 1): A benchmark running academic-pipeline full-mode is not comparable to one running only deep-research. Both are ARS; they're different scope claims.

metrics.scoring_independence

Four enum values: authors-scored, third-party-scored, blind-scored, self-scored.

self-scored parses and exits 0, but the validator emits a warning to stderr. Self-scoring is the worst-case alignment failure: the same agents that produced the output are being asked whether the output is good. blind-scored (human raters without knowledge of which output came from ARS versus human baseline) is the minimum credible standard for any claim that will be shared publicly.

caveats

Non-empty array, items non-empty strings, minItems 1. The schema physically prevents an empty caveats field. A benchmark report with no caveats either has no known limitations (implausible for any real-world evaluation) or the author didn't think about limitations (which disqualifies the report more than any specific limitation would).

The caveats field is where sample size warnings, scorer bias, task-selection bias, and tool-access asymmetries should land when they don't rise to schema errors. It is the honest disclosure box.


How to use

  1. Copy examples/benchmark_report_template.json to your benchmark directory.
  2. Fill every FILL IN: field with real values.
  3. Set human_baseline.sample_size to the actual number (schema rejects 0).
  4. Run: python scripts/check_benchmark_report.py your-report.json
  5. Fix all ERROR: lines (schema violations, exit 1).
  6. Read the WARNING: lines (stderr, exit 0). Either address them in your methodology or document them explicitly in the caveats array.
  7. Publish the JSON file in your repository alongside the benchmark write-up. The machine-readable format lets future scripts cross-check version-specific claims.

What this pattern is NOT

  • Not a quality gate for your methodology. A report can be schema-valid and still be methodologically weak. The schema forces disclosure; it does not evaluate task choice, rubric validity, or baseline skill. Compliance is necessary but not sufficient.

  • Not a way to launder self-scored or n=2 comparisons. self-scored and sample_size: 2 produce a valid document. The warnings are loud. Schema compliance means the claim is honest about its weaknesses — not that the claim is strong.

  • Not a substitute for peer review. The schema is the minimum bar for internal credibility. Peer review catches confounded variables, cherry-picked tasks, post-hoc metric selection — problems the schema cannot see.

  • Not frozen. A benchmark declaring ars_version: 3.3.5 is always checked against the v3.3.5 schema. Future versions may add fields without invalidating existing reports.


Honesty red lines

These are behaviors the schema is specifically designed to make visible. They are not prohibited — the schema accepts all of them — but they will appear in validator output or be apparent to any reader of the JSON.

  • Omitting sample size. The field is required. There is no valid ARS benchmark report without a stated human_baseline.sample_size. n=1 is worse than n=10, but it is infinitely better than unstated.

  • Author-conducted with no caveat disclosure. Using author-conducted is allowed. Not noting it in caveats as a limitation is an editorial failure, not a schema failure — but reviewers will notice.

  • Self-scoring with no warning acknowledgment. self-scored produces a warning on every validator run. If you publish a report with self-scored and don't address it in caveats, the omission is visible in the JSON to anyone who checks.

  • Empty or single-word caveats. The schema requires minLength: 1 per item and minItems: 1 for the array. A caveat of "none" is technically valid but signals that the author did not engage seriously with limitations.

  • Claiming ARS advantage without disclosing cost_usd. The field is required. A claim that "ARS completed this in 4 hours vs 40 human hours" without a dollar figure omits the dimension where the trade-off may be reversed.


Future evolution

Per-skill benchmark templates (a deep-research-specific template, an academic-paper-reviewer-specific template) are likely as v3.4 adds domain-specific evaluation protocols. The cost_usd field may split into cost_usd_api and cost_usd_compute once cloud execution costs become a factor. The data_access_level_declared enum may expand to match the v3.4 ground-truth tier vocabulary if that spec evolves. None of these changes will break existing v3.3.5 reports, which remain valid against the version they declared in ars_version.