BenchFlow Task Package Standard
June 17, 2026 · View on GitHub
Status: current task package standard (2026-06-14)
This document defines the direction for BenchFlow-native task packages. The
short version: task.md is the native authoring entrypoint; oracle/ and
verifier/ are the BenchFlow-native names for held-out reference behavior and
reward checks. Split-layout names such as solution/ and tests/ remain
compatibility names for migration and export only.
The standard is intentionally split into three views:
| View | Purpose | Owner |
|---|---|---|
| Authoring document | What humans and generators write: one task.md plus sidecar dirs | TaskDocument |
| Runtime task view | What rollout, verifier, hardening, provenance, and trajectories consume | TaskRuntimeView plus first TaskPackage boundary |
| Foreign adapter view | What external benchmark formats and hosted environments import/export | adapters |
Do not treat these as the same interface. A good authoring document can include more information than a foreign format can export, and a foreign import can preserve unknown data that native authoring would reject.
Goals and scope
A BenchFlow task is one task.md that selects a mode on each of three planes
(see Planes below): how the environment is built, how the agent interacts, and
how the result is scored. The standard has three goals:
- Native authoring — humans and generators write one
task.md(plus sidecar dirs) as the primary surface. - Interoperability — existing split-layout formats (
task.toml+instruction.md+solution/+tests/) import directly, and packages export back to that layout with an explicit, honest loss report when a native concept has no equivalent. - Coverage — the schema can express the full range of eval shapes (single-shot, multi-round, simulated-user, multi-agent, and live-arena interaction; workspace, trajectory, rubric, judge, and leaderboard reward), even where the runtime for a given mode lands in a later milestone (see Open Primitives and Roadmap).
Native authoring is the priority surface; import is best-effort-faithful; export is best-effort with a loss report. The standard does not require bidirectional-lossless round-tripping with any one external format.
Field discipline
Every normative standard field must map to a mode on one of the three planes
(below) or to a concrete BenchFlow runtime need. Fields without that mapping
belong under metadata or in an adapter, not in the standard.
Native Layout
task/
|-- task.md
|-- environment/
| `-- Dockerfile
|-- verifier/
| |-- verifier.md
| `-- test.sh
|-- oracle/
| `-- solve.sh
`-- evidence/
`-- validation.json
Compatibility aliases:
| Native | Compatibility / foreign export | Meaning |
|---|---|---|
task.md | task.toml + instruction.md | Task config plus prompt material |
oracle/ | solution/ | Held-out reference implementation |
verifier/ | tests/ | Verifier package, reward code, hidden checks, rubrics, and judges |
Target validation: native packages may carry both native and compatibility directories only when hashes prove the duplicate content is equivalent. Current runtime prefers the native spelling when both aliases exist; it does not yet prove equivalence.
Within a BenchFlow-native package, selection is fail-closed:
- if
task.mdexists, it is the authoritative task definition - if
verifier/exists, it is the authoritative verifier directory; an empty or invalidverifier/does not fall back totests/ - if
oracle/exists, it is the authoritative oracle directory; an empty or invalidoracle/does not fall back tosolution/ - duplicate alias trees must be byte-identical after normalized traversal or validation should report a collision
Split layouts are compatibility inputs and export artifacts, not the native
authoring surface. New BenchFlow tasks should publish task.md as the only
authoritative entrypoint.
Versioning
There are two versions:
schema_version: "1.3" # BenchFlow task config surface
benchflow:
document_version: "0.6" # BenchFlow task.md document syntax
schema_version is for the runtime config model shared with compatibility
imports. benchflow.document_version is for document-only concepts such as
teams, prompt composition, agent policy, runtime policy, private assets,
provenance, evidence, nudges, and export policy.
Root Frontmatter
The root frontmatter has three classes of keys.
BenchFlow config keys are modeled by TaskConfig and must be rejected
when unknown in native authoring mode:
schema_version/versiontaskmetadataagentverifierenvironmentoracle(validation alias:solution)sourceartifactsstepsmulti_step_reward_strategy
Document orchestration keys are parsed by TaskDocument:
agentsscenesuser
BenchFlow extension keys live under the reserved namespace:
benchflow
Do not add new root keys for every new idea. Put draft or BenchFlow-specific
extensions under benchflow: until they have a stable interface.
Native authoring should use oracle. Importers may accept the compatibility
solution alias, but a config that contains both names is invalid.
Prompt Body
The task.md body is the base prompt — free-form markdown, exactly like a
SKILL.md body. No ## prompt heading is required: if the body carries no
reserved section headings, the entire body is the prompt. The common single-shot
task is just frontmatter plus prose, so a bespoke benchmark ports by dropping its
existing instruction text in as the body with no markup.
Tasks that need more than a base prompt — multiple roles, multiple scenes, or a
simulated-user persona — author each as its own free-form file under
prompts/, so no body ever carries reserved-heading ceremony:
| File | Meaning |
|---|---|
prompts/role.<name>.md | Role prompt |
prompts/scene.<name>.md | Scene prompt |
prompts/user-persona.md | Simulated user persona |
Each sidecar file is itself a clean free-form body. For backward compatibility,
single-prompt source formats may instead embed the same content in the body via
reserved ## prompt, ## role:<name>, ## scene:<name>, and ## user-persona
headings; these import losslessly and normalize to the file layout. Sidecar files
take precedence over a heading of the same name. New tasks should prefer the files.
Default runtime precedence remains a simple fallback:
- inline turn prompt
- scene prompt
- role prompt
- base prompt
Native tasks can make composition explicit:
benchflow:
prompt:
composition: append
order: [base, role, scene, turn]
The first TaskPackage prompt plan now compiles append and explicit replace
policies deterministically. That avoids losing role guardrails when a scene
prompt is present. RolloutConfig consumes the compiled plan for task.md scene
execution when explicit CLI/SDK prompts do not override the document.
Agent And Runtime Policy
Agent isolation is distinct from task environment networking. A no-search task can still need dependency, LLM, or provider egress outside the sandbox; a no-network task can still need the agent harness to call its model.
benchflow:
agent_policy:
skill_access: none # none | installed | declared
allowed_skill_roots: []
search: disabled # allowed | disabled
forbidden_tools: [web_search, browser_fetch]
trajectory_audit:
require_no_forbidden_tool_calls: true
runtime_policy:
backend: modal # docker | daytona | modal | kubernetes | podman | hpc | queue
required_capabilities: [gpu:B200, private_mounts, persistent_state]
network:
task_default: no-network
agent_egress: model-and-dependencies
allowed_hosts: []
phase_overrides:
verifier: no-network
private_mounts:
- source: modal-volume://org-models/qwen
target: /mnt/models/qwen
mode: ro
visibility: agent
secret_ref: org-models-readonly
persistent_state:
required: true
scope: task-run
cleanup: after-verifier
Unsupported agent_policy, runtime_policy, private mounts, registry secrets,
GPU types, phase-specific network overrides, or persistent state semantics must
fail closed before launch for the selected sandbox. No-search proof must come
from both launch policy and trajectory audit; it is not implied by
network_mode: no-network.
Planes
The package has six planes. Keeping them separate is the core abstraction.
| Plane | Owns | Should not own |
|---|---|---|
| Package | identity, versions, provenance, source hashes, export policy | sandbox execution |
| Runtime | environment image, resources, network policy, setup/reset/readiness, mounts | prompt orchestration |
| Interaction | agents, roles, teams, scenes, turns, user/nudge loops, handoff policy | verifier checkpoints |
| Verifier | verifier package entrypoint, reward file contract, hidden fixtures, scorer type, rubrics, judge roles, separate verifier env | agent prompt text |
| Oracle | held-out reference implementation and oracle-only env | tests/verifier code |
| Evidence | validation runs, flake data, anti-cheat review, artifact hashes, leaderboard metadata | mutable task source |
Imported steps are verifier/runtime checkpoints. BenchFlow scenes are
interaction checkpoints. They are orthogonal. A task can have both, but a
runtime must define how they compose before executing them.
Of the six planes, three are the authoring axes an author selects a mode from
— Runtime (the environment), Interaction, and Verifier — while the remaining three
(Package, Oracle, Evidence) are supporting planes. A task is one mode-selection
per axis: environment × interaction-mode × verifier-strategy. New normative
fields must add or compose a mode on one of these axes; otherwise they belong in
metadata or an adapter.
Presets And Profiles
Two different reuse mechanisms were historically both called "profile," which is why the mental model felt overloaded. v0.6 separates them:
preset— authoring sugar. A named bundle of defaults thatbench tasks normalizeexpands into the canonical contract and then discards; it has no runtime existence. Presets are the lightweight-authoring path: a tinytask.mdbecomes a full contract. Examples:code-change,acceptance-live,harbor-compatible. (The current implementation still spells the preset key asprofile:; M0 renames it topreset:and keepsprofile:as a deprecated alias.)*-profile— runtime-real shared config. A benchmark-level object that many tasks reference and that persists intoTaskRuntimeView, one per authoring axis:environment-profile— a shared world reused across tasks (e.g. one clawsbench service catalog referenced by all 44 tasks)agent-profile— shared harness / model / role wiringverifier-profile— shared reward strategy and rubric
Rule of thumb: if removing it changes only how much you typed, it is a preset;
if removing it changes what runs, it is a *-profile. Presets normalize away;
profiles are inherited. *-profile resolution is M1 runtime work; preset exists
today.
Verifier Package
The verifier is a peer package, not just a tests/ directory. Native verifier
packages should have their own entry document:
verifier/
|-- verifier.md
|-- task.toml
|-- test.sh
|-- rewards/
| |-- correctness/
| | `-- reward.py
| `-- quality/
| `-- criteria.toml
|-- rubrics/
| |-- verifier.md
| `-- verifier.toml
|-- judges/
| `-- reviewer.md
`-- fixtures/
`-- hidden_cases.jsonl
verifier/verifier.md is analogous to task.md for the evaluation side. It
describes what evidence is read, which strategies can score the task, how
rubric dimensions compose, which judge agents are allowed, and what outputs the
runtime must preserve.
verifier/task.toml is an optional compatibility projection of
verifier/verifier.md; it is not a second native surface. If both files exist,
their canonical projection must match or validation fails closed.
---
document_version: "0.3"
verifier:
name: hidden-patch-and-quality
default_strategy: deterministic
strategies:
deterministic:
type: script
command: ./test.sh
rewardkit:
type: reward-kit
root: rewards/
criteria: rewards/quality/criteria.toml
judge:
type: agent-judge
role: verifier_judge
model: gpt-5.5
inputs: [trajectory/acp_trajectory.jsonl, /logs/artifacts/patch.diff]
isolation: verifier-only
ors:
type: ors-episode
reward_aggregation: last_non_null
terminal_policy: finished_true
rubric:
combine: weighted_sum
dimensions:
correctness: {weight: 0.7, source: deterministic}
maintainability: {weight: 0.2, source: judge}
evidence_quality: {weight: 0.1, source: rewardkit}
files:
human: rubrics/verifier.md
structured: rubrics/verifier.toml
outputs:
reward_text: /logs/verifier/reward.txt
reward_json: /logs/verifier/reward.json
details_json: /logs/verifier/reward-details.json
aggregate_policy:
field: reward
fallback: weighted_mean
---
## role:verifier_judge
Grade only the submitted artifact and declared evidence. Do not infer intent
from private oracle files or hidden verifier fixtures.
test.sh is the minimum executable strategy and the compatibility export target.
Reward Kit-style criteria, ORS-style episode rewards, and AgentBeats-style
assessor agents are verifier strategies. Runtime adapters may write declared
evidence artifacts such as trajectory/ors-rewards.jsonl, but the ORS-specific
judge/normalization semantics stay inside verifier scope. They must not leak
into agent prompts, and they must emit either the canonical reward envelope or a
declared multi-metric map.
Verifier packages must be isolated:
- judge models and credentials are verifier-scoped, not agent-visible
- judge prompts live under
verifier/, not inside the task prompt - rubrics are separate from judge persona:
rubrics/verifier.mdis the human scoring contract,rubrics/verifier.tomlor JSON is the structured scoring contract, and## role:verifier_judgeis the assessor posture - hidden fixtures and rubrics are mounted only during verifier execution
- agent-as-judge trajectories are recorded as evidence and audited for input leakage
- unsupported verifier strategies fail closed before scoring starts
Verifier Contract
Native verifier code lives in verifier/ and is mounted at /verifier.
Compatibility tests/ remains supported and is mounted at /tests.
The minimum script verifier contract applies to the selected verifier
entrypoint: native packages with no verifier/verifier.md run
verifier/test.sh; imported split-layout packages may run tests/test.sh.
When verifier/verifier.md is present, structural verifier validity follows
the selected strategy. A package can therefore be valid without
verifier/test.sh if, for example, the selected Reward Kit runner and criteria
files exist.
- write
/logs/verifier/reward.txtwith one float from0.0to1.0 - optionally write
/logs/verifier/reward.jsonwith structured rubric/evidence; BenchFlow preserves structured rewards, keepsreward.txtas scalar compatibility, and can compute a scalar from declared aggregate policies - prefer exit
0after writing reward - treat nonzero verifier exit without a fresh reward file as infrastructure failure; a nonzero exit with a fresh reward is a scored task result with verifier diagnostics
The richer standard should model verifier inputs explicitly:
verifier:
type: test-script
timeout_sec: 900
benchflow:
verifier:
entrypoint: verifier/test.sh
visibility: hidden
inputs:
- path: verifier/test.patch
kind: test_patch
visibility: hidden_verifier
outputs:
reward_text: /logs/verifier/reward.txt
reward_json: /logs/verifier/reward.json
transfer:
agent_to_verifier:
- /logs/artifacts/**
verifier_to_result:
- /logs/verifier/reward.*
- /logs/verifier/artifacts/**
This covers tests/test.patch, Windows entrypoints, artifact-only graders,
separate verifier images, and hidden fixtures without overloading the directory
name.
Separate verifier execution must define image resolution, hidden fixture
mounting, step-level verifier environments, and transfer rules. Hidden verifier
inputs such as tests/test.patch or verifier/test.patch are mounted only for
the verifier phase; agent logs and workspace files move into verifier scope
only through declared artifact transfer paths.
Target reward precedence:
reward.txtremains the scalar compatibility minimum- when
reward.jsonexists, it should be the authoritative rich reward artifact reward.jsonmay be an envelope with a numericreward, or a reward-kit style multi-metric map withmetricsplus a declaredaggregatepolicy- if both files exist and
reward.jsonhas a scalar aggregate, the scalar must matchfloat(reward.txt)or validation should fail closed - if
reward.jsonis a multi-metric map without a scalarreward, verifier metadata may declare the aggregate policy used for scalar exports; current runtime computesrewardformean,weighted_mean, andweighted_sum, andreward.txtmay still carry the scalar compatibility value reward-details.jsonshould be preserved when present- reward artifacts should preserve structured reserved keys such as
rubric,items,evidence,artifacts,metadata,reason,reasons,errors, and task-specific payloads such asmetrics,regressions,participants,winner,raw, anddebug
Current runtime now prefers reward.json over reward.txt, rejects
disagreeing scalar outputs, and preserves a first set of structured reward
fields. src/benchflow/task/verifier_document.py parses
verifier/verifier.md strategies, rubric metadata, output contracts, and
verifier-scoped role prompts. When verifier/verifier.md is present,
Verifier.verify() selects its default strategy: script runs the declared
command relative to the uploaded verifier directory, llm-judge uses the
existing deliverables judge with verifier-local rubric, model, input directory,
and context/context-file overrides, reward-kit runs a safe relative
reward.py package runner inside verifier scope, writes a
reward-kit-manifest.json contract, and, when criteria are declared, parses
those criteria before launch and computes/verifies canonical reward from
matching reward.json.metrics. agent-judge runs a
verifier-scoped judge role over declared evidence inputs. ORS runtime helpers
can normalize tool-output rewards into trajectory/ors-rewards.jsonl;
ors-episode then reads declared ORS reward evidence, normalizes reward
responses or event streams through the existing ORS adapter, and emits
canonical reward.json plus reward-details.json.
reward-details.json is a named rollout artifact, stale copies are cleared
before verification, script verifiers can preserve it, target-service
verifiers download it with the rest of /logs/verifier, and the built-in LLM
judge emits criterion details there. Metrics-only reward.json maps can now
use the verifier document's outputs.aggregate_policy or selected Reward Kit
criteria policy to compute and persist the canonical reward. It still does
not fully match the target: full Reward Kit parity, full OpenReward environment
import/export, and AgentBeats
assessor lifecycles are not yet first-class; selected unsupported strategies
fail closed.
Validation evidence should include parser checks, legacy-vs-native migration parity, live rollout artifacts, negative-contract failures, and explicit fail-closed results for parsed fields that the selected runtime cannot honor.
Oracle Contract
Native oracle code lives in oracle/ and is mounted at /oracle.
Compatibility solution/ remains supported and is mounted at /solution.
The naming change is semantic:
oraclemeans "held-out reference behavior used to prove solvability"solutionis a compatibility export name for older split layouts
Proposed outbound exporters should map native oracle/ to foreign
solution/. Current runtime supports native and compatibility oracle paths, and
the first split-layout exporter maps native oracle/ to solution/.
Publication-grade oracle equivalence enforcement remains target behavior.
Assets, Provenance, And Evidence
Native task packages need more than source code. The standard should track assets as first-class objects:
benchflow:
provenance:
images:
- field: environment.docker_image
reference: ghcr.io/org/task-image:2026-06
digest: sha256:...
registry: ghcr.io
provider: modal
build_source:
path: environment/Dockerfile
sha256: "<hash>"
credential_secret_ref: task-image-pull
never_persist_credentials: true
assets:
- path: environment/dataset.parquet
visibility: agent
sha256: "<hash>"
source:
url: https://huggingface.co/datasets/org/name
revision: "<commit>"
license: apache-2.0
generated: false
mount_phase: agent
- path: verifier/hidden_cases.jsonl
visibility: hidden_verifier
sha256: "<hash>"
mount_phase: verifier
secrets:
- name: task-image-pull
scope: image-pull
visibility: runtime_secret
never_persist: true
Suggested visibility values:
agenthidden_verifierhidden_oracleruntime_secretexternal_datasetevidence_only
Evidence should prove task validity, not just package shape:
benchflow:
evidence:
oracle_runs:
required_reward: 1.0
last_job: jobs/task-standard/oracle/...
artifact: evidence/calibration/oracle-run.json
verifier:
reruns: 5
flake_rate: 0.0
report: evidence/calibration/verifier-stability-report.json
review:
anti_cheat: passed
instruction_alignment: passed
reviewer: benchflow
artifact: evidence/calibration/review.json
calibration:
no_op_reward_max: 0.0
known_bad_reward_max: 0.2
partial_solution_range: [0.3, 0.8]
report: evidence/calibration/calibration-report.json
human_or_reference_examples:
- name: gold
expected_reward: 1.0
artifact: evidence/calibration/gold-result.json
judge_agreement:
required: true
sample_count: 5
min_pairwise_agreement: 0.8
trajectories:
- path: trajectory/acp_trajectory.jsonl
kind: acp
visibility: evidence_only
sha256: "<hash>"
- path: trajectory/critique.jsonl
kind: critique
visibility: evidence_only
sha256: "<hash>"
artifacts:
- path: evidence/calibration/oracle-run.json
kind: oracle_run
visibility: evidence_only
sha256: "<hash>"
- path: evidence/calibration/verifier-stability-report.json
kind: verifier_stability
visibility: evidence_only
sha256: "<hash>"
- path: evidence/calibration/review.json
kind: acceptance_review
visibility: evidence_only
sha256: "<hash>"
- path: evidence/calibration/calibration-report.json
kind: calibration_report
visibility: evidence_only
sha256: "<hash>"
- path: evidence/calibration/gold-result.json
kind: reference_result
visibility: evidence_only
sha256: "<hash>"
- path: artifacts/session.har
kind: har
mime_type: application/json
produced_by: browser
visibility: evidence_only
sha256: "<hash>"
Borrow the discipline, not the package shape, from METR-style task standards:
- keep BenchFlow document-first rather than Python
TaskFamily-first - make asset visibility explicit instead of relying on directory folklore
- declare required secrets/resources with scope and "never persist" semantics
- record oracle/no-op/partial/human baseline scores where available
- keep protected/intermediate scoring behind explicit visibility controls
Valid evidence artifact kinds include screenshot, video, har,
browser_trace, trajectory, critique, viewer_metadata, and
verifier_artifact.
Oracle and calibration evidence is required for leaderboard-grade tasks and recommended for all native tasks with hidden tests, subjective rubrics, or LLM/agent-as-judge scoring.
Interaction Model
An interaction declares one mode. The modes form the Interaction axis:
| Mode | Meaning | Runtime |
|---|---|---|
single-shot | one agent, one prompt, scored once | executable |
multi-round | oracle access / progressive disclosure across rounds (BaseUser) | executable |
simulated-user | a declarative, model-backed user persona drives turns (not a live wire peer) | partial (linear; one active role per scene) |
multi-agent-sequential | roles hand off in one shared workspace | partial (sequential handoff only) |
arena-concurrent | an assessor agent and the agent-under-test run as live A2A+MCP peers, scored during the interaction | target (runtime-deferred; see Open Primitives) |
arena-concurrent models a live interaction where the reward is produced during
the exchange (by an assessor agent) rather than by a post-hoc verifier. It is
declared in the schema so such tasks parse without loss; its runtime (an A2A bridge
for the agent-under-test plus a concurrently-running assessor) is deferred to a
later milestone (see Open Primitives and Roadmap). Note ACP (BenchFlow <-> agent)
is not A2A (agent <-> agent); arena needs the A2A leg.
The current executable slice is linear scenes that reference agents.roles.
The first team handoff subset is intentionally narrow: a document-declared user
loop may execute explicit multi-role scene turns sequentially when
benchflow.teams.<team>.handoff declares mode: sequential,
workspace_visibility: shared, and trajectory_visibility: none|metadata.
agents:
roles:
planner:
agent: claude-agent-acp
model: claude-sonnet-4-6
implementer:
agent: codex-acp
model: gpt-5.5
reasoning_effort: high
benchflow:
teams:
default:
handoff:
mode: sequential
workspace_visibility: shared
trajectory_visibility: metadata
Richer team semantics such as role membership enforcement, summaries, handoff artifacts, parallel teams, branch routing, and full trajectory sharing are parsed as draft surface but must fail closed until a runtime owns them.
Simulated users and nudges should be explicit about runtime type:
user:
model: scripted
stop_rule: satisfied-or-3-rounds
private_facts:
hidden_need: reveal only after the solver asks for it
benchflow:
nudges:
mode: simulated-user
nudge_budget: 2
This deterministic subset compiles into RolloutConfig.user as a
DocumentNudgeUser: the public prompt runs first, private facts stay out of the
package metadata and initial solver prompt, and a fact is revealed only after a
targeted clarification question. Bounded model-linear simulated users should
stay equally explicit:
user:
model: claude-haiku
stop_rule: satisfied-or-5-rounds
benchflow:
nudges:
mode: simulated-user
branchable: true
branch_execution: option-kinds-preserved
confirmation_policy:
destructive_actions: human
If a runtime cannot compile user into a concrete loop, it should fail closed
or mark the field metadata-only rather than silently ignoring the user.
Today the first model-linear slice accepts claude-*, gpt-*, and
gemini*-style models for linear single- or multi-scene simulated users through
ModelDocumentNudgeUser. confirmation_policy: human installs a fail-closed
ACP permission handler unless the caller supplies an explicit on_ask_user
handler, and the ACP ask_user bridge preserves both option IDs and option
kinds so reject/allow choices are explicit branchable evidence. Authors may
spell the current executable branch slice as
branch_execution: option-kinds-preserved; branch_execution: forked-snapshot
fails closed until the user loop is integrated with the Environment snapshot
branch engine. The first sequential shared-workspace team handoff slice records
scene, role, handoff_from, and handoff_to metadata per user round.
branchable is still not automatic branch execution; interactive approval UI,
parallel teams, handoff artifacts, full trajectory sharing, and
branch/message-routing policy remain fail-closed target work.
Compatibility
Compatibility must be explicit. A native package can guarantee export only for the subset the target format supports.
benchflow:
compatibility:
harbor:
export: full
emits:
config: task.toml
prompt: instruction.md
oracle: solution/
verifier: tests/
Target compatibility rules:
-
Native authoring rejects unknown root config keys.
-
Foreign adapter import preserves unknown
task.tomlkeys outside native config. The first implementation isimport_task_config_toml(): strict native validation still fails on unknown keys, while compatibility import returns a validatedTaskConfigplusInboundCompatibility.config_extra. -
Legacy-to-native migration writes preserved foreign keys under
benchflow.compat.extra:benchflow: compat: source: harbor extra_paths: - environment.modal.image - steps[0].runner - verifier.reward_kit.metric extra: environment: modal: image: registry.example.com/task:latest steps: - runner: harbor-step-runner verifier: reward_kit: metric: exact_match -
Split export rehydrates
benchflow.compat.extraback intotask.tomlwithout overwriting supported native keys. The export report recordsrestored_extension_pathsso compatibility is auditable. -
build_harbor_roundtrip_conformance_report()proves the supported split surface across a split ->task.md-> split hop: canonicalTaskConfig, normalized prompt, and environment/solution/tests file-map hashes. -
Export emits a degraded-export report when the target format cannot express a native concept. The first implementation is
src/benchflow/task/export.py, which writes a compatibility split layout pluscompatibility/export-report.json. -
Mixed native/legacy files are structurally invalid when task config, prompt, oracle/solution, or verifier/tests aliases drift.
-
Compatibility export should emit
task.toml,instruction.md,solution/, andtests/. -
BenchFlow import should prefer
task.mdonly when compatibility metadata proves the legacy files are equivalent. -
Export reports include selected definition, selected verifier/oracle dirs, input/output file hashes, alias collisions, restored foreign extensions, and any lost semantics.
-
tests/compatibility is path compatibility, not a separate native standard. For native packages,verifier/is authoritative. For split packages,tests/remains valid and may contain onlytest.sh. -
Exporters map the selected native verifier tree to
tests/; importers may preserve split-layouttests/without requiringverifier.mdor rubric files. If bothverifier/andtests/exist, validation should compare normalized file maps and fail closed on drift. -
ORS/AgentBeats imports/exports live at the adapter boundary: ORS tool-output rewards become declared reward-event artifacts plus a terminal aggregate, and AgentBeats assessor agents become verifier strategies rather than root task syntax.
Round-trip guarantees are semantic, not byte-exact. Config equivalence should
be checked through canonical TaskConfig dumps; prompt equivalence through
normalized prompt text; verifier/oracle equivalence through deterministic
SHA-256 file maps over regular files. Comments, TOML/YAML formatting, and blank
line trivia are not preserved unless a same-format no-op export asks for that.
Runtime Capability Matrix
Current implementation status:
| Feature | Parse | Runtime | Next gate |
|---|---|---|---|
task.md prompt | yes | yes | keep |
native verifier/ | yes | yes | keep |
native oracle/ | yes | yes | keep |
verifier script strategy | yes | yes | keep |
verifier llm-judge strategy | yes | yes | keep |
verifier reward-kit strategy | yes | partial | safe relative reward.py runner executes; declared criteria parse before launch, emit a runtime manifest, require exact metrics, and compute/verify canonical reward; fuller Reward Kit parity remains target work |
verifier agent-judge strategy | yes | partial | verifier-scoped LLM judge over declared inputs; richer ACP-backed judge agents remain target work |
verifier ors-episode strategy | yes | partial | runtime helper writes ORS tool-output rewards to trajectory/ors-rewards.jsonl; declared reward responses/event streams normalize into reward.json and reward-details.json; fuller OpenReward environment import/export remains target work |
agents.roles | yes | partial | TaskRuntimeView carries parsed scenes; explicit sequential shared-workspace handoff can switch roles through the user loop |
scenes | yes | partial | prompt composition compiles; multi-role document-user scenes execute only with explicit turns and supported team handoff |
user / ## user-persona | yes | partial | model: scripted + string private_facts compiles to DocumentNudgeUser; bounded model-linear users compile to ModelDocumentNudgeUser; linear single- and multi-scene user loops execute when every scene is single-role, or when explicit multi-role turns opt into sequential shared-workspace team handoff; confirmation_policy: human gates ACP permissions fail-closed without an explicit handler; branch_execution: option-kinds-preserved preserves option IDs and kinds; forked branch execution, interactive approval UI, parallel teams, and rich handoff artifacts fail closed |
benchflow.teams | yes | partial | supports exactly one handoff with mode: sequential, workspace_visibility: shared, and `trajectory_visibility: none |
benchflow: | raw | no | typed document schema after v0.3 stabilizes |
imported steps | yes | no/partial | fail closed per sandbox until implemented |
| root/step artifacts | yes | no/partial | implement collection or fail closed |
| network allowlist | yes | no/partial | per-sandbox capability check |
| separate verifier env | yes | no/partial | materializer plus verifier runner support |
| Windows / TPU | yes | no | fail closed |
| healthcheck | yes | no/partial | fail closed until sandbox healthcheck support lands |
| workdir | yes | partial | absolute non-root paths are materialized; root/relative paths fail closed |
reward.json precedence | yes | partial | prefer JSON when present and reject both-present mismatches |
| metrics aggregate policy | yes | partial | mean, weighted_mean, and weighted_sum; richer engines remain target work |
arena-concurrent interaction (G4) | no | no | add interaction-mode schema now; A2A bridge for the agent-under-test + concurrently-running assessor at M2 |
| hybrid reward envelope (G1) | partial | no | declared cross-surface product/sum of factors; M1 |
| GAIN aggregation (G2) | no | no | dynamic live baseline + ceiling; M1 |
| leaderboard-submission (G5) | partial | no | hosted / hidden external scorer with durable result record; M1 |
| RL step-reward (G6) | no | no | per-action environment reaction; M1/M2 |
Today bench tasks check is structural by default. bench tasks check --level schema checks only the task authoring entrypoint and prompt parse, so
schema-only fixtures can be validated without pretending to be runnable task
packages. bench tasks check --sandbox <backend> runs the sandbox-aware
capability gate for this matrix, and bench tasks check --level runtime-capability --sandbox <backend> names that gate explicitly.
bench tasks check --level publication-grade adds the first static native
publication gate: the package must use task.md, native oracle/, native
verifier/verifier.md, rubric files, and selected verifier strategy artifacts
with an explicit reward_json output contract. Unknown sandbox names fail
runtime-capability validation instead of becoming no-op checks. bench tasks check --level acceptance adds the first static evidence
gate: benchflow.evidence must declare oracle proof, verifier reruns and
flake rate, a verifier stability report with concrete run records, anti-cheat
and instruction-alignment review status, calibration bounds, reference
artifacts, a calibration report with no-op/known-bad/partial/reference cases,
and SHA-256 pins for every primary evidence file. The static gate parses the
declared JSON artifacts and checks that oracle rewards, review status,
calibration examples, calibration report cases, and verifier run records agree
with the declared metadata. bench tasks check --level acceptance-live --sandbox <backend> now adds the first executable live-evidence slice:
benchflow.evidence.acceptance_live.cases can declare fresh verifier cases
and executable oracle/solve.sh reruns that pass through the selected sandbox
and BenchFlow verifier boundary, with reward thresholds checked from live
reward.txt / reward.json output. acceptance_live.calibration.from: calibration.report can also generate live no-op/known-bad/partial/reference
cases from the static calibration report; generated low/partial cases require
single-line sandbox commands in the report so the checker runs real
perturbations instead of treating missing artifacts as expected verifier
errors. When acceptance_live.report is declared, the runner writes a
package-local acceptance-live-report JSON artifact plus a .sha256 sidecar
with run records, reward summaries, task/spec hashes, generated/declared case
sources, staged workspace hash, and secret-safe diagnostic fields such as
verifier_error_category, diagnostic_code, and artifact_hint. Dependency
install flakes point to verifier/test-stdout.txt instead of embedding raw
resolver output in the report. For dogfood against checked-in examples that must
not dirty the package, bench tasks check --level acceptance-live --report-output <path> writes the report and sidecar to a host path instead of
the package-local path, and --no-report-write skips writing the report and
sidecar entirely (validation only); the package-local write is reserved for
intentionally refreshing checked-in evidence. Repeated live cases can declare
expect.flake_rate_max to enforce observed flake rate across fresh sandbox
reruns; without that field, each failed rerun fails the gate. Larger repeated
flake campaigns, model/submission metadata, hosted leaderboard publication, and
leaderboard export are still target work. If
acceptance_live.leaderboard.required: true is declared, the live report
includes a local leaderboard_suitability verdict requiring live runs, all
runs passing, an observed flake rate within
acceptance_live.leaderboard.max_flake_rate, oracle/reference proof, and
generated calibration coverage for no-op, known-bad, partial, and reference
cases. A parsed
field that the selected runtime cannot honor is worse than a parse error.
Architecture Slices
P0: Add TaskPackage / TaskRuntimeView.
The first runtime-facing slices exist in src/benchflow/task/runtime_view.py
and src/benchflow/task/package.py. They answer:
- which entrypoint is authoritative
- what prompt goes into
/instruction.mdcompatibility materialization - which verifier/oracle directories are native versus legacy
- which scenes were parsed for execution
- which source hashes and compatibility metadata apply
- which verifier document and selected strategy apply
- which sandbox runtime issues block launch
- which compatibility export report describes target-format loss
- which prompt plan composes base, role, scene, and turn prompts for rollout, with redacted document-declared user metadata
The remaining package-boundary work is richer adapter import/export state, acceptance/calibration validation levels, and freezing selected verifier/user runtime semantics through launch instead of reparsing at every edge.
P1: Add fail-closed capability checks.
Rollout, verifier, hardening, and adapters should stop parsing fields they do not execute without surfacing a validation result.
This includes explicit gates for task.md plus split-file drift,
verifier/ plus tests/, and oracle/ plus solution/.
The first module is src/benchflow/task/runtime_capabilities.py with a pure
validator:
validate_task_runtime_support(task, *, sandbox, task_dir) -> list[UnsupportedTaskFeature]
It reports stable config paths and reasons for unsupported steps, root/step
artifacts, allowlists, separate verifier environments, Windows, TPU,
healthchecks, unsafe workdirs, document-only user/benchflow runtime
semantics, and non-main verifier services on backends that cannot run them.
It is wired into bench tasks check --sandbox <backend> and the shared sandbox
factory used by rollouts and Environment.from_task(). Unsupported parsed
semantics now raise UnsupportedTaskFeatureError before Docker, Daytona, or
Modal construction. Safe absolute non-root environment.workdir values are
materialized before agent and verifier setup.
P2: Split native and adapter validation modes.
Native authoring should be strict. Foreign import should preserve and warn.
P3: Type the benchflow: namespace.
Start with document_version, compatibility, provenance, assets,
secrets, evidence, teams, nudges, prompt, agent_policy, and
runtime_policy.
P4: Extend exporters.
The first bench tasks export path exports to a compatibility split layout
with explicit loss reports, backed by export_task_to_split_layout(). Extend
the same reporting discipline to external benchmark datasets and same-format
no-op exports.