Authoring native task.md tasks

June 16, 2026 · View on GitHub

A native BenchFlow task is one task.md document plus sidecar directories. The YAML frontmatter carries the task configuration; the markdown body is the prompt. This page teaches the native format hands-on. For the normative standard see the task standard.

When a directory contains both layouts, task.md is the authoritative task definition — the runtime selects it and ignores the split pair.

Minimal task — three files

my-task/
├── task.md                # config frontmatter + prompt body
├── environment/
│   └── Dockerfile         # sandbox image
└── verifier/
    └── test.sh            # verifier entry point

That is the complete runnable surface: structural validation requires task.md, an environment/ directory with a Dockerfile, and a verifier directory with a runnable entrypoint. An oracle/ directory is optional.

---
agent:
  timeout_sec: 300         # strongly recommended — unset means no wall-clock cap
verifier:
  timeout_sec: 120
environment:
  cpus: 1
  memory_mb: 2048
---

Create a file `/app/hello.txt` containing exactly `Hello, world!`.

#!/bin/bash
# verifier/test.sh
REWARD=0
if [ "$(cat /app/hello.txt 2>/dev/null | tr -d '\n')" = "Hello, world!" ]; then
    REWARD=1
fi
echo "$REWARD" > /logs/verifier/reward.txt

Scaffold this shape with the CLI (task.md is the default format):

bench tasks init my-task                    # task.md, environment/, verifier/, oracle/
bench tasks check tasks/my-task             # structural validation
bench tasks check tasks/my-task --level schema   # frontmatter + prompt parse only

Frontmatter

task.md must start with a ----delimited YAML frontmatter block, and the frontmatter must be a mapping — a document without it fails to parse. The keys fall into three classes.

Task config keys are the BenchFlow task config surface, validated as TaskConfig. Unknown keys are rejected (the schema is extra="forbid"), so typos fail at parse time instead of becoming silently-ignored config:

Key	Meaning
`schema_version` (alias `version`)	Config schema version, currently `"1.3"`
`task`	Package identity: `name` (`org/name` format), `description`, `authors`, `keywords`
`metadata`	Freeform mapping — difficulty, category, tags, anything descriptive
`agent`	Agent run policy: `timeout_sec`, `user`, `network_mode`, `allowed_hosts`
`verifier`	Verifier run policy: `timeout_sec` (default 600), `env`, `user`, `service`, …
`environment`	Sandbox: `docker_image`, `cpus`, `memory_mb`, `storage_mb`, `network_mode`, `env`, `workdir`, …
`oracle`	Oracle run policy: `env`, `timeout_sec` (import alias: `solution`)
`source`, `artifacts`, `steps`, `multi_step_reward_strategy`, `reward`	Provenance, artifact, and reward metadata

agent.timeout_sec is strongly recommended: it is optional and defaults to unset, and a task that omits it runs the agent with no wall-clock cap unless the caller supplies a per-run timeout. Set it on every published task.

Declaring both oracle and the legacy solution alias in one config is invalid and rejected; native tasks use oracle.

Document orchestration keys are parsed by TaskDocument, not TaskConfig: agents (named roles with agent, model, reasoning_effort, capabilities, …), scenes (ordered turns referencing declared roles — a turn that names an undeclared role is a parse error), and user (simulated user). benchflow is the reserved extension namespace.

Authoring shorthands are expanded during parsing and never reach the canonical config under their short names:

Shorthand	Expands to
`name: hello-world`	`task.name: benchflow/hello-world` (a `/` in the value keeps your org)
`image: ubuntu:24.04`	`environment.docker_image: ubuntu:24.04`
`verifier: verifier/` (string form)	`benchflow.verifier.path` / `.spec` / `.entrypoint` defaults
`oracle: oracle/` (string form)	`benchflow.oracle.path`
`profile: code-change`	Merges a named defaults bundle (see below)

Profiles (profile: / profiles:) merge predefined default bundles — code-change, harbor-compatible, reward-kit, acceptance-live, multi-agent, leaderboard-local — under your explicit keys; an unknown profile name is a parse error. bench tasks normalize <task-dir> prints the fully expanded canonical document (--write replaces task.md in place), so a minimal authored file and its canonical form never drift apart.

Prompt body and prompts/ sidecars

The body below the frontmatter is the base prompt — free-form markdown, no heading ceremony required. If the body contains no reserved section headings, the entire body is the instruction the agent receives.

Four reserved headings are recognized for compatibility imports: ## prompt, ## role:<name>, ## scene:<name>, and ## user-persona. Repeating the same section heading is a parse error. bench tasks init scaffolds a single ## prompt section as a starting point — for a single-prompt task that is equivalent to a bare body, so keep it or drop the heading as you prefer. The multi-prompt headings (## role:, ## scene:, ## user-persona) are for compatibility imports only; new multi-prompt material belongs in sidecar files under prompts/:

File	Meaning
`prompts/role.<name>.md`	Role prompt — the whole file body is the prompt text
`prompts/scene.<name>.md`	Scene prompt
`prompts/user-persona.md`	Simulated-user persona

Sidecar files take precedence over a reserved heading of the same name, so a compat-imported task can be cleaned up incrementally. Runtime prompt precedence for a turn is: inline turn prompt, then scene prompt, then role prompt, then base prompt.

A multi-role task wires the pieces together in frontmatter:

agents:
  roles:
    solver:
      agent: claude-agent-acp
scenes:
  - name: solve
    turns:
      - role: solver

with the solver guidance, if any, in prompts/role.solver.md. See docs/examples/task-md/ for runnable examples, including real converted SkillsBench packages.

Verifier package and strategy declaration

The native verifier directory is verifier/. At verify time the directory is uploaded into the sandbox at /verifier, and the verifier must write its reward to /logs/verifier/reward.txt (and optionally /logs/verifier/reward.json).

A plain verifier/test.sh is a complete verifier: with no other declaration, the runtime executes it directly. Write a float 0.0–1.0 to /logs/verifier/reward.txt, then exit 0; a nonzero exit means verifier infrastructure failure, not a scored task failure.

To declare how the task is scored, add verifier/verifier.md. Its frontmatter must contain a verifier: mapping with at least one entry under strategies; default_strategy selects which one runs (it defaults to the first declared strategy and must name a declared one):

---
document_version: "0.3"
verifier:
  name: my-task-verifier
  default_strategy: deterministic
  strategies:
    deterministic:
      type: script
      command: ./test.sh
  outputs:
    reward_text: /logs/verifier/reward.txt
    reward_json: /logs/verifier/reward.json
---

## verifier intent

What the verifier measures and which task outputs it reads.

Five strategy types are recognized, each with fail-closed required fields:

`type`	Required config	Notes
`script`	`command`	Runs as `cd /verifier && <command>`; local script files named in the command must exist in `verifier/`
`llm-judge`	`rubric`	Optional `model`, `input_dir`, and `context` or `context_file` (not both)
`reward-kit`	`root`	Optional `entrypoint` (default `reward.py`) and `criteria`; paths must be safe-relative
`agent-judge`	`role`, `isolation: verifier-only`, `inputs`	`role` must match a `## role:<name>` section in the verifier.md body
`ors-episode`	`inputs`	Optional `format`: `json`, `jsonl`, or `auto`

An unknown type is a parse error. bench tasks check also verifies the selected strategy is actually runnable — e.g. a script strategy whose referenced files are missing, or an llm-judge strategy whose rubric file does not exist, fails validation.

outputs declares the reward artifact contract (defaults shown above; details_json and aggregate_policy are optional). bench tasks check --level publication-grade additionally requires the native package shape: task.md, native oracle/, verifier/verifier.md with rubric files, and an explicit reward_json output contract.

Oracle

oracle/solve.sh is the held-out reference solution (solution/ is the legacy alias; oracle/ wins when both exist). Native oracles are uploaded to /oracle in the sandbox (legacy solution/ to /solution) and run instead of an agent with --agent oracle:

bench eval run --tasks-dir tasks/my-task --agent oracle --sandbox docker

A correct task scores 1.0 on its oracle run before any model sees it.

Multi-container tasks

A task may ship an environment/docker-compose.yaml alongside the Dockerfile. The agent always runs in the main service; any additional services you declare become sibling containers on the same Docker network. This supports vulhub-style CVE tasks where the agent attacks a separate target container over the network.

environment/Dockerfile is always required — bench tasks check rejects a task that ships only a docker-compose.yaml. If your main service uses a prebuilt image: and needs no build context, still include a minimal Dockerfile (e.g. FROM <same-image>) so structural validation and other tooling agree on the task package shape.

# environment/docker-compose.yaml
services:
  main: {}            # agent container — BenchFlow injects build/image/limits
  target:             # vulnerable service the agent must exploit
    image: vulhub/struts2-s2-001:latest
    expose: ["8080"]

main reaches target by service name (http://target:8080). The verifier can inspect target-side state — not just the agent's workspace — by passing a service argument when running commands:

# In a Python-driven run or pre/post hook
await env.exec_in_service("target", "test -f /tmp/exploit_proof.txt")
await env.exec("cat /flag", service="target")          # equivalent form
services = await env.inner.services()                  # ["main", "target"]

exec(..., service=...) works on the Docker sandbox and the Daytona DinD (compose) sandbox. Single-container backends (Modal, direct Daytona) raise a clear error for any non-main service. This lets a verifier check write-based oracles (/tmp/exploit.txt in the target), database modifications, or RCE markers without trusting the agent container.

Target-side verifier with `verifier.service`

For tasks whose success oracle lives in a target container — an RCE marker file, a modified database row — point the verifier/test.sh verifier at that service with the service key under verifier in the frontmatter:

verifier:
  service: target     # run verifier/test.sh inside the `target` container

With this set, BenchFlow uploads the task's verifier/ directory into the target container, runs test.sh there, and copies the resulting reward.txt / reward.json back to the host. service defaults to "main" (the agent container), so single-container tasks are unaffected.

verifier.service is the declarative, task-schema way to do cross-container verification; the env.exec_in_service(...) Python API above is the imperative equivalent for hook-driven runs.

Use the same service name you declared in docker-compose.yaml. A test.sh running in the target reaches main (and vice versa) by service name over the Docker network, just like the agent does.

Hardening policy for multi-container tasks

BenchFlow's pre-verification hardening — killing the sandbox user's processes, scrubbing PATH/PYTHONPATH, restoring build-config files — applies only to the main (agent) container. Target containers are deliberately left unhardened: a vulhub-style target is meant to be vulnerable, the agent never has a shell inside it, and hardening it would risk breaking the very vulnerability the task exercises. verifier.service selects where test.sh runs; it does not move hardening off main.

Migrating a legacy task

bench tasks migrate converts a task.toml + instruction.md pair into task.md:

bench tasks migrate tasks/my-task                  # writes task.md, keeps legacy files
bench tasks migrate tasks/my-task --overwrite      # replace an existing task.md
bench tasks migrate tasks/my-task --remove-legacy  # delete the split pair and
                                                   # promote tests/ -> verifier/,
                                                   # solution/ -> oracle/

The migration is non-destructive by default and refuses to write anything lossy: the generated document is re-parsed and must reproduce the original config semantics and instruction text exactly, or the command fails. Unknown task.toml keys that the schema does not model are preserved under benchflow.compat in the generated frontmatter rather than dropped. After migrating, validate the result:

bench tasks check tasks/my-task
bench eval run --tasks-dir tasks/my-task --agent oracle --sandbox docker

Compatibility Export

To produce a compatibility split package from a task.md package, use bench tasks export:

bench tasks export tasks/my-task out/my-task-split
bench tasks export tasks/my-task --report-only                # loss report only

The export writes a compatibility loss report to compatibility/export-report.json so you can see what (if anything) the split layout cannot represent. Publication-grade validation requires task.md to be the only authoritative entrypoint, so keep exported split layouts in a separate output directory rather than beside task.md. See CLI reference: bench tasks export for all flags.