CLI reference
June 17, 2026 · View on GitHub
BenchFlow uses a resource-verb pattern: bench <resource> <verb>.
bench --version
bench agent
bench agentis agent management only.bench agent listandbench agent showoperate on registered AI agents (Claude Code, Gemini CLI, Codex, OpenHands, …) — the programs that solve tasks. Onboarding a third-party benchmark (scaffold → drive → parity-gate abenchmarks/<name>/adoption) is a separate workflow underbench eval adopt. The legacybench agent create|run|verifystill work as hidden deprecated aliases through 0.6, printing a one-line notice; they are removed in 0.7.
bench agent list
List all registered agents with their protocol and native/default auth
requirements. Provider-prefixed models may use provider-specific credentials;
Azure Foundry models use AZURE_API_KEY plus AZURE_API_ENDPOINT.
bench agent list
bench agent show
Show details for a specific agent, including native/default auth and a note about provider-specific credentials.
bench agent show gemini
bench eval adopt
Bring a third-party benchmark into the environment framework. bench eval adopt
is a single multi-mode command: it scaffolds a benchmarks/<name>/ package,
drives the codex conversion, and parity-gates the result. The conversion guide is
embedded in the command itself. It was previously a subgroup with
init/convert/verify subcommands, and before that bench agent create|run|verify; both bench adopt init|convert|verify and bench agent create|run|verify still work as hidden deprecated aliases through 0.6 (they print
a one-line notice and are removed in 0.7).
The mode is selected by flags:
bench eval adopt <source>(default, convert) — scaffoldbenchmarks/<name>/if it is missing, then drive the codex conversion of the upstream benchmark at<source>. Use--dry-runto preview the launch command without running it (and without writing any files).bench eval adopt <name> --scaffold-only— only scaffold the package, do not convert.bench eval adopt <name> --verify— run the parity gate for the named benchmark.
In convert mode the argument is the SOURCE repo/path to adopt; in --verify /
--scaffold-only mode it is the benchmark SLUG. --verify and --scaffold-only
are mutually exclusive.
Convert (default). The command resolves the slug (--name, else derived from
the source basename), auto-scaffolds benchmarks/<name>/ if it does not exist
(a no-op if it already does), then launches the host codex CLI to drive the
conversion toward a benchmarks/<name>/ pull request. It assembles the adoption
context — the source, the target path, the adoption skills, and the embedded
conversion guide — and runs codex exec against the repo root. It is fail-closed
on credentials: codex needs OPENAI_API_KEY (or CODEX_API_KEY) in the
environment, or a ~/.codex/auth.json from codex login, otherwise the command
exits before assembling any context. --dry-run prints the exact launch command
without running it (no credentials required) and writes no files.
# Print the codex launch command without running it
bench eval adopt https://github.com/org/some-benchmark --dry-run
# Scaffold-if-missing, then launch the host codex driver against a local source
bench eval adopt ./vendor/some-benchmark --name my-bench --model o3
| Flag | Default | Description |
|---|---|---|
--name | derived from source | Benchmark slug (default: from source basename) |
--model | codex default | Model for the codex driver |
--dry-run | false | Print the launch command, do not run (writes no files) |
--codex-bin | codex | Host codex binary |
-c, --codex-config | — | Codex config override as key=value, passed through to codex as -c key=value; repeatable. Use it to work around host ~/.codex/config.toml drift without editing the file — e.g. -c service_tier=flex when an installed codex version rejects a stale value. |
--benchmarks-dir | repo benchmarks/ | Target benchmarks/ directory (used by the auto-scaffold) |
Scaffold only. bench eval adopt <name> --scaffold-only writes only the
package layout, which mirrors the reference benchmark benchmarks/programbench/:
benchflow.py (converter), main.py, parity_test.py, run_<name>.py,
<name>.yaml, benchmark.yaml, parity_experiment.json (status template),
README.md, and __init__.py. It is fail-closed: the slug is validated
(lowercase, leading letter, single internal hyphens, max 64 chars) and the
command refuses to overwrite an existing benchmark directory.
bench eval adopt my-bench --scaffold-only
bench eval adopt my-bench --scaffold-only --benchmarks-dir ./benchmarks
| Flag | Default | Description |
|---|---|---|
--benchmarks-dir | repo benchmarks/ | Target benchmarks/ directory |
Verify. bench eval adopt <name> --verify runs the parity gate for an
adopted benchmark and emits a confidence verdict. It reads
benchmarks/<name>/parity_experiment.json and scores two layers: a deterministic
conversion-faithfulness floor (every compared criterion's converted verdict must
match the original's verdict on identical inputs) and a statistical
reward-distribution layer (every legacy-vs-converted reward delta must sit within
--tolerance). The gate is parity-only — a faithful conversion reproduces the
original's behavior, including any reward-hackability the source has; it never
"improves" or sanitizes the source. The verdict is one of parity-confirmed,
parity-divergent, or insufficient-evidence (no recorded comparisons). On any
non-confirmed verdict the command exits non-zero and emits a draft GitHub issue
body for human support — printed to stdout, or written to --issue-out. The
draft is never filed automatically. Pass --roundtrip-task to also run the
structural round-trip conformance check on a concrete task directory.
By default the gate scores the recorded parity_experiment.json — fast, but
it trusts an artifact the conversion produced about itself. Pass --rerun to
independently re-execute parity_test.py --mode side-by-side and score its
fresh output instead. --rerun is fail-closed: a missing/failing parity_test.py,
a timeout, or output that is not in the scoreable parity_experiment.json shape
all exit non-zero (rather than silently reporting insufficient-evidence).
bench eval adopt my-bench --verify
bench eval adopt my-bench --verify --tolerance 0.05 --issue-out divergence.md
bench eval adopt my-bench --verify --roundtrip-task benchmarks/my-bench/tasks/example
bench eval adopt my-bench --verify --rerun # re-run parity_test.py, score fresh output
| Flag | Default | Description |
|---|---|---|
--benchmarks-dir | repo benchmarks/ | Target benchmarks/ directory |
--tolerance | 0.02 | Max abs reward delta (statistical layer) |
--issue-out | — | Write the divergence issue draft to this path instead of stdout |
--roundtrip-task | — | Also run the structural round-trip check on this task dir |
--rerun | false | Re-execute parity_test.py --mode side-by-side and score its fresh output instead of the recorded parity_experiment.json |
bench eval
bench eval run
Run an evaluation — single task or batch. Use it for YAML configs and batch runs; it also accepts a single task directory.
Renamed from
bench eval create. The old name still works as a deprecated alias and prints a deprecation notice; switch tobench eval run.
# From YAML config
bench eval run --config benchmarks/harvey-lab/harvey-lab-gemini-flash-lite.yaml
# From remote repo (fast Daytona batch; token usage may be unavailable)
bench eval run \
--source-repo benchflow-ai/skillsbench \
--source-path tasks \
--agent gemini \
--model gemini-3.1-flash-lite-preview \
--sandbox daytona \
--concurrency 64 \
--sandbox-setup-timeout 300
# From remote repo with required token usage telemetry
bench eval run \
--source-repo benchflow-ai/skillsbench \
--source-path tasks \
--agent gemini \
--model gemini-3.1-flash-lite-preview \
--sandbox daytona \
--usage-tracking required \
--concurrency 16 \
--sandbox-setup-timeout 300
# From local directory
bench eval run --tasks-dir ./tasks --agent gemini --model gemini-3.1-flash-lite-preview
# From a hosted PrimeIntellect / Verifiers environment
bench eval run \
--source-env primeintellect/general-agent \
--source-env-version 0.1.1 \
--source-env-arg task=calendar_scheduling_t0 \
--agent gemini \
--model google/gemini-2.5-flash-lite
# Single task with mounted skills and the recommended skill nudge
bench eval run \
--tasks-dir tasks/pdf-fix \
--agent gemini \
--model gemini-3.1-flash-lite-preview \
--sandbox daytona \
--skill-mode with-skill \
--agent-env BENCHFLOW_SKILL_NUDGE=name
# Pinned registry dataset: resolves skillsbench@1.1, verifies task digests,
# and stamps dataset identity into every result.json/config.json
bench eval run -d skillsbench@1.1 --agent gemini --model gemini-3.1-flash-lite-preview
| Flag | Default | Description |
|---|---|---|
--config | — | YAML config file |
--run-config | — | Explicit alias for the YAML run-config source file; equivalent to --config |
--tasks-dir | — | Local task dir (single native task.md package, compatibility split-layout task, or parent of many) |
-d, --dataset | — | Registry dataset to run as <name>@<version> (e.g. skillsbench@1.1). Resolves the pinned snapshot from the registry, clones tasks at their pinned commit, verifies each task's sha256 content digest, and checks the dataset's bench_version range against the installed benchflow. Each result.json/config.json is stamped with dataset_name, dataset_version, and the task's task_digest. |
--registry | skillsbench registry | Dataset registry JSON URL or local file. Only valid with --dataset. |
--source-repo | — | Remote repo as org/repo (e.g. benchflow-ai/skillsbench) |
--source-path | — | Subpath within the repo (e.g. tasks) |
--source-ref | — | Branch or tag to clone (e.g. main) |
--source-env | — | Hosted environment source (e.g. primeintellect/general-agent) |
--source-env-version | — | Hosted environment version |
--source-env-arg | — | Hosted environment argument as KEY=VALUE; repeatable |
--source-env-num-examples | 1 | Number of hosted environment examples |
--source-env-rollouts-per-example | 1 | Rollouts per hosted environment example |
--source-env-max-tokens | 1024 | Max tokens for hosted environment model calls |
--source-env-temperature | 0.0 | Temperature for hosted environment model calls |
--source-env-sampling-arg | — | Verifiers sampling argument as KEY=VALUE; repeatable (for example reasoning_effort=minimal) |
--agent | claude-agent-acp | Agent name |
--model | Agent default | Model ID |
--reasoning-effort | — | Agent reasoning/thinking effort when the agent exposes one (e.g. max) |
--sandbox | docker | Sandbox: docker, daytona, or modal |
--usage-tracking | auto | Token usage telemetry policy: auto, required, or off |
--environment-manifest | — | Path to an Environment-plane manifest (environment.toml); applied to every rollout in the batch |
--state | — | S-axis environment binding; inline JSON, registry name@version, or manifest path. Takes precedence over --environment-manifest |
--prompt | task prompt | Prompt to send to the agent; repeatable for multi-prompt runs |
--config-override | — | C-axis task config overlay; inline JSON/YAML/TOML or @file, deep-merged into each task's resolved config |
--concurrency | 4 | Max concurrent tasks (batch mode only) |
--build-concurrency | --concurrency | Max concurrent docker image builds; set lower (e.g. 8) when --concurrency is high to avoid overwhelming the docker daemon |
--worker-concurrency | — | Run batch eval through isolated worker subprocesses, each with at most this many concurrent tasks; --concurrency remains the aggregate target |
--worker-retries | 1 | Retry a crashed worker shard this many times, resuming its jobs dir |
--worker-start-stagger-sec | 1.0 | Seconds to stagger worker starts to avoid Daytona connection storms |
--agent-idle-timeout | (built-in default) | Abort ACP prompts after this many idle seconds; 0 disables idle detection |
--jobs-dir | jobs | Output directory |
--sandbox-user | agent | Sandbox user (null for root) |
--sandbox-setup-timeout | 120 | Timeout in seconds for sandbox user setup |
--skills-dir | — | Advanced custom skills directory; valid only with --skill-mode with-skill. Omit it to use each task's environment/skills. |
--skill-mode | no-skill | Skill mode: no-skill, with-skill, or self-gen |
--skill-creator-dir | — | Path to a skill-creator directory (or a skills root containing it); used when --skill-mode self-gen |
--self-gen-no-internet | false | Disable web tools for the self-generated skill run |
--agent-env | — | Agent environment variable as KEY=VALUE; repeatable |
--include | — | Only run these task names; repeatable (e.g. --include jax-computing-basics --include data-to-d3) |
--exclude | — | Skip these task names; repeatable (e.g. --exclude quantum-numerical-simulation) |
--loop-strategy | — | Wrap each rollout in a loop, e.g. verify-retry:k=3,feedback=names or self-review:k=3 (omit for single-shot) |
--ignore-bench-version | false | With --dataset, skip the dataset's bench_version compatibility gate |
When mounting skills, the recommended docs default is
--agent-env BENCHFLOW_SKILL_NUDGE=name. See
Architecture: skill loading for how
with-skill mode is registered with each agent and how the nudge modes differ.
Daytona batch runs collect provider token/cost telemetry by default with a
sandbox-local LiteLLM gateway. Use --usage-tracking required when missing telemetry
should fail the rollout, or --usage-tracking off for recovery runs that should
leave provider traffic untouched.
--source-env is for external hosted environment hubs. The first supported
runner is PrimeIntellect / Verifiers: BenchFlow preserves the hosted identity
(env_uid, hub_url), installs the versioned package into an isolated local
virtual environment, and runs vf-eval. --sandbox remains the BenchFlow task
sandbox selector for local/repo task sources; Verifiers source environments own
their own harness and sandbox behavior. --model is passed to the Verifiers
model endpoint; use a model id available to that provider. Provider-specific
sampling options are not inferred; pass them explicitly with
--source-env-sampling-arg.
bench eval list
List completed evaluations from a jobs directory.
bench eval list jobs/
bench eval metrics
Collect and display metrics (pass/fail/score, memory score, tool calls, duration)
from a jobs directory. Use --json for machine-readable output.
bench eval metrics jobs/
bench eval metrics jobs/ --json
bench eval view
Serve a trial trajectory viewer in the browser for a rollout or job directory.
bench eval view jobs/run/task__abc123
bench eval view jobs/ --port 9000
bench skills
bench skills list
List skills discovered under the default skills roots (or --dir).
bench skills list
bench skills list --dir ./skills
bench skills eval
Evaluate a skill against its evals.json test cases.
bench skills eval skills/my-skill/ \
--agent gemini \
--model gemini-3.1-flash-lite-preview \
--sandbox daytona
bench tasks
bench tasks init
Scaffold a new benchmark task.
bench tasks init my-new-task
bench tasks init my-new-task --dir tasks/
| Flag | Default | Description |
|---|---|---|
--format | task-md | Task format. New tasks use task-md; the legacy scaffold path is retired. |
bench tasks check
Validate a task directory. Native packages use task.md, environment/, and
verifier/; older split packages should be migrated with bench tasks migrate.
bench tasks check tasks/my-task
With --level, validation runs at a chosen depth: schema, structural,
runtime-capability, publication-grade, acceptance, or acceptance-live.
Acceptance-level errors such as
acceptance validation requires benchflow.evidence mapping refer to the
benchflow.evidence schema documented in the "Assets, Provenance, And
Evidence" section of docs/task-standard.md.
bench tasks migrate
Convert an older split task package into the unified task.md format. By
default the old files are kept alongside the new task.md; for publication,
use --remove-legacy.
bench tasks migrate tasks/my-task
bench tasks migrate tasks/my-task --overwrite --remove-legacy
| Flag | Default | Description |
|---|---|---|
--overwrite | false | Replace an existing task.md |
--remove-legacy | false | Delete split files and promote tests/ to verifier/ and solution/ to oracle/ after task.md is verified |
bench tasks normalize
Expand minimal task.md authoring profiles into the canonical task.md
form. Prints the normalized document to stdout unless told otherwise.
bench tasks normalize tasks/my-task
bench tasks normalize tasks/my-task --write
bench tasks normalize tasks/my-task -o normalized-task.md
| Flag | Default | Description |
|---|---|---|
--output, -o | — | Write normalized task.md to this path instead of stdout |
--write | false | Replace task.md in place with the normalized canonical form |
bench tasks export
Export a task.md task to a compatibility split package, with a compatibility
loss report written to compatibility/export-report.json in the export
directory.
bench tasks export tasks/my-task out/my-task-split
bench tasks export tasks/my-task --report-only
bench tasks export tasks/my-task out/my-task-split --overwrite
Arguments: TASK_DIR (task directory to export) and optional OUTPUT_DIR
(destination split-layout directory; may be omitted with --report-only).
| Flag | Default | Description |
|---|---|---|
--target | harbor | Compatibility target: harbor |
--overwrite | false | Replace an existing export directory |
--report-only | false | Print the compatibility loss report without writing files |
bench tasks digest
Compute the content digest that pins a task's files, independent of git — the
sha256 the dataset registry keys on (matches the digests bench eval run -d
verifies and the task_digest stamped into every result.json). Recognizes
both legacy task.toml tasks and native task.md tasks. Given a single task
directory it prints the digest; given a directory of tasks it prints one
<name> <digest> line per task. Output goes to stdout via echo (not Rich), so
it is safe to pipe into machine-readable tooling.
bench tasks digest tasks/my-task # -> sha256:<hex>
bench tasks digest tasks/ # one "<name> sha256:<hex>" line per task
Arguments: PATH (a task directory, or a directory of task directories).
bench tasks generate
Generate benchmark task directories from real agent traces.
bench tasks generate --from-local --project my-repo --limit 5
bench tasks generate --from-file session.jsonl --dry-run
bench tasks generate --from-hf opentraces-test --limit 50
| Flag | Default | Description |
|---|---|---|
--from-local | — | Generate from local Claude Code sessions |
--from-file | — | Generate from a JSONL trace file |
--from-hf | — | Generate from a HuggingFace dataset ID or alias |
--output | tasks | Output directory for generated tasks |
--projects-dir | ~/.claude/projects/ | Claude Code projects directory |
--project | — | Filter local sessions by project path substring |
--format | auto | Trace format override |
--split | train | HuggingFace dataset split |
--max-rows | 100 | Max rows to download from HuggingFace |
--limit | 20 | Max traces to process |
--min-steps | 2 | Minimum steps per trace |
--outcome | — | Filter by outcome: success, failure, unknown |
--author | benchflow-traces | Author name for generated task metadata |
--task-format | task-md | Generated task package format: task-md or legacy |
--dry-run | false | Preview traces without generating tasks |
bench tasks list-sources
List known HuggingFace trace datasets. The aliases listed here can be passed
to bench tasks generate --from-hf.
bench tasks list-sources
bench sandbox
Local sandbox lifecycle: provision a task on a docker/daytona/modal backend, list active sandboxes, and reap stale ones.
bench sandbox create
Create an environment object from a task directory. This validates environment construction but does not start the sandbox.
bench sandbox create tasks/my-task --sandbox daytona
bench sandbox list
List active local (Daytona) sandboxes.
bench sandbox list
bench sandbox cleanup
Clean up orphaned Daytona sandboxes. By default this deletes sandboxes older
than 24 hours; use --dry-run to preview what would be deleted.
bench sandbox cleanup --dry-run --max-age 1440
Daytona-backed evals also reap orphaned sandboxes automatically at run start
(failure states such as BUILD_FAILED are reaped sooner than healthy ones, and
an idle-activity guard means concurrent live runs are never reaped). Set
BENCHFLOW_DAYTONA_AUTO_REAP to any of 0/false/no/off (case-insensitive)
to disable that automatic pass and rely on the manual command above.
bench environment (deprecated)
bench environment is a hidden deprecated alias group, removed in 0.7. The
local lifecycle moved to bench sandbox (create/list/cleanup)
and hosted-provider browsing to bench hub list. The old
bench environment create|list|cleanup and show|inspect (plus list --provider/--hub) still work, each printing a one-line stderr notice.
bench hub
External environment hubs: browse a hub's environments (list/show/inspect)
and check Harbor registry compatibility (check).
bench hub list / show / inspect
Read-only browsing of a hub's environments. list covers two hubs via
--provider: primeintellect (hosted "Environments") and harbor (the
benchmark registry). To run a hosted environment, use
bench eval run --source-env.
bench hub list --provider primeintellect --owner primeintellect --search general-agent --limit 5
bench hub list --provider harbor --search coding
bench hub show primeintellect/general-agent --version 0.1.1
bench hub inspect primeintellect/general-agent --version 0.1.1 --path README.md
bench hub env list|show|inspect still resolves as a hidden back-compat alias.
bench hub check
Inventory or structurally check representative tasks from an environment hub's registry. Defaults to an inventory pass against the public Harbor registry JSON.
# Inventory the public Harbor hub registry
bench hub check
# Structural check, two tasks per dataset, JSONL output
bench hub check --level check --tasks-per-dataset 2 --out hub.jsonl
| Flag | Default | Description |
|---|---|---|
--registry | Harbor public registry URL | Harbor registry JSON URL or local file |
--tasks-per-dataset | 2 | Representative tasks selected per dataset |
--level | inventory | Compatibility level: inventory or check |
--out | — | Optional JSONL output path |
--cache-dir | .cache/hub/harbor | Cache directory for sparse clones |
--limit | — | Optional cap on selected task refs |
YAML Config Format
Batch config with skills and skill nudge
source:
repo: benchflow-ai/skillsbench
path: tasks
environment: daytona
concurrency: 64
sandbox_setup_timeout: 300
agent: gemini
model: gemini-3.1-flash-lite-preview
skill_mode: with-skill
skills_dir: shared-skills/
agent_env:
BENCHFLOW_SKILL_NUDGE: name
max_retries: 2
Multi-scene (BYOS skill generation)
Use the Python API for multi-scene experiments. bench eval run --config is for
batch job configs; scene configs are loaded with benchflow._utils.yaml_loader or built
directly in Python.
task_dir: tasks/my-task
environment: daytona
sandbox_setup_timeout: 300
scenes:
- name: skill-gen
roles:
- name: creator
agent: gemini
model: gemini-3.1-flash-lite-preview
turns:
- role: creator
prompt: "Analyze the task and write a skill document to /app/generated-skill.md"
- name: solve
roles:
- name: solver
agent: gemini
model: gemini-3.1-flash-lite-preview
turns:
- role: solver
bench eval continue
Resume a previous, unfinished (timed-out) openhands run to completion via
record-replay. Standalone — it does not touch the normal run path. See
Continuing timed-out runs for the full guide.
bench eval continue path/to/original/run-folder --tasks-dir path/to/tasks
The original top-level bench continue still works as a hidden, deprecated alias.
Key options: --model (override the live-continuation model; defaults to the
original run's model), --timeout, --output, --require-timeout,
--strict-divergence, --replay-only (rebuild via replay and stop at the
cut-point — no live model or API key needed), and --proxy-mode (replay
proxy placement: auto, host, or sandbox; default auto uses
sandbox-local replay for Daytona/Modal and host replay for Docker).
bench eval continue-batch
Continue all timed-out OpenHands runs found under a directory tree. Discovers
run folders (config.json + trajectory/llm_trajectory.jsonl) recursively,
continues each, and prints a JSON batch summary (exits 1 if any continuation
failed).
bench eval continue-batch path/to/jobs-root --tasks-dir path/to/tasks
| Flag | Default | Description |
|---|---|---|
--tasks-dir | — | Directory holding task sources; required unless the recorded task path exists |
--model | original run's model | Override the live-continuation model |
--timeout | — | Wall-clock budget per continuation |
--output | — | Output jobs dir for continued runs |
--concurrency | 100 | Maximum number of continuation runs in flight |
--limit | — | Limit discovered timeout folders |
--strict-divergence | false | Abort a run if replay leaves the original rails |
--proxy-mode | auto | Replay proxy placement: auto, host, or sandbox |