CLI reference

June 17, 2026 · View on GitHub

BenchFlow uses a resource-verb pattern: bench <resource> <verb>.

bench --version

bench agent

bench agent is agent management only. bench agent list and bench agent show operate on registered AI agents (Claude Code, Gemini CLI, Codex, OpenHands, …) — the programs that solve tasks. Onboarding a third-party benchmark (scaffold → drive → parity-gate a benchmarks/<name>/ adoption) is a separate workflow under bench eval adopt. The legacy bench agent create|run|verify still work as hidden deprecated aliases through 0.6, printing a one-line notice; they are removed in 0.7.

bench agent list

List all registered agents with their protocol and native/default auth requirements. Provider-prefixed models may use provider-specific credentials; Azure Foundry models use AZURE_API_KEY plus AZURE_API_ENDPOINT.

bench agent list

bench agent show

Show details for a specific agent, including native/default auth and a note about provider-specific credentials.

bench agent show gemini

bench eval adopt

Bring a third-party benchmark into the environment framework. bench eval adopt is a single multi-mode command: it scaffolds a benchmarks/<name>/ package, drives the codex conversion, and parity-gates the result. The conversion guide is embedded in the command itself. It was previously a subgroup with init/convert/verify subcommands, and before that bench agent create|run|verify; both bench adopt init|convert|verify and bench agent create|run|verify still work as hidden deprecated aliases through 0.6 (they print a one-line notice and are removed in 0.7).

The mode is selected by flags:

  • bench eval adopt <source> (default, convert) — scaffold benchmarks/<name>/ if it is missing, then drive the codex conversion of the upstream benchmark at <source>. Use --dry-run to preview the launch command without running it (and without writing any files).
  • bench eval adopt <name> --scaffold-only — only scaffold the package, do not convert.
  • bench eval adopt <name> --verify — run the parity gate for the named benchmark.

In convert mode the argument is the SOURCE repo/path to adopt; in --verify / --scaffold-only mode it is the benchmark SLUG. --verify and --scaffold-only are mutually exclusive.

Convert (default). The command resolves the slug (--name, else derived from the source basename), auto-scaffolds benchmarks/<name>/ if it does not exist (a no-op if it already does), then launches the host codex CLI to drive the conversion toward a benchmarks/<name>/ pull request. It assembles the adoption context — the source, the target path, the adoption skills, and the embedded conversion guide — and runs codex exec against the repo root. It is fail-closed on credentials: codex needs OPENAI_API_KEY (or CODEX_API_KEY) in the environment, or a ~/.codex/auth.json from codex login, otherwise the command exits before assembling any context. --dry-run prints the exact launch command without running it (no credentials required) and writes no files.

# Print the codex launch command without running it
bench eval adopt https://github.com/org/some-benchmark --dry-run

# Scaffold-if-missing, then launch the host codex driver against a local source
bench eval adopt ./vendor/some-benchmark --name my-bench --model o3
FlagDefaultDescription
--namederived from sourceBenchmark slug (default: from source basename)
--modelcodex defaultModel for the codex driver
--dry-runfalsePrint the launch command, do not run (writes no files)
--codex-bincodexHost codex binary
-c, --codex-configCodex config override as key=value, passed through to codex as -c key=value; repeatable. Use it to work around host ~/.codex/config.toml drift without editing the file — e.g. -c service_tier=flex when an installed codex version rejects a stale value.
--benchmarks-dirrepo benchmarks/Target benchmarks/ directory (used by the auto-scaffold)

Scaffold only. bench eval adopt <name> --scaffold-only writes only the package layout, which mirrors the reference benchmark benchmarks/programbench/: benchflow.py (converter), main.py, parity_test.py, run_<name>.py, <name>.yaml, benchmark.yaml, parity_experiment.json (status template), README.md, and __init__.py. It is fail-closed: the slug is validated (lowercase, leading letter, single internal hyphens, max 64 chars) and the command refuses to overwrite an existing benchmark directory.

bench eval adopt my-bench --scaffold-only
bench eval adopt my-bench --scaffold-only --benchmarks-dir ./benchmarks
FlagDefaultDescription
--benchmarks-dirrepo benchmarks/Target benchmarks/ directory

Verify. bench eval adopt <name> --verify runs the parity gate for an adopted benchmark and emits a confidence verdict. It reads benchmarks/<name>/parity_experiment.json and scores two layers: a deterministic conversion-faithfulness floor (every compared criterion's converted verdict must match the original's verdict on identical inputs) and a statistical reward-distribution layer (every legacy-vs-converted reward delta must sit within --tolerance). The gate is parity-only — a faithful conversion reproduces the original's behavior, including any reward-hackability the source has; it never "improves" or sanitizes the source. The verdict is one of parity-confirmed, parity-divergent, or insufficient-evidence (no recorded comparisons). On any non-confirmed verdict the command exits non-zero and emits a draft GitHub issue body for human support — printed to stdout, or written to --issue-out. The draft is never filed automatically. Pass --roundtrip-task to also run the structural round-trip conformance check on a concrete task directory.

By default the gate scores the recorded parity_experiment.json — fast, but it trusts an artifact the conversion produced about itself. Pass --rerun to independently re-execute parity_test.py --mode side-by-side and score its fresh output instead. --rerun is fail-closed: a missing/failing parity_test.py, a timeout, or output that is not in the scoreable parity_experiment.json shape all exit non-zero (rather than silently reporting insufficient-evidence).

bench eval adopt my-bench --verify
bench eval adopt my-bench --verify --tolerance 0.05 --issue-out divergence.md
bench eval adopt my-bench --verify --roundtrip-task benchmarks/my-bench/tasks/example
bench eval adopt my-bench --verify --rerun   # re-run parity_test.py, score fresh output
FlagDefaultDescription
--benchmarks-dirrepo benchmarks/Target benchmarks/ directory
--tolerance0.02Max abs reward delta (statistical layer)
--issue-outWrite the divergence issue draft to this path instead of stdout
--roundtrip-taskAlso run the structural round-trip check on this task dir
--rerunfalseRe-execute parity_test.py --mode side-by-side and score its fresh output instead of the recorded parity_experiment.json

bench eval

bench eval run

Run an evaluation — single task or batch. Use it for YAML configs and batch runs; it also accepts a single task directory.

Renamed from bench eval create. The old name still works as a deprecated alias and prints a deprecation notice; switch to bench eval run.

# From YAML config
bench eval run --config benchmarks/harvey-lab/harvey-lab-gemini-flash-lite.yaml

# From remote repo (fast Daytona batch; token usage may be unavailable)
bench eval run \
  --source-repo benchflow-ai/skillsbench \
  --source-path tasks \
  --agent gemini \
  --model gemini-3.1-flash-lite-preview \
  --sandbox daytona \
  --concurrency 64 \
  --sandbox-setup-timeout 300

# From remote repo with required token usage telemetry
bench eval run \
  --source-repo benchflow-ai/skillsbench \
  --source-path tasks \
  --agent gemini \
  --model gemini-3.1-flash-lite-preview \
  --sandbox daytona \
  --usage-tracking required \
  --concurrency 16 \
  --sandbox-setup-timeout 300

# From local directory
bench eval run --tasks-dir ./tasks --agent gemini --model gemini-3.1-flash-lite-preview

# From a hosted PrimeIntellect / Verifiers environment
bench eval run \
  --source-env primeintellect/general-agent \
  --source-env-version 0.1.1 \
  --source-env-arg task=calendar_scheduling_t0 \
  --agent gemini \
  --model google/gemini-2.5-flash-lite

# Single task with mounted skills and the recommended skill nudge
bench eval run \
  --tasks-dir tasks/pdf-fix \
  --agent gemini \
  --model gemini-3.1-flash-lite-preview \
  --sandbox daytona \
  --skill-mode with-skill \
  --agent-env BENCHFLOW_SKILL_NUDGE=name

# Pinned registry dataset: resolves skillsbench@1.1, verifies task digests,
# and stamps dataset identity into every result.json/config.json
bench eval run -d skillsbench@1.1 --agent gemini --model gemini-3.1-flash-lite-preview
FlagDefaultDescription
--configYAML config file
--run-configExplicit alias for the YAML run-config source file; equivalent to --config
--tasks-dirLocal task dir (single native task.md package, compatibility split-layout task, or parent of many)
-d, --datasetRegistry dataset to run as <name>@<version> (e.g. skillsbench@1.1). Resolves the pinned snapshot from the registry, clones tasks at their pinned commit, verifies each task's sha256 content digest, and checks the dataset's bench_version range against the installed benchflow. Each result.json/config.json is stamped with dataset_name, dataset_version, and the task's task_digest.
--registryskillsbench registryDataset registry JSON URL or local file. Only valid with --dataset.
--source-repoRemote repo as org/repo (e.g. benchflow-ai/skillsbench)
--source-pathSubpath within the repo (e.g. tasks)
--source-refBranch or tag to clone (e.g. main)
--source-envHosted environment source (e.g. primeintellect/general-agent)
--source-env-versionHosted environment version
--source-env-argHosted environment argument as KEY=VALUE; repeatable
--source-env-num-examples1Number of hosted environment examples
--source-env-rollouts-per-example1Rollouts per hosted environment example
--source-env-max-tokens1024Max tokens for hosted environment model calls
--source-env-temperature0.0Temperature for hosted environment model calls
--source-env-sampling-argVerifiers sampling argument as KEY=VALUE; repeatable (for example reasoning_effort=minimal)
--agentclaude-agent-acpAgent name
--modelAgent defaultModel ID
--reasoning-effortAgent reasoning/thinking effort when the agent exposes one (e.g. max)
--sandboxdockerSandbox: docker, daytona, or modal
--usage-trackingautoToken usage telemetry policy: auto, required, or off
--environment-manifestPath to an Environment-plane manifest (environment.toml); applied to every rollout in the batch
--stateS-axis environment binding; inline JSON, registry name@version, or manifest path. Takes precedence over --environment-manifest
--prompttask promptPrompt to send to the agent; repeatable for multi-prompt runs
--config-overrideC-axis task config overlay; inline JSON/YAML/TOML or @file, deep-merged into each task's resolved config
--concurrency4Max concurrent tasks (batch mode only)
--build-concurrency--concurrencyMax concurrent docker image builds; set lower (e.g. 8) when --concurrency is high to avoid overwhelming the docker daemon
--worker-concurrencyRun batch eval through isolated worker subprocesses, each with at most this many concurrent tasks; --concurrency remains the aggregate target
--worker-retries1Retry a crashed worker shard this many times, resuming its jobs dir
--worker-start-stagger-sec1.0Seconds to stagger worker starts to avoid Daytona connection storms
--agent-idle-timeout(built-in default)Abort ACP prompts after this many idle seconds; 0 disables idle detection
--jobs-dirjobsOutput directory
--sandbox-useragentSandbox user (null for root)
--sandbox-setup-timeout120Timeout in seconds for sandbox user setup
--skills-dirAdvanced custom skills directory; valid only with --skill-mode with-skill. Omit it to use each task's environment/skills.
--skill-modeno-skillSkill mode: no-skill, with-skill, or self-gen
--skill-creator-dirPath to a skill-creator directory (or a skills root containing it); used when --skill-mode self-gen
--self-gen-no-internetfalseDisable web tools for the self-generated skill run
--agent-envAgent environment variable as KEY=VALUE; repeatable
--includeOnly run these task names; repeatable (e.g. --include jax-computing-basics --include data-to-d3)
--excludeSkip these task names; repeatable (e.g. --exclude quantum-numerical-simulation)
--loop-strategyWrap each rollout in a loop, e.g. verify-retry:k=3,feedback=names or self-review:k=3 (omit for single-shot)
--ignore-bench-versionfalseWith --dataset, skip the dataset's bench_version compatibility gate

When mounting skills, the recommended docs default is --agent-env BENCHFLOW_SKILL_NUDGE=name. See Architecture: skill loading for how with-skill mode is registered with each agent and how the nudge modes differ.

Daytona batch runs collect provider token/cost telemetry by default with a sandbox-local LiteLLM gateway. Use --usage-tracking required when missing telemetry should fail the rollout, or --usage-tracking off for recovery runs that should leave provider traffic untouched.

--source-env is for external hosted environment hubs. The first supported runner is PrimeIntellect / Verifiers: BenchFlow preserves the hosted identity (env_uid, hub_url), installs the versioned package into an isolated local virtual environment, and runs vf-eval. --sandbox remains the BenchFlow task sandbox selector for local/repo task sources; Verifiers source environments own their own harness and sandbox behavior. --model is passed to the Verifiers model endpoint; use a model id available to that provider. Provider-specific sampling options are not inferred; pass them explicitly with --source-env-sampling-arg.

bench eval list

List completed evaluations from a jobs directory.

bench eval list jobs/

bench eval metrics

Collect and display metrics (pass/fail/score, memory score, tool calls, duration) from a jobs directory. Use --json for machine-readable output.

bench eval metrics jobs/
bench eval metrics jobs/ --json

bench eval view

Serve a trial trajectory viewer in the browser for a rollout or job directory.

bench eval view jobs/run/task__abc123
bench eval view jobs/ --port 9000

bench skills

bench skills list

List skills discovered under the default skills roots (or --dir).

bench skills list
bench skills list --dir ./skills

bench skills eval

Evaluate a skill against its evals.json test cases.

bench skills eval skills/my-skill/ \
  --agent gemini \
  --model gemini-3.1-flash-lite-preview \
  --sandbox daytona

bench tasks

bench tasks init

Scaffold a new benchmark task.

bench tasks init my-new-task
bench tasks init my-new-task --dir tasks/
FlagDefaultDescription
--formattask-mdTask format. New tasks use task-md; the legacy scaffold path is retired.

bench tasks check

Validate a task directory. Native packages use task.md, environment/, and verifier/; older split packages should be migrated with bench tasks migrate.

bench tasks check tasks/my-task

With --level, validation runs at a chosen depth: schema, structural, runtime-capability, publication-grade, acceptance, or acceptance-live. Acceptance-level errors such as acceptance validation requires benchflow.evidence mapping refer to the benchflow.evidence schema documented in the "Assets, Provenance, And Evidence" section of docs/task-standard.md.

bench tasks migrate

Convert an older split task package into the unified task.md format. By default the old files are kept alongside the new task.md; for publication, use --remove-legacy.

bench tasks migrate tasks/my-task
bench tasks migrate tasks/my-task --overwrite --remove-legacy
FlagDefaultDescription
--overwritefalseReplace an existing task.md
--remove-legacyfalseDelete split files and promote tests/ to verifier/ and solution/ to oracle/ after task.md is verified

bench tasks normalize

Expand minimal task.md authoring profiles into the canonical task.md form. Prints the normalized document to stdout unless told otherwise.

bench tasks normalize tasks/my-task
bench tasks normalize tasks/my-task --write
bench tasks normalize tasks/my-task -o normalized-task.md
FlagDefaultDescription
--output, -oWrite normalized task.md to this path instead of stdout
--writefalseReplace task.md in place with the normalized canonical form

bench tasks export

Export a task.md task to a compatibility split package, with a compatibility loss report written to compatibility/export-report.json in the export directory.

bench tasks export tasks/my-task out/my-task-split
bench tasks export tasks/my-task --report-only
bench tasks export tasks/my-task out/my-task-split --overwrite

Arguments: TASK_DIR (task directory to export) and optional OUTPUT_DIR (destination split-layout directory; may be omitted with --report-only).

FlagDefaultDescription
--targetharborCompatibility target: harbor
--overwritefalseReplace an existing export directory
--report-onlyfalsePrint the compatibility loss report without writing files

bench tasks digest

Compute the content digest that pins a task's files, independent of git — the sha256 the dataset registry keys on (matches the digests bench eval run -d verifies and the task_digest stamped into every result.json). Recognizes both legacy task.toml tasks and native task.md tasks. Given a single task directory it prints the digest; given a directory of tasks it prints one <name> <digest> line per task. Output goes to stdout via echo (not Rich), so it is safe to pipe into machine-readable tooling.

bench tasks digest tasks/my-task          # -> sha256:<hex>
bench tasks digest tasks/                  # one "<name> sha256:<hex>" line per task

Arguments: PATH (a task directory, or a directory of task directories).

bench tasks generate

Generate benchmark task directories from real agent traces.

bench tasks generate --from-local --project my-repo --limit 5
bench tasks generate --from-file session.jsonl --dry-run
bench tasks generate --from-hf opentraces-test --limit 50
FlagDefaultDescription
--from-localGenerate from local Claude Code sessions
--from-fileGenerate from a JSONL trace file
--from-hfGenerate from a HuggingFace dataset ID or alias
--outputtasksOutput directory for generated tasks
--projects-dir~/.claude/projects/Claude Code projects directory
--projectFilter local sessions by project path substring
--formatautoTrace format override
--splittrainHuggingFace dataset split
--max-rows100Max rows to download from HuggingFace
--limit20Max traces to process
--min-steps2Minimum steps per trace
--outcomeFilter by outcome: success, failure, unknown
--authorbenchflow-tracesAuthor name for generated task metadata
--task-formattask-mdGenerated task package format: task-md or legacy
--dry-runfalsePreview traces without generating tasks

bench tasks list-sources

List known HuggingFace trace datasets. The aliases listed here can be passed to bench tasks generate --from-hf.

bench tasks list-sources

bench sandbox

Local sandbox lifecycle: provision a task on a docker/daytona/modal backend, list active sandboxes, and reap stale ones.

bench sandbox create

Create an environment object from a task directory. This validates environment construction but does not start the sandbox.

bench sandbox create tasks/my-task --sandbox daytona

bench sandbox list

List active local (Daytona) sandboxes.

bench sandbox list

bench sandbox cleanup

Clean up orphaned Daytona sandboxes. By default this deletes sandboxes older than 24 hours; use --dry-run to preview what would be deleted.

bench sandbox cleanup --dry-run --max-age 1440

Daytona-backed evals also reap orphaned sandboxes automatically at run start (failure states such as BUILD_FAILED are reaped sooner than healthy ones, and an idle-activity guard means concurrent live runs are never reaped). Set BENCHFLOW_DAYTONA_AUTO_REAP to any of 0/false/no/off (case-insensitive) to disable that automatic pass and rely on the manual command above.

bench environment (deprecated)

bench environment is a hidden deprecated alias group, removed in 0.7. The local lifecycle moved to bench sandbox (create/list/cleanup) and hosted-provider browsing to bench hub list. The old bench environment create|list|cleanup and show|inspect (plus list --provider/--hub) still work, each printing a one-line stderr notice.

bench hub

External environment hubs: browse a hub's environments (list/show/inspect) and check Harbor registry compatibility (check).

bench hub list / show / inspect

Read-only browsing of a hub's environments. list covers two hubs via --provider: primeintellect (hosted "Environments") and harbor (the benchmark registry). To run a hosted environment, use bench eval run --source-env.

bench hub list --provider primeintellect --owner primeintellect --search general-agent --limit 5
bench hub list --provider harbor --search coding
bench hub show primeintellect/general-agent --version 0.1.1
bench hub inspect primeintellect/general-agent --version 0.1.1 --path README.md

bench hub env list|show|inspect still resolves as a hidden back-compat alias.

bench hub check

Inventory or structurally check representative tasks from an environment hub's registry. Defaults to an inventory pass against the public Harbor registry JSON.

# Inventory the public Harbor hub registry
bench hub check

# Structural check, two tasks per dataset, JSONL output
bench hub check --level check --tasks-per-dataset 2 --out hub.jsonl
FlagDefaultDescription
--registryHarbor public registry URLHarbor registry JSON URL or local file
--tasks-per-dataset2Representative tasks selected per dataset
--levelinventoryCompatibility level: inventory or check
--outOptional JSONL output path
--cache-dir.cache/hub/harborCache directory for sparse clones
--limitOptional cap on selected task refs

YAML Config Format

Batch config with skills and skill nudge

source:
  repo: benchflow-ai/skillsbench
  path: tasks
environment: daytona
concurrency: 64
sandbox_setup_timeout: 300
agent: gemini
model: gemini-3.1-flash-lite-preview
skill_mode: with-skill
skills_dir: shared-skills/
agent_env:
  BENCHFLOW_SKILL_NUDGE: name
max_retries: 2

Multi-scene (BYOS skill generation)

Use the Python API for multi-scene experiments. bench eval run --config is for batch job configs; scene configs are loaded with benchflow._utils.yaml_loader or built directly in Python.

task_dir: tasks/my-task
environment: daytona
sandbox_setup_timeout: 300

scenes:
  - name: skill-gen
    roles:
      - name: creator
        agent: gemini
        model: gemini-3.1-flash-lite-preview
    turns:
      - role: creator
        prompt: "Analyze the task and write a skill document to /app/generated-skill.md"

  - name: solve
    roles:
      - name: solver
        agent: gemini
        model: gemini-3.1-flash-lite-preview
    turns:
      - role: solver

bench eval continue

Resume a previous, unfinished (timed-out) openhands run to completion via record-replay. Standalone — it does not touch the normal run path. See Continuing timed-out runs for the full guide.

bench eval continue path/to/original/run-folder --tasks-dir path/to/tasks

The original top-level bench continue still works as a hidden, deprecated alias.

Key options: --model (override the live-continuation model; defaults to the original run's model), --timeout, --output, --require-timeout, --strict-divergence, --replay-only (rebuild via replay and stop at the cut-point — no live model or API key needed), and --proxy-mode (replay proxy placement: auto, host, or sandbox; default auto uses sandbox-local replay for Daytona/Modal and host replay for Docker).

bench eval continue-batch

Continue all timed-out OpenHands runs found under a directory tree. Discovers run folders (config.json + trajectory/llm_trajectory.jsonl) recursively, continues each, and prints a JSON batch summary (exits 1 if any continuation failed).

bench eval continue-batch path/to/jobs-root --tasks-dir path/to/tasks
FlagDefaultDescription
--tasks-dirDirectory holding task sources; required unless the recorded task path exists
--modeloriginal run's modelOverride the live-continuation model
--timeoutWall-clock budget per continuation
--outputOutput jobs dir for continued runs
--concurrency100Maximum number of continuation runs in flight
--limitLimit discovered timeout folders
--strict-divergencefalseAbort a run if replay leaves the original rails
--proxy-modeautoReplay proxy placement: auto, host, or sandbox