Python API
June 17, 2026 · View on GitHub
The Rollout/Scene API is the primary way to run agent benchmarks programmatically.
Install
uv tool install --upgrade benchflow
Quick Start
import asyncio
import benchflow as bf
result = asyncio.run(bf.run("gemini", task_path="tasks/my-task", model="gemini-3.1-flash-lite-preview"))
print(f"Reward: {result.rewards}")
print(f"Tool calls: {result.n_tool_calls}")
Core Types
RolloutConfig
Declarative configuration for a rollout — a sequence of Scenes in a shared sandbox.
from pathlib import Path
from benchflow import RolloutConfig, Scene, Role, Turn
# Single-agent (simplest)
config = RolloutConfig(
task_path=Path("tasks/my-task"),
scenes=[Scene.single(agent="gemini", model="gemini-3.1-flash-lite-preview")],
environment="daytona",
sandbox_setup_timeout=120,
)
# Multi-scene BYOS (skill-gen → solve)
config = RolloutConfig(
task_path=Path("tasks/my-task"),
scenes=[
Scene(name="prep", roles=[Role("gen", "gemini", "gemini-3.1-flash-lite-preview")],
turns=[Turn("gen", "Generate a skill for this task...")]),
Scene(name="solve", roles=[Role("solver", "gemini", "gemini-3.1-flash-lite-preview")],
turns=[Turn("solver")]),
],
environment="daytona",
sandbox_setup_timeout=120,
)
Set sandbox_setup_timeout when sandbox user setup needs more than the default 120 seconds.
The same field is also available on JobConfig and RuntimeConfig.
Scene
Authoring sugar for role, prompt, and skill attribution. Scenes compile to explicit rollout Steps before execution; there is no runtime Scene object or message scheduler.
# Single-role shortcut
scene = Scene.single(agent="gemini", model="gemini-3.1-flash-lite-preview")
# Multi-role with explicit turn order
scene = Scene(
name="coder-reviewer",
roles=[
Role("coder", "gemini", "gemini-3.1-flash-lite-preview"),
Role("reviewer", "gemini", "gemini-3.1-flash-lite-preview"),
],
turns=[
Turn("coder"), # None prompt = native task goal
Turn("reviewer", "Review the current workspace."),
Turn("coder", "Fix the issues."),
],
)
Rollout
The execution engine — decomposed into independently-callable phases.
from benchflow import Rollout
rollout = await Rollout.create(config)
# Full lifecycle (most common)
result = await rollout.run()
# Manual composition (for custom flows)
await rollout.setup()
await rollout.start()
await rollout.install_agent()
await rollout.connect()
await rollout.execute(prompts=["custom prompt"])
await rollout.disconnect()
await rollout.verify()
await rollout.cleanup()
RuntimeConfig
Runtime-level configuration for the Agent + Environment execution path.
from benchflow.runtime import Agent, Environment, Runtime, RuntimeConfig
config = RuntimeConfig(sandbox_setup_timeout=300)
agent = Agent("gemini", model="gemini-3.1-flash-lite-preview")
env = Environment.from_task("tasks/X", sandbox="daytona")
runtime = Runtime(env, agent, config=config)
result = await runtime.execute()
bf.run()
Convenience function — multiple calling conventions:
import benchflow as bf
# 1. RolloutConfig (full control)
result = await bf.run(config)
# 2. Agent + Environment (0.3 style)
agent = bf.Agent("gemini", model="gemini-3.1-flash-lite-preview")
env = bf.Environment.from_task("tasks/X", sandbox="daytona")
runtime_config = bf.RuntimeConfig(sandbox_setup_timeout=300)
result = await bf.run(agent, env, runtime_config)
# 3. String shortcut (simplest)
result = await bf.run(
"gemini",
task_path="tasks/X",
model="gemini-3.1-flash-lite-preview",
config=bf.RuntimeConfig(sandbox_setup_timeout=300),
)
Rollout Lifecycle
Rollout.run()
│
├─ setup() — resolve config, create env object
├─ start() — spin up sandbox, upload task files, start services
├─ install_agent() — install agent binary, credentials, sandbox user
│ (sandbox user setup: create non-root user, prepare
│ small config/auth dirs, chown the workspace — no
│ recursive copy of /root tool trees; agent binaries
│ must live on shared prefixes like /usr/local/bin)
├─ compile scenes → Steps
├─ for step in steps:
│ ├─ connect_as(role) — open/reuse ACP session for this role
│ └─ execute(prompt) — send prompt, collect trajectory, grow tree
├─ verify() — run verifier, collect rewards
└─ cleanup() — stop sandbox
Key: scene boundaries are gone by execution time; role changes are represented as Step metadata and handled by the rollout executor.
Multi-Turn vs Multi-Round
| Pattern | Roles | Turns | Communication | Example |
|---|---|---|---|---|
| Single-turn | 1 | 1 | — | Baseline benchmark |
| Multi-turn | 1 | 2+ | Same session, sequential prompts | Self-review |
| Multi-role | 2+ | 2+ | Explicit prompt sequence | Coder + Reviewer |
Multi-turn = same agent gets multiple prompts. Use when a second pass catches errors (self-review, iterative refinement). The agent keeps its context across turns.
Multi-role = different agents receive explicit turns. Use when tasks need multiple perspectives (code review, client-advisor). Any handoff text must be part of the declared prompt or agent-native communication, not a BenchFlow Scene scheduler.
Both use the same API — RolloutConfig with different Scene configurations.
Multi-Agent Patterns
Coder + Reviewer (followup-bench)
config = RolloutConfig(
task_path=task_path,
scenes=[Scene(
roles=[Role("coder", "gemini", "flash"), Role("reviewer", "gemini", "flash")],
turns=[
Turn("coder"),
Turn("reviewer", "Review /app/. Summarize any issues."),
Turn("coder", "Read feedback and fix."),
],
)],
environment="daytona",
)
Skill Generation + Solve (BYOS)
config = RolloutConfig(
task_path=task_path,
scenes=[
Scene(name="skill-gen",
roles=[Role("gen", "gemini", "flash")],
turns=[Turn("gen", "Generate a skill document to /app/generated-skill.md")]),
Scene(name="solve",
roles=[Role("solver", "gemini", "flash")],
turns=[Turn("solver")]),
],
environment="daytona",
)
User-Driven Loops
Use BaseUser or FunctionUser when one agent should run multiple rounds and
Python should decide the next prompt from verifier feedback. This is the
progressive-disclosure path: the user callback can stop early, read
RoundResult after each soft_verify(), and optionally receive the oracle
solution during setup() when oracle_access=True.
from pathlib import Path
from benchflow import FunctionUser, RolloutConfig, RoundResult, Scene
def user(round: int, instruction: str, rr: RoundResult | None) -> str | None:
if round == 0:
return instruction.splitlines()[0]
if rr and (rr.rewards or {}).get("reward") == 1.0:
return None
return f"Tests failed:\n{rr.verifier_output}\n\nUse the full spec:\n{instruction}"
config = RolloutConfig(
task_path=Path("tasks/my-task"),
scenes=[Scene.single(agent="gemini", model="gemini-3.1-flash-lite-preview")],
user=FunctionUser(user),
max_user_rounds=3,
environment="daytona",
)
result = await bf.run(config)
Use multi-role Scenes when another LLM should act as the reviewer or simulated
user. Use BaseUser when the loop is deterministic or verifier-driven. See
progressive-disclosure.md and
docs/examples/scene-patterns.ipynb.
YAML Rollout Configs
from benchflow._utils.yaml_loader import rollout_config_from_yaml
config = rollout_config_from_yaml("rollout.yaml")
result = await bf.run(config)
Registered Agents
| Agent | Protocol | Auth | Aliases |
|---|---|---|---|
gemini | ACP | GEMINI_API_KEY | — |
claude-agent-acp | ACP | ANTHROPIC_API_KEY | claude |
codex-acp | ACP | OPENAI_API_KEY, CODEX_API_KEY, CODEX_ACCESS_TOKEN, or host login | codex |
opencode | ACP | inferred from model/provider | — |
openhands | ACP | LLM_API_KEY | oh |
pi-acp | ACP | ANTHROPIC_API_KEY | pi |
openclaw | ACP | inferred from model | — |
The Auth column shows each agent's native/default credentials. Provider-prefixed
models can use provider-specific credentials instead; for example, Azure
Foundry models use AZURE_API_KEY plus AZURE_API_ENDPOINT with prefixes such
as azure-foundry-openai/gpt-5.5 or
azure-foundry-anthropic/claude-opus-4-5. BenchFlow routes these providers
through LiteLLM on both Docker and Daytona.
Any agent can be prefixed with acpx/ to run via ACPX (e.g. acpx/gemini, acpx/claude). ACPX is a headless ACP client with persistent sessions and crash recovery. The underlying agent's install, env, credentials, and skill paths are preserved.
Retry and Error Handling
Rollout.run() catches common errors:
TimeoutError— agent exceeded timeoutConnectionError— SSH/ACP pipe closed (retried 3x with exponential backoff)ACPError— agent protocol error
Evaluation-level retry with RetryConfig:
from benchflow.evaluation import Evaluation, EvaluationConfig, RetryConfig
config = EvaluationConfig(
retry=RetryConfig(
max_retries=2,
wait_multiplier=2.0,
min_wait_sec=1.0,
max_wait_sec=30.0,
),
)
Sandbox and Reward Types
Sandbox Protocol
The Sandbox protocol defines the interface any sandbox backend must implement.
Docker and Daytona are built-in; you can bring your own (Modal, Firecracker, E2B, etc.).
from benchflow import Sandbox, ImageBuilder, ImageConfig, ImageRef
# Sandbox is a runtime-checkable Protocol
class MySandbox:
async def exec(self, cmd: str, *, user: str = "root", timeout_sec: int = 30) -> ExecResult: ...
async def upload_file(self, src: Path, dst: str) -> None: ...
async def download_file(self, src: str, dst: Path) -> None: ...
async def start(self) -> None: ...
async def stop(self, *, delete: bool = True) -> None: ...
# ... plus snapshot/restore + host/expose_ports; see sandbox/protocol.py
assert isinstance(my_sandbox, Sandbox) # works at runtime
Rubric + RewardFunc (Composable Rewards)
Declarative scoring via composable reward functions.
from benchflow import Rubric, RewardFunc, RewardEvent, VerifyResult
from benchflow import TestRewardFunc, StringMatchRewardFunc, LLMJudgeRewardFunc
# Built-in reward functions
test_reward = TestRewardFunc() # runs pytest, binary pass/fail
match_reward = StringMatchRewardFunc(expected="hello world")
# Compose into a weighted Rubric
rubric = Rubric(
reward_funcs=[test_reward, match_reward],
weights=[0.7, 0.3],
)
# Score a workspace
result: VerifyResult = await rubric.score(rollout_dir=my_rollout_dir)
print(result.reward) # weighted float [0.0, 1.0]
print(result.events) # list[RewardEvent] — per-function breakdown
Adapters (Inspect AI + ORS)
Convert between BenchFlow types and external frameworks.
from benchflow import InspectAdapter, ORSAdapter, to_inspect_task, to_ors_reward
# BenchFlow Scene → Inspect AI task format
inspect_task = to_inspect_task(scene, rubric=rubric)
# BenchFlow VerifyResult → ORS reward format
ors_payload = to_ors_reward(verify_result)
Evaluation
Batch orchestration with concurrency and retries.
from benchflow import Evaluation, EvaluationConfig, EvaluationResult, RetryConfig
# EvaluationConfig holds the per-job settings (agent/model/environment/...)
# applied to every task discovered under tasks_dir.
config = EvaluationConfig(
model="gemini-3.1-flash-lite-preview",
environment="daytona",
concurrency=8,
retry=RetryConfig(max_retries=2),
)
evaluation = Evaluation(tasks_dir="tasks", jobs_dir="jobs/my-run", config=config)
eval_result: EvaluationResult = await evaluation.run()