ContextBench Agent Runner Usage Guide

June 12, 2026 · View on GitHub

contextbench.run is the unified Agent runner for ContextBench. It supports a three-step workflow: load task list, run Agent, and output trajectories.

Overview

  1. Load task list: Load instances from CSV or gold JSONL
  2. Filter subset: Filter by bench type, instance ID, or custom CSV
  3. Run Agent and output trajectories: Dispatch each instance to the appropriate agent-framework implementation based on its bench

Run Strategy

For each instance, the script:

  1. Parses the Agent to use (e.g. agentless, miniswe)
  2. Determines the instance's native bench (Verified / Pro / Poly / Multi)
  3. Invokes the agent framework adapted for that bench:
AgentVerifiedProPolyMulti
agentlessrun_bench.pyrun_bench.pyrun_bench.pyrun_bench.py
minisweswebench_context_awareswebench_context_awareswebench_context_awareswebench_context_aware
prometheusrun_bench.py (HTTP API)run_bench.pyrun_bench.pyrun_bench.py

Command-Line Arguments

Required

ArgumentDescription
--agentAgent to use: agentless, miniswe, sweagent, openhands, or prometheus

Task Source

ArgumentDefaultDescription
--task-csvdata/selected_500_instances.csvPath to task list CSV
--subset-csv-Custom subset CSV (overrides --task-csv)
--gold-jsonl-Use gold JSONL instead of CSV (bench inferred from instance_id)

Task Filtering

ArgumentDescription
--benchFilter by bench: Verified, Pro, Poly, Multi (comma-separated for multiple)
--instancesSpecify instance_id or original_inst_id (comma-separated)
--limitProcess at most N instances (0 = no limit)

Output & Control

ArgumentDefaultDescription
--output / -oresults/agent_runsTrajectory output directory
--timeout1800Timeout per instance (seconds)
--dry-runfalseOnly list tasks, do not run Agent
--debugfalseEnable debug mode with verbose logs
--rerunfalseRerun instances that already have trajectories (default: skip)

SWE-agent Options

ArgumentDescription
--sweagent-configPath to SWE-agent config YAML (or set SWEAGENT_CONFIG env var)

OpenHands Options

ArgumentDescription
--openhands-model-configOpenHands LLM config name (or set OPENHANDS_MODEL_CONFIG)
--openhands-agentOpenHands Agent class name (or set OPENHANDS_AGENT)

Prometheus Options

ArgumentDescription
--prometheus-urlPrometheus API base URL (or set PROMETHEUS_URL; default http://localhost:9002/v1.3)

Prometheus LLM credentials are not passed through contextbench.run. Configure them in agent-frameworks/prometheus/prometheus/.env and restart the Prometheus container after changes. See agent-frameworks/prometheus/README.md.

Usage Examples

Basic Usage

# Run agentless on Verified
python -m contextbench.run --agent agentless --bench Verified

# Run miniswe on Pro, first 5 instances only
python -m contextbench.run --agent miniswe --bench Pro --limit 5

# Run agentless on Poly
python -m contextbench.run --agent agentless --bench Poly

Specific Instances

# Run only specified instances (instance_id or original_inst_id)
python -m contextbench.run --agent agentless \
    --instances "scikit-learn__scikit-learn-25232,django__django-14434"

# Specify via original_inst_id
python -m contextbench.run --agent miniswe \
    --instances "keras-team__keras-18553"

Custom Task List

# Use custom subset CSV
python -m contextbench.run --agent miniswe \
    --subset-csv my_subset.csv \
    --output results/my_run

# Use gold JSONL (bench inferred automatically)
python -m contextbench.run --agent agentless \
    --gold-jsonl results/gold/contextbench_verified.gold.jsonl \
    --limit 10

Debug & Preview

# List tasks only, do not run
python -m contextbench.run --agent miniswe --bench Verified --dry-run

# Combined filters
python -m contextbench.run --agent agentless \
    --bench Verified,Pro \
    --limit 3 \
    --dry-run

Prometheus

# Default local API (http://localhost:9002/v1.3)
python -m contextbench.run --agent prometheus --bench Verified --limit 1

# Custom API URL via CLI
python -m contextbench.run --agent prometheus --bench Verified --limit 1 \
    --prometheus-url http://localhost:9002/v1.3

# Or via environment variable
export PROMETHEUS_URL=http://localhost:9002/v1.3
python -m contextbench.run --agent prometheus --bench Pro --limit 1

Prerequisites

Agentless

  • Uses unified entry point agent-frameworks/agentless/run_bench.py with --instance for single-instance runs
  • Ensure data/ contains datasets for each bench (Verified, Pro, Poly, Multi); see agentless README for details
  • Configure OpenAI API and related services per Agentless requirements (script/api_key.sh)

MiniSWE-agent

  • Entry point: mini-swe-agent/multi-poly-pro-verified/mini-swe-agent/src/minisweagent/run/extra/swebench_context_aware.py
  • Install mini-swe-agent and its dependencies
  • When using Docker, ensure the environment can pull the relevant bench images

SWE-agent

  • Entry point: swe-agent/{bench}/sweagent/run/run_batch.py (via sweagent run-batch)
  • Configure --sweagent-config or SWEAGENT_CONFIG to point to a valid config YAML
  • Supports Verified, Pro, Poly, Multi

OpenHands

  • Entry point: openhands/{verified|poly-pro|multi}/evaluation/benchmarks/swe_bench/scripts/run_infer.sh
  • Model and Agent can be configured via OPENHANDS_MODEL_CONFIG, OPENHANDS_AGENT
  • Single-instance runs use EVAL_LIMIT=1; exact filtering requires a pre-configured config.toml

Prometheus

  • Entry point: agent-frameworks/prometheus/run_bench.py (calls Prometheus HTTP API)
  • Vendored service: agent-frameworks/prometheus/prometheus/
  • Start stack: cd agent-frameworks/prometheus/prometheus && cp example.env .env && docker-compose up -d
  • LLM keys (required for /issue/answer/): edit prometheus/.env (PROMETHEUS_OPENAI_FORMAT_API_KEY, PROMETHEUS_OPENAI_FORMAT_BASE_URL, model names). Restart after changes: docker compose restart prometheus
  • API URL (adapter → service): pass --prometheus-url to contextbench.run, or set PROMETHEUS_URL. LLM keys cannot be passed via the run script.
  • Optional adapter env vars: PROMETHEUS_TIMEOUT, GITHUB_TOKEN, PROMETHEUS_JWT_TOKEN (if authentication enabled)
  • Uses SWE-bench Docker images via image_name + workdir on /issue/answer/
  • Output: prometheus/{bench}/{instance_id}.log + .json (patch)

Output Structure

Trajectories are organized by agent and bench:

<output_dir>/
├── agentless/
│   ├── Verified/
│   │   └── *_traj.json
│   ├── Pro/
│   ├── Poly/
│   └── Multi/
└── miniswe/
    ├── Verified/
    ├── Pro/
    ├── Poly/
    └── Multi/
└── prometheus/
    ├── Verified/
    │   ├── {instance_id}.log
    │   └── {instance_id}.json
    ├── Pro/
    ├── Poly/
    └── Multi/

Bench Inference Rules

When the CSV has no bench column, bench is inferred from instance_id:

instance_id patternInferred bench
SWE-Bench-Pro__* or instance_* (length > 50)Pro
SWE-PolyBench__*Poly
Contains multiMulti
SWE-Bench-Verified__* or org__repo-numberVerified

Troubleshooting

  • Agent script not found: Check that the run script exists under agent-frameworks/ for the chosen agent.
  • No tasks matched filters: Verify that --bench and --instances match the task list.
  • Timeout: Increase --timeout or test with --limit 1 first.