Twin Router Bench

June 5, 2026 · View on GitHub

Paper Code Dataset Website

Twin Router Bench is a single benchmark suite for per-step LLM routing: a router chooses which pooled model_id to use on every agent step, under locked pricing and cache rules. The suite ships in one Python distribution (twinrouterbench) and one source tree (TwinRouterBench/).

It contains two tracks inside the same product—same protocol and locked tables—not two separate benchmarks:

TrackRoleCLI entry
StaticFast validation on a fixed supervision bank (tier labels + nominal cost metrics).twinrouterbench static …
DynamicEnd-to-end evaluation on SWE-bench Verified with real tool use—mini-swe-agent scaffold or editor scaffold.twinrouterbench dynamic … / twinrouterbench swe …

Where to run commands (important)

Many examples use paths relative to the checkout parent directory (the parent of TwinRouterBench/), for example:

  • semantic-router/... — checkout of the semantic-router repo (KNN weights and models.py).
  • TwinRouterBench/data/static/... — static track: question_bank.jsonl, manifest.json.
  • TwinRouterBench/data/dynamic/... — dynamic track: locked pool, pricing, TTL, tier map, SR-KNN mapping.

Recommended: cd to the checkout parent directory, pip install -e "./TwinRouterBench[dynamic]", then run twinrouterbench … from there so those relative paths resolve.

The dynamic CLI also loads TwinRouterBench/.env via miniswerouter.cli (see Configuration). That file is resolved from the TwinRouterBench package directory, not from your shell cwd, so keeping .env next to pyproject.toml under TwinRouterBench/ is the supported layout.


Install

Static track only (lightweight dependencies):

pip install -e ./TwinRouterBench
# or from PyPI once published:
# pip install twinrouterbench

Full suite (adds Docker, SWE-bench harness, mini-swe-agent, LiteLLM, etc.):

pip install -e "./TwinRouterBench[dynamic]"

If you run twinrouterbench dynamic or twinrouterbench swe without [dynamic], the CLI exits with an explicit message to install the extra.


Configuration

TwinRouterBench/.env

  1. Copy TwinRouterBench/.env.example to TwinRouterBench/.env.
  2. Fill in real credentials (never commit .env).

Load order (dynamic / mini CLI):

  1. _load_mini_dotenv — reads TwinRouterBench/.env and sets any key that is missing or empty in the process environment (already-set variables win).
  2. _apply_gateway_aliases — if optional gateway override variables are set, they override chat-related OpenRouter/SWERouter URL and API key (see table below).
VariablePurpose
OPENROUTER_BASE_URLOpenAI-compatible root URL (must end with /v1; clients append /chat/completions). Example: https://openrouter.ai/api/v1.
OPENROUTER_API_KEYBearer token for the gateway above.
OPENROUTER_API_KEY_EXPOptional alternate key; if OPENROUTER_API_KEY is empty after dotenv, it is copied from this.
SWEROUTER_BASE_URL / SWEROUTER_API_KEYOptional explicit names; miniswerouterbench run defaults fall back to OPENROUTER_* when unset.
COMMONSTACK_API_BASEOptional override: if set, replaces OPENROUTER_BASE_URL and SWEROUTER_BASE_URL after bootstrap (legacy env name; see .env.example).
COMMONSTACK_API_KEYOptional override: if set, replaces OPENROUTER_API_KEY, OPENROUTER_API_KEY_EXP, and SWEROUTER_API_KEY after bootstrap.

Common pitfall: OPENROUTER_BASE_URL and OPENROUTER_API_KEY must refer to the same provider. Mixing a custom gateway base URL with another vendor’s API key (or the reverse) yields 401 or “invalid access key”.

Shell snippets under TwinRouterBench/scripts/examples/env.inc.sh mirror the Python alias logic for bash-driven runs.

Prerequisites (dynamic track)

  • Docker running locally; enough disk for SWE-bench images.
  • Network for image pulls and the chat gateway.
  • API keys as above. The dynamic CLIs load TwinRouterBench/.env before parsing arguments.

Static metrics subcommand does not need Docker or network if you only aggregate local JSON.

semantic-router (SR KNN only)

The SemanticRouterKNNRouter loads:

  • knn_model.json (feature dim 1024 + 14 category one-hot),
  • semantic-router repo root (for ml_model_selection/models.py / KNNModel.load),
  • sentence-transformers embedder (default Qwen/Qwen3-Embedding-0.6B; you should downloads weights first).

Ensure a checkout exists at semantic-router/ relative to the checkout parent directory (or pass absolute --router-arg paths).


Command-line interface

Primary entrypoint:

twinrouterbench static <subcommand> [args]   # static track
twinrouterbench dynamic <subcommand> [args]  # mini-swe-agent harness
twinrouterbench swe <subcommand> [args]     # editor-scaffold harness

Compatibility console scripts (same code after the same install):

ScriptModule
CommonRouterBenchmain.cli (legacy console script name)
miniswerouterbenchminiswerouter.cli
swerouterbenchswerouter.cli

For debugging without installing entrypoints:

export PYTHONPATH=/abs/path/to/TwinRouterBench
python -m miniswerouter.cli run

Static track

twinrouterbench static metrics

Aggregates Section 11–style metrics from a JSON file containing an array of CaseMetrics objects (see main.metrics.CaseMetrics and case_metrics_from_dict).

twinrouterbench static metrics --cases /path/to/cases.json

Each element must at least include case_id, task_passed, and either nominal cost fields or baseline_steps / optimal_steps / test_steps lists with completion_tokens (and optional tier / model). Example skeleton:

[
  {
    "case_id": "example-1",
    "task_passed": true,
    "baseline_cost_nominal": 10.0,
    "optimal_cost_nominal": 4.0,
    "test_cost_nominal": 5.0
  }
]

Question bank: shipped under data/static/ (question_bank.jsonl, manifest.json). The main package exposes DATA_DIR / STATIC_DATA_DIR / QUESTION_BANK_PATH pointing at that directory (see main.dataset). setuptools package-data includes those files for wheel installs. Tier-only eval APIs live under main.eval.

Static track — metric field names (outputs)

1) twinrouterbench static metrics --cases … prints one JSON object from main.metrics.aggregate_routerbench_metrics (Section 11–style task-level savings on nominal tier rates):

FieldMeaning
valid_casesNumber of cases in the input array.
passed_casesCases with task_passed == true.
pass_ratepassed_cases / valid_cases.
cost_score_cases_usedPassed cases with save_gt > 0 included in the cost ratio.
sum_save_gt_usd / sum_save_test_usdSums of per-case savings vs baseline / vs test (USD, passed + positive-save_gt subset).
cost_savings_score100 * sum_save_test_usd / sum_save_gt_usd on that subset (NaN if denominator is 0).
money_saved_testPer-case baseline_nominal - test_nominal stats over passed cases (mean_per_case_over_passed, total_over_passed).
pricing / cost_score_ruleFixed tier rates used and a short rule string (documentation only).

2) Tier-supervision eval summary (main.eval.build_eval_summary / run_question_bank_eval): one top-level JSON with routing rows and several metric blocks. Headline tier metrics most papers care about live under scores_v2 (main.eval.compute_v2_scores):

Field (scores_v2)Meaning
case_pass_rate_percentRow fraction with pred_tier_id >= gold_tier_id (errors count as fail).
case_exact_match_percentRow fraction with pred_tier_id == gold_tier_id.
trajectory_pass_rate_percentCase-weighted share of rows whose whole trajectory passes (every step pred >= gold, no error).
cost_savings_score_percentMacro-averaged trajectory-level cost savings vs always-high baseline (see scores_v2.note in the JSON).
combined_score_percentMean of the four percentages above (NaN if any component is NaN).
total_rows / error_rowsRow counts; case_pass_count / case_exact_count are integer numerators.

Other useful keys on the same summary object:

FieldMeaning
tier_match_accuracy / accuracy_excluding_errorsExact tier match rate on rows without API errors.
exact_matchInteger count of exact matches (same scope as tier accuracy numerator).
api_errors / valid_response_rateError count and 1 - api_errors/sampled.
section_11Older single-step pass rate + cost_savings_score (uniform tokens); distinct from scores_v2.
router_accountingTrajectory-level USD accounting: D_usd, N_usd, pass_rate_percent, exact_match_rate_percent, accounting_savings_score_percent, overall_score_percent (mean of those three headline percents).

Dynamic track (twinrouterbench dynamic …)

This forwards to miniswerouter.cli (run, score, audit-infra, audit-trace-cost, render).

run — main flags

FlagMeaning
--router-importRequired. module:factory, e.g. swerouter.routers.sr_knn_adapter:SemanticRouterKNNRouter.from_cli_args.
--router-arg KEY=VALUERepeatable; passed as kwargs to the factory. Values are strings.
--router-labelRequired label stored in eval_summary.json and traces.
--output-dirRequired. Run artifacts root (created if needed).
--base-url / --api-keyDefault from SWEROUTER_* then OPENROUTER_* env after .env load.
--instances id1 id2 …Optional explicit SWE-bench instance IDs.
--limit NOptional cap on how many dataset instances to consider (ordering is harness-defined).
--workersParallel workers (default 2).
--max-stepsAgent step limit (default 250, matches mini-swe-agent SWE profile).
--budget-usdAgent cost limit in USD (default 3).
--run-idStored in summaries; use a new id when you want a logically separate run.
--force-rerunRe-run instances even if results/<instance_id>.json already exists.
--pool, --pricing, --ttl, --tier-mapOverride locked JSON paths (defaults under TwinRouterBench/data/dynamic/).

Resume: without --force-rerun, instances that already have output_dir/results/<instance_id>.json are skipped.

run output layout

Under --output-dir:

  • results/<instance_id>.json — per-instance outcome (resolved, step_count, errors, etc.).
  • <instance_id>.trace.jsonl — per-step router and usage trace.
  • eval_summary.json — run-level aggregate (completed, resolved_count, errors, …).
  • case_summaries/<instance_id>.summary.json — condensed per-case view.
  • agent_logs/<instance_id>/agent.log — mini-swe-agent log.

eval_summary.json fields (miniswerouter.harness.run_eval.EvalSummary; the swe harness uses the same keys via swerouter.harness.run_eval): router_label, run_id, started_at, finished_at, dataset_name, dataset_split, pool_fingerprint, pricing_schema_version, ttl_policy_name (may be empty when unused), total_instances, completed, resolved_count, resolved_rate (resolved_count / completed, same resolved predicate as SWE-bench), total_router_cost_usd (sum of realized routed API spend only; no failure penalty), per_instance_paths, errors.

Long stretches without new console output are normal (Docker pull, repository setup, multi-step LLM calls).

Example — gold-tier oracle (paths under TwinRouterBench)

From the checkout parent directory, after installing [dynamic]:

twinrouterbench dynamic run \
  --router-import swerouter.routers.gold_tier:GoldTierRouter.from_cli_args \
  --router-arg question_bank_path=TwinRouterBench/data/static/question_bank.jsonl \
  --router-arg tier_to_model_path=TwinRouterBench/data/dynamic/tier_to_model.json \
  --router-arg allowed_instance_ids=django__django-11133 \
  --router-arg label=gold_tier_oracle \
  --router-label gold_tier_oracle \
  --output-dir runs/mini_gt_one \
  --instances django__django-11133 \
  --max-steps 250 --budget-usd 3 --run-id mini_gt_one --force-rerun

Adjust question_bank_path if your bank lives elsewhere.

Example — Semantic Router SR KNN router

Requires semantic-router/ at the checkout parent directory and CPU/GPU for embeddings (embedding_device).

twinrouterbench dynamic run \
  --router-import swerouter.routers.sr_knn_adapter:SemanticRouterKNNRouter.from_cli_args \
  --router-arg knn_json_path=semantic-router/src/training/model_selection/ml_model_selection/.cache/ml-models/knn_model.json \
  --router-arg mapping_path=TwinRouterBench/data/dynamic/sr_knn_to_pool.json \
  --router-arg sr_repo_root=semantic-router \
  --router-arg embedding_model=Qwen/Qwen3-Embedding-0.6B \
  --router-arg embedding_device=cpu \
  --router-arg label=sr_knn_smoke \
  --router-arg category=other \
  --router-label sr_knn_smoke \
  --output-dir runs/sr_knn_smoke \
  --instances django__django-11066 django__django-13410 \
  --workers 2 \
  --max-steps 40 \
  --budget-usd 5.0 \
  --run-id sr_knn_smoke \
  --force-rerun
--router-argRole
knn_json_pathPretrained knn_model.json (e.g. under semantic-router .cache/ml-models/).
mapping_pathsr_knn_to_pool.json — maps KNN label strings to model_ids in the locked pool.
sr_repo_rootRoot of semantic-router checkout (for KNNModel loader code path).
embedding_modelSentenceTransformers model id (training default: Qwen3 embedding).
embedding_devicecpu, cuda, or mps.
categoryVSR one-hot bucket passed into the feature vector (default other for smoke).

score, audit-*, render

twinrouterbench dynamic score --run-dir runs/your_run --router-label your_label
twinrouterbench dynamic audit-infra --run-dir runs/your_run
twinrouterbench dynamic audit-trace-cost --run-dir runs/your_run
twinrouterbench dynamic render --score runs/a/score.json runs/b/score.json --out leaderboard.md

score writes score.json (or --out) using the same scorer as the editor harness.

score.json fields (from swerouter.leaderboard.score.score_run_dir):

FieldMeaning
router_label / run_dirRouter name and scored directory path.
pool_fingerprint / pricing_schema_version / pricing_fingerprintLocked pool + pricing identity used when repricing traces.
high_baseline_model_idTier-high / Opus pool id (benchmark metadata).
failure_penalty_usdFixed add-on per unresolved instance (default 0.60 USD, the per-case price of a hypothetical perfect solver).
total_leaderboard_bill_usdLeaderboard sort key (lower is better): Σ instance_bill_usd = router cost + penalty cost on unresolved.
total_router_cost_usdΣ realized router cost only (no penalty).
total_penalty_cost_usdΣ penalty_usd (unresolved instances only).
resolved_count / resolved_rate / instance_countResolution stats (denominator respects exclude_infra_failures when set).
avg_steps / avg_cost_per_resolved_usdMean steps over counted instances; total_leaderboard_bill_usd / resolved_count (or inf if none resolved).
per_instanceList of rows: instance_id, resolved, step_count, router_actual_cost_usd, penalty_usd, instance_bill_usd, plus model_distribution, errors, excluded_from_metrics, etc.
exclude_infra_failures / raw_instance_count / infra_excluded_countPresent when infra failures are excluded from aggregates.
reprice_from_raw_usagePresent when costs were recomputed from trace usage + current pricing tables.

Older score.json files may still use the deprecated key total_actual_bill_usd (same role as total_leaderboard_bill_usd); twinrouterbench dynamic render accepts both.


Editor scaffold (twinrouterbench swe …)

Forwards to swerouter.cli (full SWE-bench harness in the editor-oriented layout). Requires the same [dynamic] extra, Docker, and credentials.


Shell helpers

Under TwinRouterBench/scripts/examples/:

  • env.inc.sh — source TwinRouterBench/.env and apply the same gateway aliases as Python.
  • example_router_a.sh / example_router_b.sh — wrapped smoke patterns.
  • resume_until_n.sh — loop python -m miniswerouter.cli run until results/ contains TARGET_N JSON files (for long campaigns).

Troubleshooting

SymptomThings to check
HTTP 401 on chatOPENROUTER_BASE_URL and OPENROUTER_API_KEY must be from the same provider. Unset COMMONSTACK_* overrides if you intend to use only OPENROUTER_* / SWEROUTER_*.
missing required connection settings on runSet SWEROUTER_API_KEY / OPENROUTER_API_KEY (after .env) or pass --api-key / --base-url.
Dynamic import errors for main.*Install editable from TwinRouterBench/ or set PYTHONPATH to the TwinRouterBench root.
SR KNN FileNotFoundError for knn JSONEnsure paths exist; using the checkout parent directory as cwd is simplest.
Very slow first SR KNN stepEmbedding model download + CPU encoding; use embedding_device=cuda when available.
“No output” for many minutesDocker image pull + SWE environment + agent steps; watch agent_logs/ or docker ps.

Repository layout (TwinRouterBench/)

PathPurpose
main/Static-track package (main.cli, tokenizer, pricing, eval).
miniswerouter/Dynamic track on mini-swe-agent.
swerouter/Router protocol, pricing, cache simulation, harness, and leaderboard.
data/static/Static track JSONL: question_bank.jsonl, manifest.json.
data/dynamic/Dynamic track locked JSON: model_pool.json, model_pricing.json, ttl_policy.json, tier_to_model.json, sr_knn_to_pool.json, …
twinrouterbench/Meta-CLI dispatcher.
.env.exampleTemplate for gateway credentials.

Citation

If you use Twin Router Bench in research, please cite the associated paper. Bibliographic details are withheld for anonymous review and will be added after publication (no preprint URL in this release).

Implementation note (CLI forwarding)

twinrouterbench static|dynamic|swe dispatches in-process to the existing CLIs. For debugging, you may still invoke python -m miniswerouter.cli or python -m swerouter.cli with PYTHONPATH set to TwinRouterBench/.


Appendix: migration

This tree unifies the static and dynamic router benchmark tracks in one install. Use Twin Router Bench naming in new scripts; keep legacy console script names and pathnames only where required for backward compatibility.