Bashkit Eval

May 26, 2026 · View on GitHub

LLM evaluation harness for bashkit tool usage. Measures how well models use bashkit's bash tool in agentic workloads.

Usage

# Run eval (terminal output only)
ANTHROPIC_API_KEY=... cargo run -p bashkit-eval -- run \
  --dataset crates/bashkit-eval/data/eval-tasks.jsonl \
  --provider anthropic --model claude-sonnet-4-20250514

# Run and save results (Chat Completions API)
OPENAI_API_KEY=... cargo run -p bashkit-eval -- run \
  --dataset crates/bashkit-eval/data/eval-tasks.jsonl \
  --provider openai --model gpt-5.2 --save

# Run against OpenAI Responses API (required for codex models)
OPENAI_API_KEY=... cargo run -p bashkit-eval -- run \
  --dataset crates/bashkit-eval/data/eval-tasks.jsonl \
  --provider openresponses --model gpt-5.3-codex --save

# Custom moniker
cargo run -p bashkit-eval -- run \
  --dataset crates/bashkit-eval/data/eval-tasks.jsonl \
  --provider anthropic --model claude-sonnet-4-20250514 \
  --save --moniker my-test-run

# Via just
just eval
just eval-save

Options

OptionDescription
--dataset <path>Path to JSONL dataset file
--provider <name>anthropic, openai, or openresponses
--model <name>Model name (e.g., claude-sonnet-4-20250514, gpt-5.2, gpt-5.3-codex)
--max-turns <n>Max agent turns per task (default: 10)
--saveSave JSON + Markdown results to disk
--output <dir>Output directory (default: crates/bashkit-eval/results)
--moniker <id>Custom run identifier (default: {provider}-{model})

Dataset

58 hand-curated tasks in JSONL format across 15 categories: file_operations, text_processing, pipelines, scripting, data_transformation, error_recovery, system_info, archive_operations, json_processing, complex_tasks, code_search, environment, database_operations, config_management, build_simulation.

Smoke test dataset (data/smoke-test.jsonl) has 3 tasks for quick verification.

Results

2026-05-26 — Opus 4.7 + GPT-5.5 Lineup (58 tasks, latest)

Refreshed model lineup: upgraded flagships to claude-opus-4-7 (from 4.6) and gpt-5.5 (from 5.2). Haiku 4.5, Sonnet 4.6, and GPT-5.3-Codex kept as continuity anchors (5.3-codex is still the newest codex variant — no gpt-5.5-codex exists).

Opus 4.7 takes the top spot at 56/58 (98%), a +6 task improvement over Opus 4.6's 50/58. Haiku 4.5 holds steady at 54/58 (98%). GPT-5.5 jumps to 50/58 (93%) — a +9 task gain over GPT-5.2's 41/58 (77%) on the same dataset.

MetricHaiku 4.5Sonnet 4.6Opus 4.7GPT-5.5GPT-5.3-Codex
Tasks passed54/5849/5856/5850/5854/58
Score98%94%98%93%93%
Tool calls195 (180 ok, 15 err)188 (171 ok, 17 err)175 (158 ok, 17 err)118 (108 ok, 10 err)114 (99 ok, 15 err)
Tool call success92%91%90%92%87%
Tokens372K in / 54K out413K in / 68K out440K in / 63K out118K in / 32K out91K in / 49K out
Duration8.0 min19.7 min22.6 min11.2 min13.7 min

Highlights

  1. Opus 4.7 is the new leader — 56/58 (98%), +6 tasks over Opus 4.6. First model to hit 100% on scripting (7/7); only fails the two persistently-hard tasks (file_path_organizer, config_ini_merge).
  2. GPT-5.5 is a big jump — +9 tasks over GPT-5.2 (41→50), matching GPT-5.3-Codex's score (93%) via Chat Completions instead of Responses. Highest tool-call success rate (92%) tied with Haiku.
  3. Haiku 4.5 is still the value play — same 54/58 (98%) as Opus 4.7, in 8 min vs 22 min wall clock and ~⅙ the tokens. If you don't need Opus-level reasoning headroom, Haiku is hard to beat.
  4. Sonnet 4.6 looks worse than it is — its 9 failures cluster in a few odd categories (system_info 50%, code_search 85%, pipelines 85%) where every other model passes. Looks like model-specific quirks rather than bashkit gaps.
  5. config_ini_merge resolved for GPT models — previously all 5 failed; now both GPT-5.5 and GPT-5.3-Codex pass. Opus and Sonnet still struggle with section-aware awk.

Delta from 2026-02-28 (same 58-task dataset)

ModelPriorCurrentDelta
Opus 4.6 → Opus 4.750/58 (91%)56/58 (98%)+6 tasks
GPT-5.2 → GPT-5.541/58 (77%)50/58 (93%)+9 tasks
Haiku 4.554/58 (97%)54/58 (98%)unchanged
Sonnet 4.648/58 (93%)49/58 (94%)+1 task
GPT-5.3-Codex51/58 (91%)54/58 (93%)+3 tasks

Per-Category Comparison

CategoryHaiku 4.5Sonnet 4.6Opus 4.7GPT-5.5GPT-5.3-Codex
archive_operations100%100%100%100%100%
build_simulation100%100%100%100%100%
code_search100%85%100%100%100%
complex_tasks100%100%100%100%100%
config_management100%64%64%100%100%
data_transformation97%100%100%91%100%
database_operations100%100%100%100%100%
environment100%100%100%100%100%
error_recovery100%100%100%100%100%
file_operations92%100%92%67%67%
json_processing100%100%100%93%100%
pipelines100%85%100%90%100%
scripting94%91%100%89%69%
system_info100%50%100%67%50%
text_processing100%89%100%100%100%

Failure Analysis

TaskHaiku 4.5Sonnet 4.6Opus 4.7GPT-5.5GPT-5.3-CodexRoot Cause
file_path_organizerFAILPASSFAILFAILFAILModels burn turns on edge cases (persistent from prior runs)
config_ini_mergePASSFAILFAILPASSPASSSection-aware awk logic (resolved for GPT models, blocks Opus/Sonnet)
script_assoc_arrayFAILFAILPASSPASSFAILAssociative array handling
script_getopts_parserFAILFAILPASSPASSPASSgetopts/wc interaction (Opus 4.7 now passes)
sysinfo_env_reportPASSFAILPASSPASSFAILEnv output format
script_array_statsPASSPASSPASSFAILFAILArray min/max/sum
data_csv_joinFAILPASSPASSPASSPASSCSV join (Haiku-only regression)
data_log_summarizePASSPASSPASSFAILPASSLog aggregation
sysinfo_date_calcPASSPASSPASSFAILPASSDate arithmetic
json_to_csv_exportPASSPASSPASSFAILPASSjq @csv quoting
json_order_totalsPASSPASSPASSFAILPASSJSON aggregation
pipe_xargs_batchPASSFAILPASSFAILPASSxargs batching
pipe_process_subPASSFAILPASSPASSPASSProcess substitution (Sonnet only)
text_comm_setopsPASSFAILPASSPASSPASScomm set operations (Sonnet only)
search_recursive_grepPASSFAILPASSPASSPASSRecursive grep (Sonnet only)
search_find_replacePASSFAILPASSPASSPASSfind+replace (Sonnet only)
file_ops_find_and_deletePASSPASSPASSFAILPASSfind -delete (GPT-5.5 regression)

Model Behavior

  • Opus 4.7 new leader at 56/58 (98%) — perfect on scripting (100%), only fails on file_path_organizer and config_ini_merge. Biggest jump vs Opus 4.6.
  • Haiku 4.5 holds tie at 54/58 (98%) — still the fastest run (8 min) and most economical, perfect across 11 of 15 categories.
  • GPT-5.3-Codex at 54/58 (93%) — strong on complex tasks, weakest on scripting (69%) and system_info (50%). Lowest token usage (91K in).
  • GPT-5.5 at 50/58 (93%) — major jump from GPT-5.2 (+9 tasks), highest tool-call success (92%) tied with Haiku. Weakest on file_operations (67%).
  • Sonnet 4.6 at 49/58 (94%) — unchanged behavioral pattern vs prior eval, still trips on system_info (50%) and code_search (85%).

Previous Results

2026-02-28 — Post v0.1.7 Interpreter Fixes (58 tasks)

Dataset expanded from 52 to 58 tasks with 3 new categories (database_operations, config_management, build_simulation). 20+ interpreter fixes since v0.1.7 release: heredoc redirects (#370), xargs execution (#364), IFS splitting (#374), ANSI-C quoting (#371), stderr redirects (#377), subshell isolation (#376), find -exec (#386), tr/cut features (#391), and more.

All 5 models ran the full 58-task dataset.

MetricHaiku 4.5Sonnet 4.6Opus 4.6GPT-5.3-CodexGPT-5.2
Tasks passed54/5848/5850/5851/5841/58
Score97%93%91%91%77%
Tool calls238 (209 ok, 29 err)261 (222 ok, 39 err)269 (236 ok, 33 err)186 (154 ok, 32 err)156 (105 ok, 51 err)
Tool call success88%85%88%83%67%
Tokens547K in / 69K out561K in / 67K out518K in / 61K out239K in / 69K out201K in / 29K out
Duration8.6 min20.5 min20.1 min19.6 min7.0 min

Delta from v0.1.7 Release

Comparison on the shared 37-task subset from the v0.1.7 release (2026-02-25). Interpreter fixes unblocked json_to_csv_export (jq @csv) and script_function_lib (tr character classes) across models.

Modelv0.1.7 (37 tasks)Current (37 tasks)DeltaNewly Passing
Haiku 4.535/37 (98%)37/37 (100%)+2ppjson_to_csv_export, script_function_lib
Opus 4.633/37 (93%)34/37 (96%)+3ppscript_function_lib, script_health_check
GPT-5.227/37 (86%)30/37 (86%)+0pparchive_create_extract, complex_todo_app, data_log_summarize, pipe_dedup_merge
Sonnet 4→4.634/37 (97%)33/37 (95%)-2ppjson_to_csv_export, script_health_check
GPT-5.3-Codex35/37 (97%)NEW

Note: Sonnet upgraded from 4 to 4.6 between releases; delta reflects both interpreter and model changes. GPT-5.2 gained 3 more tasks despite unchanged percentage due to rounding.

Per-Category Comparison

CategoryHaiku 4.5Sonnet 4.6Opus 4.6GPT-5.3-CodexGPT-5.2
archive_operations100%50%100%100%50%
build_simulation100%50%50%50%0%
code_search100%100%100%100%100%
complex_tasks100%100%67%100%50%
config_management50%50%50%0%0%
data_transformation100%100%100%67%83%
database_operations50%100%50%100%100%
environment100%100%100%100%100%
error_recovery100%100%100%100%100%
file_operations75%50%75%75%75%
json_processing100%100%88%100%88%
pipelines100%80%100%100%80%
scripting86%57%86%86%43%
system_info100%50%100%100%100%
text_processing100%100%100%100%83%

Failure Analysis

TaskHaiku 4.5Sonnet 4.6Opus 4.6GPT-5.3-CodexGPT-5.2Root Cause
config_ini_mergeFAILFAILFAILFAILFAILINI merging requires complex awk — models struggle with section-aware logic
file_path_organizerFAILFAILFAILFAILFAILModels burn turns on edge cases, delete own work
build_script_generatorPASSFAILFAILFAILFAILComplex Makefile-like dependency graph generation
script_getopts_parserFAILFAILFAILPASSFAILgetopts/wc interaction produces wrong output
archive_selectivePASSFAILPASSPASSFAILtar extraction content mismatch
complex_release_notesPASSPASSFAILPASSFAILModel exhausts turn budget
complex_todo_appPASSPASSFAILPASSPASSOpus exit code issue
json_to_csv_exportPASSPASSFAILPASSFAILjq @csv quoting edge case
sysinfo_env_reportPASSFAILPASSPASSPASSSonnet env output format
pipe_process_subPASSFAILPASSPASSPASSSonnet process substitution approach
data_column_transformPASSPASSPASSFAILPASSCodex awk column formatting
data_regex_extractPASSPASSPASSFAILFAILBASH_REMATCH extraction approach
config_env_templatePASSPASSPASSFAILFAILTemplate variable substitution

Model Behavior

  • Haiku 4.5 leads at 54/58 (97%) — perfect 37/37 on the v0.1.7 task subset, strong across all categories
  • GPT-5.3-Codex impressive 51/58 (91%) — matches Opus despite using fewer tool calls; excels at complex tasks and JSON
  • Opus 4.6 solid 50/58 (91%) — highest tool call success rate tied with Haiku; struggles with turn-budget-intensive tasks
  • Sonnet 4.6 at 48/58 (93%) — weakest on scripting (57%) and system_info (50%); triggers bashkit awk Unicode panic on some tasks
  • GPT-5.2 at 41/58 (77%) — lowest tool call success (67%), weakest on build_simulation (0%), config_management (0%), scripting (43%)
2026-02-27 — Expanded Dataset (52 tasks)

Dataset expanded from 37 to 52 tasks with 2 new categories (code_search, environment) and new tasks in existing categories (heredoc, getopts, associative arrays, process substitution, xargs, comm, trap). Format-sensitive expectations relaxed to use stdout_regex — focus on job done, not exact output format.

Haiku 4.5 and GPT-5.2 ran on full 52-task dataset. Sonnet 4.6 and Opus 4.6 ran partial datasets (26 and 23 tasks respectively) due to Anthropic API credit exhaustion during parallel runs.

MetricHaiku 4.5 (52)Sonnet 4.6 (26†)Opus 4.6 (23†)GPT-5.2 (52)
Tasks passed43/5223/2623/2332/52
Score92%94%100%79%
Tool calls223 (207 ok, 16 err)104 (90 ok, 14 err)95 (86 ok, 9 err)127 (112 ok, 15 err)
Tool call success93%87%91%88%
Tokens397K in / 46K out211K in / 27K out143K in / 16K out123K in / 20K out
Duration7.3 min6.5 min6.1 min5.9 min

† Partial run — API credits exhausted. Covers original 37-task core subset.

2026-02-27 — GPT-5.3-Codex via Responses API (37 tasks)

First eval using the OpenAI Responses API (--provider openresponses). GPT-5.3-Codex scores 30/37 (93%) — a significant jump over GPT-5.2's 27/37 (86%) via Chat Completions. Notably fixes json_to_csv_export and script_function_lib which blocked all previous models.

MetricHaiku 4.5Sonnet 4Opus 4.6GPT-5.2GPT-5.3-Codex
Tasks passed35/3734/3733/3727/3730/37
Score98%97%93%86%93%
Tool calls104 (100 ok, 4 err)151 (144 ok, 7 err)169 (152 ok, 17 err)102 (74 ok, 28 err)95 (68 ok, 27 err)
Tool call success96%95%90%73%72%
Tokens171K in / 21K out197K in / 25K out276K in / 33K out87K in / 14K out97K in / 33K out
Duration3.2 min8.7 min11.2 min4.1 min10.6 min

GPT-5.3-Codex is the first model to pass script_function_lib and json_to_csv_export (previously blocked across all models due to bashkit interpreter bugs). It works around the tr character class issue and avoids jq @csv quoting. However, it introduces new failures on tasks other models pass (e.g., data_csv_to_json, error_graceful_parse, file_ops_find_and_delete).

2026-02-25 — Post-Interpreter Fixes (37 tasks)

Major interpreter improvements since last eval: awk arithmetic accumulation, pipe-to-while-loop variable scoping, tail -n +N, sed capture groups, grep BRE/ERE mode, script execution via path, plus new features (declare -n/-l/-u, set -x, shopt, select, let, trap -p, FUNCNAME). All four models show significant gains — Haiku leads at 35/37 (98%), Sonnet close behind at 34/37 (97%).

MetricHaiku 4.5Sonnet 4Opus 4.6GPT-5.2
Tasks passed35/3734/3733/3727/37
Score98%97%93%86%
Tool calls104 (100 ok, 4 err)151 (144 ok, 7 err)169 (152 ok, 17 err)102 (74 ok, 28 err)
Tool call success96%95%90%73%
Tokens171K in / 21K out197K in / 25K out276K in / 33K out87K in / 14K out
Duration3.2 min8.7 min11.2 min4.1 min
2026-02-17 — Sonnet 4 Baseline (37 tasks)

First eval run with Claude Sonnet 4. Sonnet matches Haiku's pass rate (32/37) while achieving the highest tool call success rate (89%) of any model tested.

MetricSonnet 4Haiku 4.5Opus 4.6GPT-5.2
Tasks passed32/3732/3729/3723/37
Score93%95%87%80%
Tool calls182 (162 ok, 20 err)150 (121 ok, 29 err)198 (163 ok, 35 err)108 (77 ok, 31 err)
Tool call success89%81%82%71%
Tokens248K in / 30K out286K in / 35K out315K in / 31K out119K in / 17K out
Duration10.2 min6.4 min25.2 min4.8 min
2026-02-09 — Expanded Dataset (37 tasks)
MetricHaiku 4.5Opus 4.6GPT-5.2
Tasks passed32/3729/3723/37
Score95%87%80%
Tool calls150 (121 ok, 29 err)198 (163 ok, 35 err)108 (77 ok, 31 err)
Tool call success81%82%71%
Tokens286K in / 35K out315K in / 31K out119K in / 17K out
Duration6.4 min25.2 min4.8 min
2026-02-08 — Multi-Model Comparison (25 tasks)
MetricHaiku 4.5Opus 4.6GPT-5.2
Tasks passed23/2521/2518/25
Score98%93%81%
Tool calls93 (81 ok, 12 err)143 (125 ok, 18 err)103 (80 ok, 23 err)
Tool call success87%87%78%
Tokens167K in / 19K out242K in / 26K out84K in / 10K out
Duration2.9 min8.7 min3.4 min
2026-02-07 — Baseline (pre-interpreter fixes)
MetricOpus 4.6Haiku 4.5GPT-5.2
Tasks passed17/2519/2519/25
Score87%92%87%
Tool calls141 (106 ok, 35 err)116 (93 ok, 23 err)84 (48 ok, 36 err)
Tool call success75%80%57%
Tokens319K in / 27K out312K in / 29K out148K in / 15K out
Duration~9.4 min~4.1 min~4.2 min

Full per-task traces in saved markdown/JSON reports under results/.