TarantuBench Evaluation Harness

April 29, 2026 · View on GitHub

Run any AI agent against TarantuBench challenges and get a structured scorecard.

Quick Start

cd eval
npm install   # no external deps, just sets up the module

# Callback mode: auto-run your agent for each lab
node harness.js --dataset ../data/tarantubench-v1.jsonl \
  --command "python my_agent.py --url {URL}" \
  --timeout 300

# Server mode: boot labs one at a time, connect your agent manually
node harness.js --dataset ../data/tarantubench-v1.jsonl \
  --mode server --timeout 300

The lab dataset is published on Hugging Face as tarantulabs/TarantuBench. Validate the asset before a formal run:

node validate-dataset.js --dataset ../data/tarantubench-v1.jsonl --expected-count 100
node validate-dataset.js --hf tarantulabs/TarantuBench --expected-count 100

How It Works

The harness boots each lab as a local Express server and places a transparent HTTP proxy in front of it. Your agent talks to the proxy URL — it doesn't know the proxy exists.

Your Agent  ──HTTP──>  Proxy (:4101)  ──forward──>  Lab Server (:4100)
                         │
                    logs every request
                    handles /submit-flag
                    tracks timing
                         │
                         v
                   results/lab-id.json

The proxy:

Forwards all HTTP requests to the lab server transparently
Logs method, path, status code, and latency for every request
Exposes POST /submit-flag for flag validation
Tracks wall time, unique paths explored, and flag attempts

Agent Protocol

Your agent needs exactly two capabilities:

Make HTTP requests to the target URL (GET, POST, PUT, DELETE, PATCH)
Submit the flag via POST {URL}/submit-flag with body {"flag": "TARANTU{...}"}

The flag submission endpoint returns:

{"correct": true, "message": "Correct! Flag accepted."}
{"correct": false, "message": "Incorrect flag. Keep trying."}

That's it. The harness doesn't care what language your agent is written in, what model it uses, or how it reasons. It only sees HTTP traffic.

Modes

Callback Mode

The harness spawns your command for each lab, replacing {URL} with the proxy address:

node harness.js --dataset data.jsonl \
  --command "python my_agent.py --url {URL} --timeout 120" \
  --timeout 300 --max-concurrent 2

The harness waits for the command to exit (or timeout), then moves to the next lab.

Server Mode

The harness boots one lab at a time and prints the URL. You connect your agent manually:

node harness.js --dataset data.jsonl --mode server --timeout 300

Press Enter to skip a lab. The harness moves to the next one after a flag is submitted or the timeout is reached.

Options

Flag	Default	Description
`--dataset <path>`	required	Path to TarantuBench JSONL file
`--command <cmd>`	—	Agent command (`{URL}` is replaced with proxy URL)
`--mode server`	—	Interactive server mode
`--timeout <s>`	300	Per-lab timeout in seconds
`--labs <a,b,...>`	all	Only run specific lab IDs
`--port <n>`	4100	Base port (each lab uses 2 consecutive ports)
`--max-concurrent <n>`	1	Parallel labs in callback mode
`--output <dir>`	./results	Where to write result files

Output

Per-Lab Report (`results/<lab-id>.json`)

{
  "lab_id": "corporate-portal-auth-privilege-chain",
  "title": "VaultGuard — Privilege Escalation Chain",
  "difficulty": "Advanced",
  "category": "auth-bypass",
  "vuln_subtype": "auth-privilege-chain",
  "chain_type": null,
  "solved": true,
  "wall_time_ms": 41200,
  "http_requests": 8,
  "flag_attempts": ["TARANTU{wrong}", "TARANTU{abc123...}"],
  "time_to_solve_ms": 38500,
  "unique_paths": ["/", "/dashboard", "/api/team/1", "/api/admin/vault"],
  "http_log": [
    {"ts": 0, "method": "GET", "path": "/", "status": 200, "latency_ms": 12},
    {"ts": 1200, "method": "POST", "path": "/login", "status": 302, "latency_ms": 8}
  ]
}

Aggregate Scorecard

Generate after a run:

node scorecard.js --results ./results

Produces scorecard.json (machine-readable) and scorecard.md (copy-paste into papers):

# TarantuBench Scorecard

**Solve rate: 12/18 (66.7%)**

| Lab | Difficulty | Category | Result | Requests | Time |
|-----|-----------|----------|--------|----------|------|
| corporate-portal-auth-privilege-chain | Advanced | auth-bypass | PASS | 8 | 41.2s |
| banking-sqli-blind-boolean | Advanced | sqli | FAIL | 40 | 164.0s |
...

Requirements

Node.js 18+ (uses fetch, ES modules)
npm (for installing lab dependencies at runtime)
The JSONL dataset file