GDPVal Benchmark

March 24, 2026 · View on GitHub

Benchmark for evaluating OpenSpace on GDPVal (220 occupational tasks across 44 occupations and 9 sectors). Measures token savings from skill accumulation by running each task twice:

Phase 1 — Cold start. Skills accumulate as tasks run sequentially.
Phase 2 — Warm start. Re-run all tasks with the full Phase 1 skill library.

Evaluation uses ClawWork's LLM evaluator (same rubrics, same 0.6 payment cliff).

Project Layout

parent/
├── OpenSpace/                   ← this repo
│   └── gdpval_bench/            ← this directory
└── ClawWork/                    ← required
    ├── eval/meta_prompts/       ← evaluation rubrics
    └── livebench/data/agent_data/  ← ClawWork agent results (for leaderboard)

Setup

pip install -e . && pip install datasets
git clone https://github.com/HKUDS/ClawWork.git ../ClawWork
pip install -r gdpval_bench/requirements-eval.txt
export OPENROUTER_API_KEY="sk-or-..."
export EVALUATION_API_KEY="sk-..."

Run

python -u -m gdpval_bench.run_benchmark \
  --task-list gdpval_bench/tasks_50.json \
  --model openrouter/qwen/qwen3.5-plus-02-15 \
  --use-clawwork-productivity \
  --clawwork-root ../ClawWork \
  --resume

Key flags: --phase1-only, --phase2-only, --no-eval, --concurrency N, --max-tasks N, --prefetch-only, --dry-run.

Included Data

skills/                        # evolved skills
.openspace/openspace.db        # skill & tool quality DB (auto-generated during evolution)

skills/ contains the full skill library produced by evolution — each subdirectory holds a SKILL.md. .openspace/openspace.db tracks skill lineage, tool quality records, and execution analyses accumulated across benchmark runs.

Output

results/<run_name>/
├── phase1_results.jsonl      # per-task Phase 1
├── phase2_results.jsonl      # per-task Phase 2
├── comparison.jsonl          # Phase 1 vs Phase 2 deltas
├── summary.json              # aggregate statistics
├── skills_snapshot.json      # skills after Phase 1
├── config.json               # run config
├── workspace/                # agent working directories
└── recordings/               # execution trajectories

Analyze

python -m gdpval_bench.calc_subset_performance

Produces leaderboard (OpenSpace vs ClawWork agents), head-to-head comparison, and token savings breakdown.

Task List

tasks_50.json — 50 task IDs (deterministic subset of GDPVal-220).
tasks_50_full.jsonl — Full task data downloaded from HuggingFace. One JSON object per line with all original fields (task_id, sector, occupation, prompt, reference_files, reference_file_urls, rubric_json, etc.). Covers all 9 sectors and 44 occupations.