GDPVal Benchmark
March 24, 2026 · View on GitHub
Benchmark for evaluating OpenSpace on GDPVal (220 occupational tasks across 44 occupations and 9 sectors). Measures token savings from skill accumulation by running each task twice:
- Phase 1 — Cold start. Skills accumulate as tasks run sequentially.
- Phase 2 — Warm start. Re-run all tasks with the full Phase 1 skill library.
Evaluation uses ClawWork's LLM evaluator (same rubrics, same 0.6 payment cliff).
Project Layout
parent/
├── OpenSpace/ ← this repo
│ └── gdpval_bench/ ← this directory
└── ClawWork/ ← required
├── eval/meta_prompts/ ← evaluation rubrics
└── livebench/data/agent_data/ ← ClawWork agent results (for leaderboard)
Setup
pip install -e . && pip install datasets
git clone https://github.com/HKUDS/ClawWork.git ../ClawWork
pip install -r gdpval_bench/requirements-eval.txt
export OPENROUTER_API_KEY="sk-or-..."
export EVALUATION_API_KEY="sk-..."
Run
python -u -m gdpval_bench.run_benchmark \
--task-list gdpval_bench/tasks_50.json \
--model openrouter/qwen/qwen3.5-plus-02-15 \
--use-clawwork-productivity \
--clawwork-root ../ClawWork \
--resume
Key flags: --phase1-only, --phase2-only, --no-eval, --concurrency N, --max-tasks N, --prefetch-only, --dry-run.
Included Data
skills/ # evolved skills
.openspace/openspace.db # skill & tool quality DB (auto-generated during evolution)
skills/ contains the full skill library produced by evolution — each subdirectory holds a SKILL.md. .openspace/openspace.db tracks skill lineage, tool quality records, and execution analyses accumulated across benchmark runs.
Output
results/<run_name>/
├── phase1_results.jsonl # per-task Phase 1
├── phase2_results.jsonl # per-task Phase 2
├── comparison.jsonl # Phase 1 vs Phase 2 deltas
├── summary.json # aggregate statistics
├── skills_snapshot.json # skills after Phase 1
├── config.json # run config
├── workspace/ # agent working directories
└── recordings/ # execution trajectories
Analyze
python -m gdpval_bench.calc_subset_performance
Produces leaderboard (OpenSpace vs ClawWork agents), head-to-head comparison, and token savings breakdown.
Task List
tasks_50.json— 50 task IDs (deterministic subset of GDPVal-220).tasks_50_full.jsonl— Full task data downloaded from HuggingFace. One JSON object per line with all original fields (task_id,sector,occupation,prompt,reference_files,reference_file_urls,rubric_json, etc.). Covers all 9 sectors and 44 occupations.