Benchmarks

May 15, 2026 · View on GitHub

This folder records benchmark-specific integration contracts that live outside agent_base so the core harness stays generic, lightweight, and fair across different evaluations.

Benchmark	Directory	Tracked contract
ResearchClawBench	`benchmarks/ResearchClawBench/`	`README.md` + `role_prompt.md` + `adapter.py`
QA / VQA-style benchmarks	`benchmarks/QA/`	`README.md` + `role_prompt.md`
SGI-DeepResearch	`benchmarks/SGI-DeepResearch/`	`README.md` + `role_prompt.md`
SGI-IdeaGeneration	`benchmarks/SGI-IdeaGeneration/`	`README.md` + `role_prompt.md`
SGI-DryExperiment	`benchmarks/SGI-DryExperiment/`	`README.md` + `role_prompt.md`
SGI-Reasoning	`benchmarks/SGI-Reasoning/`	`README.md` + `role_prompt.md`
SGI-WetExperiment	`benchmarks/SGI-WetExperiment/`	`README.md` + `role_prompt.md`

Notes

agent_base/ stays focused on the reusable harness runtime.
Benchmark-specific prompts, adapters, and integration notes should live under their own benchmark subdirectory.
Local benchmark helpers may exist for private experimentation, but they do not define the formal external integration contract.