Benchmarks

May 15, 2026 ยท View on GitHub

This folder records benchmark-specific integration contracts that live outside agent_base so the core harness stays generic, lightweight, and fair across different evaluations.

BenchmarkDirectoryTracked contract
ResearchClawBenchbenchmarks/ResearchClawBench/README.md + role_prompt.md + adapter.py
QA / VQA-style benchmarksbenchmarks/QA/README.md + role_prompt.md
SGI-DeepResearchbenchmarks/SGI-DeepResearch/README.md + role_prompt.md
SGI-IdeaGenerationbenchmarks/SGI-IdeaGeneration/README.md + role_prompt.md
SGI-DryExperimentbenchmarks/SGI-DryExperiment/README.md + role_prompt.md
SGI-Reasoningbenchmarks/SGI-Reasoning/README.md + role_prompt.md
SGI-WetExperimentbenchmarks/SGI-WetExperiment/README.md + role_prompt.md

Notes

  • agent_base/ stays focused on the reusable harness runtime.
  • Benchmark-specific prompts, adapters, and integration notes should live under their own benchmark subdirectory.
  • Local benchmark helpers may exist for private experimentation, but they do not define the formal external integration contract.