Inspect Eval Template
May 26, 2026 · View on GitHub
A template repository for building Inspect AI evaluations, part of the inspect_evals registry. Supports multiple evaluations in a single repository.
Important Features
This template contains:
- Skills — Claude Code skills to help produce evaluations, improve their quality, and speed up velocity.
- Documentation — guides on best practices and recommended standards.
- Examples — example evaluations that show how to produce evaluations in Inspect.
Getting started
-
Fork this repository — click "Use this template" (or fork) on GitHub to create your own copy, then clone it:
git clone https://github.com/<your-username>/<your-repo>.git cd <your-repo> -
Install dependencies and run:
# Install dependencies uv sync # Run an evaluation uv run inspect eval src/<eval_name>/<task_file>.py@<task_name> # Run tests uv run pytest -k "not agentic" # Run linters uv run ruff check && uv run ruff format --check && uv run mypy src tests
Quickstart
-
Fork and clone (see above), then
uv sync -
Copy an example:
cp -r src/examples/simple_qa src/<eval_name> -
Rename the module file,
@taskfunction, and__init__.pyimport to match your package (replacesimple_qa/examples.simple_qa.simple_qareferences with your own names). -
In
pyproject.toml, set the distribution name and add an entry point:[project] name = "<eval_name>" [project.entry-points.inspect_ai] <eval_name> = "<eval_name>" -
Run it:
uv run inspect eval <eval_name>/<task_name> --model openai/gpt-5-nano# note: the task_name is saved in src/eval_dir/task_file, if the project name != eval_diruv run inspect eval task_namewill work instead
See src/examples/ for complete working examples, including a real
benchmark adaptation (GPQA).
Structure
Each evaluation lives in its own directory under src/ and is registered via
entry points in pyproject.toml. Tests go in tests/<eval_name>/.
src/
<eval_name>/ # Your evaluation
__init__.py # Exports task function(s) for Inspect discovery
<eval_name>.py # Task implementation
eval.yaml # Evaluation metadata
examples/ # Example evaluations (not registered)
tests/
<eval_name>/ # Tests for your evaluation
examples/ # Tests for examples
Adding evaluations
-
Create a new directory under
src/(e.g.,src/my_eval/) — copying one ofsrc/examples/is the easiest start. It needs an__init__.pythat exports your@taskfunctions and aneval.yaml(seesrc/examples/gpqa/eval.yamlfor a fully populated example). -
Add an entry point in
pyproject.toml:[project.entry-points.inspect_ai] my_eval = "my_eval" -
Add the module name to
[tool.setuptools.packages.find]include list -
Add a mypy override for the new module
Examples
The src/examples/ directory contains three working evaluations demonstrating
common patterns. These are not registered as evaluations — they exist purely
as reference implementations you can copy from. Keep them in place: they're
managed files (kept up-to-date by sync-template.yml) and don't ship as
part of your eval's wheel.
Simple Q&A (src/examples/simple_qa/)
A straightforward question-answering evaluation using match() scoring.
Demonstrates dataset conversion, prompt templates, and few-shot examples.
Similar to evaluations like GPQA.
LLM-as-Judge (src/examples/llm_judge/)
An evaluation that uses a language model to grade open-ended responses via
model_graded_qa(). Demonstrates custom grading templates and judge model
configuration. Similar to evaluations like Healthbench.
GPQA (src/examples/gpqa/)
A real-world benchmark adaptation demonstrating external dataset loading,
multiple-choice scoring, and domain filtering. Adapted from the
inspect_evals implementation.
Includes a fully-populated eval.yaml showing all available metadata fields.
Agentic (src/examples/agentic/)
An agent-based evaluation where the model uses bash() and python() tools
in a Docker sandbox via basic_agent(). Demonstrates sandbox configuration
and tool-use scoring. Similar to evaluations like GAIA.
For more examples of production evaluations, see the inspect_evals registry, which contains reference implementations of benchmarks like GPQA, GAIA, HumanEval, and many others.
Skills
The .claude/skills/ directory ships Claude Code skills that activate when
you ask Claude to perform the matching task — e.g. "create a new evaluation"
or "review this PR against the template standards". They are managed files,
kept up to date by sync-template.yml. The reliable way to invoke one is
"Please run the /SKILL_NAME skill on EVAL_NAME."
Authoring
- create-eval — implement a new evaluation from an issue, paper, or benchmark spec, with checkpoints between phases.
- investigate-dataset — explore HuggingFace, CSV, or JSON datasets to understand their structure and data quality.
- ensure-test-coverage — review existing tests or create missing ones for a single evaluation against repo conventions.
Reviewing
- eval-quality-workflow — fix or review a single evaluation against the standards in EVALUATION_CHECKLIST.md.
- eval-validity-review — assess whether an evaluation accurately measures what it claims to measure and has good methodological standards.
- review-pr-workflow — review a PR against the agent-checkable standards
in
EVALUATION_CHECKLIST.mdandBEST_PRACTICES.md; designed to run in CI.
Running and analysing
- eval-report-workflow — produce a README results table by selecting models, estimating costs, and running evaluations.
- read-eval-logs — view and analyse Inspect
.evallog files via the Python API and CLI. - check-trajectories-workflow — use Inspect Scout to scan agent trajectories for failures, formatting issues, reward hacking, and refusals.
Submission
- prepare-submission-workflow — finalize an evaluation for PR submission
(dependencies, tests, lint,
eval.yaml, README).
Features
- Multiple evaluation support — add as many evaluations as needed, each with its own directory, tests, and metadata
- Automated quality checks — pre-commit hooks and CI for ruff, mypy, and pytest
- Managed file sync — template updates are synced automatically via GitHub Actions with three-way merging to preserve your customizations (see MANAGED_FILES.md)
CI workflows
The template includes several GitHub Actions workflows that run automatically.
Standard checks (always active)
- Checks (
checks.yml) — runs ruff, mypy, POSIX code check, unlisted-eval check, package build, autolint, generated-docs check, and large-file scan on every push and PR. Each check is individually enforceable — see Checks and enforcement. - Markdown Lint (
markdown-lint.yml) — lints markdown files on PRs - PR Template Check (
pr-template-check.yml) — verifies PR body contains the required checklist - Sync Template (
sync-template.yml) — weekly sync of managed files from the upstream template (skips inactive repos) - Sync Upstream (
sync-upstream.yml) — weekly sync of managed files from inspect_evals
Repository settings for sync workflows
The sync workflows create branches and open PRs automatically. Two things need to be set up in your repository for this to work:
1. Workflow permissions and PR creation
- Go to Settings > Actions > General > Workflow permissions
- Select Read and write permissions
- Check Allow GitHub Actions to create and approve pull requests
2. SYNC_PAT secret
Sync runs often include changes to files under .github/workflows/
(template-managed workflow files). GitHub blocks the default
GITHUB_TOKEN from creating or updating workflow files, so the sync
needs a separate token with the Workflows permission. Without it,
the sync run fails on push with refusing to allow a GitHub App to create or update workflow … without 'workflows' permission.
To set it up:
- At github.com → Settings → Developer settings → Personal access tokens → Fine-grained tokens, generate a new token. Set Resource owner to your org, Repository access to this repo, and Repository permissions to: Contents = Read and write, Pull requests = Read and write, Workflows = Read and write (Metadata = Read is added automatically).
- If your org requires admin approval for fine-grained PATs, approve it on the org's PAT page.
- Add the token as a repository secret named
SYNC_PAT(Settings > Secrets and variables > Actions > New repository secret).
See MANAGED_FILES.md for details on which files are synced and how the three-way merge preserves your customizations.
Claude-powered workflows
These workflows use Claude to automate code review and issue resolution. To enable them, add one of these secrets to your repository settings (Settings > Secrets and variables > Actions > Secrets):
ANTHROPIC_API_KEY— an Anthropic API key (recommended for most users). Create one at https://console.anthropic.com/settings/keys.ANTHROPIC_ROLE_ARN— an AWS IAM role ARN for Bedrock access via OIDC (for organisations using AWS Bedrock).
Without either secret, the workflow is skipped.
At time of writing, each Claude review costs roughly $0.50–$2 using Claude Opus 4.6 ($5/$25 per million tokens input/output).
-
Claude Code Review (
claude-review.yaml) — reviews PRs against the evaluation standards in EVALUATION_CHECKLIST.md and BEST_PRACTICES.md.Three modes, controlled by repository variables (Settings > Variables
Actions):
- Disabled (default if no secret is set) — workflow skips entirely.
- On-demand only (
AUTO_REVIEW_ALL_COMMITSunset orfalse) — review fires only when someone comments/reviewon a PR or dispatches the workflow manually. Recommended starting point. - Every commit (
AUTO_REVIEW_ALL_COMMITS=true) — review fires on every push to a non-draft PR. Highest cost; use when the team wants a review on each iteration.
Optional
CLAUDE_MODELrepository variable overrides the model (defaults toclaude-opus-4-6-20250725for API key users).
Checks and enforcement
This template is calibrated against the inspect_evals registry's quality standards. Those standards are recommended, not required in the template — meeting them is what we suggest if you want a smooth path to registry submission, but the template doesn't block your work if you don't.
Each check has an ENFORCE_<NAME> setting in
tools/enforcement.config. When
ENFORCE_<NAME>=true, the check blocks merge on failure. When =false, it
reports as advisory in the PR (visible in the run logs but doesn't prevent
merging). To change enforcement for your repo, edit that file and commit —
the change is git-tracked and reviewable.
The same file is honoured locally by make check (which runs
tools/run_checks.sh): advisory failures are reported with a ⚠ and a
"not enforced" note; only enforced failures cause make check to exit
non-zero. Environment variables override the file for one-off local runs:
ENFORCE_AUTOLINT=true bash tools/run_checks.sh
All toggles live in tools/enforcement.config — edit and commit to change behaviour for your fork. The defaults committed there are:
Default-enforced (blocking unless ENFORCE_<NAME>=false):
ENFORCE_RUFF— Ruff format + lintENFORCE_MYPY— Mypy static typesENFORCE_POSIX_CHECK— POSIX-only Python idioms (tools/check_posix_code.py)ENFORCE_UV_LOCK—uv.lockin sync withpyproject.tomlENFORCE_UNLISTED_EVALS— eval directories must be registered inpyproject.tomlentry-points and have aneval.yamlENFORCE_PACKAGE— package builds and inspects cleanly
Default-advisory (reported but non-blocking unless ENFORCE_<NAME>=true):
ENFORCE_AUTOLINT— inspect_evals structural standards (eval.yaml schema, README sections, test patterns, etc.) viatools/run_autolint.pyENFORCE_GENERATED_DOCS— auto-generated README sections committedENFORCE_MARKDOWN_LINT— markdown style (.markdownlint.yaml)ENFORCE_LARGE_FILES— no files >10MBENFORCE_PR_TEMPLATE— PR body contains the template checklist
Quick recipes
- Strict mode (registry-ready): set every
ENFORCE_*totrue. Your PRs now fail on the same standards inspect_evals applies. - Loose mode (default): leave everything unset. Correctness blocks (Ruff/Mypy/POSIX/lock/registration/build); style and structure are advisory.
- Per-check: set just the toggles you want — e.g.
ENFORCE_AUTOLINT=trueif you want eval-structure rules enforced but don't care about markdown style.
To opt out of a default-enforced check, set its variable to false. To opt
into a default-advisory check, set its variable to true.
Documentation
- CONTRIBUTING.md — development setup, testing, and submission guidelines.
- BEST_PRACTICES.md — evaluation design best practices
(synced from
inspect_evals). - EVALUATION_CHECKLIST.md — quality checklist used
when reviewing evaluations (synced from
inspect_evals). - AUTOMATED_CHECKS.md — what
tools/run_autolint.pychecks, and how to suppress individual rules. - TASK_VERSIONING.md — when to bump an eval's
taskversion. - AGENTS.md — repo-wide tips for coding agents and pointers to the skills above.
- MANAGED_FILES.md — which files in this repo are template-managed vs. user-owned, and how the sync preserves customizations.
- CLAUDE.md — project-level instructions Claude Code reads on every session in this repo.