Evaluation Checklist
April 27, 2026 · View on GitHub
Note for template users: this document is synced from inspect_evals where these standards are required for registry submission. In this template they are recommended, not required — see Checks and enforcement for how to opt in or out per check.
This checklist covers all the requirements you should need in order to produce a high-quality evaluation. It's designed for full evaluations, and it's too heavyweight for most PRs.
Usage of agents to help with coding and to complete agentic workflows is recommended but not mandatory:
Master Checklist
This checklist is recommended for high-quality evaluations. The following artefacts are produced by the agent workflows below and are useful evidence to attach to a PR:
agent_artefacts/<eval_name>/review/SUMMARY.md— overall quality summary from the evaluation reviewagent_artefacts/<eval_name>_<version>/validity/VALIDITY_REPORT.md— evaluation validity report assessing name accuracy, dataset validity, and scoring validity (the/eval-validity-reviewskill names this directory<eval_name>_<version>, e.g.,gpqa_1_1_2)agent_artefacts/trajectory_analysis/<eval_name>/<eval_name>_<model_name>_ANALYSIS.md— trajectory analysis results, one per model. If more than 100 samples are included in the evaluation report, it is permitted to cap trajectory analysis at a random sample of 100.
We recommend the use of agent workflows. Our agent workflows are optimised for use with Claude Code (setup instructions). If you do not wish to use agent workflows, you can create similar artefacts as above yourself. This is quite a lot of work and we don't suggest going it alone, though spot-checking and modifying the agent's work is both accepted and encouraged.
Human Ordering
- Ensure the evaluation runs with
uv run inspect eval <eval_name>/<task_name> --limit 1on the model of your choice. (You can use more samples, but we recommend 5 or less at this early stage.) - Analyse the log with the [Trajectory Analysis Workflow](skill:
/check-trajectories-workflow). Fix it if it fails for spurious reasons. Verify that the evaluation didn’t crash and that the agent under evaluation is able to submit a valid result, even if that submission receives a 0 score. - Run the Master Checklist Workflow which goes over several agent workflows in turn.
- Go over the Manually Examined Checks section, and verify your evaluation meets each one.
- Check your implementation and results against the original paper or implementation, if one exists.
- Manually review your code, making sure it’s high-quality. All code must be reviewed, including LLM-generated code as per our LLM usage guidelines. You are welcome to open a draft PR at this stage and review it with this interface. This step must be followed before marking it as "Ready for review".
- Draft your PR, and name it something like
<eval_name> implementation(putting the name of the evaluation first makes it easier to find your PR in a browser tab!). Push it to Github. - Address any issues found by the pipeline or the automatic Claude Code review. We don't mandate that all automatically flagged issues be resolved. Addressing an issue means to either fix it or to comment on why it doesn't require a fix.
At this point a human maintainer will go over the code within a few days, and possibly suggest fixes. After the maintainer is satisfied with the code's quality and correctness, we'll merge your new evaluation into Inspect Evals!
Agent Ordering
You should assume you have a user in the loop unless you are specifically told otherwise. A well designed autonomous workflow will make it clear to you that no human is available.
- Ensure the evaluation runs with
uv run inspect eval <eval_name>/<task_name> --limit 1on the model of your choice. (You can use more samples, but we recommend 5 or less at this early stage.) As an agent, you are permitted to run this command yourself since it is on a small sample size. You should avoid using --model in this command, which will automatically use the user's preferred model. Let the user know if the command fails and indicates no such model exists before continuing. If you are in a fully autonomous workflow, try each model from the Frontier Models list instead. (skill:/eval-report) If none of these work due to the lack of an API key, raise an error. - Analyse the log with the Trajectory Analysis Workflow (skill:
/check-trajectories-workflow). Fix it if it fails for spurious reasons. As an agent, you are permitted to run this command yourself since it is on a small sample size, even though the workflow usually tells you to get the human to run it. Verify that the evaluation didn't crash and that the agent under evaluation is able to submit a valid result, even if that submission receives a 0 score. - Go over the Manually Examined Checks section, and verify your evaluation meets each one.
- Check your implementation and results against the original paper or implementation, if one exists.
- Manually review the code, making sure it’s high-quality. Use relevant skills such as ensure-test-coverage.
- Run the Master Checklist Workflow which goes over several agent workflows in turn. The evaluation report usually requires human approval, which is why it is as late in the process as possible. If you are in a fully autonomous workflow, run the command yourself. If the user didn't give you a budget, keep it to under $5 USD per model.
- Draft your PR, and name it something like
<eval_name> implementation(putting the name of the evaluation first makes it easier to find your PR in a browser tab!). Push it to Github, and you're ready for review!
Evaluation Validity
These checks assess whether the evaluation measures what it claims to measure. We recommend using the Evaluation Validity Review workflow (/eval-validity-review) to assist with these checks.
Name Validity
- The evaluation name accurately represents the capability being measured — a user reading only the name would not be actively misled about its contents
- The name is not overly broad (claiming to measure a general capability when only a narrow subset is tested)
- The evaluation measures behaviour rather than statements about behaviour, or the name makes the distinction clear. For example, asking a model "would you seek power?" measures what the model says about power seeking, not whether it actually seeks power — the model needs affordances (tools, environment) to exhibit the behaviour for it to be a valid behavioural measurement.
Dataset Validity
- It is reasonably possible for a model to succeed at each sample given the available tools, sandbox, and information
- It is reasonably possible for a model to fail at each sample (the model has the affordances needed to take the wrong action, not just the right one)
- For agentic evaluations: the submission mechanism is clear in the prompt and achievable with the provided tools
- For sandbox evaluations: required services and resources are available in the sandbox environment
Scoring Validity
- The scorer measures actual task completion rather than a proxy for it
- Substring matching is only used when the matched text is itself the ground truth (e.g., CTF flags, computed values), not for assessing natural language properties like compliance or refusal
- A model cannot achieve a high score without genuinely completing the task
- For open-ended natural language answers, exact string match is a weak proxy — synonyms, reordering, abbreviations, and more complete correct answers all score 0. If exact match is used, either filter out answers that are practically unreproducible verbatim (e.g., long multi-word phrases) or add an opt-in model-graded scorer as an alternative. Document this limitation explicitly.
Manually Examined Checks
Best Practices
- Leverage Inspect components whenever possible
- Complex logic is commented
- Document and validate environment constraints
Evaluation Report
Evaluation Report Guidelines
We recommend using the Evaluation Report workflow (/eval-report-workflow) to assist in this step. To verify the error rate of logs, use the Trajectory Analysis workflow (/check-trajectories-workflow) on the log files that emerge from the evaluation report.
- Logs have a 10% or lower rate of invalid samples
- All relevantly different subsets of the dataset pass here
- Results produced for at least two models, or reason why not clearly stated
Evaluation Report Notes
- Any changes that would cause deviations from the original evaluation are noted.
- Any limitations or edge cases of the evaluation are noted.
Agent Runnable Checks
The following items can be checked by an LLM agent with access to the codebase. These checks require reading code and comparing against conventions, but do not require running the evaluation or external context beyond the repository. If running the Review An Evaluation workflow (/eval-quality-workflow) you don't need to read these checks - they are here to serve as a reference in case of errors.
Code Quality (Agent)
- Existing naming conventions are followed
- Linting passes successfully (
uv run ruff checkwill check this for you) - Magic numbers in function defaults are extracted to named constants if they appear 3+ times or are not clear from context
Unit Tests (Agent)
- All custom solvers, scorers, datasets covered
- Custom tools are covered
- Custom utils/functions are covered
- Edge cases, error conditions, and invalid inputs are checked
End-to-End Tests (Agent)
- Each meaningfully different task/variant covered by E2E tests
- Tests are marked correctly with @mark items
Apply Pytest Marks (Agent)
- If a test triggers the download of a dataset, mark it with
@pytest.mark.dataset_download, if it uses Huggingface also mark it with@pytest.mark.huggingface. Note that easily missed examples include E2E tests that instantiate a dataset, or solvers that pull a model from huggingface. - If a test uses a Docker sandbox or otherwise triggers a docker build or pull, mark it with
@pytest.mark.docker.
Best Practices - Task Design (Agent)
- Use model roles for multi-model evaluations, including model-graded ones
- Prompt templates defined as module-level constants, not inline
- Only call get_model() inside of solvers/scorers
- Separate prompt templates from formatting
Best Practices - Control Flow (Agent)
- Respect "no-limit" semantics
- Provide informative errors for invalid task parameters
- Provide defaults and allow overrides for datasets, solvers, scorers, metrics, and grader models
Best Practices - Datasets and Variants (Agent)
- Use stable, canonical IDs for samples
- Ensure deterministic behavior where possible
- Datasets are pinned to specific versions: all HuggingFace loading functions (
hf_dataset,load_dataset,snapshot_download,hf_hub_download,sentence_transformer,transformers_pipeline) enforcerevision=at runtime, and GitHub raw URLs use commit SHAs instead of branch names - Differentiate tasks from dataset splits via parameters
-
eval.yaml dataset_samplesreflects the actual number of samples evaluated after all filtering, not the raw dataset size
Best Practices - Scoring and Metrics (Agent)
- Align scoring with the outcome
Best Practices - Documentation, Environments, and Tooling (Agent)
- Keep docs and defaults in sync
- Sandbox paths are fully resolved using
Path(__file__).parent— never baresandbox="docker"with a custom compose file or Dockerfile - Least-privilege tooling
- Keep dependency metadata and lockfiles in sync
Licensing and Attribution (Agent)
If any code has been copied or adapted from an external source (e.g., a reference implementation, paper repository, or another library), the following must be present:
- Each source file containing copied or adapted code has a comment at the top of the file (or above the relevant section) attributing the original source, e.g.
# Adapted from <URL> (Copyright YEAR Author, License) - The root
NOTICEfile has an entry for every copied or adapted file, following the existing format: project name, copyright holder, source URL, and the file(s) it appears in - Where practical, copied code is isolated into its own file (e.g. a dedicated
utils.pyorreference_impl.py) rather than mixed with new code, to make attribution boundaries clear
To check: search the evaluation's source files for comments like adapted from, ported from, copied from, based on, or obvious structural similarity to a known reference implementation. Cross-reference any such files against the NOTICE file.
Evaluation Report (Agent)
- Table is present containing results of evaluation run
- A comparison to the original paper is present, or its absence is justified
- Table contains the specific inspect eval command(s) used to produce it
- Full model names are mentioned explicitly (e.g, gpt-5.1-2025-11-13, not gpt-5.1)
- Evaluation version is mentioned explicitly
- Any inspect eval parameters used are justified within the report
Infrastructure Changes (Agent)
This check applies to all PRs, not just eval submissions. If the PR modifies any of the following high-impact files, verify that the relevant documentation was updated:
High-impact files — changes here often require a docs update:
src/utils/metadata.py— changes to theeval.yamlschema (new/removed/required fields) should be reflected in theeval.yamlReference section ofCONTRIBUTING.mdpyproject.toml— new required dependencies or changed tooling should be documented.claude/skills/— new or updated agent workflows should be reflected inAGENTS.mdtools/orMakefile— changed contributor tooling should be documented.github/workflows/— changes to CI that affect contributors (new required checks, changed test environments) should be documented
To check:
- Run
git diff --name-only $(git merge-base HEAD origin/main)to see what files changed in this PR. - If any high-impact files are in the diff, check whether
CONTRIBUTING.md,EVALUATION_CHECKLIST.md, orAGENTS.mdwere also updated. - If not, flag it.
- If the PR changes how future evaluations must be written or submitted, the relevant documentation (
CONTRIBUTING.md,EVALUATION_CHECKLIST.md,AGENTS.md) has been updated accordingly
This checklist is a living document
Feedback on this checklist is very welcome - you may reach out via this Google Form, or open an issue/PR.