Validation Checklist

December 24, 2025 · View on GitHub

Use this checklist to validate your problem before submitting a PR.

Reading path: Design Philosophy → Example → Tutorial → You are here

Problem Design

Is the problem solvable with only module calls?
- Yes: rethink it to add twists
Is this a natural thing someone would want solved?
Is there a clear bad and good solution?
Is there a straightforward way to break it down into checkpoints?
Is there a deterministic way to grade this?
- This does not exclude random simulations -- you just need to frame the problem as running MANY simulations so the output can fall into a known outcome (i.e. flip coins 1000 times, EV is about 50%)

First Checkpoint

Does this lay out the core problem?
Does it leak any information about upcoming checkpoints?
- Yes --> Rewrite that part to remove details
- No --> LGTM
Does it specify how to standardize errors?
Does it specify what the entrypoint interface is?
- No --> Make sure all flags/arguments/outputs are described
Would this first checkpoint be trivial (<4 hours) to implement?
- Yes --> Add more
Have I described all of the terminology/equations/priors required so that someone can solve it without requiring a web search?
- NOTE: if there is provided data (i.e. an API they must query or data files they must query), you likely don't need to describe it if it can be explored.

Subsequent Checkpoints

Is there a single core focus to this checkpoint?
- No --> Think about ways to break it up into multiple checkpoints
If the prior checkpoints were coded in a near perfect way, would it take you < 5 hours to implement?
- Yes --> Add more meat to the checkpoint
Does the spec describe every prior functionality (error messages/data formats) that needs to be changed/modified?
- Yes --> Check you didnt leak anything. If not, LGTM
- No --> Add those details in

ALL Checkpoint Validation

Can you write 5+ test cases without ambiguity?
Could two correct implementations produce different outputs?
Is any behavior undefined or left to interpretation?
Could an agent reasonably need to search the web to understand requirements?
Would a human SWE need to ask clarifying questions?

If you answer YES to any of 2-5, add more specification.

Submitting Your Problem

Files to Include

Your PR should include:

problems/your_problem/
├── config.yaml           # Problem configuration (with inline checkpoints)
├── checkpoint_1.md       # Checkpoint 1 specification
├── checkpoint_2.md       # Checkpoint 2 specification
├── tests/
│   ├── conftest.py       # Pytest fixtures
│   ├── test_checkpoint_1.py
│   ├── test_checkpoint_2.py
│   └── data/             # Test case data
│       ├── checkpoint_1/
│       └── checkpoint_2/
└── files/                # Static assets (if needed)

PR Description Template

## Problem: [Problem Name]

### Overview
[1-2 sentences describing what the problem tests]

### Checkpoints
1. **Checkpoint 1**: [Brief description]
2. **Checkpoint 2**: [Brief description]
...

### Test Coverage
- Core cases: [N] tests
- Error cases: [N] tests
- Edge cases: [N] tests

### Design Notes
[Any important design decisions or trade-offs]

What Reviewers Look For

Spec clarity: Can someone implement this without asking questions?
Test diversity: Are there core, error, and edge cases?
Checkpoint progression: Do checkpoints build naturally on each other?
No structure leakage: Does the spec describe behavior, not implementation?
Deterministic grading: Will two correct solutions produce the same output?

After Submission

Respond to reviewer feedback promptly
Test your changes locally before pushing updates
Use the Troubleshooting Guide if tests fail