tests/
May 20, 2026 · View on GitHub
Eval suite for the PM Brain skill. A scenario here is a short, scripted story we feed the skill (a sequence of interviews, meetings, or messages) so we can check that the resulting brain (the folder of markdown files the skill maintains) ends up the way it should. The check happens after every turn.
New to PM Brain terms? The glossary covers brain, ingest, provenance, hypothesis, decision file, audit trail, and anything else that shows up in this doc and isn't obvious.
This file is the operator-level quickstart. For everything else (scenario format, ground-truth schema, harness internals, assumptions, full coverage / gap list, and how to add a scenario) see TESTING.md. Design rationale lives in ../docs/testing.md.
Quickstart
# Single scenario, single run (cheap, fast)
python tests/harness/run_scenario.py tests/scenarios/01-b2b-churn
# Single scenario, N runs (for convergence pass rate)
python tests/harness/run_scenario.py tests/scenarios/01-b2b-churn --runs 5
# All scenarios, N runs each
python tests/harness/run_all.py --runs 5
Results land in tests/results/<date>-<scenario>-<run>.json. The folder is gitignored by default.
Layout
tests/
├── README.md # This file
├── scenarios/ # Synthetic scenarios with ground truth
│ └── 01-b2b-churn/
│ ├── README.md # What this scenario covers
│ ├── inputs/ # Cached synthetic artifacts, ordered
│ └── expected.yaml # Ground-truth assertions
└── harness/
├── run_scenario.py # Per-scenario runner
├── run_all.py # Batch runner
├── checks/ # Structural assertion helpers
│ └── structural.py
└── judges/ # LLM-judge rubrics
├── hypothesis_promoted.md
├── contradiction_surfaced.md
└── decision_quality.md
What a scenario looks like
A scenario is an ordered stream of synthetic PM artifacts (we call them turns) plus assertions about what should be true after each turn.
Each turn under inputs/ is a single file: an interview transcript, a meeting note, an analytics screenshot description, a Slack thread, a competitor changelog. The harness replays them in filename order, calling the skill on each one as if a real PM had just dropped it in.
expected.yaml declares the assertions. Two kinds:
- Structural: file-system checks (does this hypothesis file exist? did the decision file get updated?). Deterministic, fast, free.
- Content: LLM-judge calls with a tight rubric (did the brain surface the contradiction? did the synthesis tag provenance correctly?). Non-deterministic, costs money. Reserved for things a structural check can't answer.
See docs/testing.md § Scenario format for the full ground-truth schema.
Adding a scenario
- Pick a lifecycle move not covered by existing scenarios (see the coverage gaps in
docs/testing.md). - Create
tests/scenarios/<NN-slug>/. - Write
README.mddeclaring what the scenario covers and which lifecycle moves it exercises. - Generate cached synthetic inputs under
inputs/. Name themturn-NN-<kind>.md. Commit them. Don't regenerate on the fly. - Write
expected.yamlwith per-turn structural + content assertions and afinal_stateblock. - Run the harness against the scenario. Iterate on
expected.yamluntil the scenario passes at the threshold you set.
Cost guard
The harness uses real LLM calls. Cost ballpark with the default Sonnet model split (see TESTING.md § Cost model for the full table):
- Per turn: ~$0.10–0.40
- Per content (judge) assertion: ~$0.02–0.05 (Sonnet) / ~$0.10–0.20 (Opus, opt-in)
- Per full scenario run (10 turns + ~15 judges): ~$3–5
--runs 5of one scenario: ~$15–25
Under a Claude subscription the printed cost is the API-equivalent price, useful as a proxy for 5-hour-window quota pressure, not real billing. Either way it's the number to optimize.
--max-cost (default $20) aborts the run on cumulative overrun. Use it. Don't loop the harness in a debug session without it.