Architecture

May 20, 2026 · View on GitHub

PM Brain is two architectural decisions and one operating loop. Everything else follows.

Decision 1: Deterministic scaffold + adaptive prompts

The skill is split into two layers that evolve independently.

Layer	Lives in	Behavior	Why
Static structure: schemas, `CLAUDE.md`, `INDEX.md`, folder tree, file templates	`.claude/skills/pm-brain/scaffold/`	Deterministic. Same every install.	Copy-paste reliability. No generation drift. Schemas can evolve without touching reasoning.
Adaptive reasoning: mode detection, migration, interview, post-scaffold self-test	`.claude/skills/pm-brain/prompts/`	Probabilistic. Depends on inputs.	Reasoning can improve without rewriting schemas. Behavior evolves in `prompts/`, structure stays stable.
Orchestration: when to load what	`.claude/skills/pm-brain/SKILL.md`	Glue.	Single entry point.

The first version of this skill mixed both layers in one prompt. Forcing deterministic content through probabilistic generation caused inconsistencies, formatting drift, occasional missing files, path errors. Splitting them eliminates that whole class of failure.

Decision 2: Markdown files in a git repo, no vector layer

Most "AI memory" systems reach for embeddings, vector databases, hidden retrieval. PM Brain reaches for markdown files the PM can read, edit, version-control, and grep.

What we picked	What we rejected	Why
Plain markdown	Vector DB / embedded chunks	Inspectable. Editable. No black box.
Git for versioning	Hidden auto-snapshots	The PM can see what changed and revert.
Explicit promotion (PM signs off)	Automatic memory consolidation	Memory promotion is judgment work. Auto-promotion fills the durable layer with noise.
One `CLAUDE.md` per repo	Long system prompts in chat history	The operating manual is part of the codebase. It survives sessions, models, and tools.
Five knowledge areas + three lifecycle areas	One flat note library	Bounded retrieval. The agent loads the relevant area, not half the repo.

The cost of this choice: no fuzzy semantic search out of the box. The benefit: every claim the system makes traces to a specific file. That trade-off is right for PM judgment work, where provenance matters more than recall breadth.

The architecture map

knowledge/                 (durable, synthesized)
├── strategy.md            North-star metric, priorities, non-goals, tensions
├── product/               Metrics (AARRR + north-star), features, roadmap
├── users/                 Personas, segments, synthesized insights
├── market/                Landscape, competitors, trends
└── org/                   Team, rituals, tools

hypotheses/                (durable, evidence-state)
└── <feature-slug>.md      One file per feature, 5 risk areas as sections

decisions/                 (durable, append-only)
└── <date>-<slug>.md       One file per decision, with "what would reverse this"

stakeholders/              (durable, people-state)
└── <slug>.md              One file per person, with touchpoint log

ingestion/                 (working memory)
├── interviews/            Customer interviews
├── meetings/              1:1s, reviews, syncs
├── market/                Competitor screenshots, articles
└── adhoc/                 "Just learned this" dumps

source/                    (immutable copies of inputs)
maintenance/log/           (dated sweep reports)
rules/                     (PM-specific rules: discovery, data, prioritization, shipping, writing)
docs/                      (workflow + schema reference)

The cognition pipeline

Evidence flows in one direction. It fans out at the durable layer.

source/            →   ingestion/        →   knowledge/      (synthesized observations, durable facts)
(immutable copy)       (working memory)      hypotheses/     (testable beliefs, evidence accrual)
                                             decisions/      (committed choices, append-only)
                                             stakeholders/   (people state, touchpoint log)

The same ingestion can update multiple durable destinations in parallel. A 45-minute customer interview might touch six files: one source copy, one ingestion record, one insight promoted to knowledge/users/, one hypothesis strengthened, one stakeholder touchpoint logged, one decision drafted as pending.

Provenance: a vocabulary, not a workflow

Every load-bearing claim in hypotheses/, decisions/, and knowledge/users/insights.md carries a provenance tag from a small canonical enum:

Tag	Trust
`[ingestion/<path>](...)`	Highest. Went through synthesis, links back to `source/`.
`[source/<path>](...)`	High. Direct citation to a raw artifact.
`(stakeholder-verbal, <name>, <date>)`	Medium. Heard from a person, no recording.
`(intuition, PM, <date>)`	Low for external defense, useful internally.
`(industry-knowledge)`	Low. Accepted background, flag for replacement.
`(chat, no artifact)`	Low. Synthesized in-session, nothing written down.

The system enforces the vocabulary, not the workflow. Real PM work is messy: PMs have intuitions, hear things off-the-record from execs, and inherit claims with no clear pedigree. Those are legitimate inputs. The brain just makes them wear their actual provenance instead of laundering them through a fake ingestion/ record. The auditability promise is "every claim wears its source," not "every claim was synthesized."

A path-typed tag walks (in two clicks) from a decision back to a source/ artifact. A non-path tag tells you honestly that no artifact exists. Both are auditable; only the missing tag is a bug.

The hypothesis / decision split

This is the load-bearing distinction.

Hypotheses are bets being tested. Feature-scoped. Each file has the 5 risk areas as sections (value, usability, feasibility, viability, and "other" for the ones that don't fit the canonical four: regulatory, partnership, and so on). Each hypothesis has evidence-for, evidence-against, confidence, a test, and a decision trigger.

Decisions are commitments made. Append-only log. Every shipped feature has at least one. Every decision has a "what would reverse this" field, which is the most useful field a decision record can have.

When a hypothesis is confirmed, it gets promoted and a decision record is auto-drafted (status: pending, waiting for PM sign-off). When a decision's "what would reverse this" condition triggers, the maintenance sweep surfaces it.

Most systems mash these together. They become useless.

What survives sessions

The operating manual (CLAUDE.md) and the durable layer. Not the chat history. Not the agent's working memory. Not whatever model you happened to use that week.

The portability test: clone the repo to a fresh machine, open Claude Code, ask "what's the current state of feature X?". If the answer comes from files (not chat history), the system is doing its job.