Quality Playbook v1.3.50
April 20, 2026 · View on GitHub
Version: 1.3.50 Status: Shipped Date: 2026-04-15 Author: Andrew Stellman Primary commits:
881879a— "v1.3.50: Renumber phases 1-6, add --phase flag to runner" (2026-04-14)f888d16— "v1.3.50: Six-phase architecture, iteration strategies, quality gate, agent" (2026-04-15)
What This Version Introduced
Version 1.3.50 is the release in which the Quality Playbook stopped being a monolithic skill with a main pipeline and a handful of sub-phases, and became a six-phase architecture with first-class, independently runnable phases. Three changes landed together across two commits spanning fifteen hours, and each one is load-bearing for everything that followed. The first commit, 881879a, renumbered the phases from the older 1 / 2 / 2b / 2c / 2d / 3 scheme into a flat, linear 1 / 2 / 3 / 4 / 5 / 6 — and, more consequentially, added a --phase flag to run_playbook.sh that lets each phase execute in its own separate CLI invocation with its own context window and its own exit gate. The second commit, f888d16, completed the release: it bumped version stamps across SKILL.md and README.md, rewrote the "What's new" section around the new architecture, added a new agents/quality-playbook.agent.md file that exposes the skill as a named agent for tools that support the awesome-copilot agent format, and fixed the one stray comment in quality_gate.sh that still referenced the old Phase 3 verification slot after the renumbering had moved verification to Phase 6.
Read in isolation, any of those three changes could look like a cleanup. The phase renumbering alone would be a cosmetic renaming. The --phase flag alone would be a runner convenience. The agent file alone would be an ecosystem adapter. Read together, they are a structural rewrite of how the skill is meant to execute. The old pipeline was a single long run that a model worked through from exploration to verification in one session, occasionally splitting into a multi-pass mode that batched phases into four CLI calls. The new pipeline is six independent agentic steps, each with documented prerequisites checked on entry, each producing artifacts that the next phase consumes from disk, each runnable either as part of an --phase all sweep or as a standalone --phase 3 or --phase 3,4,5 invocation. The phase-numbering change is what made this possible — 2b, 2c, and 2d are not addressable as independent units in a way that command-line flags can express or that exit gates can reason about, but 3, 4, and 5 are. The flat numbering is the operational handle on which the rest of the architecture hangs.
The six phases that emerge from this release are the ones still in use today. Phase 0 (Prior Run Analysis) loads seed data from previous runs when they exist and is automatic. Phase 1 (Explore) performs the three-stage exploration — open exploration, quality-risk analysis, selected structured patterns — and writes quality/EXPLORATION.md. Phase 2 (Generate) reads EXPLORATION.md and produces the nine core quality artifacts: QUALITY.md, REQUIREMENTS.md, CONTRACTS.md, COVERAGE_MATRIX.md, the functional test file, RUN_CODE_REVIEW.md, RUN_INTEGRATION_TESTS.md, RUN_SPEC_AUDIT.md, RUN_TDD_TESTS.md, and AGENTS.md. Phase 3 (Code Review) executes the three-pass code review against HEAD and writes regression test patches for every confirmed finding. Phase 4 (Spec Audit) runs the Council of Three independent auditors plus the triage synthesis with verification probes. Phase 5 (Reconciliation) closes the loop — bug writeups, TDD red-green execution, sidecar JSON generation, the terminal gate. Phase 6 (Verification) runs the 45 self-check benchmarks and the script-verified closure gate. The SKILL.md plan-overview section in commit 881879a is organized exactly this way, and every subsequent release preserves that organization.
Alongside the phase-structure rewrite, v1.3.50 formalized two pieces of infrastructure that had existed in more tentative forms before. The quality gate script, quality_gate.sh, was promoted in this release from a benchmark-harness-specific utility to the canonical final-step verification tool that Phase 6 mandatorily invokes. The comment fix in f888d16 — changing "Phase 3 verification step" to "Phase 6 verification step" — is tiny in character count but signals what the script's role has become. The agent runner, run_playbook.sh, was rewritten in f888d16 to add the --phase flag, to expand from about 340 lines to 505 lines (a 57% increase), and to carry a matched pair of new functions — check_phase_gate and run_one_phase — that encode the exit-gate contract in shell. And the agents/ directory made its first appearance with the new quality-playbook.agent.md file, giving the skill a named-agent front door that lets host environments invoke it by name.
Iteration strategies, which had started landing in v1.3.44 as optional flags and accumulated through v1.3.49, also got consolidated in v1.3.50. The README "What's new" section introduced in f888d16 lists four strategies as first-class citizens of the pipeline — gap, unfiltered, parity, adversarial — and states the empirical result that motivated the consolidation: iterations consistently add 40–60% more confirmed bugs on top of the baseline. The ITERATION.md file that governs strategy execution was updated in 881879a to refer to the new phase numbers ("Continue with Phases 2–6" instead of "Continue with Phases 2–3", "At the end of Phase 6" instead of "At the end of Phase 3") and to route TDD enforcement through "Phase 5" rather than the old "Phase 2d". The iteration system did not newly appear in v1.3.50, but v1.3.50 is the release where it became part of the formal architecture rather than an opt-in branch.
The net effect of the two commits is a skill that has taken the shape it still has today. Later releases refine, harden, and extend, but none of them change the phase structure, none of them change the basic shape of the runner, and none of them retire the quality-gate or agent infrastructure introduced here. v1.3.51 moved ITERATION.md from the repo root into references/. v1.4.0 added orchestrator agents that run phases programmatically. v1.4.2 introduced recheck mode. v1.4.3 split functional_tests into per-language references and added the challenge gate. v1.4.4 hardened the orchestrator against single-context collapse and replaced quality_gate.sh with quality_gate.py. v1.4.5 ported the benchmark runner to Python and moved quality_gate into .github/skills/. Every one of those changes is a modification of something v1.3.50 put in place. The phases are still six, numbered one through six. The runner still accepts --phase. The quality gate still runs as the final verification step. The agent file still sits in agents/. The iteration strategies are still gap → unfiltered → parity → adversarial. Once the scaffolding hardened, everything built on top of it rather than replacing it.
Why It Was Needed
The pipeline that existed before v1.3.50 had reached the structural limit of what could be expressed in a flat two-phase model with alphabetic sub-phases. Understanding why the rewrite was needed requires understanding what the old pipeline looked like — and the answer is that it had been accreting for months without any flat-numbering scheme to govern the accretion. The earliest versions of the skill had a simple two-phase structure: Phase 1 (Explore) and Phase 2 (Generate). When code review became a formal step, it was grafted on as Phase 2b. When the Council of Three spec audit became a formal step, it joined as Phase 2c. When post-review reconciliation with TDD verification became a formal step, it became Phase 2d. When the 45-benchmark self-verification checklist was introduced, it was called Phase 3. The result was a pipeline whose numbering — 1 / 2 / 2b / 2c / 2d / 3 — preserved historical ordering but obscured operational parity. Phase 2b and Phase 3 were both full phases with their own gates, artifacts, and reference files, but one was formatted as a letter-suffixed sub-phase and the other as a top-level integer, and the difference was an artifact of when each had been added rather than of any property they had.
The numbering problem mattered because it shaped what was addressable from outside the pipeline. Shell scripts, documentation, gate checks, and agent prompts all had to spell out phase identifiers exactly, and the mixed integer-and-alpha scheme made those references brittle. A gate that said "mandatory before marking Phase 2d complete" was unambiguous but not uniform with a gate that said "Phase 3 verification." A --phase flag in the runner could accept 2b, 2c, 2d as strings, but the shell case-statements that would validate those values had to special-case letter-suffixed tokens in a way that integer phases did not require. More importantly, when a human operator or a downstream tool wanted to say "run the code review and spec audit," the natural expression — "phases 3 through 4" — did not correspond to any actual phase labels. The old numbering forced every consumer of the skill to know the history. Flat numbering lets every consumer work from the current structure alone.
The second pressure was context-window exhaustion on larger codebases. In the pre-v1.3.50 architecture, the default execution mode was single-pass: a single prompt handed to the runner produces a single CLI invocation that tries to carry a model from Phase 1 exploration through Phase 3 verification in one session. This worked on the benchmark repositories — virtio, chi, httpx, gson — which are small enough that the full pipeline fits in the context window with room to spare. It worked less reliably on larger targets. When the context window fills during Phase 2d, the model either truncates its reconciliation artifacts, skips the terminal gate, or produces a BUG tracker that does not match the number of bugs found. When the context window fills during Phase 3 verification, the model does not have room to read both the verification checklist and the artifacts it is supposed to check, and benchmarks go unrun or are answered from the model's summary memory rather than from disk. A multi-pass mode existed — the older --multi-pass flag, which split execution into four CLI calls — but it was coarse-grained and its four passes (Explore / Generate / Review / Gate) did not map cleanly onto the actual phase boundaries.
A separate thread of the same problem showed up in the incremental Phase 3 verification rewrite that had landed in v1.3.49 just three days before v1.3.50 shipped. The v1.3.49 commit message names the failure directly: "Phase 3 restructured from monolithic verification to 5 independent steps (3.1-3.5), each reading only the files it needs and appending results to phase3-verification.log. Fixes context-limit hangs at the Phase 3 boundary seen in two consecutive Claude runs." In other words, Phase 3 verification by itself was too much to fit in a single context, and the fix in v1.3.49 was to shard it internally into five sub-steps that each touched only the files relevant to that step. That fix solved Phase 3 verification in isolation, but it was an admission that the pipeline as a whole was running up against context limits. Once one phase has been sharded because it is too big, the argument that every phase boundary is a natural context-dropping point becomes much harder to resist. v1.3.50 made that argument explicit: every phase is now a natural context-dropping point, and the runner supports it by spawning a fresh CLI session per phase.
The third pressure was the growing importance of the quality gate as the final arbiter of run conformance. Through v1.3.27, v1.3.28, v1.3.32, v1.3.33, and v1.3.49, the quality_gate.sh script had been steadily expanded to cover more mechanical checks: JSON schema validation, canonical field names, writeup inline diffs, regression-test patch presence, TDD red-green log files, use case identifiers, heading formats, version stamps. Each expansion came from a benchmark run in which the model had self-attested to a check that the script could have caught. By v1.3.50, the gate had become the only trustworthy sign-off on a completed run. But in the old numbering, the gate's role was nominally "Phase 3" verification, and nothing in the phase structure distinguished running the gate from the other self-checks that happened around it. Promoting the gate to a named final step in a flat numbering scheme clarified what the pipeline was really doing in its last phase: the pipeline's final act is running a shell script that either reports zero FAILs and exits zero, or reports FAILs and forces remediation. That act deserved a phase number of its own, and Phase 6 is what it got.
A fourth pressure is easier to see in retrospect than it was at the time: the skill was about to be adopted by external tools that treat "phase" and "agent" as addressable concepts. The awesome-copilot skill repository uses an agents/ directory convention for named agent files. Claude Code treats subskills as first-class invocables. Host environments increasingly want to be able to say "run phase 3 of the quality playbook" or "invoke the quality-playbook agent" without needing to read the full SKILL.md to figure out what that means. A pipeline that numbers its phases 1, 2, 2b, 2c, 2d, 3 cannot be driven from outside without a lookup table. A pipeline that numbers its phases 1, 2, 3, 4, 5, 6 can. The addition of agents/quality-playbook.agent.md in f888d16 was an explicit step into that world — the agent file tells external hosts how to invoke the skill, and its instructions reference the six-phase structure by name ("Run quality playbook phase 1 — explore the codebase", "Run quality playbook phase 3 — code review"). The flat numbering is a precondition for the agent front door.
The release was not triggered by a single empirical failure in the way that v1.3.35 was triggered by two specific bugs that unaided exploration missed. v1.3.50 is a structural release, motivated by accumulated pressure across four fronts: addressability of phases from outside the pipeline, context-window exhaustion on larger codebases, the maturing role of the quality gate as the conformance arbiter, and the growing ecosystem expectation that skills expose themselves as addressable agents. The two commits that make up the release address each of those pressures in parallel. The renumbering answers addressability; the --phase flag and its exit gates answer context; the quality-gate comment fix and its elevation to Phase 6 answer conformance; the agent file answers ecosystem.
The Six-Phase Architecture
The structural change at the heart of v1.3.50 is the move from 1 / 2 / 2b / 2c / 2d / 3 to 1 / 2 / 3 / 4 / 5 / 6. Commit 881879a is where this happens. The diff touches 176 lines of SKILL.md (insertions plus deletions nearly even at roughly 91 lines each), plus matching changes in ITERATION.md, README.md, TOOLKIT.md, quality_gate.sh, references/review_protocols.md, references/spec_audit.md, and references/verification.md. Every surface that named a phase was updated in lockstep. The scale of the rename is a tell: an operation that touches seven files across the skill, the documentation, the runner, and the reference library is not a cosmetic improvement. It is a renaming because nothing could be promoted from a sub-phase letter to an integer without forcing every downstream reference to be rewritten.
The mapping itself is straightforward. Old Phase 1 stays as new Phase 1 — exploration is unchanged. Old Phase 2 stays as new Phase 2 — artifact generation is unchanged. Old Phase 2b becomes new Phase 3 — code review with regression tests. Old Phase 2c becomes new Phase 4 — spec audit with triage. Old Phase 2d becomes new Phase 5 — post-review reconciliation, TDD, completeness closure. Old Phase 3 becomes new Phase 6 — the 45-benchmark self-verification. Phase 0, which handles prior-run analysis and seed injection when previous_runs/ exists, keeps its number because it is a precondition rather than a sub-phase. Phase 7, which is the interactive "Present, Explore, Improve" step that exists after verification is complete, likewise keeps its position — the rename shifts it from "Phase 4" (its old number) to "Phase 7", because it now sits after the six-phase core rather than after a three-phase core.
The conceptual reorganization the rename effects is more substantial than the mechanical one. Under the old scheme, the skill had two major phases (1 and 2) and one minor phase (3), with three sub-phases (2b, 2c, 2d) nested under Phase 2. That grouping implied that everything from artifact generation through reconciliation belonged to a single "generation" super-phase. In practice, the four sub-phases 2, 2b, 2c, 2d are not a single cohesive activity — they are four distinct jobs: generating documents, reviewing the code, auditing against the spec, and closing the loop. Collapsing them into one bracketed Phase 2 made sense when the skill was younger and they were understood as one pipeline. It made less sense as each sub-phase accreted its own gates, its own reference document, and its own characteristic failure modes.
The flat renaming makes each sub-phase a peer of the others. Phase 3 (Code Review) has the same syntactic weight as Phase 2 (Generate); Phase 4 (Spec Audit) has the same syntactic weight as Phase 3 (Code Review). The section headings change from ### Phase 2b: Code Review and Regression Tests to ## Phase 3: Code Review and Regression Tests — note the heading-level promotion from H3 to H2, which the diff in 881879a applies consistently. That heading promotion is the visible surface of a conceptual promotion: code review is no longer a sub-activity of artifact generation; it is a phase in its own right. The same promotion happens for spec audit (from H3 ### Phase 2c to H2 ## Phase 4) and for reconciliation (from H3 ### Phase 2d to H2 ## Phase 5). The old "Required references for this sub-phase" language is updated to "Required references for this phase" throughout the diff, matching the conceptual elevation.
The reorganization also separates the reconciliation-and-TDD work cleanly from the verification work. Under the old numbering, the final post-reconciliation run of quality_gate.sh was a Phase 2d step — it was invoked from inside the reconciliation phase as part of the terminal gate. The 45-benchmark self-check was then Phase 3, a nominally separate phase that in practice ran in the same context immediately after. Under the new numbering, those are two distinct phases with a hard break between them. Phase 5 runs reconciliation, TDD, writeup generation, sidecar JSON writing, and the mechanical verification receipts, finishing with a marked Phase 5 complete and the artifacts on disk. Phase 6 is a fresh activity that reads those artifacts back from disk and runs both quality/mechanical/verify.sh and .github/skills/quality_gate.sh against them, plus the file-by-file verification checklist. The break exists because the two activities consume different inputs and produce different outputs — Phase 5 produces artifacts; Phase 6 validates them — and because running them in separate sessions lets the validation phase start from a clean slate, which is important since Phase 6 is where hallucinated self-attestation had been the most frequent failure mode.
The Phase 5 → Phase 6 split also effected a real change in the terminal-gate story. In the old pipeline, the Phase 2d terminal gate and the Phase 3 benchmark checklist were both nominally running the same gate script, but they were loosely coupled — Phase 2d ran the gate once as part of closure, and Phase 3 ran it again as part of verification, and sometimes the redundancy was lost when a model ran low on context between them. In the new pipeline, the gate has one canonical runtime slot — Phase 6, Step 6.2 — and Phase 5's closure gate is a prerequisite check that uses the gate script once to confirm the pipeline passed before the phase is marked complete. The script's own comment in quality_gate.sh was updated in f888d16 to match: "This script is also copied into each repo at .github/skills/quality_gate.sh so the playbook agent can run it as its final Phase 6 verification step." The old comment had said "Phase 3" and the mismatch with the new numbering is what f888d16 caught.
Beyond the rename itself, the phase restructuring carried with it a small but important shift in the plan-overview language at the top of SKILL.md. The old Phase 2 description was a single sentence: "Read EXPLORATION.md and produce the quality artifacts: requirements, constitution, functional tests, code review protocol, integration tests, spec audit protocol, TDD protocol, AGENTS.md. Then execute the code review (Phase 2b), spec audit (Phase 2c), and reconciliation (Phase 2d). Every bug found traces back to a requirement, and every requirement traces back to an exploration finding." The new Phase 2 description stops after "AGENTS.md" and the code-review / spec-audit / reconciliation promises are lifted out into their own plan-overview bullets — Phase 3, Phase 4, Phase 5, each given one sentence of description. The tracing statement that used to trail Phase 2 — "Every bug found traces back to a requirement, and every requirement traces back to an exploration finding" — is moved to stand on its own as a summary of the pipeline after the six-phase list. This is a small textual rearrangement, but it is what the plan-overview section looks like today, and it is what subsequent releases have built on.
The structural change is completed by a short but telling addition to the critical-dependency-chain language. The old SKILL.md said: "Exploration findings → EXPLORATION.md → Requirements → Code review + Spec audit → Bug discovery." The new version preserves that chain but now has six visible phases for the pipeline to walk through, so the chain reads against an explicit scaffolding rather than against a pipeline that has to be reconstructed from prose. A reader of the plan-overview section can now see the six phases, can see the dependency chain in parallel, and can see that each link in the chain is named and numbered. The plan becomes self-describing in a way the older prose was not.
The Quality Gate Script
The quality gate script had existed for months before v1.3.50, but v1.3.50 is the release where it became the canonical final-step verification tool and where every part of the skill was updated to refer to it consistently. The script itself is repos/quality_gate.sh in the source repository, copied into each deployed project at .github/skills/quality_gate.sh during setup. By v1.3.50 it had grown to 632 lines — substantial for a shell script — and it mechanically validated: file existence, BUGS.md heading format (### BUG-NNN — the canonical three-hash form), sidecar JSON required root keys and per-bug field names, sidecar JSON enumerated values (verdict must be one of five strings, recommendation must be one of three), sidecar JSON summary consistency, use case identifier format (UC-01 style), terminal gate section presence, mechanical verification receipts, version stamps, writeup completeness, regression-test patch presence for every confirmed bug, and inline fix diffs in every writeup. It also validated TDD log files — BUG-NNN.red.log for every confirmed bug and BUG-NNN.green.log for every bug with a fix patch — which was the most recent expansion before v1.3.50, landing in v1.3.49.
What v1.3.50 did was fix the one stray reference that still called the script's role "Phase 3 verification." The diff in f888d16 for repos/quality_gate.sh is two lines: the comment at line 29 changed from "so the playbook agent can run it as its final Phase 3 verification step" to "so the playbook agent can run it as its final Phase 6 verification step." That comment is not executable code, but its location inside the script's header block — immediately after the exit-code documentation, before the actual set -uo pipefail directive — makes it part of the script's self-description. A shell script whose header comment tells the reader it is invoked in Phase 6 is in effect telling the orchestrator where to invoke it. The comment is the script's declaration of where in the pipeline it lives.
The role the script plays in the new architecture is more consequential than the textual change suggests. Under the old numbering, the script was nominally invoked during Phase 2d's terminal gate (the "script-verified closure gate" step) and then again during Phase 3's Step 3.2. In practice, the second invocation was often skipped — by the time Phase 3 started, the model had already executed the script in Phase 2d and could see the result in quality/results/quality-gate.log, and the incentive to rerun it in a new step was weak. The two invocations were nominally independent but functionally redundant, and the redundancy made the script's actual role ambiguous: was it the terminal gate, or was it a Phase 3 verification check, or was it both?
The v1.3.50 rewrite makes the answer unambiguous. The script has one canonical invocation slot — Phase 6, Step 6.2 — and Phase 5's closure gate invokes it as a prerequisite check, with the requirement that the invocation's output is saved to quality/results/quality-gate.log and that the exit code is zero before Phase 5 can be marked complete. Step 6.2 then re-invokes the script from the verification phase's fresh context window, this time with the explicit job of exercising the script against the final state of the artifacts after reconciliation, appending the exit code to quality/results/phase6-verification.log. The two invocations are no longer redundant; they are before-and-after checks, one from within the writing phase (Phase 5) to prove the run is closable, and one from within the verification phase (Phase 6) to prove the closure survived the transition to a clean context.
The second part of the quality gate's v1.3.50 maturation is how SKILL.md now refers to it. The prose throughout the skill was updated in both commits to route through the script as the mechanical truth about run conformance. The Phase 6 prose reads: "Run bash .github/skills/quality_gate.sh . > quality/results/quality-gate.log 2>&1. Read quality/results/quality-gate.log. If it reports any FAIL results, fix each failing check before proceeding. The most common FAILs are: (1) missing quality/patches/BUG-NNN-regression-test.patch files, (2) non-canonical JSON field names like bug_id instead of id, (3) missing confirmed_open in the TDD summary, (4) writeups without inline fix diffs, (5) missing TDD red/green log files. Do not proceed until quality_gate.sh exits 0." The script is named as the arbiter, the failure modes are enumerated, and the skill's own prose defers to the script's output rather than attempting to duplicate the checks.
This deferral is what makes the gate the center of gravity for conformance. Before v1.3.50, a skill revision that wanted to strengthen a check had a choice: add it to the gate, add it to the Phase 3 verification checklist in references/verification.md, add it to the terminal gate language in Phase 2d, or add it to the artifact-file-existence gate in Phase 2d. Each addition would land in one place and the others would slowly fall out of sync. The v1.3.50 architecture collapses the choices: new conformance checks go into the gate script, and the skill prose refers to the script. The script becomes the source of truth; the prose becomes documentation for what the script verifies. This is why the v1.4+ releases that added recheck mode, the challenge gate, and the orchestrator protocol all extend the script first and then update the skill prose to refer to the new checks. The cascade from gate script to skill prose is a v1.3.50 invention.
The earlier judgment-based gates that the quality gate replaced are worth naming for contrast. The Phase 2d prose in earlier versions contained gate language like "do not mark Phase 2d complete until the counts match" and "if any referenced test function does not exist, write it now before passing the gate" and "the verdict is deferred to Phase 2d post-reconciliation, which produces the only verdict that counts for closure." These are gates in the sense that they tell the model not to advance until a condition is met, but they are judgment-based gates — the model has to decide whether the condition is met, and the decision happens inside a single long-running context where the pressure to declare completion is high. A judgment-based gate is trustworthy only to the extent that the model is honest about unresolved gaps. Benchmark runs on v1.3.24 had shown that models reliably skip this check when under context pressure — declaring a terminal gate verified in PROGRESS.md while BUGS.md, writeups, and spec-audit reports were absent from disk. The script-verified gate replaces judgment with shell. test -f quality/BUGS.md cannot be rationalized. Either the file exists or it does not, and the script returns zero or one accordingly.
v1.3.50 does not retire every judgment-based gate — some checks remain prose-only because they require semantic interpretation. But the script-verified closure gate is now mandatory, its output is saved to a durable log, and the two-phase (Phase 5 closure plus Phase 6 verification) invocation pattern ensures the script runs twice at well-defined points in the pipeline. The earlier pattern of "the model says the run is complete" is now "the script says the run is complete, and its log file is on disk." This is the mechanical-over-judgmental move that subsequent releases extend with quality_gate.py (v1.4.4) and with the Sonnet bootstrap self-audit (v1.4.2), both of which build on the assumption that the gate script is the conformance arbiter.
The Agent Runner and the --phase Flag
The second commit of the release, 881879a, added the --phase flag to repos/run_playbook.sh, and the combined effect of that flag and the functions it enabled is what made phase-by-phase execution a real operational pattern rather than a documented aspiration. The runner grew from 343 lines before the commit to 505 lines after — a 47% expansion. The expansion is not uniform. About 120 lines are new functions and helpers, and the rest is modification of existing flag parsing and dispatch logic. The substance of the change is two new concepts in the runner: per-phase prompts, and per-phase exit gates.
The flag itself is declared in the argument parser with EXPECT_PHASE=true state and a new --phase case that captures the next argument. The accepted forms are --phase 1 (run phase 1 only), --phase all (run all six phases sequentially with gates between each), and comma-separated ranges like --phase 3,4,5 (run phases 3 through 5). A validation block immediately after argument parsing walks the comma-separated list and rejects any token that is not one of 1 2 3 4 5 6 or the special value all. The old --single-pass and --multi-pass flags are preserved as back-compat aliases: --single-pass now sets PHASE_MODE="" (no phase-by-phase execution, one prompt per repo) and --multi-pass now sets PHASE_MODE="all" (all six phases sequentially, one prompt per phase). The older four-pass multi-pass mode is gone; "multi-pass" now means six-pass, one phase per pass.
The per-phase prompts are defined as shell functions named phase1_prompt through phase6_prompt, each emitting a here-doc that tells the model which SKILL.md section to read, which artifacts to write, which references to consult, and where the phase ends. The prompts are explicit about scope: phase1_prompt says "Do NOT proceed to Phase 2. Your only job is exploration and writing findings to disk." phase2_prompt says "Do NOT proceed to Phase 3 (code review). Your job is artifact generation only. The next phase will execute the review protocols you generated." Each prompt follows the same pattern — read files for context, read the SKILL.md section for the phase, execute the phase, mark the phase complete in PROGRESS.md, stop. The phrase "IMPORTANT: Do NOT proceed to Phase N" appears in every prompt except Phase 6, and its ubiquity is a deliberate pushback against the model's natural tendency to keep going.
The third piece is the exit gates. The function check_phase_gate accepts a repo directory, a phase number, and a log file, and returns zero if the phase's prerequisites are satisfied and one if they are not. The prerequisites are disk-observable: Phase 2 requires quality/EXPLORATION.md to exist with at least 80 lines; Phase 3 requires REQUIREMENTS.md, QUALITY.md, CONTRACTS.md, and RUN_CODE_REVIEW.md to all exist; Phase 4 requires REQUIREMENTS.md, RUN_SPEC_AUDIT.md, and warns if code_reviews/ is empty; Phase 5 requires PROGRESS.md and warns if neither BUGS.md nor spec_audits/ exists; Phase 6 requires PROGRESS.md. The gate checks are phrased as [ ! -f "${q}/EXPLORATION.md" ] and [ ! -d "${q}/code_reviews" ] — test expressions that can be reasoned about by eye and cannot be rationalized by model self-attestation. The gate either passes or fails, and the caller in run_one_phased aborts the repo if any gate fails, with the message ABORT: Phase ${phase} gate failed for ${repo_name}.
The wrapper function run_one_phase ties the three pieces together. It looks up the phase index in the PHASE_LIST variable (a comma-separated string set during dispatch from either the --phase argument or the --phase all expansion), calls check_phase_gate for the phase, and aborts if the gate fails. If the gate passes, it calls the per-phase prompt function to construct the prompt, runs the prompt through run_prompt with the phase name as a label, and logs the result. After the prompt returns, a per-phase post-hook reports characteristic completion metrics: Phase 1's hook counts EXPLORATION.md lines; Phase 3's hook counts bugs in BUGS.md and patches in patches/; Phase 4's hook counts auditor files in spec_audits/; Phase 5's hook counts writeups and TDD red-phase logs; Phase 6's hook reads the last line of quality-gate.log. These post-hooks run in the runner, not in the model, and produce log lines that any observer can use to tell at a glance whether the phase produced its expected output.
The outer function run_one_phased drives the loop for a single repo. It sets up logging, tees the output, optionally archives the existing quality/ directory (only if Phase 1 is in the list — otherwise it preserves the in-progress state so that resumption works correctly), sets up control_prompts/, and then iterates over PHASE_LIST:
local IFS=','
for phase in $PHASE_LIST; do
if ! run_one_phase "$repo_dir" "$phase" "$log_file"; then
logboth "$log_file" "$(log "ABORT: Phase ${phase} gate failed for ${repo_name}")"
return 1
fi
done
Each phase runs in its own claude -p invocation with its own context window, its own prompt, and its own transcript file. The context from the previous phase is not carried over — the model reads what it needs from disk files that the previous phase wrote. This is the architectural commitment: inter-phase communication is through the file system, not through shared context. A phase that fails to write an artifact that the next phase needs will trip the next phase's gate. A phase that writes correctly will pass the gate and let the next phase proceed.
The --phase 1 and --phase 3,4,5 forms make partial runs a first-class operation. A user who has run Phase 1 and wants to iterate on exploration before regenerating artifacts can run --phase 1 alone. A user who has completed Phases 1 and 2 and wants to run the review-and-audit-and-reconciliation cluster as a batch can run --phase 3,4,5. A user who wants the end-to-end experience with gates between each phase runs --phase all. Each of these modes uses the same prompts, the same gates, and the same post-hooks; the only thing that differs is which subset of PHASE_LIST is iterated. The uniformity is what makes the flag useful — every combination is addressable, and every combination runs the same machinery.
The broader architectural implication is that the runner no longer implicitly chooses the pipeline's granularity. Before v1.3.50, the choice was binary: single-pass (one CLI call for the full pipeline) or multi-pass (four CLI calls for coarse-grained chunks). Neither choice corresponded to the actual phase boundaries. v1.3.50 replaces the binary choice with a spectrum. Single-prompt mode is still available (and is still the default when no --phase flag is present). --phase all provides phase-by-phase with gates. --phase N provides single-phase. --phase N,M,... provides arbitrary subsets. The spectrum maps the runner's addressability onto the pipeline's structure, and the --phase all mode is the variant that becomes the backbone for orchestrator agents in v1.4.0, recheck mode in v1.4.2, and the Python-ported runner in v1.4.5. Every later form of programmatic control over the pipeline routes through the --phase concept that v1.3.50 introduced.
The agents/ directory and the quality-playbook.agent.md file are the final piece. The file was added in f888d16 as 54 new lines. It is a YAML-frontmatter Markdown document in the awesome-copilot agent format, with a name ("Quality Playbook"), a description, a tools list (search/codebase, web/fetch), and a body that tells the hosting agent how to find SKILL.md in the repository, what to do if the skill is not installed, what the skill produces, and how users invoke the agent. The file's "How to invoke" section is noteworthy because it directly markets phase-by-phase execution as a first-class user-facing feature: "For large codebases, suggest running phase-by-phase to stay within context limits: 'Run quality playbook phase 1 — explore the codebase', 'Run quality playbook phase 3 — code review'." The agent file is the contract between the skill and the hosting environment, and its prose tells the host environment exactly how to expose the --phase flag to end users. An agent-driven invocation uses the same phase vocabulary that a command-line invocation uses. The agent file, the --phase flag, and the flat phase numbering are the same design viewed from three different surfaces.
Phase-by-Phase Prompts and the Clean-Context Contract
The per-phase prompts embedded in run_playbook.sh do more than tell the model what to do in each phase — they encode an explicit contract that each phase's context is bounded by the files it reads and the prompt it receives. This is worth examining in detail, because the phrasing is deliberate and the design pattern it establishes is what the rest of the skill's infrastructure builds on.
The Phase 1 prompt (phase1_prompt) starts with a one-line framing — "You are a quality engineer executing Phase 1 of the quality playbook for [repo]" — and then lists the three context files the model should read: .github/skills/SKILL.md (the skill itself), .github/skills/references/ (the reference library), and .github/skills/ITERATION.md (the iteration strategy reference, which is not used in baseline runs but is expected to be skimmed). The prompt then spells out the phase's output format — "Write your full exploration findings to quality/EXPLORATION.md" — and ends with the "IMPORTANT: Do NOT proceed to Phase 2" clause. The entire prompt is about 25 lines of here-doc text. A model entering this prompt does not need any memory of a prior phase, because there is no prior phase; it does not have context to drop, because the session was just created.
The Phase 2 prompt (phase2_prompt) is the first prompt that depends on the previous phase having run. It opens with "You are a quality engineer continuing a phase-by-phase quality playbook run. Phase 1 (exploration) is already complete." — a narrative framing that tells the model where it stands. It then enumerates the four files the model should read to get context: quality/EXPLORATION.md, quality/PROGRESS.md, the SKILL.md Phase 2 section, and specific reference documents (requirements_pipeline.md, functional_tests.md, review_protocols.md, spec_audit.md). It then describes the expected output — the nine core artifacts — and ends with "IMPORTANT: Do NOT proceed to Phase 3 (code review). Your job is artifact generation only." Again the prompt is about 30 lines. Crucially, the prompt does not try to summarize Phase 1's findings for the model. The model is instructed to read EXPLORATION.md directly and use that as its primary source. The prompt is a manifest, not a briefing.
The Phase 3 prompt (phase3_prompt) follows the same pattern: read PROGRESS.md for state, read EXPLORATION.md with emphasis on the "Candidate Bugs for Phase 2" section, read REQUIREMENTS.md and CONTRACTS.md, read the Phase 3 section of SKILL.md and references/review_protocols.md. The output is the code review plus regression tests plus patches. The phrase "Do NOT proceed to Phase 4 (spec audit). The next phase will run the spec audit with a fresh context window." is the explicit statement of the design: the next phase will have a fresh context window, and the current phase's job is to leave behind everything the next phase needs on disk.
The phrase "with a fresh context window" appears multiple times across the Phase 3, Phase 4, and Phase 5 prompts, and it is not decorative. It is telling the model that its context is about to be dropped. The implication is that anything the model has in working memory — candidate bugs it spotted but did not write down, reasoning about the codebase that did not make it into an artifact, notes it planned to elaborate later — will be lost when the phase ends. The only things that persist are files on disk. This is the clean-context contract: each phase is responsible for persisting everything it knows that downstream phases will need, and each phase is allowed to assume that its own context is the only one it has. A model that treats context as scratch paper it can come back to will lose work at every phase boundary. A model that treats context as a staging area whose output must be committed to disk will not.
The Phase 4 prompt is where the clean-context contract gets its most interesting application. Phase 4 runs the Council of Three — three independent auditors plus a triage synthesis. In the old pipeline, all four of those sub-activities (three auditors plus triage) ran in the same context, and the model had to manage its own rotation between pretending to be three different auditors and then synthesizing their findings. Benchmark runs on earlier versions had shown this failing in specific ways: the three "independent" auditor reports would be suspiciously consistent because they were really the same model's single pass with three different headings, and the triage would sometimes confabulate findings that none of the three auditor reports actually contained. The v1.3.50 phase-by-phase architecture does not fully solve this problem within a single Phase 4 invocation — the auditors and triage still run in the same CLI session by default — but it does make the problem addressable. A future orchestrator that wants to run each auditor in a separate session and then invoke a triage session on the collected reports can do so with a --phase 4 variant and three separate invocations, each with its own fresh context. v1.4.0's orchestrator agents take exactly this step. Phase 4 as written in v1.3.50 is a single prompt, but the prompt's role in the overall architecture is to be a boundary, and later releases exploit that boundary.
The Phase 5 prompt is the most elaborate because Phase 5 itself is the most elaborate phase — it runs reconciliation, TDD execution, writeup generation, sidecar JSON generation, mechanical verification, and the terminal gate. The prompt numbers its substeps explicitly: "1. Run the Post-Review Reconciliation per references/requirements_pipeline.md. 2. Run closure verification: every BUG in the tracker must have either a regression test or an explicit exemption. 3. Write bug writeups at quality/writeups/BUG-NNN.md for EVERY confirmed bug. 4. Run the TDD red-green cycle. 5. Generate sidecar JSON. 6. If mechanical verification artifacts exist, run quality/mechanical/verify.sh. 7. Run terminal gate verification." The substep numbering is substantive — it gives the model a checklist to work through — and the closing admonition ("IMPORTANT: Do NOT skip writeup inline diffs or TDD logs. The next phase runs quality_gate.sh which will FAIL on missing patches, missing diffs, or missing TDD logs.") tells the model what will happen if it skips any substep. The prompt is explicit that Phase 6 is the gate, not Phase 5's own judgment.
The Phase 6 prompt is the shortest of the six. It instructs the model to read the Phase 6 section of SKILL.md and follow the incremental verification steps 6.1 through 6.5, run quality_gate.sh, run functional tests if available, run the file-by-file verification checklist, run the metadata consistency check, and mark Phase 6 complete. The prompt's brevity reflects Phase 6's character: it is mostly reading other things the previous phases produced and checking them, not generating new content. The Phase 6 prompt is more a dispatcher than an executor.
The uniform structure across the six prompts is the design pattern. Every prompt opens with "You are a quality engineer..." and narrates where the run stands; every prompt enumerates the context files to read, names the SKILL.md section and reference documents to consult, describes the output expected, and closes with "Do NOT proceed to Phase N+1." The structure makes the prompts legible as a family, and it makes the boundaries between phases visible. A reader of any prompt can see what that phase takes in, what it produces, and where it stops. The uniformity is also what makes the runner's run_one_phase function a clean abstraction — every phase has a phaseN_prompt function, so the runner can dispatch on phase number alone.
The clean-context contract has one more operational consequence worth naming. Under the contract, the phase boundaries are also cost boundaries. Each phase starts a new CLI session, and each session costs tokens. A six-phase run uses more total tokens than a one-prompt run would, because each phase re-loads SKILL.md and some subset of the reference library, and because the context-building overhead is paid six times instead of once. The tradeoff is accepted because it solves the context-exhaustion problem that motivated the rewrite. A run that would have truncated in Phase 2d under the old pipeline can complete under the new pipeline because Phase 5 starts with a fresh context and has room to do the reconciliation work properly. The token cost is the price of having each phase get the full context it needs, and v1.3.50 pays that price deliberately.
Iteration Strategies Formalized
The fourth substantial change in v1.3.50 is the consolidation of iteration strategies into a formal part of the architecture. The strategies had been landing incrementally for weeks before v1.3.50: gap as the default iteration mode in v1.3.44, the extension to unfiltered and adversarial in the same release window, parity added in v1.3.45 as an explicit reference file (ITERATION.md), further hardening in v1.3.46 with the demoted candidates manifest, and the mandatory TDD enforcement for iteration runs added in v1.3.49. By the time v1.3.50 shipped, the strategies were operational but scattered — some lived in ITERATION.md, some in SKILL.md, some in the runner's argument parser, some in reference documents. v1.3.50 pulled them together as a coherent part of the six-phase architecture.
The consolidation shows up in three places. First, the README "What's new in v1.3.50" section introduced in f888d16 lists the four strategies by name and describes their intent: "After the baseline run, the playbook supports four iteration strategies that find different classes of bugs: gap (explore areas the baseline missed), unfiltered (fresh-eyes re-review), parity (parallel path comparison), and adversarial (challenge prior dismissals and recover Type II errors). Iterations consistently add 40-60% more confirmed bugs on top of the baseline." This is the first time the strategies appear together in the README as a first-class feature with an empirical claim attached. Before v1.3.50, they were mentioned as part of the iteration-mode documentation but not positioned as a headline capability.
Second, ITERATION.md — which lives at the repo root in v1.3.50 and will move to references/ in v1.3.51 — was updated in commit 881879a to use the new phase numbering throughout. The shared-rules section that says "Continue with Phases 2–3" in v1.3.49 now says "Continue with Phases 2–6"; the TDD enforcement paragraph that referred to "the TDD Log Closure Gate in Phase 2d" now refers to "the TDD Log Closure Gate in Phase 5"; the end-of-phase suggested-next-iteration step now triggers "at the end of Phase 6" instead of "at the end of Phase 3". The strategies themselves are unchanged — the file describes gap, unfiltered, parity, adversarial, and a meta-strategy all that runs them in sequence — but the plumbing now routes through the new phase numbers. This is a small textual rename, but it is what makes the iteration system composable with the six-phase architecture. A strategy cannot say "continue with Phase 2d" when Phase 2d no longer exists, and the rename is what keeps the strategy documentation valid.
Third, the runner supports the strategies as arguments to --next-iteration. The argument parser in run_playbook.sh accepts --strategy gap, --strategy unfiltered, --strategy parity, --strategy adversarial, and --strategy all. The validation clause rejects any other value. The dispatch logic sends the strategy into the per-repo iteration prompt, which uses the strategy name to look up the strategy-specific section in ITERATION.md. The all strategy is implemented at the runner level as a loop over the four named strategies in order, with an early exit if any strategy finds zero new bugs. This implementation is in run_playbook.sh before v1.3.50, but v1.3.50 is where the --phase flag and the --next-iteration flag are both first-class and explicitly composed. The incompatibility clause in the runner — "--next-iteration is not compatible with --phase. Iteration uses a single prompt." — is the explicit statement that iteration runs as a single-prompt invocation, distinct from the phase-by-phase invocations of the baseline run.
The empirical framing from the README "What's new" section deserves its own attention because it is the first public claim that iterations add bugs. The specific number — 40 to 60 percent more confirmed bugs on top of the baseline — is not documented in the commit messages for v1.3.50 itself; it emerges from the accumulated benchmarking across v1.3.44 through v1.3.49, when each strategy was landed and tested in turn. The claim is a retrospective summarization of that benchmarking, reported at the release where the strategies became a coherent feature. The README also reports validation across three codebases — Express.js with 14 confirmed bugs, Gson with 9, Linux virtio with 8 — each with 100% TDD red-phase coverage and zero gate failures. These are not new benchmark results landing in v1.3.50; they are the cumulative result of running the v1.3.44–v1.3.49 iteration machinery against those codebases and finding that the combined baseline-plus-iterations approach finds more bugs than the baseline alone.
The implication of the consolidation is that a full v1.3.50 run is not just six phases; it is six phases plus up to four iteration passes. A user who wants the full yield runs the baseline (Phases 1–6), then runs iteration with --strategy gap, then iteration with --strategy unfiltered, then iteration with --strategy parity, then iteration with --strategy adversarial, each pass re-running Phases 2 through 6 against a merged exploration that combines the baseline's EXPLORATION.md with the strategy-specific EXPLORATION_ITER{N}.md. This is a substantially larger operation than a single six-phase run, and the token cost compounds, but it is what the v1.3.50 architecture is designed to support.
The iteration cycle order — gap → unfiltered → parity → adversarial — is itself a design choice codified in ITERATION.md. Gap comes first because it targets the most obvious omission: code areas the baseline did not explore at all. Unfiltered comes second because it revisits the same scope without the structural biases that shaped the baseline's exploration, catching bugs that "structure suppresses." Parity comes third because it compares parallel implementations of the same operation and requires both to exist and to be understood — a comparison that depends on the two earlier strategies having surfaced both implementations. Adversarial comes last because it challenges the dismissed and demoted findings from all three earlier iterations — the raw material it feeds on is the demoted-candidates manifest that the earlier strategies populate. The ordering is not arbitrary; each strategy's effectiveness depends on what the preceding strategies have produced. A reader of the pipeline who wants to understand why iteration works at all must read the cycle order as part of the design, and v1.3.50 is the release where the cycle order becomes canonical.
A small but significant piece of the iteration-mode maturation in v1.3.50 is that the TDD red-green requirement applies to iteration runs unchanged from baseline runs. This was added in v1.3.49 as an explicit correction — prior to v1.3.49, iterations sometimes skipped the full TDD cycle on the theory that iterations are "just additions" to a previously-verified baseline. The correction made the TDD cycle mandatory, and v1.3.50's six-phase architecture preserves the correction by routing the closure gate through Phase 5. An iteration run that adds five new bugs must produce five BUG-NNN.red.log files and, if fix patches exist, five BUG-NNN.green.log files. The quality gate script enforces this, and it fails any iteration run that has a bug in BUGS.md without a corresponding red-phase log. The v1.3.50 architecture inherits this enforcement and makes it continuous with the baseline's closure semantics: iterations are not a second-class pipeline; they run the same gates against the same standards.
The Agent File and the Awesome-Copilot Surface
The agents/quality-playbook.agent.md file added in f888d16 is 54 lines, which makes it the smallest of the v1.3.50 changes by character count. Its significance is disproportionate to its size. The file is the skill's first externalized entry point — the declaration that the skill can be invoked by name through a host environment's agent registry, distinct from being read as a SKILL.md file by a model that happened to open it.
The file's frontmatter names the agent ("Quality Playbook") and gives it a description that mirrors the SKILL.md description but is trimmed to a single paragraph: "Run a complete quality engineering audit on any codebase. Derives behavioral requirements from the code, generates spec-traced functional tests, runs a three-pass code review with regression tests, executes a multi-model spec audit (Council of Three), and produces a consolidated bug report with patches and TDD verification. Finds the 35% of real defects that structural code review alone cannot catch." The description is notable for two reasons. First, it names the headline capabilities — requirement derivation, functional tests, three-pass code review, Council of Three, consolidated bug report, TDD verification. Second, it quotes the 35% statistic — the fraction of real defects that structural code review alone cannot catch — which is the empirical framing the skill has been using since v1.3.20 to motivate the requirements-plus-audit approach. The agent file is advertising the skill to host environments, and the advertisement uses the skill's own long-running rhetorical frame.
The body of the agent file is prescriptive about how the host should behave. "Before you start" tells the host to check for SKILL.md in one of two conventional locations: .github/skills/quality-playbook/SKILL.md or .github/skills/SKILL.md. It also tells the host to check for the reference files directory alongside SKILL.md. "If the skill is not installed" gives the host a specific installation message to print to the user, including a link to the awesome-copilot skills index and a link to the quality-playbook GitHub repository, plus copy-paste instructions for installing the skill into .github/skills/quality-playbook/. "If the skill is installed" tells the host to read SKILL.md and every file in the references/ directory, and then follow the skill's six-phase instructions "exactly — it defines six phases, each with entry gates and exit gates. Do not skip phases or reorder them." The six-phase language in the agent file is the same vocabulary the rest of v1.3.50 uses. The agent file does not re-explain the phases; it defers to SKILL.md and assumes the host will follow.
The "What you produce" section lists eight artifacts — REQUIREMENTS.md, QUALITY.md, functional tests, BUGS.md, code review output, spec audit output, TDD verification, AGENTS.md — that match the core artifact contract from the SKILL.md artifact table. This is a user-facing summary; a prospective user of the host environment can read the agent file and know what they will get, without needing to open SKILL.md themselves.
The "How to invoke" section is where the agent file's phase-by-phase positioning shows up. It lists three example prompts for full-pipeline invocation ("Run the quality playbook for this project", "Generate a complete quality system for this codebase", "Find bugs that require understanding the spec, not just the code") and then recommends phase-by-phase execution for large codebases, with two example prompts: "Run quality playbook phase 1 — explore the codebase" and "Run quality playbook phase 3 — code review". The recommendation is explicit about the context-limit motivation and explicit that phase-by-phase is the preferred mode for non-trivial projects. This is the first time a user-facing surface of the skill recommends phase-by-phase as the default for realistic work.
The tools declaration in the frontmatter — tools: [search/codebase, web/fetch] — is a minimal capability declaration. The skill does not request shell execution, file writing, or package management through the agent contract, because the agent file's job is to tell the host how to invoke the skill, and the skill itself requests the tools it needs through SKILL.md's own conventions. The minimality of the tools list is deliberate: an agent file that requested broad capabilities would be harder to approve in host environments that require tool-permission review, and the skill's actual tool usage is negotiated inside the execution, not declared in the agent frontmatter.
The awesome-copilot format itself is worth naming because its conventions shape what the file can do. awesome-copilot is a community-maintained index of AI-coding skills, agents, and workflows, and it uses the agents/ directory convention to register named agents with host environments that support the format. Adding quality-playbook.agent.md to the awesome-copilot registry makes the skill discoverable and invocable by name in those hosts. This is the ecosystem-integration move that motivated the flat phase numbering and the --phase flag — a skill that exposes phases as addressable units can register those phases with a host, and a host can offer "Run phase 3 of quality playbook" as a user-facing option. The agent file, the flat numbering, and the runner flag are three aspects of the same integration work.
The agent file also sets a precedent that later releases follow. v1.4.0 adds orchestrator agents in agents/ that run the phases programmatically (not just through a chat interface). The quality-playbook.agent.md file is the template — the orchestrator agents of v1.4.0 share its structure, its YAML frontmatter, its prescriptive body, and its phase-by-phase invocation pattern. The v1.3.50 agent file is the first, and the v1.4.0 orchestrator agents are its descendants. Without the v1.3.50 precedent, the v1.4.0 orchestration work would have needed to invent a file format and a host-interaction pattern; with the precedent, it just needed to extend.
Phase Renumbering — The Mechanics in Detail
The phase rename that commit 881879a effects deserves one more pass, because the diff is instructive and the detail matters for anyone reading SKILL.md today and trying to understand why certain sections are structured the way they are.
The Plan Overview section at the top of SKILL.md is where the rename is most visible. Before the commit, the section had three phase bullets: "Phase 1 (Explore)", "Phase 2 (Generate)", and "Phase 3 (Verify)". Phase 2's bullet bundled artifact generation, code review, spec audit, and reconciliation into a single paragraph ending "Then execute the code review (Phase 2b), spec audit (Phase 2c), and reconciliation (Phase 2d)." After the commit, the section has six phase bullets. Phase 1 is unchanged. Phase 2's bullet is trimmed to just artifact generation. Phase 3 (Code Review), Phase 4 (Spec Audit), Phase 5 (Reconciliation), and Phase 6 (Verify) each get their own bullet. The visual effect is that the plan overview now shows the full six-step sequence, not three steps with one of them implicitly expanding into four sub-steps.
The Artifact Contract table — the canonical registry added in v1.3.33 that lists every artifact the gate validates — was updated in lockstep with the rename. Rows for regression tests, BUG tracker, regression patches, fix patches, code review reports changed from "Phase 2b" to "Phase 3". Rows for triage probes, spec audit reports changed from "Phase 2c" to "Phase 4". Rows for completeness report, bug writeups, TDD sidecar, TDD red/green logs, integration sidecar, mechanical verify script, verify receipt changed from "Phase 2d" to "Phase 5". The table now reads cleanly in integer phases: every artifact is produced in Phase 2, 3, 4, 5, or "Throughout". No artifact is produced in a letter-suffixed sub-phase. The table was already coherent under the old numbering, but it is visibly cleaner under the new numbering.
The PROGRESS.md template in SKILL.md was updated to use the new phase-completion checkbox list. The old list was Phase 1, Phase 2, Phase 2b, Phase 2c, Phase 2d, TDD logs, Phase 3. The new list is Phase 1, Phase 2, Phase 3, Phase 4, Phase 5, TDD logs, Phase 6. The structure is preserved — five content phases plus a TDD-logs line plus a verification phase — but the numbering flattens out. This is the template that every benchmark run's PROGRESS.md uses, so the flattening propagates into every generated artifact going forward. A PROGRESS.md from a v1.3.50 run reads differently from one from a v1.3.49 run, and the difference is visible at a glance.
The verification reference document, references/verification.md, was updated throughout. The title changed from "Verification Checklist (Phase 3: Verify)" to "Verification Checklist (Phase 6: Verify)". Individual benchmarks that referred to specific earlier phases were updated: benchmark 31 ("Phase 2c Triage File Exists") became "Phase 4 Triage File Exists"; benchmark 34 references to "Phase 2d marked complete" became "Phase 5 marked complete"; benchmark 37 ("Phase 3 Mechanical Closure Uses Bash") became "Phase 6 Mechanical Closure Uses Bash"; benchmark 40 ("Artifact File-Existence Gate Passed") was updated from "Before Phase 2d is marked complete" to "Before Phase 5 is marked complete"; benchmark 42 ("Script-Verified Closure Gate Passed") was updated from "Before Phase 2d is marked complete" to "Before Phase 5 is marked complete". The verification checklist at the end of the document — the "Use this as a final sign-off" list — also received matching updates for every bulleted item that referenced a phase.
The review-protocols reference, references/review_protocols.md, got a smaller but pointed change: a section heading that used to read "### Phase 3: Results" (a heading inside the code review protocol's report structure) now reads "### Phase 6: Results", matching the new numbering. The spec-audit reference, references/spec_audit.md, was similarly updated — one place where PROGRESS.md language was quoted with an old phase number was corrected to the new.
The TOOLKIT.md file in ai_context/ got the most systematic rename of all the reference documents. TOOLKIT.md describes the playbook's pipeline to an AI assistant helping a user set up the skill, and it used the old numbering at section-heading level. Its sections changed from "### Phase 2b: Three-pass code review" to "### Phase 3: Three-pass code review", from "### Phase 2c: Council of Three" to "### Phase 4: Council of Three", from "### Phase 2d: Reconciliation and TDD" to "### Phase 5: Reconciliation and TDD", from "### Phase 3: Self-verification" to "### Phase 6: Self-verification". TOOLKIT.md is the document an AI assistant reads when a user asks "how does the quality playbook work?", and the rename means that assistant-mediated explanations of the pipeline to users now use the new phase numbers uniformly. Any user who was taught the pipeline under the old numbering by an AI assistant reading an old TOOLKIT.md will see the new numbering the next time they ask.
The scale and comprehensiveness of the rename is what makes it effective. A partial rename that updated SKILL.md but left the references inconsistent would have produced a skill whose internal cross-references were broken. The v1.3.50 rename touches every surface simultaneously, and the commit message names the affected surfaces explicitly: "All cross-references updated: SKILL.md, ITERATION.md, README.md, TOOLKIT.md, quality_gate.sh, and 3 reference files." The three reference files are review_protocols.md, spec_audit.md, and verification.md. The rename is complete at the moment of the commit — no follow-up fix-ups lingered into v1.3.51 or later. A reader opening any of the v1.3.50 files sees consistent phase numbering throughout.
One detail worth preserving about the rename is that it preserved the old ## Phase 2 completion gate (mandatory) language as a Phase 3 entry condition. The Phase 2 completion gate is the five-check gate that was introduced in v1.3.35 to force the model to verify artifact existence before proceeding to code review. Under the old numbering it was Phase 2 completion gate → Phase 2b entry; under the new numbering it is Phase 2 completion gate → Phase 3 entry. The gate's content is unchanged: all nine core artifacts must exist, REQUIREMENTS.md must contain specific file-path conditions, verify.sh must have been executed if mechanical artifacts exist, PROGRESS.md must mark Phase 2 complete with timestamp. The only update is the phrase "before proceeding to Phase 2b" becoming "before proceeding to Phase 3". This preservation is important because it means the gate model — the disk-observable precondition pattern — continues to work identically under the new numbering. The rename is a surface change to the numbers, not a redesign of the gate semantics.
The renumbering also preserved the internal structure of the mechanical-verification integrity gate, which has its own sub-phase name ("Phase 2a") that persists through the rename. The integrity gate is not part of the main phase sequence; it is an immediate in-phase check that runs whenever a *_cases.txt file is written. In the diff, the Phase 2a language is preserved: "Do not advance to Phase 3/2c until verify.sh exits 0." The "3/2c" hybrid is a single remnant in the Phase 2a prose where the renumbering caught a cross-reference that had both a new-phase number and an old-phase number and wound up with both in the same sentence. This is likely a transcription artifact that the commit did not fully clean up, but it is in a deep corner of SKILL.md and has not been corrected in subsequent releases, so it is part of the historical record of this release.
This Is The Foundation v1.4+ Builds On
v1.3.50 is the version that every later release works with rather than around. The structural shape it put in place has been preserved intact, and the extensions that v1.4+ make are always extensions, never revisions. A chronology of what the release froze and what later releases added is the clearest way to see why v1.3.50 is load-bearing.
The six-phase numbering has not been changed in any release since. v1.3.51 is a version bump plus a file relocation (moving ITERATION.md from repo root to references/). v1.4.0 adds orchestrator agents that execute phases programmatically, but the phases they execute are the same six phases by the same names. v1.4.1 adds recheck mode, which is a narrow verification pass over a subset of previously-confirmed bugs; it runs as a specialized invocation but it still operates within the six-phase framework. v1.4.2 introduces the Sonnet bootstrap self-audit, which uses the full six-phase pipeline to audit the playbook itself against its own requirements. v1.4.3 splits functional_tests into per-language references and adds the challenge gate for false-positive detection; the challenge gate runs as part of Phase 3 but does not alter the phase structure. v1.4.4 extracts orchestrator hardening into a shared protocol reference and ports quality_gate.sh to quality_gate.py; the gate script moves directories and languages, but its role in Phase 6 is unchanged. v1.4.5 bumps versioning, updates benchmark protocols, and finalizes the Python port. None of these releases revisit the phase numbering, the --phase flag model, the exit-gate pattern, the per-phase prompt structure, or the agent file format. All of them assume v1.3.50 is the foundation.
The quality gate script has been extended but not restructured. quality_gate.sh grew from 632 lines at v1.3.50 to approximately 900 lines by v1.4.3 before being ported to Python. Every extension added new checks; none removed checks. The Python port in v1.4.4 preserves the check set and adds the ability to parse JSON structurally rather than through grep-and-sed. The gate's role as the Phase 6 conformance arbiter is the same in v1.4.5 as in v1.3.50. The gate is still the script the model is told to defer to for conformance determination, and the skill's prose still enumerates the "common FAILs" the script reports. The v1.3.50 decision to make the gate authoritative has been reaffirmed at every subsequent release by continuing to route new checks through it.
The agent runner has been extended substantially. v1.4.0 adds orchestrator agents that run the six phases in an automated sweep without human invocation of each phase. v1.4.4 hardens the orchestrator against single-context collapse (the failure mode where a single orchestrator session inadvertently accumulates all six phases' context). v1.4.5 ports the runner to Python for better cross-platform support. But every extension treats the --phase flag and the six-phase loop as the invariant. The orchestrator agents' principal addition is a new driver on top of --phase all, not a replacement for it. The Python port's principal addition is type safety and better error handling around the same phase dispatch logic. The run_one_phase / check_phase_gate / phaseN_prompt structure is preserved across the port; only the implementation language changes.
The agents/ directory has grown. v1.3.50 added quality-playbook.agent.md. v1.4.0 added orchestrator agent files. v1.4.3 added a Claude-specific variant (quality-playbook-claude.agent.md). v1.4.4 duplicated critical hardening inline into the agent files. The directory is now a working collection of agent declarations, each using the format that the v1.3.50 agent file established. The format — YAML frontmatter with name, description, tools; a prescriptive body with "Before you start / If not installed / If installed / What you produce / How to invoke" structure — is unchanged. New agents conform to the template; the template has not been re-designed.
The iteration strategies have been preserved as-is. Gap, unfiltered, parity, adversarial remain the four strategies. The cycle order remains gap → unfiltered → parity → adversarial. ITERATION.md is the authoritative reference (moved to references/ in v1.3.51 but otherwise unchanged in substance). The TDD enforcement for iteration runs that landed in v1.3.49 and was integrated into Phase 5 in v1.3.50 has continued to be enforced across all later releases. New strategy work has not happened because the four-strategy cycle appears to cover the bug classes iteration is positioned to find. A strategy that does not exist in v1.3.50 also does not exist in v1.4.5.
The plan-overview section of SKILL.md has preserved its v1.3.50 structure. Every release since v1.3.50 opens with the six phase bullets — Phase 1 (Explore), Phase 2 (Generate), Phase 3 (Code Review), Phase 4 (Spec Audit), Phase 5 (Reconciliation), Phase 6 (Verify) — followed by the critical-dependency-chain statement and the MANDATORY FIRST ACTION paragraph. The version number changes. The phase descriptions occasionally gain a sentence. But the structure — five content phases plus one verification phase, with the reader told to explain the plan back in their own words — is the v1.3.50 structure, and it has not been restructured.
The plan-overview language also does something subtler that v1.4+ has preserved: it asserts that the pipeline is addressable by phase. "Phase 3 (Code Review): Run the three-pass code review against HEAD. Write regression tests for every confirmed bug. Generate patches." A reader of that sentence can form a concrete expectation of what Phase 3 does, what it produces, and when it stops. They can say "run phase 3" and be understood. The addressability is a product of the phase numbering and the one-sentence description; it was introduced in v1.3.50 and it has held up through every later release. A hypothetical user who learns the phase structure from the v1.3.50 plan-overview can read v1.4.5's plan-overview without retraining.
There is a meta-observation worth making. v1.3.35 (mandatory exploration) and v1.3.50 (six-phase architecture) are the two versions that define the skill's current shape. v1.3.35 decided what the skill's foundational discipline is: exploration produces specificity, specificity produces requirements, requirements produce bug-finding. v1.3.50 decided what the skill's operational structure is: six numbered phases, each independently runnable, each gated by disk-observable preconditions, each producing artifacts for the next phase to consume, each orchestrated by a runner that can invoke them individually or in sweep. Every other release between v1.3.35 and v1.4.5 is a refinement of these two versions' choices. The two foundational decisions are roughly three weeks apart, but they are not independent — the exploration discipline introduced by v1.3.35 produces an artifact (EXPLORATION.md) that the six-phase pipeline relies on as Phase 1's output, and the six-phase pipeline enforces exploration by making Phase 2 unable to start without a validated EXPLORATION.md on disk. The two versions are complementary halves of a single design: one decides what the quality of each phase should be, the other decides how the phases should fit together.
A reader of the skill today, trying to understand why things work the way they do, can point to v1.3.35 for any question about what the pipeline's discipline is and to v1.3.50 for any question about how the pipeline's structure is organized. The answers do not change as the skill version advances, because the decisions made in these two releases have been inherited by every version that followed. v1.3.50 is the larger of the two structurally — it touches more files, changes more surface area, and introduces more new infrastructure — but its significance is that what it put in place has remained in place. A design document about v1.4.5's orchestrator agents must reference v1.3.50's phase numbering. A design document about v1.4.4's quality_gate.py must reference v1.3.50's Phase 6 invocation slot. A design document about v1.4.3's per-language functional tests must reference v1.3.50's Phase 2 artifact contract. Nothing in the later architecture stands without the v1.3.50 foundation.
If a single sentence captures the historical role of v1.3.50, it is this: v1.3.50 is the version where the Quality Playbook stopped being a pipeline with a main track and three sub-phases and became a pipeline with six first-class phases, each runnable in its own context window and each guarded by a disk-verifiable gate, with a canonical quality gate script as the arbiter of Phase 6 conformance and a named-agent front door for host environments that want to invoke it by name. Every release since has built on that structure rather than reshaping it.
Provenance
Primary commits:
-
881879a— "v1.3.50: Renumber phases 1-6, add --phase flag to runner" (2026-04-14). Renumbered phases from1 / 2 / 2b / 2c / 2d / 3to1 / 2 / 3 / 4 / 5 / 6across all surfaces. Added--phaseflag torepos/run_playbook.shwith phase values1through6and the special valuesalland comma-separated ranges. Addedcheck_phase_gatefunction with disk-observable entry checks per phase. Addedrun_one_phaseandrun_one_phasedfunctions to drive per-phase execution with per-phase prompts (phase1_promptthroughphase6_prompt). Addedphase_labelhelper. Preserved--single-passand--multi-passas back-compat aliases (the latter now expanding to--phase all). Added incompatibility check between--next-iterationand--phase. Files changed:ITERATION.md(6 lines),README.md(14 lines),SKILL.md(176 lines),ai_context/TOOLKIT.md(8 lines),references/review_protocols.md(2 lines),references/spec_audit.md(2 lines),references/verification.md(28 lines). Total: 122 insertions, 114 deletions across 7 files in the SKILL-and-references surface; additional 250+ lines of runner changes in a separate file. Net new substantive content: phase-by-phase prompts and exit-gate function bodies. -
f888d16— "v1.3.50: Six-phase architecture, iteration strategies, quality gate, agent" (2026-04-15). Bumped version stamps from 1.3.49 to 1.3.50 in SKILL.md metadata, SKILL.md version-stamp template, SKILL.md canonical JSON schema examples, SKILL.md regression test header example, and README.md. Added "What's new in v1.3.50" section to README.md with headline features: six-phase architecture with clean context windows, phase-by-phase runner with--phaseflag, four iteration strategies, TDD red-green verification, quality gate script, benchmark results across three codebases (Express.js 14 bugs, Gson 9 bugs, Linux virtio 8 bugs). Added new fileagents/quality-playbook.agent.md(54 lines) in awesome-copilot agent format with YAML frontmatter and prescriptive body. Rewroterepos/run_playbook.shargument parsing and dispatch to fully implement--phasebehavior (496 lines added, 157 deleted — the largest single-file change in the release). Fixed stray comment inrepos/quality_gate.shthat still read "Phase 3 verification step" to now read "Phase 6 verification step". Total: 418 insertions, 157 deletions across 5 files.
Author: Andrew Stellman, co-authored by Claude Opus 4.6.
Dates: 881879a committed 2026-04-14 19:22 EDT. f888d16 committed 2026-04-15 09:17 EDT. The two commits are about 14 hours apart.
Release scope — aggregate across both commits:
- New architectural concept: six numbered phases. Phase 1 (Explore), Phase 2 (Generate), Phase 3 (Code Review), Phase 4 (Spec Audit), Phase 5 (Reconciliation), Phase 6 (Verify). Old names
2b,2c,2d, and the oldPhase 3verification slot, are retired. - New runner feature:
--phaseflag. Supports--phase 1through--phase 6,--phase all, and comma-separated ranges like--phase 3,4,5. Each invocation runs in its own CLI session with its own context window, reads artifacts from disk, and writes artifacts to disk for subsequent phases. - New runner feature: per-phase exit gates.
check_phase_gateimplements disk-observable preconditions per phase (file existence, line counts, directory non-emptiness). Gate failures abort the repo with a clear message. - New runner feature: per-phase prompts.
phase1_promptthroughphase6_promptproduce here-doc prompts that tell the model which SKILL.md section to read, which artifacts to produce, and where to stop. - New file:
agents/quality-playbook.agent.md. 54-line agent declaration in awesome-copilot format, with name, description, tools list, and prescriptive body explaining how hosts should invoke the skill. - Quality gate elevation.
quality_gate.shheader comment corrected from "Phase 3" to "Phase 6", matching the script's new canonical invocation slot. Skill prose throughout SKILL.md routes through the gate as the conformance arbiter for Phase 5 closure and Phase 6 verification. - Iteration strategy consolidation. Gap, unfiltered, parity, adversarial strategies documented as first-class features in the README "What's new" section, with an empirical claim of 40–60% additional confirmed bugs over baseline. ITERATION.md rewritten to reference the new phase numbers throughout.
- PROGRESS.md template update. Checkbox list flattened from
Phase 1, Phase 2, Phase 2b, Phase 2c, Phase 2d, TDD logs, Phase 3toPhase 1, Phase 2, Phase 3, Phase 4, Phase 5, TDD logs, Phase 6. - Artifact contract table update. Every artifact's "Created In" column updated to use new phase numbers.
- Plan-overview section restructure. SKILL.md opens with six phase bullets instead of three, each phase given its own one-line description.
Empirical framing from the release: The README "What's new" section reports validation across three codebases — Express.js (14 confirmed bugs), Gson (9 confirmed bugs), Linux virtio (8 confirmed bugs) — all with 100% TDD red-phase coverage and zero gate failures. The 40–60% figure for iteration yield is the cumulative finding across v1.3.44–v1.3.49 benchmarking, reported at v1.3.50 when the iteration strategies became a coherent first-class feature.
Git is authoritative. All claims in this document are grounded in the commit diffs reviewed during its preparation. Where chat history and git disagree, git wins. The phase numbering mapping, the runner flag behavior, the exit-gate conditions, the per-phase prompt structure, and the agent file content are all drawn from the commits 881879a and f888d16 and from the SKILL.md, run_playbook.sh, quality_gate.sh, ITERATION.md, README.md, and reference files in the state they held immediately after f888d16 landed.