thrifty

June 10, 2026 · View on GitHub

Tiered-delegation task execution for Claude Code. A planner model writes the spec into sprints, a cheap model executes and self-verifies against the gate — same gate quality as the strong model, ~64% cheaper (≈⅓ the cost) (benchmarked).

Offload each sprint of work to the weakest model that can do it correctly.

The idea

Most of the work in a task is pattern-following execution, not reasoning. thrifty concentrates planning in one model and pushes execution down to a cheap one, with the gate (tests / a checklist) as the trust contract — independently re-run, never self-reported. This generalizes "tests as the source of truth" to any task, code or not.

Benchmarked architecture

The configuration that won the cost/quality bake-off (spec → working code, 7 tasks across JS / Python / Go / prose, gate-verified — see eval/RESULTS.md and experiments/):

spec ──▶ Sonnet  writes contract (pins cross-sprint + genuinely-ambiguous decisions)
                 + sprints.jsonl (one self-contained unit of work each)
            │
            ▼
         Haiku   one cached agent builds every sprint, runs the gate, and self-fixes
            │
            ▼
         Sonnet  a SCOPED patch — only if a specific failure is left (in practice: never;
                 the Haiku agent self-fixed to green on all 7 tasks)

Result: ~64% cheaper than Opus building the same spec, at equal gate quality — on large, multi-unit builds (the win grows with size). On trivial builds the planning overhead makes thrifty cost more than a single capable agent, so there's a crossover below which you should just use one strong model directly (see Usage). Two things make it work: a single cached agent reuses the contract across turns (cheap, no cold-call bloat), and the contract pins every decision that crosses a sprint boundary so the executor never invents a system-level choice (leave one ambiguous and the executor writes contradictory tests). Opus is not in the execution loop.

Install

thrifty ships as a Claude Code plugin (bundles all six skills):

/plugin marketplace add 2389-research/thrifty
/plugin install thrifty@thrifty

Then trigger it with "thrifty", "delegate this", or "tiered build".

The subagent-based skills (thrifty, thrifty-plan/brief/execute/check) need no external runtime. The lean thrifty-dispatch flow shells out to skills/thrifty-dispatch/dispatch.py, so it additionally requires Python 3 on PATH.

To hack on it locally instead, symlink the skills into ~/.claude/skills/:

for d in skills/*/; do ln -s "$PWD/$d" ~/.claude/skills/; done

Lineage

thrifty generalizes three existing systems to arbitrary tasks:

speed-run — offload first-pass generation to a fast model; the strong model does architecture and surgical fixes, never wholesale regeneration.
pipelines/local_code_gen — a strong architect pins every cross-sprint decision in a contract so the weak executor never makes a system-level choice.
Noospheric Orrery — proof of the philosophy: Sonnet never extracts; Haiku does all the work; Sonnet writes and refines the spec until the cheap model reliably passes.

Where local_code_gen writes 600-line byte-pinned contracts for a tiny local model (qwen3.6 / gemma), thrifty deliberately sits well above that floor. Haiku is far stronger, so the contract pins only what is cross-sprint AND genuinely ambiguous — the seams, not the interiors. Since the planner's (Sonnet) output is the expensive part, terseness is the goal: pin the few decisions two capable sprints would otherwise diverge on, and let Haiku infer the rest.

Skills

Skill	Role	Model	Runs as
`thrifty`	orchestrator (subagent flow)	—	this session
`thrifty-dispatch`	orchestrator (lean dispatch flow — default)	—	this session
`thrifty-plan`	director's planning discipline	Sonnet	this session
`thrifty-brief`	expand a unit spec into a brief (split tier)	Sonnet	dispatched subagent
`thrifty-execute`	execute one unit	Haiku	dispatched subagent
`thrifty-check`	verify + fix one unit	Sonnet	dispatched subagent

Planning tiers (who writes the briefs)

direct — the architect writes the contract and every brief. Fewest moving parts, one translation boundary. Best for few sprints / subtle, correctness-critical work.
split — the architect (director) writes the contract + terse unit specs; parallel Sonnet brief-writers expand them (mirrors local_code_gen's contract → sprint flow). Brief-writing runs in parallel and the architect's context stays lean. Best for many sprints (≳ 6) / mechanical briefs / scale.
hybrid — the architect writes the 1–2 subtle sprints' briefs, delegates the routine rest.

Rough rule: direct below ~5 sprints, split above — but it's per-task, and the subtle sprints can stay direct even in a split run. Authority follows the tier: the architect owns the contract (cross-sprint), the brief-writer owns its unit (within-sprint), so a within-sprint fix stays with a Sonnet brief-writer and only contract defects reach the architect.

Usage

Say "thrifty", "delegate this", or "tiered build" on a multi-part task.

Which flow runs: the agent should try the lean dispatch flow (thrifty-dispatch) first — it's the benchmarked, cheapest path — and fall back to the subagent flow below only when the dispatch runtime is unavailable (python3 / claude -p can't be spawned) or the task genuinely needs per-unit parallel verification. Cost ordering, measured: dispatch < subagent < direct-strong-model on large builds; on trivial builds the planning overhead makes either thrifty flow cost more than just one capable agent — so below ~a few real units, skip thrifty. The thrifty skill now points the agent at dispatch by default.

For the subagent flow, the orchestrator will: frame the goal, plan (write CONTRACT.md + per-sprint briefs with acceptance criteria, confirmed with you), dispatch Haiku executors in parallel where dependencies allow, verify each unit by criterion type (run the gate for runnable criteria; spend a Sonnet read only when a gate fails or there are assertional criteria to judge), apply a bounded tiered fix loop, then do a final coherence pass and report.

Verification is adaptive (don't pay Sonnet to read passing code)

Runnable criteria (tests, build, a CLI exit code) → the orchestrator just runs the gate. Pass = done, no model spent. This is still independent verification (the gate is re-run, not self-reported).
A gate fails → a Sonnet checker reads the failure and surgically fixes it.
Assertional criteria (prose, citations, canon, design quality) → a Sonnet checker reads and judges those dimensions.

So for code-with-tests the common path costs no checker tokens; Sonnet is spent only on failures and on things only a reader can judge.

Artifacts land in docs/thrifty/<task-slug>/ (CONTRACT.md, briefs/, LEDGER.md) so a run is auditable and resumable.

Decomposition modes

Not every task is "one file per agent." The architect picks a mode by how cohesive the final artifact must be:

partition — sprints own separate regions/files, run in parallel, architect merges. Best for separable outputs (code files, doc sections).
relay — one shared artifact extended segment by segment, sequentially, each agent reading the artifact-so-far. Best for flowing prose (a story chapter).
layered — role-specialized passes over the whole artifact (draft → continuity edit → polish), sequentially. Best for one seamless voice via multiple lenses.

A single artifact with no cross-sprint seams skips the contract entirely — one brief, one execute, one check (or just don't use thrifty).

Fix loop (bounded, terminates)

The checker diagnoses (pass / local / execution / brief); the orchestrator decides and counts:

surgical (Sonnet, in place) — ≤ 2
executor redo (fresh Haiku + notes) — ≤ 1
architect replan (revise contract/brief) — ≤ 1
otherwise surface to the human

A regression guard rolls back any fix that breaks a previously-passing criterion.

Design & evidence

The benchmarked architecture and the full cost/quality investigation (including the approaches we tried and rejected) are in eval/RESULTS.md; reproducible scripts + captured data are in experiments/.

Two implementations — use dispatch by default

Recommended: the lean dispatch flow (thrifty-dispatch). It is the benchmarked architecture and the one the ~64% cost claim is measured on. Reach for the subagent substrate only when you specifically need per-unit parallel verification — and know it costs materially more (see why below).

Lean dispatch flow (benchmarked, recommended) — thrifty-dispatch: Sonnet writes contract.md + sprints.jsonl, then dispatch.py runs bare claude -p calls that hard-pin every codegen sprint to Haiku (the plan's tier is ignored in code) and write outputs to disk; the orchestrator reads only a tiny manifest. ~64% cheaper than Opus at equal gate quality. This is the architecture diagrammed at the top.
Subagent substrate (richer, pricier) — the thrifty-* skills above (Sonnet architect → parallel Haiku executor subagents → Sonnet checker). Reach for it only when you need per-unit parallel verification.

Why it costs more (structural, not a bug): every subagent spawn carries ~40k of Claude Code harness context, and each subagent's report lands back in the orchestrator's context — which is then re-read every turn, so cost compounds with the number of units. The dispatch flow avoids both (bare calls, outputs to disk, manifest-only reads).

Executor-tier caveat: unlike dispatch, this path cannot force the executor model in code — it asks the orchestrator to dispatch with model: "haiku", which a runtime may silently ignore (falling back to Sonnet/the default). Verify the actual model each executor ran on before trusting any cost figure — and note a subagent's self-reported model is only a weak signal (a silently-substituted model often still claims to be Haiku); the authoritative check is observed cost/usage. The ledger's savings line is an estimate, not a measurement. (See the "Verify the executor tier" step in thrifty's Step 3.)

Status

v0.1 — benchmarked. The lean dispatch flow is ~64% cheaper than Opus building the same spec, at equal gate quality, across 7 tasks (JS / Python / Go / prose). Evidence: eval/RESULTS.md (main findings + full journey) and experiments/ (reproducible scripts + captured cost data).