Autoresearch

June 23, 2026 · View on GitHub

Autoresearch

Turn Claude Code, OpenCode, or OpenAI Codex into a relentless improvement engine.

Based on Karpathy's autoresearch — constraint + mechanical metric + autonomous iteration = compounding gains.

Claude Code Skill OpenCode Codex Version License: MIT

Based on Follow @iuditg Support


"Set the GOAL → The agent runs the LOOP → You wake up to results"

You don't need AGI. You need a goal, a metric, and a loop that never quits.

Supports Claude Code, OpenCode, and OpenAI Codex. 14 commands. 9 safety hooks. 95% fewer tokens per invocation.

v2.2.0 — Autonomous Orchestrator: Type a plain-language goal to /autoresearch and it classifies your goal, derives a Success predicate, confirms it once, then loops across subcommands until done. No manual chaining required. Metric:/Verify: invocations run the classic loop unchanged. See guide/autoresearch-orchestrator.md.


How It Works · Commands · Quick Start · Guides · FAQ


     PLAN             LOOP            DEBUG             FIX             SECURE            SHIP
 ┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
 │   Goal   │     │  Modify  │     │   Find   │     │   Fix    │     │  STRIDE  │     │  Stage   │
 │  Metric  │────▶│  Verify  │────▶│   Bugs   │────▶│  Errors  │────▶│  OWASP   │────▶│  Deploy  │
 │  Scope   │     │Keep/Drop │     │  Trace   │     │  Repair  │     │ Red Team │     │ Release  │
 └──────────┘     └──────────┘     └──────────┘     └──────────┘     └──────────┘     └──────────┘
 /autoresearch:   /autoresearch    /autoresearch:   /autoresearch:   /autoresearch:   /autoresearch:
   plan                              debug            fix              security         ship

 ┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
 │  Probe   │     │ Scenario │     │ Predict  │     │  Reason  │
 │ Require- │     │   Edge   │     │ 5-Expert │     │  Debate  │
 │  ments   │     │  Cases   │     │  Swarm   │     │ Converge │
 └──────────┘     └──────────┘     └──────────┘     └──────────┘
 /autoresearch:   /autoresearch:   /autoresearch:   /autoresearch:
   probe            scenario         predict          reason

 ┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
 │  Learn   │     │ Improve  │     │   Eval   │     │ Baseline │
 │   Docs   │     │ Research │     │ Analyze  │     │   Diff   │
 │   Gen    │     │   PRDs   │     │ Results  │     │ Verdict  │
 └──────────┘     └──────────┘     └──────────┘     └──────────┘
 /autoresearch:   /autoresearch:   /autoresearch:   /autoresearch:
   learn            improve          evals            regression

Why This Exists

Karpathy's autoresearch demonstrated that a 630-line Python script could autonomously improve ML models overnight — 100 experiments per night — by following simple principles: one metric, constrained scope, fast verification, automatic rollback, git as memory.

Claude Autoresearch generalizes these principles to ANY domain. Not just ML — code, content, marketing, sales, HR, DevOps, or anything with a number you can measure.

v2.1.0 is a major architecture rebuild. The monolithic SKILL.md (813 lines, ~100K tokens per invocation) is replaced with a thin 41-line routing file and 12 self-contained command files (94–120 lines each, ~5–8K tokens per invocation). That is a 95% token reduction with the same capability surface.


How It Works

LOOP (N iterations or until done):
  1. Review current state + git history + results log
  2. Pick the next change (based on what worked, what failed, what's untried)
  3. Make ONE focused change
  4. Git commit (before verification)
  5. Run mechanical verification (tests, benchmarks, scores)
  6. If improved → keep. If worse → git revert. If crashed → fix or skip.
  7. Log the result
  8. Repeat until N iterations complete or goal is met.

Every improvement stacks. Every failure auto-reverts. Progress is logged in TSV format.

The Setup Phase

Before looping, Claude performs a one-time setup:

  1. Read context — reads all in-scope files
  2. Define goal — extracts or asks for a mechanical metric
  3. Define scope — which files can be modified vs read-only
  4. Establish baseline — runs verification on current state (iteration #0)
  5. Confirm and go — shows setup, then begins the loop

8 Critical Rules

#Rule
1Bounded by default — every command has a default iteration count; unlimited is opt-in via Iterations: unlimited
2Read before write — understand full context before modifying
3One change per iteration — atomic changes; if it breaks, you know why
4Mechanical verification only — no subjective "looks good"; use metrics
5Automatic rollback — failed changes revert instantly
6Simplicity wins — equal results + less code = keep
7Git is memory — experiments committed with experiment: prefix; agent reads git log + git diff before each iteration
8When stuck, think harder — re-read, combine near-misses, try radical changes

Hooks & Safety

v2.1.1 ships a 9-hook safety system that protects your sessions automatically. Hooks fire on every session — not just during autoresearch commands.

What's Protected

HookWhat it doesEvent
scout-blockBlocks node_modules/, .git/, pycache/, etc. from filling your contextPreToolUse
privacy-blockBlocks .env, SSH keys, credentials from being read in sessionsPreToolUse
dangerous-cmd-blockBlocks force-push, rm -rf, git reset --hardPreToolUse
iteration-contextInjects recent TSV iteration data after context compactionUserPromptSubmit
subagent-contextGives subagents awareness of active loop stateSubagentStart
dev-rules-reminderRe-injects plan path and code standards after compactionUserPromptSubmit
simplify-gateWarns at 400 LOC, blocks at 800 LOC before shippingUserPromptSubmit
session-initSets up project context at session startSessionStart
stop-notifyTerminal notification + optional webhook on session endSessionEnd

Configuration

All hooks are on by default. Disable individually:

# Disable a specific hook
export AR_DISABLE_SCOUT_BLOCK=1
export AR_DISABLE_PRIVACY_BLOCK=1
export AR_DISABLE_DANGEROUS_CMD_BLOCK=1
# ... etc for each hook name

Optional webhook for session completion notifications:

export AR_NOTIFY_WEBHOOK=https://hooks.slack.com/services/...

Customize blocked directories with a .ckignore file (gitignore syntax) at your project root.

See guide/hooks.md for full reference.


Commands

CommandWhat it doesDefault Iterations
/autoresearchClassic: Core iterate loop: modify → verify → keep/discard · Orchestrator: free-form goal → auto-select pipeline → loop until predicate met25 / goal-bounded
/autoresearch:planConvert goal into validated configone-shot
/autoresearch:debugHunt bugs via hypothesis iteration15
/autoresearch:fixCrush errors one-by-one to zero20
/autoresearch:securitySTRIDE + OWASP audit with red-team15
/autoresearch:shipShip through 8 phaseslinear
/autoresearch:scenarioGenerate edge cases across 12 dimensions20
/autoresearch:predict5 expert personas debateone-shot
/autoresearch:learnScout → generate docs → validate → fix10
/autoresearch:reasonAdversarial debate with blind judges8
/autoresearch:probe8 personas interrogate requirements15
/autoresearch:improveResearch ICP, discover improvements, generate PRDs15
/autoresearch:evalsAnalyze iteration results: trends, plateausone-shot
/autoresearch:regressionStability gate: baseline vs candidate, verdict STABLE/UNSTABLEone-shot

Universal flags: Iterations: N, Iterations: unlimited, --evals, --evals-interval N, --chain <targets>, --<subcommand> shorthand.

All commands use interactive setup when invoked without arguments. Just type the command — the agent asks for what it needs with smart defaults based on your codebase.

OpenCode users: Commands use underscore naming (/autoresearch_debug, /autoresearch_fix, etc.). All 14 commands available.

Codex users: Invoke via $autoresearch mention syntax. Subcommands are keywords: $autoresearch debug, $autoresearch plan, etc.

Quick Decision Guide

I want to...Use
Give a plain-language goal, let it self-orchestrate/autoresearch <goal> (bare, no Metric/Verify)
Improve test coverage / reduce bundle size / any metric/autoresearch
Run bounded iterationsAdd Iterations: N to any command
Don't know what metric to use/autoresearch:plan
Run a security audit/autoresearch:security
Ship a PR / deployment / release/autoresearch:ship
Optimize without breaking existing testsAdd Guard: npm test
Hunt all bugs in a codebase/autoresearch:debug
Fix all errors (tests, types, lint)/autoresearch:fix
Debug then auto-fix/autoresearch:debug --fix
Check if something is ready to ship/autoresearch:ship --checklist-only
Explore edge cases for a feature/autoresearch:scenario
Generate test scenarios/autoresearch:scenario --format test-scenarios
Get expert opinions before starting/autoresearch:predict
Analyze from multiple angles then debug/autoresearch:predict --chain debug
Generate docs for a new codebase/autoresearch:learn --mode init
Update existing docs after changes/autoresearch:learn --mode update
Debate an architecture decision/autoresearch:reason --domain software
Surface hidden constraints before starting/autoresearch:probe
Pre-flight a fuzzy goal then loop/autoresearch:probe --chain plan,autoresearch
Discover what to build next for your ICP/autoresearch:improve
Research competitors and generate PRDs/autoresearch:improve --depth deep
Probe requirements then research improvements/autoresearch:probe --improve
Analyze trends and plateaus across past runs/autoresearch:evals
Check if a run has stalled/autoresearch:evals --file *-results.tsv
Verify a change won't regress before pushing/autoresearch:regression
Gate a PR: predict, fix, re-gate, then ship/autoresearch:regression --predict --fix --ship

Quick Start

Claude Code

Option A — npx install (recommended):

npx skills add uditgoenka/autoresearch

All 14 commands are available after restarting Claude Code.

Option B — Plugin install:

/plugin marketplace add uditgoenka/autoresearch
/plugin install autoresearch@autoresearch

Note: Start a new Claude Code session after installing. Reference files aren't resolvable in the same session where installation happened — this is a Claude Code platform limitation.

Updating (no reinstall needed):

/plugin update autoresearch

Run /reload-plugins to activate. No need to uninstall or re-clone.

Option C — Manual copy:

git clone https://github.com/uditgoenka/autoresearch.git

# Copy skill + subcommands to your project
cp -r autoresearch/.claude/skills/autoresearch .claude/skills/autoresearch
cp -r autoresearch/.claude/commands/autoresearch .claude/commands/autoresearch
cp autoresearch/.claude/commands/autoresearch.md .claude/commands/autoresearch.md

Or install globally:

cp -r autoresearch/.claude/skills/autoresearch ~/.claude/skills/autoresearch
cp -r autoresearch/.claude/commands/autoresearch ~/.claude/commands/autoresearch
cp autoresearch/.claude/commands/autoresearch.md ~/.claude/commands/autoresearch.md

Option D — Guided installer:

git clone https://github.com/uditgoenka/autoresearch.git
cd autoresearch
./scripts/install.sh --claude --global

OpenCode Quick Start

Option A — Guided installer (recommended):

git clone https://github.com/uditgoenka/autoresearch.git
cd autoresearch
./scripts/install.sh --opencode --global

Option B — Manual copy:

git clone https://github.com/uditgoenka/autoresearch.git

cp -r autoresearch/.opencode/skills/autoresearch .opencode/skills/autoresearch
cp autoresearch/.opencode/commands/autoresearch*.md .opencode/commands/

Or globally:

cp -r autoresearch/.opencode/skills/autoresearch ~/.config/opencode/skills/autoresearch
cp autoresearch/.opencode/commands/autoresearch*.md ~/.config/opencode/commands/

All 14 commands available as /autoresearch_debug, /autoresearch_fix, /autoresearch_improve, etc.

Codex Quick Start

Option A — Guided installer (recommended):

git clone https://github.com/uditgoenka/autoresearch.git
cd autoresearch
./scripts/install.sh --codex --global

Option B — Manual copy:

git clone https://github.com/uditgoenka/autoresearch.git
cp -r autoresearch/.agents/skills/autoresearch ~/.codex/skills/autoresearch

Invoke via $autoresearch mention syntax. Subcommands are keywords: $autoresearch plan, $autoresearch debug, $autoresearch evals, etc.

Run It

/autoresearch
Goal: Increase test coverage from 72% to 90%
Scope: src/**/*.test.ts, src/**/*.ts
Metric: coverage % (higher is better)
Verify: npm test -- --coverage | grep "All files"
Iterations: 25

Claude reads all files, establishes a baseline, and starts iterating — one change at a time. Keeps improvements, auto-reverts failures, logs everything. Stops after N iterations or when you interrupt.


/autoresearch:plan — Goal to Config

The hardest part isn't the loop — it's defining Scope, Metric, and Verify correctly. /autoresearch:plan converts your plain-language goal into a validated, ready-to-execute configuration.

/autoresearch:plan
Goal: Make the API respond faster

Walks through 5 steps: capture goal → define scope → define metric → define direction → validate verify command (dry-run). Every gate is mechanical — scope must resolve to files, metric must output a number, verify must pass a dry-run. Emits a handoff.json for chaining.


/autoresearch:debug — Autonomous Bug Hunter

Scientific method meets autoresearch loop. Doesn't stop at one bug — iteratively hunts ALL bugs using falsifiable hypotheses, evidence-based investigation, and 7 investigation techniques.

/autoresearch:debug
Scope: src/api/**/*.ts
Symptom: API returns 500 on POST /users
Iterations: 15

How it works: Gather symptoms → Recon → Hypothesize (specific, testable) → Test (one experiment per iteration) → Classify (confirmed/disproven/inconclusive) → Log → Repeat.

Every finding requires code evidence (file:line + reproduction steps). Every disproven hypothesis is logged — equally valuable.

FlagPurpose
--fixAfter hunting, auto-switch to /autoresearch:fix
--scope <glob>Limit investigation scope
--symptom "<text>"Pre-fill symptom
--severity <level>Minimum severity to report

/autoresearch:fix — Autonomous Error Crusher

Takes a broken state and iteratively repairs it until everything passes. ONE fix per iteration. Atomic, committed, verified, auto-reverted on failure.

/autoresearch:fix
Iterations: 20

Auto-detects what's broken (tests, types, lint, build) → Prioritizes (blockers first) → Fixes ONE thing → Commits → Verifies error count decreased → Guard check → Keep/Revert → Repeat. Stops automatically when error count hits zero.

FlagPurpose
--target <command>Explicit verify command
--guard <command>Safety command that must always pass
--category <type>Only fix specific type (test, type, lint, build)
--from-debugRead findings from latest debug session

Chain them: /autoresearch:debug/autoresearch:fix --from-debug


/autoresearch:security — Autonomous Security Audit

Read-only security audit using STRIDE threat modeling, OWASP Top 10 sweeps, and red-team adversarial analysis with 4 hostile personas.

/autoresearch:security
Iterations: 15

Codebase recon → asset inventory → trust boundaries → STRIDE threat model → attack surface map → autonomous testing loop → structured report. Every finding requires code evidence (file:line + attack scenario).

FlagPurpose
--diffOnly audit files changed since last audit
--fixAuto-fix confirmed Critical/High findings
--fail-on <severity>Exit non-zero for CI/CD gating

Output: Creates security/{date}-{slug}/ with 7 structured report files.


/autoresearch:ship — Universal Shipping Workflow

Ship anything through 8 phases: Identify → Inventory → Checklist → Prepare → Dry-run → Ship → Verify → Log.

/autoresearch:ship --auto

Auto-detects what you're shipping (code PR, deployment, blog post, email campaign, sales deck, research paper, design assets) and generates domain-specific checklists — every item mechanically verifiable.

FlagPurpose
--dry-runValidate everything but don't ship
--autoAuto-approve if checklist passes
--forceSkip non-critical items (blockers still enforced)
--rollbackUndo last ship action
--monitor NPost-ship monitoring for N minutes
--checklist-onlyJust check readiness

9 supported types: code-pr, code-release, deployment, content, marketing-email, marketing-campaign, sales, research, design.


/autoresearch:scenario — Scenario Explorer

Autonomous scenario exploration engine. Takes a seed scenario and iteratively generates situations across 12 dimensions — happy paths, errors, edge cases, abuse, scale, concurrency, temporal, data variation, permissions, integrations, recovery, and state transitions.

/autoresearch:scenario
Scenario: User attempts to checkout with multiple payment methods
Iterations: 20

Seed analysis → Decompose into 12 dimensions → Generate ONE situation per iteration → Classify (new/variant/duplicate) → Expand edge cases → Log → Repeat.

FlagPurpose
--domain <type>software, product, business, security, marketing
--depth <level>shallow (10), standard (20), deep (50+)
--format <type>use-cases, user-stories, test-scenarios, threat-scenarios
--focus <area>edge-cases, failures, security, scale

/autoresearch:predict — Multi-Persona Prediction

Before you debug, fix, or ship — get 5 expert perspectives in 2 minutes.

Simulates a team (Architect, Security Analyst, Performance Engineer, Reliability Engineer, Devil's Advocate) who independently analyze your code, debate findings, and reach consensus.

/autoresearch:predict --chain debug
  • --chain debug — pre-ranked hypotheses before debugging
  • --chain security — multi-persona red team analysis
  • --chain scenario,debug,fix — full quality pipeline

/autoresearch:learn — Autonomous Documentation Engine

Scout codebase → generate docs → validate → fix → repeat. 4 modes: init (create from scratch), update (refresh existing), check (read-only health report), summarize (quick overview).

/autoresearch:learn --mode init --depth deep
Iterations: 10

Dynamic doc discovery, project-type detection, validation-fix loop, git-diff scoping for updates, selective single-doc update with --file. Auto-generates Mermaid architecture diagrams, API reference, testing guide, config guide, and cross-reference links.


/autoresearch:reason — Adversarial Refinement

Extends autoresearch to subjective domains where no objective metric exists. The blind judge panel is the fitness function.

/autoresearch:reason
Task: Should we use event sourcing for our order management system?
Domain: software
Iterations: 8

How it works: Generate-A → Critic attacks → Author-B responds → Synthesizer merges → Blind judge panel (randomized labels) picks winner → Winner becomes new A → Repeat until convergence. Every agent is a cold-start fresh invocation — no history bleed.

FlagPurpose
--judges NJudge count (3-7, odd preferred)
--convergence NConsecutive wins to converge (default 3)
--mode <mode>convergent (default), creative, debate
--domain <type>software, product, business, security, research, content
--chain <targets>Chain converged output to any autoresearch command

Output: Creates reason/{date}-{slug}/ with lineage.md, candidates.md, judge-transcripts.md, reason-results.tsv, handoff.json.


/autoresearch:probe — Adversarial Requirement Interrogation

Eight adversarial personas interrogate user and codebase together until net-new constraints saturate. Output is the 5 autoresearch primitives (Goal/Scope/Metric/Direction/Verify) plus a handoff.json ready to feed any downstream command.

/autoresearch:probe --chain plan,autoresearch
Topic: Add multi-tenant isolation to the database layer

The 8 personas: Skeptic, Edge-Case Hunter, Scope Sentinel, Ambiguity Detective, Contradiction Finder, Prior-Art Investigator, Success-Criteria Auditor, Constraint Excavator.

FlagPurpose
--depth <level>shallow (5 rounds), standard (15), deep (30)
--adversarialRotate Skeptic + Contradiction Finder + Edge-Case Hunter to front
--mode <mode>interactive (default) or autonomous
--chain <targets>plan, predict, debug, scenario, reason, fix, ship, learn

Output: Creates probe/{date}-{slug}/ with probe-spec.md, constraints.tsv, autoresearch-config.yml, handoff.json.


/autoresearch:improve — Product Improvement Engine

Research what to build next. Discovers ICP challenges via deep multi-source research, scores and ranks improvements, generates per-feature PRDs with evidence chains.

/autoresearch:improve
Goal: Improve onboarding conversion
ICP: B2B SaaS product managers at 50-500 person companies

How it works: Resolve product context → Research across 5 categories (ICP challenges, competitor gaps, market trends, UX & experience, revenue & growth) → Saturate → ICP binary gate → Tiered ranking (Must-have / Nice-to-have / Moonshot) → User selects features → Generate PRDs.

FlagPurpose
--icp "<text>"Ideal customer profile
--discoverForce codebase scan even with existing context
--no-discoverSkip auto-discover
--depth <level>shallow (5), standard (15), deep (30+)
--seeds <categories>Override default research categories

Output: Creates improve/{date}-{slug}/ with research-findings.md, improvement-plan.md, per-feature PRDs, summary.md, improve-results.tsv, handoff.json.

Terminal emitter — improve is the last link in any autoresearch chain. PRDs are consumed by external tools (/ck:plan, /ck:cook), not by other autoresearch commands.

Chain into improve: /autoresearch:probe --improve, /autoresearch:predict --improve, /autoresearch:debug --improve.


/autoresearch:evals — Results Analyzer

Analyzes *-results.tsv files from any autoresearch run. Surfaces trends, plateau detection, convergence signals, and iteration efficiency. Backward compatible with v2.0.x TSV format.

/autoresearch:evals
/autoresearch:evals --file coverage-results.tsv

Adaptive checkpoints: floor(max_iterations/3), minimum 1 checkpoint. Reports per-checkpoint delta, stall detection, best iteration, and a recommendation (continue / stop / change strategy).

Inline evals during a run:

/autoresearch
Goal: Reduce bundle size below 200kb
Iterations: 30
--evals-interval 10

Prints a checkpoint report every 10 iterations without interrupting the loop.


/autoresearch:regression — Stability Gate

Before you push, prove the change didn't break what already worked. Captures baseline behavior from a git worktree of the base ref, diffs the candidate across 8 dimensions, and emits a single STABLE / UNSTABLE verdict.

/autoresearch:regression --predict --evals --fix --ship

Core invariant: a regression is a green→red transition only. Pre-existing failures (red→red), new tests (absent→red), and flaky tests (flake→red) are classified and excluded — never counted as regressions.

Tiered verdict:

  • HARD gate (any green→red = UNSTABLE): functional, api-contract, data-migration, integration-e2e
  • SCORE (0–100, noise-tolerant, weighted; UNSTABLE below threshold 95): flakiness .30, performance .30, resource .20, visual-ui .20
FlagPurpose
--select autoUse detected affected-test mapper (jest --findRelatedTests, nx affected) else FULL suite — never a silent subset
--samples N / --noise-band %Tune the perf statistical gate (default 7 samples/side, Mann–Whitney U)
--fix / --fix-cycles NRe-gate after fixing; each cycle must strictly shrink the blocking-set (max 3)
--predictPre-empt likely regressions before the gate runs
--reasonAdversarial root-cause when a regression's cause is ambiguous
--debugForce the bisect Hunter (HARD dims passing 3/3 reproduction)
`--max-runs N$\text{Ceiling} \text{on} \text{dims} \times \text{axes} \times \text{samples} \times \text{cells} (\text{warn}+\text{confirm} \text{past} 200)

\text{Output}: \text{Creates} $regression/{date}-{slug}/` with regression-results.tsv, stability-report.md, dimensions/.md, baseline/, evals-summary.md, handoff.json.

data-migration is hard-guarded: opt-in, and refuses any DB URL that isn't ephemeral/allowlisted (*test*, *ci*, container). Migrations are forward-only by default.


Guard — Prevent Regressions

When optimizing a metric, the loop might break existing behavior. Guard is an optional safety net.

/autoresearch
Goal: Reduce API response time to under 100ms
Verify: npm run bench:api | grep "p95"
Guard: npm test
  • Verify = "Did the metric improve?" (the goal)
  • Guard = "Did anything else break?" (the safety net)

If the metric improves but the guard fails, Claude reworks the optimization (up to 2 attempts). Guard/test files are never modified.

Credit: Guard was contributed by @pronskiy (JetBrains) in PR #7.


Results Tracking

Every iteration is logged in TSV format:

iteration  commit   metric  delta   status    description
0          a1b2c3d  85.2    0.0     baseline  initial state
1          b2c3d4e  87.1    +1.9    keep      add tests for auth edge cases
2          -        86.5    -0.6    discard   refactor test helpers (broke 2 tests)
3          c3d4e5f  88.3    +1.2    keep      add error handling tests

Run /autoresearch:evals at any time to analyze trends across any TSV file. Adaptive checkpoints fire at floor(max_iterations/3) intervals.


Crash Recovery

FailureResponse
Syntax errorFix immediately, don't count as iteration
Runtime errorAttempt fix (max 3 tries), then move on
Resource exhaustionRevert, try smaller variant
Infinite loop / hangKill after timeout, revert
External dependencySkip, log, try different approach

Repository Structure

autoresearch/
├── README.md
├── COMPARISON.md                                  ← Karpathy's vs Claude Autoresearch
├── guide/                                         ← Guides — one per command + advanced patterns
├── scripts/
│   ├── install.sh                                 ← Guided installer (Claude Code + OpenCode + Codex)
│   ├── transform.sh                               ← Single transform: .claude/ → .opencode/ + .agents/
│   ├── release.sh                                 ← Release automation
│   └── release.md                                 ← Release checklist
├── .claude/
│   ├── skills/autoresearch/
│   │   ├── SKILL.md                               ← Thin routing table (41 lines)
│   │   └── references/                            ← 3 focused reference files
│   │       ├── security-checklist.md              ← STRIDE + OWASP
│   │       ├── predict-personas.md                ← 5 personas + adversarial set
│   │       └── reason-judge-protocol.md           ← Adversarial refinement loop
│   └── commands/
│       ├── autoresearch.md                        ← Core loop (self-contained, ~100 lines)
│       └── autoresearch/                          ← 13 subcommand files (self-contained)
│           ├── plan.md
│           ├── debug.md
│           ├── fix.md
│           ├── security.md
│           ├── ship.md
│           ├── scenario.md
│           ├── predict.md
│           ├── learn.md
│           ├── reason.md
│           ├── improve.md
│           ├── probe.md
│           ├── evals.md
│           └── regression.md
├── .opencode/                                     ← OpenCode port (via transform.sh)
│   ├── skills/autoresearch/
│   └── commands/                                  ← 14 command files (autoresearch_*.md)
├── .agents/                                       ← Codex port (via transform.sh)
│   └── skills/autoresearch/
├── plugins/                                       ← Codex plugin metadata
│   └── openai.yaml
└── LICENSE

FAQ

Q: I don't know what metric to use. A: Run /autoresearch:plan — it analyzes your codebase, suggests metrics, and dry-runs the verify command before you launch.

Q: What changed in v2.2.0? A: The root /autoresearch command now supports an autonomous orchestrator mode. Type a plain-language goal (e.g., /autoresearch help me fix the login bug) instead of Metric:/Verify: and the orchestrator classifies your goal, derives a verifiable Success predicate, confirms it once, then loops across subcommands until done. Classic metric-loop behavior is unchanged when Metric: or Verify: are present.

Q: What changed in v2.1.0? A: Architecture rebuild. The monolithic SKILL.md (813 lines, ~100K tokens) is replaced with a thin routing file + 12 self-contained command files (~5–8K tokens each). 95% token reduction. A new /autoresearch:evals command analyzes iteration results. Every looping command now has a bounded default instead of running unlimited.

Q: How do bounded defaults work? A: Every looping command ships with a sensible default (e.g., /autoresearch defaults to 25 iterations). Override inline: Iterations: 50 for more, Iterations: unlimited for the old unbounded behavior.

Q: How does /autoresearch:evals work? A: Point it at any *-results.tsv file from a previous run. It reports trends, plateau detection, and a recommendation. Use --evals-interval N during a live run to get checkpoint reports without interrupting the loop.

Q: Does this work with any project? A: Yes. Any language, framework, or domain. Install via plugin (Claude Code), installer script, or manual copy.

Q: Does this work with OpenCode? A: Yes. Run ./scripts/install.sh --opencode --global or manually copy .opencode/ files. Commands use underscore naming (/autoresearch_debug, /autoresearch_evals, etc.). All 14 commands available.

Q: Does this work with OpenAI Codex? A: Yes. Run ./scripts/install.sh --codex --global or copy .agents/skills/autoresearch/ to ~/.codex/skills/autoresearch. Invoke via $autoresearch mention syntax.

Q: How do I stop the loop? A: Ctrl+C or add Iterations: N to your inline config. Claude commits before verifying, so your last successful state is always in git.

Q: Can I use this for non-code tasks? A: Absolutely. Sales emails, marketing copy, HR policies, runbooks — anything with a measurable metric. See Examples by Domain.

Q: Does /autoresearch:security modify my code? A: No. Read-only by default. Use --fix to opt into auto-remediation of confirmed Critical/High findings.

Q: What's the difference between /autoresearch:predict and /autoresearch:reason? A: Predict is a one-shot analysis — 5 experts debate your existing code. Reason is an iterative refinement loop — competing candidates are generated, critiqued, synthesized, and blind-judged over multiple rounds until convergence. Use predict for analysis before acting; use reason for decisions where no objective metric exists.

Q: What is handoff.json? A: A structured file emitted by plan, probe, reason, and other commands that carries Goal/Scope/Metric/Verify config for downstream commands. When you --chain plan,autoresearch, the chain reads handoff.json automatically.


Contributing

Contributions welcome. See CONTRIBUTING.md.

Areas of interest: new domain examples, verification script templates, CI/CD integrations, real-world benchmarks. All guides are in guide/.


Star History

Star History Chart

License

MIT — see LICENSE.


Credits


About the Author

Udit Goenka

Udit Goenka — AI Product Expert, Founder & Angel Investor

Self-taught builder who went from a slow internet connection in India to founding multiple companies and helping 700+ startups generate over ~$25m in revenue.

Building: TinyCheque (India's first agentic AI venture studio) · Firstsales.io (sales automation)

Investing: 38 startups backed, 6 exits. Focused on early-stage AI and SaaS.

Connect: udit.co · @iuditg · @uditgoenka · Newsletter

"Autonomy scales when you constrain scope, clarify success, mechanize verification, and let agents optimize tactics while humans optimize strategy."