autospec

March 30, 2026 · View on GitHub

Natural-language domain specs in, working service code out.

An autonomous keep-or-revert loop — inspired by karpathy/autoresearch — that reads business rules written in plain language and iteratively builds, tests, and verifies a service until the spec is satisfied.

Demo Results

We wrote 5 domain documents (67 lines of Korean). The orchestrator ran 7 cycles in 26 minutes and built a complete REST API from a 119-line skeleton:

CycleWhat the AI DidTestsLinesTime
1CRUD + validation + status transitions1 → 12+3844m44s
2Error response consistency + edge cases12 → 18+1215m19s
3500 handler, null status check, test gaps18 → 22+974m29s
4Lifecycle test, edge case coverage22 → 28+1235m44s
5Transactional safety, input validation tests28 → 34+1015m58s
6-7(no changes — converged)34

119-line skeleton → 950 lines of working Java. 34 tests. 5 accepts, 0 rejects. $0 cost.

How It Works

┌─────────────────────────┐
│  .autospec/domain/*.md  │  Human writes business rules (natural language)
│  .autospec/common/*.md  │  Human writes tech conventions (once)
└───────────┬─────────────┘


┌─────────────────────────┐
│    orchestrator.py      │  Loop controller
│                         │
│  1. Read previous runs  │
│  2. Build prompt        │
│  3. Call claude -p      │──► Claude Code CLI reads specs, writes code, commits
│  4. Evaluate result     │
│  5. Accept or reject    │
└───────────┬─────────────┘


┌─────────────────────────┐
│     evaluator.py        │  Judge (no AI)
│                         │
│  ./gradlew build        │
│  Parse JUnit XML        │
│                         │
│  Accept: build pass     │
│    + tests pass         │
│    + test count ≥ prev  │
│                         │
│  Reject: git reset      │
└─────────────────────────┘

The evaluator is outside the AI. The AI writes code; a deterministic script judges it.

Quick Start

git clone https://github.com/jeongph/autospec.git
cd autospec

# Requires: Java 17, Python 3, Claude Code CLI
python orchestrator.py examples/spring-boot-todo

Domain Documents

Domain docs are pure natural language — no code, no types, no API paths:

할일을 만들면 "대기" 상태가 된다. 작업을 시작하면 "진행중"으로 바뀌고, 끝나면 "완료"가 된다. 완료된 할일은 다시 되돌릴 수 없다.

The AI reads this, maps "대기" to PENDING, figures out which endpoint handles status changes, and writes the validation logic.

Technical conventions (response format, naming, DB) live in .autospec/common/ — separated from business rules.

Project Structure

autospec/
├── orchestrator.py          ← Loop controller
├── evaluator.py             ← Build/test judge (no AI)
├── history.py               ← Cycle records + context passing
└── examples/
    └── spring-boot-todo/    ← Example: Todo API
        ├── .autospec/
        │   ├── program.md   ← Agent instructions
        │   ├── common/      ← Tech conventions
        │   ├── domain/      ← Business rules (Korean)
        │   └── eval.md      ← Pass/fail criteria
        └── src/             ← Skeleton (AI fills this)

Safety

  • Reject on build failuregit reset --hard HEAD~1
  • Reject on test failure → rollback
  • Reject on test regression → test count cannot decrease
  • Max 3 consecutive failures → stop
  • Convergence detection → stop after 2 unchanged cycles
  • 10-minute timeout per cycle

Autoresearch Correspondence

autoresearchautospec
program.md.autospec/program.md
prepare.py (immutable)evaluator.py (no AI)
train.py (AI modifies)src/ (AI writes)
val_bpbtest count + build pass

License

MIT