README.md
June 6, 2026 · View on GitHub
Change an agent config, instantly know if it got better.
You tune an agent — a prompt, a tool, a memory setting, a temperature — and you don't actually know if it improved. Last week's good config is gone. Two setups can only be compared by gut. Your agent's configuration lives in scattered prompt strings, a Notion doc, and someone's head.
EvanCore is the lab that fixes that: run two harness configs on the same task, score the outcomes, and see which won — output, config diff, cost, verdict. And because every config is a content-addressed, signed, reversible artifact, the winner ships with one command.
The engine is closed. The design, the spec schema, the philosophy, and the evaluation log are open.
The lab: evan compare
Two configs differing only by model. Same task. The judge decides.
$ evan compare variant_a.yaml variant_b.yaml \
--trigger "Write a 3-sentence pitch for a tool that version-controls AI agent configs." \
--criteria "punchy, concrete, no buzzwords"
╭──────────────────────────────╮ ╭──────────────────────────────────╮
│ Variant A · variant_a.yaml │ │ Variant B · variant_b.yaml ◀ winner│
│ score 7.0/10 │ │ score 9.0/10 │
│ cost \$0.0020 │ │ cost \$0.0023 │
│ tokens 10 in / 509 out │ │ tokens 3 in / 152 out │
│ status ✓ │ │ status ✓ │
╰──────────────────────────────╯ ╰──────────────────────────────────╯
Config delta (A → B)
~ model.model_id: 'haiku' → 'sonnet'
🏆 Variant B wins — 9.0 vs 7.0
B opens with a sharper pain point, the 'last Tuesday' detail is concrete,
and 'git for the part of your stack that lives in someone's head or a
Notion doc' lands harder than A's generic closer.
Ship it: evan release variant_b.yaml
Change anything — prompt, tools, memory backend, temperature, routing — author it as an overlay, and compare again. That loop is the lab. The judge is an LLM scored on your stated criteria; the cost and token numbers are real; the config delta is computed from the specs.
See examples/compare/ for the variant specs that produce this.
Run it with no API key
EvanCore harnesses can run on the local claude CLI (Claude Code) — using the machine's existing auth, no ANTHROPIC_API_KEY required. If you have Claude Code installed, you can compare real Claude models out of the box:
model:
provider: claude-code # routes through your local `claude` CLI
model_id: sonnet # or haiku / opus / a full model id
$ evan compare a.yaml b.yaml -t "..." --judge "claude-code:sonnet"
Ollama (provider: ollama) works the same way for fully-offline runs. Native anthropic / openai providers are there when you want raw-API cost accounting.
The spine beneath the lab
A comparison is only worth trusting if the thing you compared is reproducible. EvanCore's delivery layer is what makes that true — and what turns "B won" into "B is shipped."
| Capability | What it buys the lab |
|---|---|
| Content-addressed specs (blake2b-64) | The exact config you scored is the exact config you ship — no drift |
evan diff | The config delta in every comparison |
evan release / rollback / upgrade / promote | Ship the winner, reverse it, move it across environments — every release has an inverse |
| Signed artifacts (Ed25519 + trust + revocation) | Distribute a winning config others can verify |
| Overlays (Kustomize-style) | Author A/B/C variants without copy-pasting whole specs |
# the winner becomes a signed, reversible artifact
$ evan release variant_b.yaml --sign alice@acme
📦 SPEC_RELEASED h-abc123f7d9e2
✍ SPEC_SIGNED alice@acme
🚀 HARNESS_DEPLOYED app (av)
# changed your mind after the next experiment?
$ evan rollback h-abc123 --yes
↩ HARNESS_ROLLED_BACK
What this is / is not
Is:
- A lab for agent harnesses — run two configs on one task, judge which won (
evan compare) - A delivery spine that makes the winner reproducible and shippable (assemble / diff / release / rollback / upgrade / promote / publish / pull / verify)
- Backend-agnostic runs —
claude-code(no key),ollama(offline),anthropic,openai
Is not:
- An agent framework (no runtime DSL, no orchestration primitives, no LangGraph/CrewAI overlap)
- A hosting platform (you still choose where agents run)
- A tracing/observability product — those watch runs; EvanCore version-controls and compares the config that produced them
What's open here / what's closed
This repository is the public surface of EvanCore:
| Open (this repo) | Closed (private engine) |
|---|---|
docs/index.html — interactive lifecycle visualization | Compare loop (run → judge → side-by-side) |
docs/philosophy.md — why a lab, not a runtime | Assembler (spec → harness packaging) |
docs/evaluations.md — ecosystem evaluation log | Release machinery (content-addressed storage, hash chain) |
docs/spec-schema.md — public harness spec schema | Signing / trust / revocation kernel |
examples/*.yaml — runnable example specs | Registry federation (file:// + https://) |
LICENSE — Apache-2.0 (on materials in this repo) | CLI implementation + v2 four-line backend runtime |
Ship status: v1 A–G + v2 four-line foundation + release closure trilogy + B3 supply-chain (A/A.1/B/B.1/B.2) + MCP Phase 2 + the lab (evan compare, LLM-judge, claude-code backend) — 633 tests passing as of 2026-06-06.
The verification shape (schema, evaluation transparency, signed artifacts, the real comparison transcript above) is here. The engine isn't.
Positioning (honest)
| If you want… | Use this instead |
|---|---|
| A runtime that executes agent steps | LangGraph, CrewAI, Microsoft Agent Framework |
| A hosted platform for running agents | Vercel AI, Modal, Fly.io Machines |
| Tracing / eval dashboards over agent runs | LangSmith, Langfuse, Braintrust |
| A tool-use protocol / integration substrate | MCP itself — EvanCore consumes it, doesn't replace it |
The difference from the tracing tools: they observe the run. EvanCore version-controls the config and tells you which config to keep. If your harness spec is the source of truth, EvanCore is the lab bench it sits on and the press that ships it.
Evaluation transparency
Every external project we evaluated for absorption — and every one we declined — is logged in docs/evaluations.md. The default verdict is decline or take idea only; absorbing code is rare and requires a specific gap hit. This log exists so the project's scope stays defensible in public.
Access
- Design, schema, examples, evaluation log: this repo (Apache-2.0)
- Engine / CLI binary: private beta — open an issue or email to request access
- Visual tour: open
docs/index.htmlin a browser
Related work
memory-core-eval— sibling project, open verification layer for agent memory systemsEvanyuan-builder— profile, three-layer strategy (open verification · semi-open interface · closed core)
License
Apache-2.0 on materials in this repository (docs, schema, examples, landing page). See LICENSE.
The EvanCore engine is not covered by this license and is not distributed here.