agent-eval
March 16, 2026 ยท View on GitHub
Compare coding agents head-to-head. Pass rate, cost, time, consistency -- one command.
The problem
Every "which coding agent is best?" discussion runs on vibes. There's no lightweight tool to systematically compare agents on your tasks, with your codebase, tracking what actually matters: does it work, how long did it take, and what did it cost?
Quickstart
pip install -e ".[dev]"
agent-eval run --tasks examples/tasks.yaml --agents claude-code,aider --runs 3 --output report.json
agent-eval report --input report.json
Example output
Sample Evaluation Suite
-------------------------------------------------------------------
Agent Pass Rate Avg Time Avg Cost Consistency
-------------------------------------------------------------------
claude-code 80% 45.2s \$0.1200 0.95
aider 60% 30.1s \$0.0300 0.85
-------------------------------------------------------------------
Best pass rate: claude-code (80%) | Fastest: aider (30.1s avg) | Cheapest: aider (\$0.0300 avg)
How it works
- Define tasks in YAML -- description, repo, test command, timeout
- Isolated execution -- each agent runs in a fresh git worktree (no Docker needed)
- Run agents -- Claude Code, Aider, or any CLI agent via adapters
- Judge results -- test commands (exit 0 = pass) or LLM-as-judge
- Report -- pass rate, avg time, avg cost, consistency across runs
Supported agents
| Agent | CLI | Status |
|---|---|---|
| Claude Code | claude | Supported |
| Aider | aider | Supported |
| Custom | Extend AgentAdapter | Extensible |
Task format
name: "My Eval Suite"
tasks:
- id: add-auth
description: "Add JWT auth middleware to the Flask app"
repo: "./my-flask-app"
test_cmd: "python -m pytest tests/ -v"
timeout_seconds: 180
tags: [python, auth]
vs SWE-bench
| SWE-bench | agent-eval | |
|---|---|---|
| Tasks | Fixed dataset (GitHub issues) | Your custom tasks |
| Setup | Docker, heavy infra | Git worktrees, no Docker |
| Cost tracking | No | Yes |
| Consistency measurement | No (1 run) | Yes (N runs, std dev) |
| Time to first result | Hours | Minutes |
CLI reference
agent-eval run --tasks YAML --agents NAME,NAME --runs N --concurrency N --output PATH
agent-eval report --input PATH
agent-eval list-agents
License
MIT