Sniffbench

December 18, 2025 · View on GitHub

npm version License: MIT GitHub stars

A benchmark suite for coding agents. Think pytest, but for evaluating AI assistants.

Demo

What is this?

When you change your AI coding setup—switching models, adjusting prompts, adding MCP servers, or trying new tools—you're flying blind. Did it actually get better? Worse? Hard to say without data.

Sniffbench gives you that data. It runs your coding agent through evaluation tasks, captures your configuration, and measures what matters.

Quick Start

# Install globally
npm install -g sniffbench

# Or clone and build
git clone https://github.com/answerlayer/sniffbench.git
cd sniffbench && npm install && npm run build && npm link

# Check it's working
sniff --help
sniff doctor

Core Workflow

1. Run a Comprehension Interview

sniff interview

This runs your agent through comprehension questions about your codebase. You grade each answer (1-10) to establish baselines.

Every interview automatically:

  • Creates a run with a unique ID
  • Captures your agent configuration (version, model, MCP servers, tools)
  • Auto-links to matching variants (if registered)
# With an optional label for easy reference
sniff interview --run "baseline"

2. Register Variants for A/B Testing

Before making configuration changes, snapshot your current setup:

sniff variant register "control" --description "Stock Claude Code config"

Make your changes (add MCP server, update CLAUDE.md, etc.), then register the new config:

sniff variant register "with-linear" --description "Added Linear MCP server"

3. Compare Results

# Compare two runs
sniff compare baseline after-changes

# Or by run ID
sniff compare run-1734567890-abc123 run-1734567891-def456

Shows both config diff (what changed) and metrics diff (did it help):

Configuration Changes:
  MCP: Linear: none → stdio
  Allowed Tools: none → 1 tools

Case Comparison:
  comp-001: Tokens 10,959 → 8,234 (-25%) ✓
  comp-002: Grade 7/10 → 9/10 ↑

Aggregate Summary:
  Total tokens: 45,000 → 38,000 ↓ -15.6%
  Total cost: \$0.52 → \$0.44 ↓ -15.4%

Commands

sniff interview                 # Run comprehension interview
sniff variant register <name>   # Snapshot current config
sniff compare <run1> <run2>     # Compare two runs
sniff closed-issues scan        # Find repo issues to use as cases
sniff closed-issues run         # Evaluate agent on real issues

See COMMANDS.md for the full reference.

What Gets Captured

Agent Configuration (Automatic)

Every run captures:

FieldSourceExample
Agent nameCLI detectionclaude-code
Versionclaude --version2.0.55
ModelAPI responseclaude-sonnet-4-20250514
CLAUDE.md hashFile hash8b28a4e5...
MCP servers~/.claude.jsonLinear(stdio)
Allowed tools~/.claude.jsonBash(osgrep:*)
Permission modeSettingsdefault
Thinking modeSettingsenabled

Behavior Metrics (Per Case)

MetricWhat it measures
totalTokensTotal tokens used
inputTokensInput/prompt tokens
cacheReadTokensTokens read from cache
cacheWriteTokensTokens written to cache
toolCountNumber of tool calls
readCountNumber of file reads
costUsdEstimated API cost
explorationRatioRead vs write tool ratio
cacheHitRatioCache efficiency

Variant System

Variants enable scientific A/B testing of agent configurations.

Why Variants?

Without variants, you're comparing runs but don't know why one performed differently. Variants let you:

  1. Document what changed: "Added Linear MCP", "Updated CLAUDE.md prompts"
  2. Auto-link runs: Runs automatically link to matching variants
  3. Compare configs: See exactly what's different between setups

Sandboxed Variant Execution

For true isolation, variants can be packaged as Docker containers with your configuration baked in:

# Register and build a container image
sniff variant register "control" -d "Stock config" --build

# Make changes to CLAUDE.md, add MCP servers, etc...

# Register the treatment variant
sniff variant register "with-osgrep" -d "Added semantic search" --build

# Run interview in sandboxed container
sniff interview --use-variant control
sniff interview --use-variant with-osgrep

# Compare the runs
sniff compare <control-run> <osgrep-run>

Each container includes:

  • Claude Code (same version as your host)
  • Your CLAUDE.md baked in
  • Tool permissions configured
  • Complete isolation from host config

Requirements: Docker must be installed for sandboxed execution.

Workflow Example (Without Containers)

# 1. Register your baseline config
sniff variant register "control" -d "Stock Claude Code"

# 2. Run some interviews
sniff interview --run "control-test-1"
sniff interview --run "control-test-2"

# 3. Make changes to your setup
# ... add MCP server, update CLAUDE.md, etc ...

# 4. Register the new config
sniff variant register "treatment" -d "Added semantic search"

# 5. Run more interviews (auto-links to "treatment")
sniff interview --run "treatment-test-1"

# 6. Compare!
sniff compare control-test-1 treatment-test-1

Storage

All data is stored in .sniffbench/ in your project root:

.sniffbench/
├── runs.json       # All runs with results and config
├── variants.json   # Registered variants
└── baselines.json  # Legacy format (auto-migrated)

Case Types

TypeDescriptionStatus
ComprehensionQuestions about codebase architecture✅ Ready
BootstrapCommon tasks (fix linting, rename symbols)🚧 In Progress
Closed IssuesReal issues from your repo's history✅ Ready

What We Measure

Sniffbench evaluates agents on behaviors that matter for real-world development:

  1. Comprehension - Does the agent understand the codebase?
  2. Efficiency - Does it explore without wasting tokens?
  3. Accuracy - Are its answers correct and complete?
  4. Consistency - Does it perform reliably across runs?

See VALUES.md for our full evaluation philosophy.

Contributing

We welcome contributions! Areas that need work:

  • Agent wrappers - Integrate with Cursor, Aider, or other coding agents
  • Bootstrap cases - Detection and validation for common tasks
  • LLM-judge - Automated answer quality evaluation
  • Documentation - Examples, tutorials, case studies

See CONTRIBUTING.md to get started.

License

MIT - see LICENSE