Step 1: Calibrate -- run just 10 trials to measure variance

April 17, 2026 · View on GitHub

AgentAssay

Test More. Spend Less. Ship Confident.

The first agent testing framework that delivers statistical guarantees WITHOUT burning your token budget.

A Qualixar Research Initiative by Varun Pratap Bhardwaj

arXiv DOI Build Coverage PyPI Python AGPL v3


The Problem

Every time you change a prompt, swap a model, or update a tool, you need to know: does my agent still work?

Today, answering that question is painfully expensive. Run 100 trials across 20 scenarios, and you've burned thousands of tokens just to check for a regression. Most teams either:

  • Over-test: Run fixed-N trials and waste budget on scenarios that don't need it.
  • Under-test: Skip testing because the cost is too high, and ship broken agents.
  • Guess: Run a few trials, eyeball the results, and hope for the best.

None of these are engineering. They are gambling.

The Solution

AgentAssay introduces token-efficient agent testing -- three techniques that deliver the same statistical confidence at a fraction of the cost:

1. Behavioral Fingerprinting

Instead of comparing raw text outputs (high-dimensional, noisy, expensive), AgentAssay extracts behavioral fingerprints -- compact representations of what the agent did rather than what it said. Tool sequences, state transitions, decision patterns. Low-dimensional signals need fewer samples to detect change.

2. Adaptive Budget Optimization

No more guessing how many trials to run. AgentAssay runs a small calibration set (5-10 runs), measures behavioral variance, and computes the exact minimum number of trials needed for your target confidence level. High-variance scenarios get more trials. Stable scenarios get fewer. Zero waste.

3. Trace-First Offline Analysis

Coverage metrics, contract checks, metamorphic relations, and mutation analysis can all run on production traces you already have -- at zero additional token cost. Why re-run your agent when you can analyze runs that already happened?

Result: Same confidence. 83% less cost.


Install

pip install agentassay                    # Core (works with CustomAdapter)
pip install agentassay[langgraph]         # + LangGraph support
pip install agentassay[crewai]            # + CrewAI support
pip install agentassay[all]               # All framework adapters

Supported Frameworks

AgentAssay works with every major agent framework — zero lock-in, plug-and-play.

FrameworkInstallAdapter
LangGraphpip install agentassay[langgraph]LangGraphAdapter
CrewAIpip install agentassay[crewai]CrewAIAdapter
AutoGenpip install agentassay[autogen]AutoGenAdapter
OpenAI Agentspip install agentassay[openai]OpenAIAgentsAdapter
smolagentspip install agentassay[smolagents]SmolAgentsAdapter
Semantic Kernelpip install agentassay[semantic-kernel]SemanticKernelAdapter
AWS Bedrock Agentspip install agentassay[bedrock]BedrockAgentsAdapter
MCPpip install agentassay[mcp]MCPToolsAdapter
Vertex AI Agentspip install agentassay[vertex]VertexAIAgentsAdapter
Any custom agentpip install agentassayCustomAdapter

Don't see your framework? Use CustomAdapter — wrap any callable that returns execution traces.


Quick Start: Pick Your Framework

LangGraph:

from agentassay.integrations import LangGraphAdapter

adapter = LangGraphAdapter(graph=your_graph)
trace = adapter.run({"query": "Book a flight from NYC to London"})
print(f"Steps: {len(trace.steps)}, Cost: ${trace.total_cost_usd:.4f}")

CrewAI:

from agentassay.integrations import CrewAIAdapter

adapter = CrewAIAdapter(crew=your_crew)
trace = adapter.run({"task": "Research protein folding"})
print(f"Success: {trace.success}, Duration: {trace.total_duration_ms}ms")

Any Framework:

from agentassay.integrations import CustomAdapter

def my_agent_fn(input_data):
    # Your agent logic here
    return execution_trace

adapter = CustomAdapter(callable_fn=my_agent_fn)
trace = adapter.run({"query": "Hello world"})

Try the demo:

# See it in action instantly (no config needed)
agentassay demo

Full Example: Token-Efficient Testing

from agentassay.efficiency import BehavioralFingerprint, AdaptiveBudgetOptimizer
from agentassay.core.runner import TrialRunner
from agentassay.verdicts import VerdictFunction

# Step 1: Calibrate -- run just 10 trials to measure variance
optimizer = AdaptiveBudgetOptimizer(alpha=0.05, beta=0.10)
estimate = optimizer.calibrate(calibration_traces)

print(f"Recommended trials: {estimate.recommended_n}")   # e.g., 17 (not 100)
print(f"Estimated cost: ${estimate.estimated_cost_usd:.2f}")  # e.g., \$0.34
print(f"Savings vs fixed-100: {estimate.savings_vs_fixed_100:.0%}")  # e.g., 83%

# Step 2: Run only the trials you need
runner = TrialRunner(agent_fn=my_agent, config=config)
results = runner.run_trials(scenario, n=estimate.recommended_n)

# Step 3: Compare fingerprints for regression detection
baseline_fp = BehavioralFingerprint.from_traces(baseline_traces)
current_fp = BehavioralFingerprint.from_traces(current_traces)
drift = baseline_fp.distance(current_fp)

# Step 4: Get a statistically-backed verdict
verdict = VerdictFunction(alpha=0.05).evaluate(results)
print(f"Verdict: {verdict.status}")  # PASS / FAIL / INCONCLUSIVE
print(f"Pass rate: {verdict.pass_rate:.1%} [{verdict.ci_lower:.1%}, {verdict.ci_upper:.1%}]")

How It Works

                      Token-Efficient Testing Pipeline
  +-----------------------------------------------------------------+
  |                                                                   |
  |  Production Traces -----> Trace Store -----> Offline Analysis     |
  |  (already paid for)                          (coverage, contracts,|
  |                                               metamorphic -- FREE)|
  |                                                     |             |
  |  New Agent Version --> Calibration (5-10 runs) --> Budget Estimate|
  |                                                     |             |
  |                   Targeted Testing (optimal N) --> Fingerprint    |
  |                                                    Comparison     |
  |                                                     |             |
  |                                          Statistical Verdict      |
  |                                          (5-20x cheaper)          |
  +-----------------------------------------------------------------+

The core insight: most of the information you need to test an agent is already in traces you have collected. AgentAssay extracts maximum signal from minimum runs.


Feature Matrix

FeatureDescription
Behavioral fingerprintingDetect regression from behavioral patterns, not raw text. Fewer samples needed.
Adaptive budget optimizationCalibrate variance, compute exact minimum N. No over-testing.
Trace-first offline analysisRun coverage, contracts, and metamorphic checks on existing traces. Zero token cost.
Multi-fidelity proxy testingUse cheaper models for initial screening, expensive models only for confirmation.
Warm-start sequential testingIncorporate prior results to reach verdicts faster.
Three-valued verdictsPASS, FAIL, or INCONCLUSIVE -- never a misleading binary answer.
Confidence intervalsKnow the true pass rate range, not a point estimate.
Statistical regression detectionHypothesis tests catch regressions before production.
5D coverage metricsMeasure tool, path, state, boundary, and model coverage.
Mutation testingPerturb your agent to validate test sensitivity.
Metamorphic testingVerify behavioral invariants across input transformations.
Contract oracleCheck behavioral specifications from AgentAssert contracts.
Deployment gatesBlock broken deployments in CI/CD with statistical evidence.
Framework adaptersWorks with popular agent frameworks out of the box.
pytest integrationUse familiar pytest conventions with statistical assertions.
CLIFive commands: run, compare, mutate, coverage, report.

Comparison

FeatureAgentAssaydeepevalagentrialLangSmith
Statistical regression testing:white_check_mark::x::warning::x:
Three-valued verdicts:white_check_mark::x::x::x:
Token-efficient testing:white_check_mark::x::x::x:
Behavioral fingerprinting:white_check_mark::x::x::x:
Adaptive budget optimization:white_check_mark::x::x::x:
Trace-first offline analysis:white_check_mark::x::x::x:
5D coverage metrics:white_check_mark::x::x::x:
Mutation testing:white_check_mark::x::x::x:
Metamorphic testing:white_check_mark::x::x::x:
CI/CD deployment gates:white_check_mark::x::white_check_mark::x:
Published research paper:white_check_mark::x::x::x:

Architecture

+-------------------------------------------------------------------+
|  Layer 6: Efficiency                                               |
|  Fingerprinting | Budget Optimization | Trace Analysis             |
|  Multi-Fidelity | Warm-Start Sequential                           |
+-------------------------------------------------------------------+
|  Layer 5: Integration                                              |
|  Framework Adapters | pytest Plugin | CLI | Reporting              |
+-------------------------------------------------------------------+
|  Layer 4: Analysis                                                 |
|  Coverage (5D) | Mutation | Metamorphic | Contract Oracle          |
+-------------------------------------------------------------------+
|  Layer 3: Verdicts                                                 |
|  Stochastic Verdicts | Deployment Gates                           |
+-------------------------------------------------------------------+
|  Layer 2: Statistics                                               |
|  Hypothesis Tests | Confidence Intervals | SPRT | Effect Size      |
+-------------------------------------------------------------------+
|  Layer 1: Core                                                     |
|  Data Models | Execution Engine | Trace Format                    |
+-------------------------------------------------------------------+

Layer 6 (Efficiency) is the differentiator. It sits atop the full statistical testing stack, optimizing how many runs are needed while Layers 1-5 ensure every run produces rigorous results.


Usage with pytest

import pytest

@pytest.mark.agentassay(n=30, threshold=0.80)
def test_agent_booking_flow(trial_runner):
    runner = trial_runner(my_agent)
    scenario = TestScenario(
        scenario_id="booking",
        name="Flight booking",
        input_data={"task": "Book a flight from NYC to London"},
        expected_properties={"max_steps": 10, "must_use_tools": ["search", "book"]},
    )
    results = runner.run_trials(scenario)
    assert_pass_rate(results, threshold=0.80, confidence=0.95)
python -m pytest tests/ -v --agentassay

CLI

AgentAssay provides 8 commands for testing, analysis, and reporting:

# Try the interactive demo (no setup needed)
agentassay demo

# Run trials with adaptive budget
agentassay run --scenario booking.yaml --budget-mode adaptive

# Compare two versions for regression
agentassay compare --baseline v1.json --current v2.json

# Analyze coverage from existing traces
agentassay coverage --traces production-traces/ --tools search,book,cancel

# Mutation testing
agentassay mutate --scenario booking.yaml --operators prompt,tool,model

# Generate test reports
agentassay test-report --results trials.json --format html

# Generate full HTML report
agentassay report --results trials.json --output report.html

# Check version
agentassay --version

Documentation


Research

AgentAssay is backed by a published research paper with formal definitions, theorems, and proofs.

Paper: arXiv:2603.02601 (cs.AI + cs.SE) DOI: 10.5281/zenodo.18842011

@article{bhardwaj2026agentassay,
  title={AgentAssay: Formal Regression Testing for Non-Deterministic AI Agent Workflows},
  author={Bhardwaj, Varun Pratap},
  journal={arXiv preprint arXiv:2603.02601},
  year={2026},
  doi={10.5281/zenodo.18842011}
}

Contributing

Contributions welcome. See CONTRIBUTING.md for guidelines.


License

GNU Affero General Public License v3.0 (AGPL-3.0). See LICENSE.

For commercial licensing (closed-source, proprietary, or hosted use), see COMMERCIAL-LICENSE.md or contact varun.pratap.bhardwaj@gmail.com.

Copyright (c) 2026 Varun Pratap Bhardwaj / Qualixar.


Part of Qualixar — The Complete Agent Development Platform
A research initiative by Varun Pratap Bhardwaj

qualixar.com · varunpratap.com · arXiv:2603.02601


⭐ Support This Project

If this project solves a real problem for you, please star the repo — it helps other developers discover Qualixar and signals that the AI agent reliability community is growing. Every star matters.

Star History Chart


Part of the Qualixar AI Agent Reliability Platform

Qualixar is building the open-source infrastructure for AI agent reliability engineering. Seven products, seven peer-reviewed papers, one coherent platform. Each tool solves one reliability pillar:

ProductPurposeInstallPaper
SuperLocalMemoryPersistent memory + learning for AI agentsnpx superlocalmemoryarXiv:2604.04514
Qualixar OSUniversal agent runtime (13 execution topologies)npx qualixar-osarXiv:2604.06392
SLM MeshP2P coordination across AI agent sessionsnpm i slm-mesh
SLM MCP HubFederate 430+ MCP tools through one gatewaypip install slm-mcp-hub
AgentAssayToken-efficient AI agent testingpip install agentassayarXiv:2603.02601
AgentAssertBehavioral contracts + drift detectionpip install agentassert-abcarXiv:2602.22302
SkillFortifyFormal verification for AI agent skillspip install skillfortifyarXiv:2603.00195

Zero cloud dependency. Local-first. EU AI Act compliant.

Start here → qualixar.com · All papers on Qualixar HuggingFace