README.md

April 17, 2026 · View on GitHub

AgentAssert
Formal Behavioral Contracts for AI Agents

PyPI Python arXiv CI AGPL v3


AgentAssert is the formal behavioral specification and runtime enforcement engine for autonomous AI agents. Define what your agent must and must not do in a YAML contract, then enforce those rules at runtime with mathematical guarantees.

It is the only framework combining all 6 pillars of rigorous agent governance:

  1. ContractSpec DSL -- YAML-based behavioral specification with 14 operators
  2. Hard/Soft Constraints -- Formal separation with graduated enforcement and recovery
  3. Drift Detection -- Jensen-Shannon Divergence for distributional behavioral analysis
  4. (p, delta, k)-Satisfaction -- Probabilistic compliance guarantees with statistical bounds
  5. Compositional Safety Proofs -- Formal bounds for multi-agent pipelines
  6. Mathematical Stability -- Ornstein-Uhlenbeck dynamics with Lyapunov stability proof

Paper: Bhardwaj, V.P. (2026). AgentAssert: Formal Behavioral Contracts for Autonomous AI Agents. arXiv:2602.22302


Install

pip install agentassert-abc[yaml,math]

Requires Python 3.12+. Licensed under AGPL-3.0.

Optional extras:

ExtraWhat it adds
yamlYAML contract parsing (ruamel.yaml)
mathDrift detection, Theta computation (scipy, numpy)
llmRecovery re-prompting (LiteLLM)
otelOpenTelemetry metric export
allEverything above

Quick Start -- 5 Minutes to Behavioral Contracts

import agentassert_abc as aa
from agentassert_abc.integrations.generic import GenericAdapter

# 1. Load a domain contract (12 included out of the box)
contract = aa.load("contracts/examples/ecommerce-product-recommendation.yaml")

# 2. Create an adapter
adapter = GenericAdapter(contract)

# 3. Monitor agent output on every turn
result = adapter.check({
    "output.pii_detected": False,
    "output.competitor_reference_detected": False,
    "output.sponsored_items_disclosed": True,
    "output.brand_tone_score": 0.85,
    "output.recommendation_relevance_score": 0.9,
})

print(f"Hard violations: {result.hard_violations}")
print(f"Soft violations: {result.soft_violations}")

# 4. Raise on critical violations
adapter.check_and_raise({
    "output.pii_detected": False,
    "output.competitor_reference_detected": False,
    "output.sponsored_items_disclosed": True,
    "output.brand_tone_score": 0.85,
    "output.recommendation_relevance_score": 0.9,
})

# 5. Get session reliability score (Theta)
summary = adapter.session_summary()
print(f"Reliability (Theta): {summary.theta:.3f}")
print(f"Deploy-ready: {summary.theta >= 0.90}")

Framework Integration

AgentAssert is plug-and-play with the major 2026 agent frameworks.

LangGraph -- Node Interception

from langgraph.graph import StateGraph, START, END
from agentassert_abc.exceptions import ContractBreachError
from agentassert_abc.integrations.langgraph import LangGraphAdapter

contract = aa.load("contracts/examples/customer-support.yaml")
adapter = LangGraphAdapter(contract)

builder = StateGraph(State)
builder.add_node("classify", adapter.wrap_node(classify_fn))
builder.add_node("respond", adapter.wrap_node(respond_fn))
builder.add_edge(START, "classify")
builder.add_edge("classify", "respond")
builder.add_edge("respond", END)

graph = builder.compile()

try:
    result = graph.invoke(initial_state)
except ContractBreachError as e:
    print(f"Hard violation blocked: {e}")

print(f"Session Theta: {adapter.session_summary().theta:.3f}")

CrewAI -- Task Guardrails

from crewai import Agent, Task, Crew
from agentassert_abc.integrations.crewai import CrewAIAdapter

contract = aa.load("contracts/examples/research-assistant.yaml")
adapter = CrewAIAdapter(contract)

# Guardrail rejects output on hard violations -- CrewAI retries automatically
research_task = Task(
    description="Research AI agent frameworks in 2026",
    expected_output="Cited report on top 5 frameworks",
    agent=researcher,
    guardrail=adapter.guardrail,
    guardrail_max_retries=3,
)

OpenAI Agents SDK -- Output Guardrails

from agents import Agent, Runner
from agentassert_abc.integrations.openai_agents import OpenAIAgentsAdapter

contract = aa.load("contracts/examples/healthcare-triage.yaml")
adapter = OpenAIAgentsAdapter(contract)

agent = Agent(
    name="triage-agent",
    instructions="You are a medical triage assistant.",
    output_guardrails=[adapter.output_guardrail],
    output_type=TriageOutput,
)

result = await Runner.run(agent, "I have chest pain", hooks=adapter.run_hooks)
print(f"Theta: {adapter.session_summary().theta:.3f}")

AgentContract-Bench -- 293 Scenarios, 12 Domains

AgentAssert ships with AgentContract-Bench, a benchmark suite of 293 scenarios across 12 real-world domains for testing contract enforcement accuracy.

Benchmark Results (v0.1.0)

DomainScenariosPass RateHard P/R/F1Soft P/R/F1
E-Commerce (Product)50100%1.00 / 1.00 / 1.001.00 / 1.00 / 1.00
Financial Advisor33100%1.00 / 1.00 / 1.001.00 / 1.00 / 1.00
Healthcare Triage33100%1.00 / 1.00 / 1.001.00 / 1.00 / 1.00
MCP Tool Server28100%1.00 / 1.00 / 1.001.00 / 1.00 / 1.00
RAG Agent28100%1.00 / 1.00 / 1.001.00 / 1.00 / 1.00
Code Generation23100%1.00 / 1.00 / 1.001.00 / 1.00 / 1.00
Customer Support23100%1.00 / 1.00 / 1.001.00 / 1.00 / 1.00
E-Commerce (CS)15100%1.00 / 1.00 / 1.001.00 / 1.00 / 1.00
E-Commerce (Order)15100%1.00 / 1.00 / 1.001.00 / 1.00 / 1.00
Research Assistant15100%1.00 / 1.00 / 1.001.00 / 1.00 / 1.00
Retail Shopping15100%1.00 / 1.00 / 1.001.00 / 1.00 / 1.00
Telecom Support15100%1.00 / 1.00 / 1.001.00 / 1.00 / 1.00
Total293100%1.00 / 1.00 / 1.001.00 / 1.00 / 1.00
# Run benchmarks locally
python benchmarks/runner.py                     # All 293 scenarios
python benchmarks/runner.py --domain ecommerce  # Single domain
python benchmarks/runner.py --verbose           # Show details

Live LLM Benchmark -- Real Models, Real Contracts

We tested AgentAssert against 3 production LLMs on a 10-16 turn e-commerce session using the retail-shopping-assistant contract with real Azure AI Foundry endpoints:

ModelTurnsHard ViolationsSoft ViolationsThetaMean Drift
GPT-5.3 (OpenAI)160110.6880.034
Claude Sonnet 4.6 (Anthropic)10400.8230.020
Mistral-Large-3 (Mistral)10500.8130.025

Key findings:

  • GPT-5.3 achieved zero hard violations but exhibited soft quality drift (response completeness and latency)
  • Claude Sonnet 4.6 and Mistral-Large-3 triggered no-false-availability hard violations -- fabricating product availability without catalog access
  • All three models scored below the 0.90 Theta threshold for autonomous deployment, demonstrating why runtime behavioral contracts are essential

These results are consistent with the findings reported in arXiv:2602.22302. AgentAssert catches violations that traditional guardrails miss because it tracks behavioral drift over entire sessions, not just individual outputs.


Domain Contracts -- Ready to Use

12 production-ready contracts ship with AgentAssert in contracts/examples/:

ContractDomainHardSoftKey Checks
ecommerce-product-recommendationE-Commerce78PII, competitor mentions, sponsored disclosure
ecommerce-order-managementE-Commerce78Payment data, order accuracy, refund policy
ecommerce-customer-serviceE-Commerce78Escalation, SLA, customer sentiment
financial-advisorFinance78Regulatory compliance, risk disclosure, suitability
healthcare-triageHealthcare97Medical safety, urgency detection, no diagnosis
retail-shopping-assistantRetail79Availability, pricing accuracy, upsell limits
telecom-customer-supportTelecom79Plan accuracy, billing, cancellation handling
code-generationDev Tools77License compliance, security, test coverage
research-assistantResearch67Citation accuracy, source attribution, bias
customer-supportGeneral65Tone, escalation, resolution quality
mcp-tool-serverMCP (2026)65Tool authorization, rate limits, output bounds
rag-agentRAG (2026)77Hallucination, source grounding, retrieval quality

ContractSpec DSL

Define behavioral contracts in YAML:

contractspec: "0.1"
kind: agent
name: my-agent-contract
description: Behavioral contract for my agent
version: "1.0.0"

invariants:
  hard:
    - name: no-pii-leak
      description: Never expose personal information
      check:
        field: output.pii_detected
        equals: false

  soft:
    - name: tone-quality
      description: Maintain professional tone
      check:
        field: output.tone_score
        gte: 0.7
      recovery: fix-tone
      recovery_window: 2

recovery:
  strategies:
    - name: fix-tone
      type: inject_correction
      actions:
        - "Rewrite with professional tone"

satisfaction:
  p: 0.95
  delta: 0.1
  k: 3

14 operators: equals, not_equals, gt, gte, lt, lte, in, not_in, contains, not_contains, matches, exists, expr, between


Writing Your Own Contract

  1. Identify fields -- Examine your agent's output and list the fields that matter for safety and quality
  2. Map to flat dict -- AgentAssert uses output.field_name as keys (e.g., {"output.safe": True})
  3. Choose constraint type -- Hard for non-negotiable safety (violations halt execution), Soft for quality goals (violations trigger recovery)
  4. Set satisfaction -- p = target compliance rate, delta = tolerance, k = max violations before alert

SPRT Certification

Certify agents for production with 50-80% fewer test sessions using Sequential Probability Ratio Testing:

from agentassert_abc.certification.sprt import SPRTCertifier, SPRTDecision

certifier = SPRTCertifier(p0=0.85, p1=0.95, alpha=0.05, beta=0.10)
for session_passed in session_results:
    result = certifier.update(session_passed)
    if result.decision != SPRTDecision.CONTINUE:
        print(f"Decision: {result.decision.value} after {result.sessions_used} sessions")
        break

Compositional Guarantees

Prove safety bounds for multi-agent pipelines:

from agentassert_abc.certification.composition import compose_guarantees

# Agent A (p=0.95) -> Agent B (p=0.98), handoff reliability 0.99
bound = compose_guarantees(p_a=0.95, p_b=0.98, p_h=0.99)
print(f"Pipeline bound: {bound:.3f}")  # p_{A+B} >= 0.921

How AgentAssert Differs

DimensionAgentAssertGuardrails AINeMo GuardrailsMicrosoft AGT
Formal math (Theta, SPRT)YesNoNoNo
Session drift detection (JSD)YesNoNoNo
Compositional safety proofsYesNoNoNo
Hard/Soft constraint separationYesPartialNoNo
Recovery re-promptingYesYesYesNo
Framework integrations10 adapters31 (LangChain)2
Statistical certification (SPRT)YesNoNoNo
Benchmark suite293 scenariosNoNoNo
Academic paperarXiv:2602.22302NoNoNo

Examples

See examples/ for runnable demos:

ExampleWhat It Shows
01_basic_monitoring.pySimplest usage -- load, monitor, get Theta
02_ecommerce_session.pyFull e-commerce session from the paper
03_drift_detection.pyJSD-based behavioral drift over 20 turns
04_sprt_certification.pySPRT statistical certification
05_langgraph_middleware.pyLangGraph StateGraph integration
06_crewai_integration.pyCrewAI task guardrails
07_composition_pipeline.pyMulti-agent compositional bounds
08_mcp_tool_monitoring.pyMCP tool server monitoring

Research Paper

"AgentAssert: Formal Behavioral Contracts for Autonomous AI Agents"

The theoretical foundations, formal proofs, and experimental validation are published in a peer-reviewed paper covering all 6 pillars of the framework, with full mathematical treatment of the Reliability Index, drift dynamics, compositional guarantees, and SPRT certification.

Read the paper on arXiv (cs.AI + cs.SE)

Cite This Work

@article{bhardwaj2026agentassert,
  title={AgentAssert: Formal Behavioral Contracts for Autonomous AI Agents},
  author={Bhardwaj, Varun Pratap},
  journal={arXiv preprint arXiv:2602.22302},
  year={2026},
  url={https://arxiv.org/abs/2602.22302}
}

Contributing

Contributions welcome. See CONTRIBUTING.md for setup instructions, coding standards, and submission guidelines.


License

GNU Affero General Public License v3.0 (AGPL-3.0). See LICENSE.

For commercial licensing (closed-source, proprietary, or hosted use), see COMMERCIAL-LICENSE.md or contact varun.pratap.bhardwaj@gmail.com.

Copyright (c) 2026 Varun Pratap Bhardwaj / Qualixar.


Part of Qualixar -- AI Agent Reliability Engineering
A research initiative by Varun Pratap Bhardwaj

qualixar.com · varunpratap.com · arXiv:2602.22302 · agentassert.com


⭐ Support This Project

If this project solves a real problem for you, please star the repo — it helps other developers discover Qualixar and signals that the AI agent reliability community is growing. Every star matters.

Star History Chart


Part of the Qualixar AI Agent Reliability Platform

Qualixar is building the open-source infrastructure for AI agent reliability engineering. Seven products, seven peer-reviewed papers, one coherent platform. Each tool solves one reliability pillar:

ProductPurposeInstallPaper
SuperLocalMemoryPersistent memory + learning for AI agentsnpx superlocalmemoryarXiv:2604.04514
Qualixar OSUniversal agent runtime (13 execution topologies)npx qualixar-osarXiv:2604.06392
SLM MeshP2P coordination across AI agent sessionsnpm i slm-mesh
SLM MCP HubFederate 430+ MCP tools through one gatewaypip install slm-mcp-hub
AgentAssayToken-efficient AI agent testingpip install agentassayarXiv:2603.02601
AgentAssertBehavioral contracts + drift detectionpip install agentassert-abcarXiv:2602.22302
SkillFortifyFormal verification for AI agent skillspip install skillfortifyarXiv:2603.00195

Zero cloud dependency. Local-first. EU AI Act compliant.

Start here → qualixar.com · All papers on Qualixar HuggingFace