Rhesis: Collaborative Testing for LLM & Agentic Applications

June 25, 2026 · View on GitHub

Rhesis AI Logo

Rhesis: Collaborative Testing for LLM & Agentic Applications

License PyPI Version Python Versions codecov Discord LinkedIn Hugging Face Documentation

Website · Docs · Discord · Changelog

More than just evals.
Collaborative agent testing for teams.

Generate tests from requirements, simulate conversation flows, detect adversarial behaviors, evaluate with 60+ metrics, and trace failures with OpenTelemetry. Engineers and domain experts, working together.

Rhesis Platform Overview - Click to watch demo


Core features

Rhesis Core Features

Test generation

AI-Powered Synthesis - Describe requirements in plain language. Rhesis generates hundreds of test scenarios including edge cases and adversarial prompts.

Knowledge-Aware - Connect context sources via file upload or MCP (Notion, GitHub, Jira, Confluence) for better test generation.

Single-turn & conversation simulation

Single-turn for Q&A validation. Conversation simulation for dialogue flows.

Penelope Agent simulates realistic conversations to test context retention, role adherence, and dialogue coherence across extended interactions.

Adversarial testing (red-teaming)

Polyphemus Agent proactively finds vulnerabilities:

  • Jailbreak attempts and prompt injection
  • PII leakage and data extraction
  • Harmful content generation
  • Role violation and instruction bypassing

Garak Integration - Built-in support for garak, the LLM vulnerability scanner, for comprehensive security testing.

60+ pre-built metrics

FrameworkExample Metrics
RAGASContext relevance, faithfulness, answer accuracy
DeepEvalBias, toxicity, PII leakage, role violation, turn relevancy, knowledge retention
GarakJailbreak detection, prompt injection, XSS, malware generation, data leakage
CustomNumericJudge, CategoricalJudge for domain-specific evaluation

All metrics include LLM-as-Judge reasoning explanations.

Traces & observability

Monitor your LLM applications with OpenTelemetry-based tracing:

from rhesis.sdk.decorators import observe

@observe.llm(model="gpt-4")
def generate_response(prompt: str) -> str:
    # Your LLM call here
    return response

Track LLM calls, latency, token usage, and link traces to test results for debugging.

Bring your own model

Use any LLM provider for test generation and evaluation. Provider routing is powered by LiteLLM under the hood, giving you a single interface to 100+ models:

Cloud: OpenAI, Anthropic, Google Gemini, Mistral, Cohere, Groq, Together AI

Local/Self-hosted: Ollama, vLLM, LiteLLM

See Model Configuration Docs for setup instructions.


Why Rhesis?

Platform for teams. SDK for developers.

Use the collaborative platform for team-based testing: product managers define requirements, domain experts review results, engineers integrate via CI/CD. Or integrate directly with the Python SDK for code-first workflows.

The testing lifecycle

Six integrated phases from project setup to team collaboration:

PhaseWhat You Do
1. ProjectsConfigure your AI application, upload & connect context sources (files, docs), set up SDK connectors
2. RequirementsDefine expected behaviors (what your app should and shouldn't do), cover all relevant aspects from product, marketing, customer support, legal and compliance teams
3. MetricsSelect from 60+ pre-built metrics or create custom LLM-as-Judge evaluations to assess whether your requirements are met
4. TestsGenerate single-turn and conversation simulation test scenarios. Organize in test sets and understand your test coverage
5. ExecutionRun tests via UI, SDK, or API; integrate into CI/CD pipelines; collect traces during execution
6. CollaborationReview results with your team through comments, tasks, workflows, and side-by-side comparisons

Rhesis vs...

Instead of...Rhesis gives you...
Manual testingAI-generated test cases based on your context, hundreds in minutes
Traditional test frameworksNon-deterministic output handling built-in
LLM observability toolsPre-production validation, not post-production monitoring
Red-teaming servicesContinuous, self-service adversarial testing, not one-time audits

What you can test

Use CaseWhat Rhesis Tests
Conversational AIConversation simulation, role adherence, knowledge retention
RAG SystemsContext relevance, faithfulness, hallucination detection
NL-to-SQL / NL-to-CodeQuery accuracy, syntax validation, edge case handling
Agentic SystemsTool selection, goal achievement, multi-agent coordination

SDK: Code-first testing

Test your Python functions directly with the @endpoint decorator:

from rhesis.sdk.decorators import endpoint

@endpoint(name="my-chatbot")
def chat(message: str) -> str:
    # Your LLM logic here
    return response

Features: Zero configuration, automatic parameter binding, auto-reconnection, environment management (dev/staging/production).

Generate tests programmatically:

from rhesis.sdk.synthesizers import PromptSynthesizer

synthesizer = PromptSynthesizer(
    prompt="Generate tests for a medical chatbot that must never provide diagnosis"
)
test_set = synthesizer.generate(num_tests=10)

Deployment options

OptionBest ForSetup Time
Rhesis CloudTeams wanting managed deploymentInstant
DockerLocal development and testing5 minutes
KubernetesProduction self-hostingSee docs

Quick Start

Option 1: Cloud (fastest) - app.rhesis.ai - Managed service, just connect your app

Option 2: Self-host with Docker

git clone https://github.com/rhesis-ai/rhesis.git && cd rhesis && ./rh start

./rh start pulls prebuilt images from GHCR. To build images from the repo instead, use ./rh start --build (and ./rh restart --build after local Dockerfile changes).

Access: Frontend at localhost:3000, API at localhost:8080/docs

Commands: ./rh logs · ./rh stop · ./rh restart · ./rh delete

Note: This setup enables auto-login for local testing. For production, see Self-hosting Documentation.

Option 3: Python SDK

pip install rhesis-sdk

Integrations

Rhesis integrates with your LLM stack across four layers, each addressing a different concern:

LayerWhat it covers
LLM providersThe model that runs your test generation and LLM-as-Judge evaluation
TracingStreaming spans from your application to Rhesis over OpenTelemetry
Test executionLetting Rhesis invoke entry points in your application remotely to run test cases
REST APIProgrammatic access to test sets, runs, and platform resources

LLM providers (test generation & judges)

Choose any provider for the LLMs that drive test synthesis and LLM-as-Judge evaluation. Provider routing is powered by LiteLLM, giving you a single interface to 100+ models.

IntegrationLanguagesDescription
OpenAIPythonOpenAI supported models and embeddings.
AnthropicPythonNative support for Claude models.
Google GeminiPythonNative integration for Google's Gemini models.
Vertex AIPythonGoogle Cloud Vertex AI model support.
OllamaPythonLocal LLM deployment with Ollama integration.
OpenRouterPythonAccess to multiple LLM providers through OpenRouter.
HuggingFacePythonDirect integration with HuggingFace models.
LiteLLMPythonUnified interface for 100+ LLMs (OpenAI, Azure, Anthropic, Cohere, Ollama, vLLM, HuggingFace, Replicate).

Tracing your application

Your application emits spans through the Rhesis SDK; spans are batched and sent to Rhesis over HTTP using OpenTelemetry span conventions. The integration mechanism depends on the framework — auto-instrumented frameworks need no code changes, while others use the @observe.llm decorator to mark the boundaries you want traced.

IntegrationLanguagesMechanismDescription
Rhesis SDKPython, JS/TSDecoratorsNative SDK with @observe.llm and convenience variants (@observe.tool, @observe.retrieval, @observe.embedding, …). Wrap any function you want traced.
LangChainPython✅ AutomaticAdd the Rhesis callback handler once and every chain step, tool call, and LLM call is traced automatically — no per-function decorators required.
LangGraphPython✅ AutomaticBuilt-in integration for LangGraph agent workflows with full observability — every node transition, tool invocation, and graph step is captured automatically.
Microsoft Agent FrameworkPython✅ AutomaticOne-line auto_instrument("agent_framework") traces every agent, model call, tool, and handoff in MAF ChatAgent and HandoffBuilder workflows - no per-function decorators. See Microsoft Agent Framework tracing.
OpenTelemetry / OpenInferencePython✅ Automatic via OTelAny framework with an OpenInference instrumentor (LlamaIndex, CrewAI, OpenAI Agents SDK, Google ADK, Pydantic AI, DSPy, Haystack, Semantic Kernel) exports to Rhesis through the SDK's OTel-based exporter. See Tracing setup docs for exact endpoint and header configuration.
AutoGen, OpenAI Agents SDK, LlamaIndex, CrewAI, and othersPythonDecoratorsWrap the functions, tools, or agents you want to trace with @observe.llm. Without decorators, only top-level inputs and outputs are captured.

Test execution: the connector

For Rhesis to run test cases against your application, it needs a way to call your code from outside your environment. The Rhesis SDK provides a persistent outbound WebSocket connection — your application opens it at startup and Rhesis can then invoke registered entry points whenever a test run fires. The connection is outbound from your app, so it works through firewalls and from local laptops without exposing a public URL.

You register an entry point with the @endpoint decorator (see SDK: Code-first testing). When a test run starts, Rhesis sends each test case's input down the WebSocket; your application runs the function locally and sends the output back up the same connection. The same call path serves single-turn test cases and multi-turn conversations driven by Penelope (our multi-turn conversation runner).

Both channels run from the same SDK in the same process: spans flow up the HTTP/OTLP channel; test commands flow down the WebSocket. Production traffic and test traffic produce traces in the same format, so the same evaluation metrics grade both.

REST API

Direct API access for custom integrations and CI/CD pipelines: manage test sets, trigger test runs, fetch results, and inspect traces programmatically. Language-agnostic — call from Python, TypeScript, Go, shell scripts, or anywhere else. OpenAPI spec available.

See Integration Docs for setup instructions.


Open source

MIT licensed. No plans to relicense core features. Enterprise version will live in ee/ folders and remain separate.

We built Rhesis because existing LLM testing tools didn't meet our needs. If you face the same challenges, contributions are welcome.


Contributing

See CONTRIBUTING.md for guidelines.

Ways to contribute: Fix bugs or add features · Contribute test sets for common failure modes · Improve documentation · Help others in Discord or GitHub discussions


Support


Security & privacy

We take data security seriously. See our Privacy Policy for details.

Telemetry: Rhesis collects basic, anonymized usage statistics to improve the product. No sensitive data is collected or shared with third parties.

  • Self-hosted: Opt out by setting OTEL_RHESIS_TELEMETRY_ENABLED=false
  • Cloud: Telemetry enabled as part of Terms & Conditions

Made with Rhesis logo in Potsdam, Germany 🇩🇪

Learn more at rhesis.ai