LLM & Agent Mocking
June 26, 2026 · View on GitHub
Overview
MockServer provides first-class LLM mocking through a new action type httpLlmResponse that produces provider-correct responses from a high-level, provider-neutral Completion abstraction. The feature spans codec encoding, streaming physics, conversation-aware matching, session isolation, MCP tool exposure, and dashboard rendering.
Action Type
httpLlmResponse is a peer to httpResponse, httpSseResponse, etc. It lives on Expectation as a separate field and dispatches through HttpLlmResponseActionHandler.
flowchart LR
REQ([Incoming POST]) --> MATCH[RequestMatchers]
MATCH --> EXP["Expectation\nwith httpLlmResponse"]
EXP --> HANDLER["HttpLlmResponseActionHandler"]
HANDLER -->|non-streaming| CODEC["ProviderCodec.encode()"]
HANDLER -->|streaming| EXPAND["StreamingPhysicsExpander"]
EXPAND --> SSE_CODEC["ProviderCodec.encodeStreaming()"]
SSE_CODEC --> SSE_HANDLER["HttpSseResponseActionHandler"]
CODEC --> WIRE[HTTP Response]
SSE_HANDLER --> WIRE
Codec Registry
ProviderCodecRegistry is a singleton that maps Provider enum values to ProviderCodec implementations. Each codec exposes:
encode(Completion, model)-- non-streaming responseencodeStreaming(Completion, model, StreamingPhysics)-- SSE event listencodeEmbedding(EmbeddingResponse, input)/encodeEmbedding(EmbeddingResponse, input, model)-- embeddings (the model-aware overload lets Bedrock pick Titan vs Cohere; most codecs ignore the model and inherit the two-arg default)encodeRerank(RerankResponse, documents)-- rerank results (Cohere/Voyage)decode(HttpRequest)-- parse inbound request toParsedConversation(for conversation matchers)
All embedding codecs share EmbeddingVectors (deterministic-from-input or random, then L2-normalised); only the JSON envelope differs per provider. Embedding shapes: OpenAI/Azure {"object":"list","data":[{"embedding":[...]}]} (default 1536 dims); Gemini {"embedding":{"values":[...]}} (768); Ollama {"embeddings":[[...]]} — the /api/embed shape (768); Bedrock Titan {"embedding":[...],"inputTextTokenCount":N} or Bedrock Cohere {"embeddings":[[...]]} when the model id starts with cohere (1024). ANTHROPIC and OPENAI_RESPONSES have no embeddings endpoint and throw (surfaced as a 400 by the handler). Rerank shares RerankScoring (per-document relevance scores — reproducible when deterministicFromInput is set, else random — descending, capped to topN) and emits the provider-correct envelope via a RerankScoring.Envelope selector: Cohere {"results":[{"index":N,"relevance_score":F}, ...]}, Voyage {"object":"list","data":[...],"usage":{"total_tokens":N}}.
Currently registered codecs:
| Provider | Codec class | Status |
|---|---|---|
| ANTHROPIC | AnthropicCodec | Complete (no embeddings endpoint) |
| OPENAI | OpenAiChatCompletionsCodec | Complete (chat + embeddings) |
| OPENAI_RESPONSES | OpenAiResponsesCodec | Complete (no embeddings endpoint) |
| GEMINI | GeminiCodec | Complete (chat + embeddings) |
| BEDROCK | BedrockCodec | Complete (delegates chat to AnthropicCodec; streaming uses application/vnd.amazon.eventstream binary framing via BedrockEventStreamEncoder; SigV4 signing is a follow-up; embeddings = Titan default / Cohere by model) |
| AZURE_OPENAI | AzureOpenAiCodec | Complete (delegates to OpenAiChatCompletionsCodec) |
| OLLAMA | OllamaCodec | Complete (chat + embeddings; see security audit for NDJSON wire-format limitation) |
| COHERE | CohereCodec | Rerank only (/v1/rerank) |
| VOYAGE | VoyageCodec | Rerank only (/v1/rerank) |
Streaming Physics
StreamingPhysicsExpander converts a Completion + StreamingPhysics configuration into a List<SseEvent> with pre-computed per-event delays.
Parameters:
timeToFirstToken-- delay before the first SSE eventtokensPerSecond-- base rate (1-10000)jitter-- fractional uniform deviation (0.0-1.0)seed-- PRNG seed for reproducible timing
The expanded events are handed to HttpSseResponseActionHandler which already honours per-event delays.
Conversation Matchers
LlmConversationMatcher evaluates predicates against a ParsedConversation decoded from the inbound request body:
whenTurnIndex(n)-- assistant turn countwhenLatestMessageContains(text)-- substring match on last messagewhenLatestMessageMatches(pattern)-- regex match on last messagewhenLatestMessageRole(role)-- role of last messagewhenContainsToolResultFor(toolName)-- tool result presencewithNormalization(options)-- opt-in prompt normalisation applied before thecontains/matchestext predicates (see below)
Predicates are stored as ConversationPredicates on HttpLlmResponse for JSON round-tripping. The matcher is lazily reconstructed from predicates after deserialisation.
Multimodal (image) recognition
The decoders recognise image content parts on the request side so a mocked request can be matched on image presence. Each ParsedMessage exposes hasImage(), imageCount(), and getImages() (a list of ImagePart, each carrying the declared media type where the provider shape includes it):
| Provider | Image shape recognised |
|---|---|
| OPENAI / AZURE_OPENAI | image_url content part (media type parsed from a data: URL; null for a remote https URL) |
| ANTHROPIC / BEDROCK | image block with a source.media_type |
| GEMINI | inline_data / inlineData part with mime_type / mimeType |
This is request-side recognition only — MockServer notes that a message contains an image (and how many, and the media type) so conversation matchers can assert it; it does not store the image bytes or generate image responses.
Multimodal (audio) recognition
The OpenAI decoder also recognises audio content parts on the request side, mirroring the image handling. Each ParsedMessage exposes hasAudio(), audioCount(), and getAudio() (a list of AudioPart, each carrying the declared format where the provider shape includes it):
| Provider | Audio shape recognised |
|---|---|
| OPENAI / AZURE_OPENAI | input_audio content part (format read from input_audio.format, e.g. wav/mp3; null when absent) |
Like image recognition, this is request-side only — MockServer notes that a message contains audio (and how many, and the declared format); it does not store the audio bytes.
Normalised prompt matching
Agent prompts are dynamically assembled, so exact-byte matching is brittle. NormalizationOptions (carried on ConversationPredicates) applies a deterministic transform to the latest-message text before the text predicates run, via PromptNormalizer.normalize(text, options):
collapseWhitespace(default on) -- collapse runs of whitespace to a single space and trimlowercase(default off) -- lowercase the textsortJsonKeys(default on) -- when the prompt is JSON, sort object keys so key ordering is irrelevantdropBuiltInVolatileFields(default off) -- strip ISO-8601 timestamps, UUIDs, andprefix_…ids (req_,msg_,call_, …)dropVolatileFields-- names of JSON fields to drop before matching
For latestMessageContains, both the subject text and the expected substring are normalised; for latestMessageMatches, only the subject is normalised (normalising the regex source would corrupt the pattern). Normalisation applies only to the latest-message text — the containsToolResultFor tool name, turnIndex, and latestMessageRole are matched exactly as specified. Boolean options are nullable: an unset flag uses its default (collapseWhitespace and sortJsonKeys on; lowercase and dropBuiltInVolatileFields off), resolved identically whether the options arrive via the REST API or the MCP tool. Normalisation is idempotent and pure — it never makes a test flaky — and is a modifier, not a predicate: it does not count toward hasAnyPredicate() and has no effect unless a text predicate is also set.
Semantic prompt matching (opt-in, exploratory)
The semanticMatch predicate (ConversationPredicates.semanticMatchAgainst) matches when the latest message expresses a given intent, judged by a runtime LLM. It is deliberately quarantined from the deterministic path:
- Off by default.
SemanticMatchingis a static gate that is onlyinstalled (at server start) whenmockserver.llmSemanticMatchingEnabledis set and a backend resolves viaLlmBackendResolver. Until thenisEnabled()is false andLlmConversationMatcherignores the predicate (logs once, deterministic fallback) — so default behaviour is unchanged. - Fail-closed when active.
SemanticPromptMatcherasks the LLM (via the Phase-2LlmCompletionService,temperature=0, cached) a strict yes/no judge question; a non-affirmative, empty, or errored answer does not match. - Never for assertions. It is non-deterministic by construction (a live model) and documented as exploratory only.
Session Isolation
IsolationSource describes where to extract the isolation key from an inbound request (header, query parameter, or cookie). The key is encoded into the scenario name:
__llm_conv_<uuid>__iso=header:x-session-id
ScenarioManager uses composite keys (scenarioName, isolationValue) to maintain independent state per session.
Conversation Builder
LlmConversationBuilder produces an array of Expectation objects, one per turn, with:
- Auto-generated scenario name (with optional isolation suffix)
- State progression:
Started->turn_1->turn_2-> ... ->__done ConversationPredicateson eachHttpLlmResponse
The class relationships between the builder, predicates, matcher, and isolation model:
classDiagram
class LlmConversationBuilder {
+withPath(path)
+withProvider(provider)
+withModel(model)
+isolateBy(IsolationSource)
+turn() TurnBuilder
+build() Expectation[]
}
class TurnBuilder {
+whenTurnIndex(int)
+whenLatestMessageContains(String)
+whenLatestMessageMatches(Pattern)
+whenLatestMessageRole(Role)
+whenContainsToolResultFor(String)
+whenSemanticMatch(String)
+withNormalization(NormalizationOptions)
+respondingWith(Completion)
+andThen() TurnBuilder
}
class ConversationPredicates {
+turnIndex: Integer
+latestMessageContains: String
+latestMessageMatches: String
+latestMessageRole: Role
+containsToolResultFor: String
+semanticMatchAgainst: String
+normalization: NormalizationOptions
}
class LlmConversationMatcher {
+matches(HttpRequest) boolean
}
class IsolationSource {
+kind: Kind
+name: String
+header(String)
+queryParameter(String)
+cookie(String)
}
class LlmScenarioNames {
+ISOLATION_MARKER: String
+generate(IsolationSource) String
}
class ScenarioManager {
+getState(scenarioName, isolationValue)
+setState(scenarioName, isolationValue, state)
}
LlmConversationBuilder --> TurnBuilder : creates
LlmConversationBuilder --> IsolationSource : uses
LlmConversationBuilder --> LlmScenarioNames : uses
TurnBuilder --> ConversationPredicates : produces
LlmConversationMatcher --> ConversationPredicates : evaluates
LlmScenarioNames --> IsolationSource : encodes
ScenarioManager --> LlmScenarioNames : keyed by
MCP Tools
Two MCP tools expose the LLM mocking feature to agents:
| Tool | Description |
|---|---|
mock_llm_completion | Creates a single LLM expectation from provider, path, text, tool calls, usage, and an optional outputSchema (response-path structured-output validation) |
create_llm_conversation | Creates a multi-turn conversation with scenario state chain, optional isolation, and an optional per-turn match.normalization object |
verify_tool_call | Asserts an agent called a named tool atLeast/atMost times (optional args regex), by decoding recorded LLM requests. Supports provider=AUTO to auto-detect from request paths |
explain_agent_run | Summarises a recorded agent run: turn/tool-call sequence, tool results, latest role. Supports provider=AUTO to auto-detect from request paths |
verify_structured_output | Validates the JSON output text of recorded LLM responses against a JSON Schema (decodes each response via the runtime-LLM client SPI, then JsonSchemaValidator); reports per-response conformance |
verify_cost_budget | Sums input/output tokens from recorded responses' usage, prices them via org.mockserver.llm.cost.LlmPricing, and asserts the total USD cost is within maxCostUsd — a deterministic CI cost gate. Unpriceable models are reported and excluded |
mock_llm_failover | Creates a failover/retry scenario: the first N requests fail with specified HTTP statuses, then subsequent requests succeed with a provider-correct LLM response. Uses LlmFailoverBuilder |
The first two validate provider availability against ProviderCodecRegistry at registration time. The analysis tools delegate to org.mockserver.llm.analysis.AgentRunAnalyzer.
Structured-output validation
Structured-output validation against a JSON Schema works on both sides of a mock, both built on JsonSchemaValidator:
- Read side —
verify_structured_output(assertion over recorded traffic): decodes each recorded response for a provider via the runtime-LLM client SPI and checks the assistant's output text against the schema. Read-only and deterministic. - Response side —
Completion.outputSchema(fixture sanity check): a completion may declare the JSON Schema itstextshould conform to (Completion.withOutputSchema(...), theoutputSchemaexpectation-JSON field, or themock_llm_completionMCP param — string or inline object).HttpLlmResponseActionHandler.validateStructuredOutput(...)validates the configured text as the response is encoded. It is fail-soft by default: a mismatch never alters the response body — it adds thex-mockserver-structured-output-invaliddiagnostic header (a single-line, CR/LF-collapsed message; non-streaming only) and logs a warning. A blank schema, absent text, or a malformed schema are all "nothing to check" and can never break the response. This surfaces malformed fixtures while still letting you return a deliberately non-conforming response unchanged.- Strict enforcement (opt-in) —
Completion.enforceOutputSchema: set theenforceOutputSchemaflag (Completion.enforceOutputSchema()/withEnforceOutputSchema(true), theenforceOutputSchemaexpectation-JSON field, or themock_llm_completionMCP boolean param) alongside the schema to switch from fail-soft to strict.HttpLlmResponseActionHandler.enforcementErrorResponseOrNull(...)then fails loudly when the configured body does not conform: it returns a provider-correct error (HTTP502viaLlmErrorBodies, with thex-mockserver-structured-output-invalidheader) instead of the non-conforming body. This models a provider's strictresponse_format: json_schemamode, which guarantees schema-valid output.HttpActionHandlerchecks it before dispatch (after chaos, which takes priority — a transport-level failure independent of the body), so it applies on both the streaming and non-streaming paths and a strict streaming completion with a non-conforming body never begins streaming. The sharedstructuredOutputError(...)helper backs both the fail-soft and strict paths, so a blank/absent-text/malformed schema is a no-op in either mode and can never produce an enforcement error. The flag is back-compatible: unset/falsekeeps the fail-soft behaviour and the flag has no effect without anoutputSchema.
- Strict enforcement (opt-in) —
Approximate token counting and usage inference
TokenCounter (org.mockserver.llm) is a pure, deterministic helper that estimates token counts for text. It is an estimate, not a real tokenizer — it does not implement BPE/SentencePiece or any provider's vocabulary, so its counts differ from a provider's billed counts (roughly within ±20% for ordinary English prose, further off for code or non-Latin text). The heuristic averages two cheap signals — characters ÷ 4 and words × 4/3 — plus a small punctuation-density allowance (BPE tends to split punctuation into its own tokens), clamped to ≥1 for any non-empty text and 0 for null/empty. It exposes estimateTokens(text), estimatePromptTokens(ParsedConversation) (per-message text + a small per-message chat-format overhead, including tool-call args and tool results), and estimateCompletionTokens(text, toolCalls).
When the opt-in mockserver.llmInferUsageEnabled flag is set, HttpLlmResponseActionHandler.withInferredUsageIfEnabled(...) returns a per-request shallow copy of the completion carrying approximate prompt_tokens / completion_tokens for a mocked completion that omits usage, on both the non-streaming and streaming paths, before the codec encodes. The shared expectation Completion is never mutated, so the request-dependent prompt estimate is recomputed every request (no stale caching, no concurrent-write race). The prompt estimate comes from decoding the inbound request with the provider codec (ProviderCodec.decode); the completion estimate from the response text and tool-call arguments. It is off by default so existing responses are unchanged (an absent usage continues to encode as zeros) and a completion that already declares a non-zero usage is never overwritten. Decoding failures are fail-soft (prompt estimate degrades to 0, never an error). This is independent of HttpLlmResponseActionHandler.estimateTokenCount(...), the existing rough character estimate that backs the token-based chaos quota, which is unchanged.
Cached/reasoning token usage and reasoning content encoding
The Usage optional fields and the Completion reasoning fields are encoded by every chat codec onto the response wire — additively, and only when set, so a completion that omits them encodes byte-identically to before (the golden fixtures are unchanged).
Usage token details — emitted only when the corresponding Usage field is non-null and non-zero, under each provider's native key (the same keys the runtime-LLM clients decode, so encode/decode are symmetric):
| Provider(s) | cachedInputTokens key | cacheCreationTokens key | reasoningTokens key |
|---|---|---|---|
| ANTHROPIC / BEDROCK | usage.cache_read_input_tokens | usage.cache_creation_input_tokens | — (no native field) |
| OPENAI / AZURE_OPENAI | usage.prompt_tokens_details.cached_tokens | — | usage.completion_tokens_details.reasoning_tokens |
| OPENAI_RESPONSES | usage.input_tokens_details.cached_tokens | — | usage.output_tokens_details.reasoning_tokens |
| GEMINI | usageMetadata.cachedContentTokenCount | — | usageMetadata.thoughtsTokenCount |
| OLLAMA | — (no native field) | — | — (no native field) |
The Anthropic cache_read/cache_creation keys are mirrored into the streaming message_start usage; the OpenAI Responses details into the response.completed usage; the Gemini counts into the final streaming chunk's usageMetadata. (Bedrock and Azure inherit the Anthropic/OpenAI behaviour by delegation.)
Reasoning ("thinking") content — Completion.reasoningText (plus optional reasoningSignature for Anthropic redaction) encodes a provider-correct reasoning block before the visible text block, on both paths, absent unless set:
| Provider | Non-streaming shape | Streaming events |
|---|---|---|
| ANTHROPIC / BEDROCK | leading {"type":"thinking","thinking":…,"signature":…} content block | content_block_start(thinking) → thinking_delta → optional signature_delta → content_block_stop, at index 0 before the text block |
| OPENAI_RESPONSES | leading {"type":"reasoning","summary":[{"type":"summary_text","text":…}]} output item | response.output_item.added(reasoning) → response.reasoning_summary_part.added → response.reasoning_summary_text.delta/.done → …part.done → output_item.done |
| GEMINI | leading {"text":…,"thought":true} part | a thought chunk (parts:[{text,thought:true}]) before the text chunks |
| OLLAMA | message.thinking sibling string | a leading chunk with message.thinking set |
OpenAI Chat Completions has no reasoning-content representation on the response wire (reasoning is summarised only via the Responses API), so reasoningText is not encoded for OPENAI/AZURE_OPENAI. None of this touches the request matcher — only response encoding.
Adversarial-response harness
AdversarialResponseLibrary (org.mockserver.llm.adversarial) is a curated catalog of hostile/malformed responses an agent might receive from a compromised tool or jailbroken model — prompt injection, jailbreak persona-swaps, data-exfiltration requests, malformed/truncated JSON, an empty response, and an over-long repetition. The mock_adversarial_llm_response MCP tool mocks a chosen payload as the provider-correct LLM response so you can test that your agent resists it. The payloads are short, well-known benign test fixtures (not working exploits) — a defensive testing aid — and generation is deterministic (each id maps to fixed text).
Fault / chaos injection
LlmChaosProfile (org.mockserver.model) attaches a fault profile to any HttpLlmResponse for resilience testing. Applied by HttpLlmResponseActionHandler:
- Probabilistic error —
chaosErrorResponseOrNull(...)returns an errorHttpResponse(errorStatus+ optionalRetry-After) when triggered. AnerrorStatuswith noerrorProbabilityalways fires; a fractional probability draws once (reproducible viaseed).HttpActionHandlerchecks this first and, if present, returns the error on the normal (non-streaming) path — a provider error is a plain HTTP response, not an SSE stream, even for a would-be streaming completion. - Mid-stream truncation —
applyStreamingChaos(...)keeps a leadingtruncateAtFractionof the SSE events (default 0.5) so the stream ends early. - Malformed SSE — appends a deliberately broken-JSON chunk so the client must handle a corrupt event.
- Stateful request quota — a deterministic fixed-window rate limit (
quotaName+quotaLimit+quotaWindowMillis, optionalquotaErrorStatusdefault 429).quotaErrorResponseOrNull(...)(called first insidechaosErrorResponseOrNull) records one request againstorg.mockserver.llm.LlmQuotaRegistryand returns the quota error once the in-window count exceeds the limit. The registry is a process-wide singleton keyed byquotaName(expectations sharing a name share one counter — model an upstream account limit), thread-safe viaConcurrentHashMap, with an injectable clock for testing, cleared onHttpState.reset(). A misconfigured/partial quota fails open (never rate-limits).
Truncation, malformed-SSE, and the stateful quota are fully deterministic; the probabilistic error path is deterministic at probability 0.0/1.0. Each injection increments the LLM_CHAOS_INJECTED_COUNT metric. The profile round-trips as the top-level chaos field on HttpLlmResponse (alongside completion, embedding, and conversationPredicates) and is exposed per turn in the dashboard wizard and via the chaos MCP parameter.
Provider-specific error bodies
Both chaos error paths (the probabilistic errorStatus and a stateful quota breach) emit the provider-correct JSON error body for the detected provider, so client SDK retry/backoff logic — which parses the body's error.type / error.code — can be exercised faithfully against a mock. LlmErrorBodies (org.mockserver.llm) is a pure, deterministic helper that maps a Provider + a coarse error Kind (derived from the HTTP status) to the body shape. When the provider is null/unknown, the handler falls back to the previous generic body ({"error":{"type":"chaos_injected"|"quota_exceeded"|"token_quota_exceeded",...}}), so behaviour is unchanged for an unspecified provider.
The error Kind is derived from the status: 429 → RATE_LIMIT, 529 → OVERLOADED, any other status → SERVER_ERROR.
| Provider | Body shape (by error kind) |
|---|---|
| ANTHROPIC / BEDROCK | {"type":"error","error":{"type":"overloaded_error"|"rate_limit_error"|"api_error","message":...}} — Bedrock delivers the Anthropic body unchanged |
| OPENAI / OPENAI_RESPONSES / AZURE_OPENAI | {"error":{"message":...,"type":"rate_limit_exceeded"|"server_error","param":null,"code":...}} — code is the numeric status for server_error, the string "rate_limit_exceeded" for a 429 |
| GEMINI | {"error":{"code":<status>,"message":...,"status":"UNAVAILABLE"|"RESOURCE_EXHAUSTED"|"INTERNAL"}} |
| OLLAMA | {"error":"<message>"} (a plain message string) |
The Retry-After and provider-specific rate-limit headers (see below) are still applied on top of the body by the same code path, so a 429/529 carries both the correct body and the correct headers.
Token-based quota (TPM/TPD)
Real LLM providers (OpenAI, Anthropic) enforce token-per-minute (TPM) and token-per-day (TPD) limits in addition to request-count limits. MockServer models this with two additional LlmChaosProfile fields: tokenQuotaLimit (Long, >= 1) and tokenQuotaWindowMillis (Long, >= 1). When both are set alongside quotaName, each response charges its cumulative token count (from Usage.inputTokens + outputTokens, or ceil(text.length()/4) as a fallback when no Usage is present) against a separate fixed-window counter in LlmQuotaRegistry under the key quotaName + ":tokens". Once the in-window token sum exceeds tokenQuotaLimit, the response path returns a 429 (or custom quotaErrorStatus) with error type token_quota_exceeded and the Retry-After header when set. The request-count quota and token quota are independent counters that can coexist on the same profile; the request-count quota is checked first. Embeddings contribute zero tokens. The LlmQuotaRegistry.tryAcquire(name, limit, windowMillis, amount) overload supports arbitrary increment amounts for this purpose.
Provider rate-limit headers
When an LLM response path returns a rate-limit / quota error (probabilistic errorStatus or stateful quota 429), MockServer emits the provider-correct rate-limit HTTP headers that real LLM providers send, so client SDK retry/backoff logic (which reads those headers) can be exercised faithfully against a mock. The same headers are stamped on successful responses when a quota is configured, so a client can observe the limit counting down.
LlmRateLimitHeaders (org.mockserver.llm) is a pure, deterministic helper that maps a Provider + quota parameters to the provider-specific header names and values. The standard Retry-After header is generic HTTP (not provider-specific), so it is owned by HttpLlmResponseActionHandler.applyRateLimitHeaders(...) — emitted once for every provider on a 429 — rather than by the helper, so it can never appear twice on the wire.
| Provider | Provider-specific headers on error (429) | Headers on success (with quota) | Retry-After on 429 |
|---|---|---|---|
| OPENAI / OPENAI_RESPONSES / AZURE_OPENAI | x-ratelimit-limit-requests, x-ratelimit-remaining-requests, x-ratelimit-reset-requests (e.g. "60s") | x-ratelimit-limit-requests, x-ratelimit-reset-requests | yes (seconds) |
| ANTHROPIC | anthropic-ratelimit-requests-limit, anthropic-ratelimit-requests-remaining, anthropic-ratelimit-requests-reset (RFC 3339 timestamp) | anthropic-ratelimit-requests-limit, anthropic-ratelimit-requests-reset | yes (seconds) |
| GEMINI | (none) | (none) | yes (seconds) |
| BEDROCK | (none) | (none) | yes (seconds) |
| OLLAMA | (none) | (none) | yes, when a quota window or literal retryAfter is set |
Header values are derived from the LlmChaosProfile fields: quotaLimit becomes the limit header; the reset duration is quotaWindowMillis / 1000 (falling back to tokenQuotaWindowMillis / 1000 for a token-only quota, then to a numeric retryAfter), so a token-quota-only 429 still carries reset/Retry-After headers; remaining is 0 on a 429 (omitted on success since the registry window count is not cheaply re-readable). On a 429 Retry-After is the literal configured retryAfter (which may be an HTTP-date) when set, otherwise the computed reset seconds. Applied at three sites: the probabilistic chaos error, the quota-exceeded error, and the successful non-streaming response when a quota is configured.
Agent-run analysis
AgentRunAnalyzer (org.mockserver.llm.analysis) is a deterministic, read-only inspector. Given the LLM requests MockServer recorded (retrieved via the normal request log), it decodes each with the provider's ProviderCodec and treats the richest conversation (most messages — the latest dialogue snapshot) as the canonical run. From that it derives:
-
inspectToolCalls(requests, provider, toolName, argsRegex)→ count + matched tool calls (powersverify_tool_call). -
summarise(requests, provider)→ message count, assistant-turn count, ordered tool-call name sequence, tool-result keys, latest message role (powersexplain_agent_run). -
buildCallGraph(requests, provider)→ aCallGraphof nodes (one per message, one per assistant tool call) and directed edges:NEXT(message sequence),INVOKES(assistant turn → the tool calls it made),RESULT(tool call → the tool message that returned its result, correlated by tool-call id). Powers the dashboard call-graph view.
No LLM is called and no network is used — it reads the structure the codecs already produce, so assertions are reproducible. The MCP tools are thin wrappers that retrieve recorded requests (/mockserver/retrieve?type=REQUESTS) and format the analyzer's output; explain_agent_run includes the callGraph (nodes + edges). The dashboard Sessions view (SessionInspector → AgentRunGraph.tsx, with the pure transform mockserver-ui/src/lib/callGraph.ts) loads the graph per session via explain_agent_run and renders it as a step list (role + invoked tool calls + result indicator) plus a copyable Mermaid flowchart.
Proxied/forwarded traffic support
Agent-run analysis works identically for proxied/forwarded traffic. Every incoming request — whether it matches a mock expectation or is forwarded to an upstream provider — is recorded as a RECEIVED_REQUEST log entry with the full request body. The type=REQUESTS retrieval returns these entries, and AgentRunAnalyzer decodes them with the appropriate ProviderCodec.
Provider auto-detection. The verify_tool_call and explain_agent_run MCP tools accept "AUTO" as the provider parameter. ProviderDetector (org.mockserver.llm.ProviderDetector) infers the provider from recorded request paths (e.g. /v1/messages maps to ANTHROPIC, /v1/chat/completions to OPENAI), mirroring the UI-side detection in llmTraffic.ts. This is especially useful for proxy users who route real LLM calls through MockServer and may not know or want to specify the provider explicitly.
Dashboard Sessions view. The Sessions view groups proxied LLM traffic by upstream host (from the Host header) when no conversation-isolation expectations are configured, so proxy traffic to different providers appears in separate session lanes. The call graph (via AgentRunGraph) is shown for all sessions including these host-grouped proxy sessions.
Capturing coding-assistant CLIs. A common use is to point a coding assistant at MockServer as an HTTPS proxy and watch its model calls in the Traffic / LLM Traces / LLM Optimise views. Detection is path-based so it works regardless of the (possibly private) gateway host a tool is configured for:
| CLI | LLM endpoint | Detected as |
|---|---|---|
Claude Code (claude) | api.anthropic.com/v1/messages | ANTHROPIC |
| opencode (Codex backend) | chatgpt.com/backend-api/codex/responses | OPENAI_RESPONSES |
| Tabnine CLI (Gemini-CLI fork) | <gateway>/…/chat/completions | OPENAI |
opencode's OpenAI Codex backend serves the Responses API at the non-standard path /backend-api/codex/responses; LlmProviderSniffer, ProviderDetector, and llmTraffic.ts recognise the /codex/responses path (and the chatgpt.com host on it) the same as the hosted /v1/responses. A local-only smoke harness that drives the real CLIs end-to-end lives in scripts/llm-proxy-capture/; the CI-safe equivalent is CodingCliLlmCaptureTest (mockserver-core) and the llmTraffic.test.ts codex cases (mockserver-ui).
Dashboard Rendering
The expectation panel renders an "LLM Response" badge (with provider, model, and text preview) when httpLlmResponse is present on an expectation.
The ScriptedTurnsPanel component renders the scripted turn sequence for conversation expectations, showing per-turn predicates, responses, and scenario state transitions.
Domain Model
classDiagram
class HttpLlmResponse {
+provider: Provider
+model: String
+completion: Completion
+embedding: EmbeddingResponse
+rerank: RerankResponse
+conversationPredicates: ConversationPredicates
}
class Completion {
+text: String
+toolCalls: List~ToolUse~
+toolChoice: String
+stopReason: String
+usage: Usage
+streaming: Boolean
+streamingPhysics: StreamingPhysics
+outputSchema: String
+enforceOutputSchema: Boolean
+reasoningText: String
+reasoningSignature: String
}
class ToolUse {
+id: String
+name: String
+arguments: String
}
class Usage {
+inputTokens: Integer
+outputTokens: Integer
+cachedInputTokens: Integer
+cacheCreationTokens: Integer
+reasoningTokens: Integer
}
class StreamingPhysics {
+timeToFirstToken: Delay
+tokensPerSecond: Integer
+jitter: Double
+seed: Long
}
class EmbeddingResponse {
+dimensions: Integer
+deterministicFromInput: Boolean
+seed: Integer
}
class RerankResponse {
+topN: Integer
+deterministicFromInput: Boolean
+seed: Long
}
class Provider {
<<enum>>
ANTHROPIC
OPENAI
OPENAI_RESPONSES
GEMINI
BEDROCK
AZURE_OPENAI
OLLAMA
COHERE
VOYAGE
}
class ConversationPredicates {
+turnIndex: Integer
+latestMessageContains: String
+latestMessageMatches: String
+latestMessageRole: Role
+containsToolResultFor: String
+semanticMatchAgainst: String
+normalization: NormalizationOptions
}
class ParsedConversation {
+messages: List~ParsedMessage~
}
class ParsedMessage {
+role: Role
+textContent: String
+toolName: String
+toolCallId: String
+images: List~ImagePart~
+audio: List~AudioPart~
+hasImage() boolean
+imageCount() int
+hasAudio() boolean
+audioCount() int
}
class IsolationSource {
+kind: Kind
+name: String
}
class LlmScenarioNames {
+ISOLATION_MARKER: String
}
HttpLlmResponse --> Provider
HttpLlmResponse --> Completion
HttpLlmResponse --> EmbeddingResponse
HttpLlmResponse --> RerankResponse
HttpLlmResponse --> ConversationPredicates
Completion --> ToolUse
Completion --> Usage
Completion --> StreamingPhysics
Runtime LLM client SPI
Most LLM mocking is deterministic and offline. A few opt-in features (drift detection, semantic prompt matching) need MockServer to act as a client against a real LLM the user already runs. This is the opposite direction to the codecs (decode parses an inbound request; encode builds a mock response), so a sibling SPI mirrors the codec-registry shape:
org.mockserver.llm.client.LlmClient—provider(),buildCompletionRequest(LlmBackend, ParsedConversation),parseCompletionResponse(HttpResponse). Implementations are pure (no transport, no shared state) so they unit-test offline.AbstractLlmClientprovides URL parsing, base-request construction, and JSON helpers.org.mockserver.llm.client.LlmClientRegistry— singleton, static-block registration keyed byProvider, structurally identical toProviderCodecRegistry. All seven providers registered: Ollama, OpenAI, OpenAI Responses, Azure OpenAI, Anthropic, Gemini, Bedrock.org.mockserver.llm.client.LlmBackend— immutable record (name, provider, baseUrl, apiKey, model, headers, timeoutMillis);baseUrl/modeldefault per provider,apiKeyredacted intoString().org.mockserver.llm.client.LlmBackendResolver— three config layers: (1) provider env conventions (OPENAI_API_KEY/ANTHROPIC_API_KEY/GEMINI_API_KEY/OLLAMA_HOST), (2)mockserver.llmProvider/llmApiKey/llmModel/llmBaseUrl, (3) named backends JSON (mockserver.llmBackendsConfig). Properties take precedence over env; named backends are selectable by name.org.mockserver.llm.client.LlmCompletionService— the single entry point for runtime-LLM features. Looks up the client, builds the request, sends it via an injectedLlmTransport, parses the response. Enforces the safety rules: off unless a backend resolves, fail closed (timeout / transport error / non-2xx / parse failure →Optional.empty()+ one log line), and reproducible (clients pintemperature=0/seed; responses cached per provider+model+baseUrl+normalised prompt).LlmTransportis a seam;NettyHttpClientLlmTransportwraps the server'sNettyHttpClientin production.
flowchart TD
A["Runtime-LLM feature needs an LLM"] --> B{"Backend resolved?"}
B -- "no" --> C["Feature unavailable\nlog one line, deterministic fallback"]
B -- "yes" --> D["LlmClientRegistry.lookup(provider)"]
D --> E["client.buildCompletionRequest"]
E --> F["LlmTransport.send (NettyHttpClient)"]
F --> G["client.parseCompletionResponse to Completion"]
F -- "timeout / error / non-2xx" --> C
Adding a provider = implement LlmClient + one register(...) line — the same one-line story as codecs. Ollama is the reference backend (no auth, local, free) used to prove the path. Bedrock builds the Anthropic-on-Bedrock body and parses the Anthropic-shaped response, but automatic AWS SigV4 signing is not yet implemented — callers supply auth via the headers escape hatch or a signing proxy (tracked in llm-security-audit.md).
This SPI is never on the deterministic assertion/matching path. The features that consume it (drift detection, semantic matching) are opt-in and documented in this file above.
OpenTelemetry export
Optional, off-by-default OTLP export, in two independent parts (both fail-soft — a setup error logs one line and never affects the server or a response; io.opentelemetry is relocated in the shaded jar):
- Metrics (
org.mockserver.metrics.OtelMetricsExporter,mockserver.otelMetricsEnabled) — bridges the existingMetrics.Namegauges (the same set exposed for Prometheus, including the LLM/SSE/chaos counters) to OTLP as observable gauges that read the current values, so Prometheus and OTLP stay consistent. An alternative to the Prometheus endpoint. - GenAI spans (
org.mockserver.telemetry.GenAiSpanExporter+GenAiSpans,mockserver.otelTracesEnabled) — emits one span per LLM completion with GenAI semantic-convention attributes (gen_ai.system,gen_ai.request.model,gen_ai.usage.*,gen_ai.response.finish_reasons, tool-call count). When a provider reports them, cached-input and reasoning token counts are also emitted under themockserver.gen_ai.usage.*namespace (cached_input_tokens,cache_creation_tokens,reasoning_tokens) — there is no GenAI semconv attribute for these yet, and they are omitted entirely when absent. These are spans MockServer codes deliberately — no auto-instrumentation.GenAiSpansis a process-wide no-op untilGenAiSpanExporterinstalls a tracer. Spans fire on two paths:- Mock action path —
HttpLlmResponseActionHandlercallsGenAiSpans.recordCompletion()for mocked responses (streaming and non-streaming). - Forward/proxy path —
HttpActionHandler.emitForwardGenAiSpan()detects LLM traffic viaLlmProviderSniffer(maps the forwarded request's target host to aProvider), parses the upstream response using the provider'sLlmClient.parseCompletionResponse(), and records a completion span. Covers matched-expectation forwards and unmatched proxy-pass. Streaming forward paths emit the GenAI span in the completion listener after the full SSE body is captured. Model is extracted from the response body (most providers include it), falling back to the request body.
- Mock action path —
Both use the OTLP HTTP/protobuf exporter with the JDK HttpClient sender (no gRPC/OkHttp) and share mockserver.otelEndpoint (a base collector URL; /v1/metrics and /v1/traces appended per signal, resolved by telemetry.OtelEndpoints).
Drift detection
detect_llm_drift (MCP) closes the loop on stale cassettes: it replays a recorded cassette's exchanges against the live provider and reports structural drift in the responses. Built from two pieces in org.mockserver.llm.drift:
StructuralShapeDiff— pure: walks two JSON documents into path→type shape maps and reports added / removed / type-changed paths (values ignored; arrays use a representative-first-element model). Reusable.DriftDetector— for each recorded exchange, decodes the recorded request via theProviderCodec, builds a fresh live request via the runtime-LLMLlmClient(Phase 2 SPI), sends it through an injectedLlmTransport, and diffs the live response shape against the recorded one. Fails closed per exchange: a missing client/codec, network error, non-2xx, or non-JSON body is reported asCOULD_NOT_CHECK, never as drift, and never thrown.
The MCP tool resolves a backend via LlmBackendResolver and is disabled (returns {disabled:true}) when none is configured. When configured, it builds a transient NettyHttpClient-backed transport for the live calls. Because it needs real API keys/tokens and is inherently non-deterministic against a live API, it belongs in an opt-in/scheduled CI lane (see docs/infrastructure/ci-cd.md), never the per-commit build. No dashboard control — it is an operational/CI tool.
AI-Powered Stub Generation
PUT /mockserver/generateExpectation infers a plausible MockServer expectation from an unmatched HTTP request, optionally calling a configured LLM backend for intelligent generation.
Request format
{
"request": { "method": "GET", "path": "/api/users/42" },
"preview": true,
"limit": 1
}
| Field | Type | Default | Description |
|---|---|---|---|
request | object | required | The unmatched HttpRequest to generate a stub for |
preview | boolean | true | When true, returns the suggestion without registering it; false registers immediately |
limit | integer | 1 | Number of suggestions to return (1-5) |
Response format
{
"suggestions": [ { "httpRequest": { ... }, "httpResponse": { ... } } ],
"confidence": 0.75,
"preview": true,
"explanation": "Generated from request pattern (no LLM backend configured)"
}
Behaviour
-
LLM-powered (when a runtime LLM backend is configured via
LlmBackendResolver):StubGenerationPromptBuilderbuilds a prompt containing the unmatched request details and up to 10 existing expectations as context. The prompt is sent viaLlmCompletionServiceand the response is parsed as expectation JSON. If the LLM response is unparseable, falls back to template generation. -
Template fallback (no LLM backend): generates a simple expectation matching the request's method and path with an appropriate status code (200 for GET, 201 for POST, 204 for DELETE) and a
{"status":"ok"}body. Confidence is reported as0.5.
Wiring
HttpStateholds an optionalLlmCompletionService+LlmBackendpair, set byLifeCycle.installLlmCompletionServiceIfAvailable()at boot when a backend resolves viaLlmBackendResolver.- The handler uses
RequestMatchers.retrieveActiveExpectations(null)to obtain context expectations. - When
preview=false, generated expectations are registered viaRequestMatchers.add(expectation, Cause.API).
Source files
| File | Purpose |
|---|---|
llm/StubGenerationPromptBuilder.java | Builds the LLM prompt from the unmatched request + existing expectations context |
llm/StubGenerationResult.java | Result DTO with suggestions, confidence, explanation, raw LLM response |
mock/HttpState.handleGenerateExpectation() | Control-plane handler for PUT /mockserver/generateExpectation |
mock/HttpState.generateSimpleStub() | Template-based fallback when no LLM is available |
MCP server conformance testing
run_mcp_contract_test (MCP) verifies that a target MCP (Model Context Protocol) server correctly implements the protocol over Streamable HTTP. It is deterministic and involves no LLM — it checks the protocol, not any tool's semantics.
org.mockserver.netty.mcp.McpContractTest— the orchestrator. Pure, with an injectedJsonRpcExchangetransport (a(message, sessionId) → ExchangeResultfunction), so the whole check sequence is unit-testable without a network (McpContractTestTestdrives it with stub servers). Runs an ordered suite of checks, each producing aCheckResult(name, passed, statusCode, detail, validationErrors), aggregated into aReport(checks + negotiatedprotocolVersion+serverInfo):- initialize — POSTs
initialize, validates the JSON-RPC 2.0 envelope and thatresultcarriesprotocolVersion, acapabilitiesobject, andserverInfo.name; captures theMcp-Session-Id. A transport error here short-circuits (only this check is reported). - notifications/initialized — sends the notification (no
id) with the session; expects HTTP 200/202/204 and no JSON-RPC error. - ping — expects a JSON-RPC
result. - tools/list — expects
result.toolsto be an array where every tool has anameand an objectinputSchema(type: object). - rejects unknown method — sends a bogus method; expects a JSON-RPC
errorwith code-32601(Method not found). - tools/call (optional) — only when the caller passes
toolName, since a real call may have side effects; a shape check onresult.content[]+ theisErrorboolean.
- initialize — POSTs
- The
McpToolRegistryhandler validatestargetUrl(absolute http/https with a host), builds theJsonRpcExchangeover the existingsendHttpRequest(HttpURLConnection, 10 s timeout), extracts the session header case-insensitively, and parses the JSON-RPC body — handling bothapplication/jsonandtext/event-stream(SSEdata:framing) responses.
The check sequence is self-consistent with MockServer's own McpStreamableHttpHandler (which returns -32601 for unknown methods and 202 ACCEPTED for the initialized notification), so pointing the tool at MockServer's own /mockserver/mcp endpoint passes. No dashboard control — it is an agent/CI-invoked developer tool, like run_contract_test.
VCR (record / replay)
LLM/MCP traffic forwarded through MockServer can be snapshotted to committable fixture files and replayed deterministically:
- Record —
record_llm_fixtures(MCP) converts recorded request/response pairs (including SSE) into expectations viaSseAwareExpectationConverter, thenFixtureRedactormasks sensitive headers and — whenredactBodyFields/mockserver.fixtureBodyRedactFieldsis set — named JSON body fields (recursively, value →***REDACTED***). - Replay —
load_expectations_from_file(MCP) loads the fixture as active expectations. Two replay aids: strict mode (strictparam ormockserver.llmVcrStrict) registers a lowest-priority (Integer.MIN_VALUE) catch-all per cassette path returning HTTP 599 so an unmatched request fails loudly; replay normalisation (normalizeRequestBodyFields) drops volatile JSON fields from each recorded request body and rewrites the matcher toJsonBodywithMatchType.ONLY_MATCHING_FIELDS, so per-run values do not block the match.
These are operational settings (config + MCP, for CI/automation), not dashboard controls.
Configuration
| Property | Default | Range | Description |
|---|---|---|---|
mockserver.maxLlmConversationBodySize | 1048576 (1 MiB) | 16384 - 67108864 | Maximum request body size for conversation matcher parsing |
mockserver.fixtureBodyRedactFields | (unset) | — | Comma-separated JSON field names redacted from recorded fixture bodies |
mockserver.llmVcrStrict | false | — | Strict VCR mode: unmatched requests on a cassette path return HTTP 599 |
mockserver.llmProvider | (unset) | — | Default runtime-LLM backend provider (enables runtime-LLM features) |
mockserver.llmApiKey | (unset) | — | API key for the default backend (secret; redacted in logs) |
mockserver.llmModel | (provider default) | — | Model for the default backend |
mockserver.llmBaseUrl | (provider default) | — | Base URL override for the default backend |
mockserver.llmBackendsConfig | (unset) | — | Path to JSON file of named backends |
mockserver.llmRequestTimeoutMillis | 30000 | — | Per-request timeout for outbound runtime-LLM calls |
mockserver.llmSemanticMatchingEnabled | false | — | Opt-in: activate the exploratory semanticMatch predicate (needs a backend; never for assertions) |
mockserver.llmInferUsageEnabled | false | — | Opt-in: fill approximate prompt_tokens/completion_tokens (via TokenCounter) when a mocked completion omits usage. Off by default so responses are unchanged; never overwrites a declared usage |
mockserver.otelMetricsEnabled | false | — | Export MockServer's metrics to an OTLP collector (alternative to Prometheus) |
mockserver.otelTracesEnabled | false | — | Emit one explicit GenAI semantic-convention span per served LLM completion |
mockserver.otelEndpoint | (unset) | — | OTLP base endpoint shared by metrics and span export |
mockserver.otelMetricsExportIntervalSeconds | 60 | ≥1 | How often metrics are pushed to the OTLP collector |
mockserver.llmMetricsEnabled | false | — | Enable LLM token/cost Prometheus counters (requires metricsEnabled); activates forward-path response parsing even without OTLP tracing |
mockserver.llmCostBudgetUsd | -1.0 (disabled) | — | Cumulative LLM cost budget in USD; enforced on ALL forward paths (matched FORWARD, breakpoint-continuation, unmatched proxy). When exceeded, LLM forwards return 429. Negative = disabled. Resets on server reset. Trip surfaces via mock_server_llm_cost_budget_tripped counter, WARN log, and the dashboard Circuit Breakers section |
LLM failover scenarios
LlmFailoverBuilder (org.mockserver.client) produces an ordered array of expectations that simulate a provider returning failures for the first N attempts, then succeeding on subsequent attempts. This is the deterministic way to test retry/failover logic (LiteLLM, Envoy AI Gateway, SDK retries) against MockServer.
The mechanism relies on expectation registration order and Times exhaustion: failure expectations with Times.exactly(n) are registered before the success expectation with Times.unlimited(). MockServer matches expectations in priority-then-insertion order (SortableExpectationId), so the first N requests match and consume the failure expectations, then fall through to the unlimited success expectation.
// Java builder
llmFailover()
.withPath("/v1/chat/completions")
.withProvider(Provider.OPENAI)
.withModel("gpt-4o")
.failWith(503)
.failWith(503)
.failWith(429)
.thenRespondWith(completion().withText("The answer is 42."))
.applyTo(mockServerClient);
Consecutive failures with the same status code and body are coalesced into a single expectation with Times.exactly(count) for efficiency. Custom error bodies can be provided per failure via failWith(status, body).
| MCP Tool | Description |
|---|---|
mock_llm_failover | Creates a failover scenario: failStatuses array of HTTP status codes (one per failure attempt), then a success response with provider-correct encoding. Validates provider against ProviderCodecRegistry. |
Related Documents
- Security Audit -- M5 security review including known codec limitations
- Codec Golden-File Testing -- how to refresh provider fixtures
- Request Processing -- action dispatch pipeline (LLM dispatch flow)
- Domain Model -- model class hierarchy
- Event System -- event logging pipeline
- AI & RPC Protocol Mocking -- SSE, MCP, A2A mocking
Source References
Key source files under mockserver/mockserver-core/src/main/java/org/mockserver/:
| File | Role |
|---|---|
llm/ProviderCodecRegistry.java | Codec registry singleton; all 9 providers registered at boot (7 chat + 2 rerank-only) |
llm/codec/AnthropicCodec.java | Anthropic Messages API encoder/decoder |
llm/codec/OpenAiChatCompletionsCodec.java | OpenAI Chat Completions encoder/decoder |
llm/codec/OpenAiResponsesCodec.java | OpenAI Responses API encoder/decoder |
llm/codec/GeminiCodec.java | Gemini encoder/decoder |
llm/codec/BedrockCodec.java | Bedrock wrapper (delegates to Anthropic codec; streaming uses AWS event-stream framing) |
llm/codec/BedrockEventStreamEncoder.java | AWS event-stream binary framing encoder/decoder (application/vnd.amazon.eventstream) |
llm/codec/AzureOpenAiCodec.java | Azure OpenAI wrapper (delegates to OpenAI codec) |
llm/codec/OllamaCodec.java | Ollama encoder/decoder |
llm/codec/CohereCodec.java | Cohere rerank-only codec (/v1/rerank) |
llm/codec/VoyageCodec.java | Voyage AI rerank-only codec (/v1/rerank) |
llm/codec/EmbeddingVectors.java | Shared deterministic/L2-normalised embedding-vector generation used by every embedding codec |
llm/codec/RerankScoring.java | Shared deterministic rerank scoring + {"results":[...]} envelope used by the rerank codecs |
model/RerankResponse.java | Rerank action config (topN, deterministicFromInput, seed) carried on HttpLlmResponse |
llm/StreamingPhysicsExpander.java | Converts Completion + StreamingPhysics to List<SseEvent> |
llm/IsolationSource.java | Session isolation key extraction descriptor |
llm/LlmScenarioNames.java | Scenario name generation with isolation encoding |
llm/ParsedConversation.java | Decoded conversation model |
llm/ParsedMessage.java | Single decoded message (role, text, tool name, tool call ID) |
client/LlmConversationBuilder.java | Fluent builder producing per-turn Expectation arrays |
client/LlmFailoverBuilder.java | Fluent builder producing failover/retry Expectation arrays (failures then success) |
client/TurnBuilder.java | Per-turn predicate and response configuration |
matchers/LlmConversationMatcher.java | Evaluates ConversationPredicates against decoded requests |
llm/PromptNormalizer.java | Deterministic prompt normalisation (whitespace/case/JSON-key-sort/volatile-field drop) |
model/HttpLlmResponse.java | Action type holding provider, model, completion, predicates |
model/ConversationPredicates.java | Serialisable predicate set stored on HttpLlmResponse |
model/NormalizationOptions.java | Serialisable normalisation modifier carried on ConversationPredicates |
llm/client/LlmClient.java + AbstractLlmClient.java | Runtime-LLM client SPI (build request / parse response), pure |
llm/client/LlmClientRegistry.java | Singleton registry of runtime-LLM clients keyed by Provider |
llm/client/{Ollama,OpenAi,OpenAiResponses,AzureOpenAi,Anthropic,Gemini,Bedrock}LlmClient.java | Per-provider runtime clients |
llm/client/LlmProviderSniffer.java | Maps forwarded request host/path to LLM Provider for forward-path GenAI observability (path-gated fallback) |
llm/client/LlmBackend.java | Immutable backend config record (apiKey redacted) |
llm/client/LlmBackendResolver.java | Three-layer backend resolution (env / properties / named JSON) |
llm/client/LlmCompletionService.java | Orchestrator: off-unless-configured, fail-closed, cached |
llm/client/LlmTransport.java + NettyHttpClientLlmTransport.java | Transport seam over NettyHttpClient |
llm/ProviderDetector.java | Heuristic provider detection from request path; mirrors UI-side detection; powers AUTO provider for MCP tools |
llm/analysis/AgentRunAnalyzer.java | Deterministic read-only agent-run inspection (tool-call counts, run summary, call graph) |
llm/semantic/SemanticPromptMatcher.java + SemanticMatching.java | Opt-in LLM-judge fuzzy match + its off-by-default static gate |
llm/adversarial/AdversarialResponseLibrary.java | Curated adversarial-response payloads for agent-resilience testing |
model/LlmChaosProfile.java | Fault/chaos profile carried on HttpLlmResponse |
llm/LlmErrorBodies.java | Pure helper mapping a Provider + error kind to the provider-correct chaos/quota error JSON body |
llm/TokenCounter.java | Pure, deterministic approximate token-count estimator (char/word heuristic; not a real tokenizer); backs opt-in usage inference |
mock/action/http/HttpLlmResponseActionHandler.java | Encodes LLM responses and applies chaos (error / truncation / malformed SSE) |
fixture/FixtureRedactor.java | Masks sensitive headers and (optional) JSON body fields when recording fixtures |
llm/drift/StructuralShapeDiff.java | Pure JSON shape diff (added/removed/type-changed paths) |
llm/drift/DriftDetector.java + DriftReport.java | Replays a cassette against the live provider and reports structural drift, fail-closed |
llm/StubGenerationPromptBuilder.java | Builds the LLM prompt for AI stub generation from unmatched requests |
llm/StubGenerationResult.java | Result DTO for stub generation (suggestions, confidence, explanation) |
metrics/OtelMetricsExporter.java | Optional OTLP metrics export bridging the Prometheus gauges (off by default) |
telemetry/GenAiSpanExporter.java + GenAiSpans.java + OtelEndpoints.java | Optional explicit GenAI span export per served completion (off by default) |
llm/analysis/LlmOptimisationReport.java | Structured JSON bundle — nested Session, Totals, Call, ToolCall, Signal, Redaction POJOs; schema version 1 |
llm/analysis/LlmOptimisationReportBuilder.java | Builds the report from FORWARDED_REQUEST log entries via ProviderCodecRegistry + LlmProviderSniffer + LlmPricing + FixtureRedactor |
llm/analysis/OptimisationSignals.java | Six deterministic signal detectors (see below); pure — no network, no LLM |
llm/analysis/LlmOptimisationBriefRenderer.java | Renders an LlmOptimisationReport to a pre-framed Markdown brief |
llm/analysis/LlmOptimisationCsvRenderer.java | Renders an LlmOptimisationReport to CSV (per-call rows + totals/verdict summary, RFC-4180 escaped) |
llm/analysis/LlmOptimisationReportService.java | Façade: build(pairs, filter) + renderBrief(result) + renderCsv(result) — used by both the REST handler and the MCP tool; also pushes the verdict snapshot to Metrics for the optimisation gauges |
LLM Optimisation Export
MockServer can turn any captured LLM session into a structured optimisation report — either a copy-paste Markdown brief (pre-framed so a user can paste it into any LLM for cost-reduction advice) or a JSON bundle for programmatic use. The feature is export-only: MockServer never calls an LLM; every number is deterministic.
Data flow
flowchart LR
LOG["FORWARDED_REQUEST\nlog entries"] --> SVC["LlmOptimisationReportService\n.build(pairs, filter)"]
SVC --> SNIFF["LlmProviderSniffer\n(detect provider)"]
SVC --> CODEC["ProviderCodecRegistry\n.decode() → ParsedConversation"]
SVC --> PRICE["LlmPricing\n(estimate USD cost)"]
SVC --> REDACT["FixtureRedactor\n(strip secrets)"]
SVC --> BUILD["LlmOptimisationReportBuilder\n(assemble report)"]
BUILD --> SIG["OptimisationSignals\n(detect 9 signal types)"]
BUILD --> RPT["LlmOptimisationReport\n(JSON bundle)"]
RPT -->|"format=markdown"| RENDER["LlmOptimisationBriefRenderer\n(Markdown brief)"]
Only LLM traffic (where the sniffer recognises a provider) is included; non-LLM forwarded traffic is ignored.
Per-call upstream latency
The per-call latencyMs is the measured upstream round-trip time. It is carried from the forward path to the report via an internal x-mockserver-response-time-ms header attached only to the logged FORWARDED_REQUEST response clone (never the response written to the real client) — the same convention as x-mockserver-streamed / x-mockserver-chunk-delays-ms. HttpActionHandler defines the constant (HttpActionHandler.RESPONSE_TIME_HEADER) and sets the header on every forward path's logged clone:
- Non-streaming — the value prefers the precise
Timing.getTotalTimeInMillis()measured byNettyHttpClient(matchingrecordForwardMetrics), falling back to the wall-clock delta. This matters for the matched-FORWARDpath, wherescheduler.submit(responseFuture, …)only runs after the future has completed, so a naive nanos delta would read ~0. - Streaming — the full-stream duration is computed from the captured forward-start nanos at stream completion (the upstream
Timingonly covers the response head).
LlmOptimisationReportService reads the header off the recorded response and passes it as CapturedExchange.latencyMs; the builder applies it when non-null and >= 0 and aggregates totals.totalLatencyMs. A malformed/absent header degrades gracefully to a 0 latency for that call. The matched-FORWARD two-arg writeForwardActionResponse(HttpResponse, …) overload (pre-resolved responses, e.g. object-callback) has no upstream timing and leaves latency unset.
Endpoint and MCP tool
REST — GET /mockserver/llm/optimisationReport (mockserver-netty control-plane, handled by HttpRequestHandler.handleOptimisationReport):
| Query parameter | Values | Default |
|---|---|---|
format | json | markdown | csv | json |
session | grouping key | all captured LLM traffic |
host | upstream hostname | all hosts |
provider | OPENAI | ANTHROPIC | GEMINI | BEDROCK | AZURE_OPENAI | OLLAMA | all providers |
An unrecognised format returns 400 with format must be one of: json, markdown, csv. The csv format is served as text/csv; charset=utf-8.
CORS is enabled on this endpoint so the dashboard UI can call it even when the dashboard and control plane are on different origins.
MCP tool — export_optimisation_report (registered in McpToolRegistry.registerExportOptimisationReport), same parameters as the REST endpoint (format accepts markdown (default), json, or csv). Returns the brief text (brief), JSON bundle (report), or CSV text (csv) as a tool result.
CSV format
format=csv is served by LlmOptimisationCsvRenderer (mirroring the structure of LlmOptimisationBriefRenderer). It is intended for spreadsheets and data pipelines: deterministic ordering, RFC-4180 escaping (a field containing a comma, double quote, CR, or LF is wrapped in double quotes with embedded quotes doubled). The output has two sections separated by a blank line:
- Per-call rows — header
index,provider,model,input_tokens,output_tokens,cached_input_tokens,reasoning_tokens,estimated_cost_usd,cost_is_estimated,latency_ms,tool_calls,finish_reason, one row per capturedCall. - Totals/verdict summary — header
section,metric,value, carrying the aggregateTotals(call_count, token totals,estimated_cost_usd,total_latency_ms,tool_call_count,cache_hit_ratio,one_shot_rate,retry_call_count) and the headlineVerdict(grade,total_estimated_saving_usd,total_wasted_input_tokens,saving_fraction_of_spend, severity counts).
An empty report renders the per-call header row with no data rows, followed by the totals section with zeros — always a valid, header-bearing CSV. The CSV is additive to (not a substitute for) the JSON bundle, so unlike the JSON shape it is not a frozen wire contract.
Optimisation verdict Prometheus gauges
Three single global Prometheus gauges expose the latest report's headline verdict/totals — mock_server_llm_estimated_waste_usd (from verdict.totalEstimatedSavingUsd), mock_server_llm_cache_hit_ratio (from totals.cacheHitRatio), and mock_server_llm_one_shot_rate (from totals.oneShotRate). They read a cached snapshot that LlmOptimisationReportService.build(...) pushes on every build, so they reflect the most-recently-built report (0 until one is built, and after a reset). See metrics.md → LLM Optimisation Verdict Gauges for the source-of-truth rationale and OTLP mirroring.
Dashboard — the LLM Optimise screen (OptimiseView.tsx, the LLM Optimise nav tab, positioned immediately after Chaos) fetches format=json for display and format=markdown for the "Copy optimisation brief" and "Download bundle" buttons.
In-product verdict
LlmOptimisationReport.Verdict is always present on the report (an empty session yields grade A, zeros, and the rationale "No optimisation opportunities detected."). It is computed by LlmOptimisationReportBuilder.buildVerdict using per-call MAX attribution — the approach that ensures overlapping signals can never inflate the headline above actual spend.
Attribution algorithm. For each signal with k affected calls, divide its estimatedSavingUsd and estimatedWastedInputTokens by k to get per-call shares. For each call index i, take the maximum across all signals (not a sum) so the same wasted tokens cannot be counted twice. Sum those per-call maxima, then apply a final clamp: totalEstimatedSavingUsd ≤ totals.estimatedCostUsd.
Grade thresholds. score = 100 × (1 − savingFraction), where savingFraction = totalEstimatedSavingUsd / totals.estimatedCostUsd (falls back to a token-wasted fraction when cost is zero):
| Score | Grade |
|---|---|
| ≥ 90 | A |
| ≥ 75 | B |
| ≥ 60 | C |
| ≥ 45 | D |
| < 45 | F |
Severity floor: if highCount > 0 and the score would otherwise yield A, the grade is promoted to B.
Rationale string (templated, deterministic): no signals → "No optimisation opportunities detected."; otherwise → "Grade C — an estimated 18% of spend (\$1.42) is recoverable across 3 findings (1 high, 2 medium)." Zero-count severities are omitted.
| Verdict field | Type | Notes |
|---|---|---|
grade | String | "A" … "F" |
rationale | String | Templated one-liner |
totalEstimatedSavingUsd | double | Clamped ≤ totals.estimatedCostUsd |
totalWastedInputTokens | long | Sum of per-call MAX wasted input tokens |
savingFractionOfSpend | double | 0..1 |
costIsEstimated | boolean | Mirrors totals.costIsEstimated |
highCount / mediumCount / lowCount | int | Signal counts by severity |
The dashboard renders the verdict as a banner above the hero cards (data-testid="optimise-verdict"): the grade letter is colour-coded (A/B = success, C = warning, D/F = error), the headline shows "Est. $X recoverable (Y% of spend)", and the rationale appears beneath it. The "Copy verdict" button next to "Copy optimisation brief" builds a compact plain-text verdict client-side from the already-loaded JSON (no additional fetch).
The Markdown brief opens with a ## Verdict section (grade, rationale, recoverable estimate, cache-hit %, one-shot %) before the run summary.
Session KPIs
Three new KPIs are added to Totals and reflected in the run summary and hero cards:
| KPI field | Formula | Dashboard label |
|---|---|---|
cacheHitRatio | inputTokens > 0 ? cachedInputTokens / inputTokens : 0 | Cache hit |
oneShotRate | callCount > 0 ? 1 − retryCallCount / callCount : 1.0 | One-shot |
retryCallCount | Windowed retry count (window = 3): call i is a retry if it matches any of the prior 3 calls on path, model, messageCount, systemPromptFingerprint, and inputTokens. | — (used to derive oneShotRate) |
Optimisation signals
OptimisationSignals.detect(calls, providers, cachedInputTokens) runs nine pure detectors. Results are sorted by urgency descending (urgency = severityWeight × callShare, where severityWeight is 1.0 / 0.6 / 0.3 for HIGH / MEDIUM / LOW, and callShare = affected calls / total calls); severity rank is the tie-breaker.
Each signal also carries a structured Fix object (all String fields, any nullable):
Fix field | Content |
|---|---|
summary | Imperative ≤6-word headline, e.g. "Enable prompt caching" |
action | 1–2 sentence description of what to do |
configSnippet | Copy-paste env / JSON snippet, or null |
exampleExpectation | Example MockServer expectation JSON, or null |
docsUrl | Absolute URL into the consumer docs, or null |
The legacy recommendation String on each signal is retained for back-compat. When fix is present the dashboard renders fix.summary (bold) + fix.action + copy-button for configSnippet / exampleExpectation + a docs link; when fix is null it falls back to recommendation.
| Signal id | Severity | Urgency | Trigger | Lever |
|---|---|---|---|---|
REPEATED_SYSTEM_PROMPT | HIGH / MEDIUM | urgency-ranked | Same system-prompt fingerprint on ≥2 calls | Prompt caching / retrieval tool |
LARGE_STATIC_CONTEXT_RESENT | HIGH | urgency-ranked | Context block ≥2,000 tokens resent on ≥2 calls | Prompt caching / retrieval tool |
DETERMINISTIC_TOOL_CALL | MEDIUM | urgency-ranked | Same tool name + args fingerprint on ≥2 calls | Direct HTTP/MCP endpoint |
OVERSIZED_TOOL_RESULT | MEDIUM | urgency-ranked | Tool result ≥1,000 tokens | Trim/summarise output |
| `OUTPUT_TOKEN_BLOAT$ | \text{LOW} | \text{urgency}-\text{ranked} | \text{Output} ≥1{,}500 \text{tokens} \text{or} ≥3 \times \text{session} \text{median} | $max_tokens/response_format` |
DUPLICATE_CONSECUTIVE_CALL | MEDIUM | urgency-ranked | Near-identical consecutive request shape | De-duplicate / cache / retry guard |
LOW_CACHE_HIT_RATE | HIGH / MEDIUM | urgency-ranked | cacheHitRatio < 0.5 AND a repeated cacheable system-prompt fingerprint exists AND notYetCached > 0. HIGH when notYetCached ≥ 2,000 tokens AND ratio < 0.2, else MEDIUM. fix is provider-aware: Anthropic gets a cache_control snippet; OpenAI/Gemini get prefix-caching advice. | Enable prompt caching |
MODEL_OVERSPEND | LOW | urgency-ranked | ≥2 trivial calls (output < 256 tokens, no tool calls, no reasoning tokens) on a model whose blended rate is > 30% above the provider's cheapest model. estimatedSavingUsd = affectedCost × savingFraction; estimatedWastedInputTokens = null. | Switch to a smaller model |
UNUSED_TOOL_SCHEMA | MEDIUM / LOW | urgency-ranked | Tools defined in the tools array but never invoked across the session, resent on ≥2 calls with definedToolTokens > 0. MEDIUM when total wasted tokens ≥ 1,000. Per-call waste = definedToolTokens × unusedInCall / definedCount. | Remove unused tools from tools |
Each signal carries estimatedWastedInputTokens (nullable) and estimatedSavingUsd (nullable, scaled from per-call cost via LlmPricing). Additionally, each signal carries urgency (double, 0..1) used for sorting.
Call additions for UNUSED_TOOL_SCHEMA
buildCall captures the tool names and schema size from the request body:
Call field | Type | Source |
|---|---|---|
definedToolNames | List<String> | Tool/function names from the request tools array (OpenAI: function.name; Anthropic/Gemini: name). Best-effort; empty on parse failure. |
definedToolTokens | long | charsToTokens(serialized tools node length) (~4 chars/token). 0 when tools is absent. |
Anthropic top-level system field
The Anthropic provider codec (AnthropicCodec) maps the top-level system field in request bodies — both the simple "system": "text" string form and the "system": [{"type":"text","text":"...","cache_control":{...}}] blocks array form — to a leading SYSTEM message in the decoded conversation. This ensures REPEATED_SYSTEM_PROMPT and LOW_CACHE_HIT_RATE fire correctly on Anthropic traffic (including Bedrock). cachedInputTokens is populated from usage.cache_read_input_tokens in the Anthropic response by AnthropicLlmClient.
Session grouping
Sessions group by isolation key (when LLM conversation expectations with IsolationSource are active — isolation key encoded in the scenario name as __llm_conv_<uuid>__iso=<type>:<key>) or by upstream Host header otherwise. The groupingBasis field in the report (ISOLATION_KEY | PROXY_HOST) records which was used.
v1 deferral. Proxied (forwarded) LLM traffic — the optimisation use case — carries no matched-expectation scenario, so the builder groups it by upstream host (PROXY_HOST); server-side isolation-key grouping is deferred. The session filter in LlmOptimisationReportService therefore accepts the composite host:<host> key or the bare host, and the dashboard OptimiseView picker offers host-grouped sessions only (plus "All captured LLM traffic") so a selection always resolves rather than silently returning an empty report.
Redaction
FixtureRedactor strips Authorization, x-api-key, api-key, Cookie, Set-Cookie, and Proxy-Authorization headers. JSON body fields named in mockserver.fixtureBodyRedactFields are also redacted. The redaction object in the report lists what was stripped.
Configuration
mockserver.llmOptimisationMaxCalls(int, default 200) — caps report size; only the most recent N calls are included when the session is larger. Signal thresholds are v1 constants inOptimisationSignals(no config properties beyond this bound).mockserver.fixtureBodyRedactFields— shared withrecord_llm_fixtures; controls body-field redaction.
Markdown brief structure (frozen order)
- Framing preamble (verbatim instructions for the downstream LLM)
## Verdict— grade, rationale,"Est. $X (Y% of spend) / Z tokens recoverable", cache-hit %, one-shot %- Run summary (providers, models, token totals, estimated cost, latency, tool-call count, cache-hit rate, one-shot rate)
- Per-call table (
# | model | in tok | out tok | cost | latency | tools | finish) - Detected opportunities (urgency-ranked, each as a
###section with title, detail, affected call indices, estimated saving, recommendation,Fix:line, and optional fencedconfig/jsonblock for snippet / example expectation and docs link) - Conversations and tool definitions appendix (redacted messages + tool schemas per call)