Observability and Telemetry

April 21, 2026 · View on GitHub

This document describes the observability architecture implemented in ToolHive for monitoring MCP (Model Context Protocol) server interactions. ToolHive provides OpenTelemetry-based instrumentation with support for distributed tracing, metrics collection, and structured logging.

This document is intended for developers working on ToolHive. For user guides on setting up and using these features, see the ToolHive documentation:

  • Observability overview, including trace structure and example metrics
  • CLI guide, including how to enable and configure telemetry and send to common backends

For migrating from legacy attribute names to the new OTEL MCP semantic conventions, see the Telemetry Migration Guide.

Overview

ToolHive's observability stack provides complete visibility into MCP proxy operations through:

  1. Distributed tracing: Track requests across the proxy-container boundary with OpenTelemetry traces
  2. Metrics collection: Monitor performance, usage patterns, and error rates with Prometheus and OTLP metrics
  3. Structured logging: Capture detailed audit events for compliance and debugging
  4. Protocol-aware instrumentation: MCP-specific insights beyond generic HTTP metrics

See the original design document for more details on the design and goals of this observability architecture.

Architecture

graph TD
    A[MCP Client] --> B[ToolHive Proxy Runner]
    B --> C[Container MCP Server]

    B --> D[OpenTelemetry Middleware]
    D --> E[Trace Exporter]
    D --> F[Metrics Exporter]

    E --> G[OTLP Endpoint]
    E --> H[Jaeger]
    E --> I[DataDog]

    F --> J[Prometheus /metrics]
    F --> K[OTLP Metrics]

    G --> L[Observability Backend]
    K --> L
    J --> M[Prometheus Server]

    classDef toolhive fill:#EDD9A3,color:#000;
    classDef external fill:#7AB7FF,color:#000;
    class B,D toolhive;
    class L,M external;

Integration with Existing Middleware

The OpenTelemetry middleware integrates seamlessly with ToolHive's existing middleware stack:

graph TD
    A[HTTP Request] --> B[Authentication Middleware]
    B --> C[MCP Parsing Middleware]
    C --> D[OpenTelemetry Middleware]
    D --> E[Authorization Middleware]
    E --> F[Audit Middleware]
    F --> G[MCP Server Handler]

    style D fill:#EDD9A3,color:#000;

The telemetry middleware:

  • Leverages parsed MCP data from the parsing middleware
  • Includes authentication context from JWT claims
  • Captures authorization decisions for compliance
  • Correlates with audit events for complete observability

This provides end-to-end visibility across the entire request lifecycle while maintaining the modular architecture of ToolHive's middleware system.

Configuration

CLI Flags

FlagTypeDefaultDescription
--otel-endpointstring""OTLP endpoint URL (e.g., localhost:4317). Telemetry is disabled when empty and Prometheus is not enabled.
--otel-tracing-enabledbooltrueEnable distributed tracing (requires endpoint)
--otel-metrics-enabledbooltrueEnable OTLP metrics export (requires endpoint)
--otel-sampling-ratefloat0.1Trace sampling rate (0.0–1.0). The CLI default is 0.1 (10%); the Kubernetes CRD default is 0.05 (5%). Config file values override the CLI default when the flag is not explicitly set.
--otel-service-namestring"toolhive-mcp-proxy"Service name for telemetry resource
--otel-headersstring[]nilOTLP authentication headers (key=value format)
--otel-insecureboolfalseUse HTTP instead of HTTPS for the OTLP endpoint
--otel-enable-prometheus-metrics-pathboolfalseExpose Prometheus /metrics endpoint on the transport port
--otel-env-varsstring[]nilEnvironment variables to include in spans (comma-separated)
--otel-custom-attributesstring""Custom resource attributes (key1=value1,key2=value2)
--otel-use-legacy-attributesbooltrueEmit legacy attribute names alongside new OTEL semantic convention names

Configuration File

Telemetry can also be configured via ~/.toolhive/config.yaml:

otel:
  endpoint: "localhost:4317"
  sampling-rate: 0.1
  env-vars:
    - NODE_ENV
    - DEPLOYMENT_ENV
  insecure: true
  use-legacy-attributes: false

CLI flags take precedence over configuration file values when explicitly set.

Kubernetes CRD

MCPTelemetryConfig (preferred): Define telemetry settings in a shared MCPTelemetryConfig resource and reference it via spec.telemetryConfigRef in MCPServer, MCPRemoteProxy, or VirtualMCPServer. This eliminates duplication when managing multiple servers. Each server provides a unique serviceName override. Sensitive headers (API keys, bearer tokens) are stored in Kubernetes Secrets via sensitiveHeaders[].secretKeyRef.

apiVersion: toolhive.stacklok.dev/v1beta1
kind: MCPTelemetryConfig
metadata:
  name: shared-otel
spec:
  openTelemetry:
    enabled: true
    endpoint: otel-collector:4318
    insecure: true
    tracing:
      enabled: true
      samplingRate: "0.1"
    metrics:
      enabled: true
---
apiVersion: toolhive.stacklok.dev/v1beta1
kind: MCPServer
metadata:
  name: my-server
spec:
  # ... other fields ...
  telemetryConfigRef:
    name: shared-otel
    serviceName: my-server    # unique per server

See examples/operator/mcp-servers/mcpserver_fetch_otel.yaml for a complete example.

Inline (deprecated): The inline spec.telemetry (MCPServer, MCPRemoteProxy) and spec.config.telemetry (VirtualMCPServer) fields still work but are deprecated and will be removed in a future API version. They are mutually exclusive with telemetryConfigRef (CEL enforced). All three resource types now support spec.telemetryConfigRef.

For VirtualMCPServer telemetry, see the vMCP observability docs.

Validation Rules

  • If an OTLP endpoint is configured but both tracingEnabled and metricsEnabled are false, configuration validation fails.
  • If only enablePrometheusMetricsPath is enabled (no OTLP endpoint), Prometheus metrics are served without OTLP export.
  • If nothing is configured (no endpoint, no Prometheus), telemetry is disabled.

Metrics Reference

MCP Proxy Metrics

These metrics are emitted by the telemetry middleware (pkg/telemetry/middleware.go) for each MCP server proxy.

toolhive_mcp_requests (Counter)

Total number of MCP requests processed.

AttributeTypeDescription
methodstringHTTP method (POST, GET)
status_codestringHTTP status code (200, 500)
statusstring"success" or "error" (error if status >= 400)
mcp_methodstringMCP method name (tools/call, resources/read, etc.)
mcp_resource_idstringTool name, resource URI, or prompt name
serverstringMCP server name
transportstringBackend transport type (stdio, sse, streamable-http)

Note: SSE connection establishment events also increment this counter with mcp_method="sse_connection" and do not include mcp_resource_id.

toolhive_mcp_request_duration (Histogram, seconds)

Duration of MCP requests. Uses default histogram bucket boundaries.

Attributes: Same as toolhive_mcp_requests.

mcp.server.operation.duration (Histogram, seconds)

Duration of MCP server operations per the OTEL MCP semantic conventions.

Bucket boundaries: [0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10, 30, 60, 120, 300]

AttributeTypeConditionDescription
mcp.method.namestringAlwaysMCP method (tools/call, resources/read, etc.)
jsonrpc.protocol.versionstringAlwaysAlways "2.0"
network.transportstringAlways"tcp" or "pipe"
network.protocol.namestringIf applicable"http" for SSE/streamable-http
network.protocol.versionstringIf availableHTTP protocol version (1.1, 2)
error.typestringOn HTTP 5xxHTTP status code as string
gen_ai.operation.namestringFor tools/callAlways "execute_tool"
gen_ai.tool.namestringFor tools/callTool name
gen_ai.prompt.namestringFor prompts/getPrompt name

toolhive_mcp_tool_calls (Counter)

Total number of MCP tool invocations (only recorded for tools/call requests).

AttributeTypeDescription
serverstringMCP server name
toolstringTool name
statusstring"success" or "error"

toolhive_mcp_active_connections (UpDownCounter)

Number of currently active MCP connections.

AttributeTypeDescription
serverstringMCP server name
transportstringBackend transport type
connection_typestring"sse" (only present for SSE connections)

Span Attributes

HTTP Attributes

These follow the OTEL HTTP semantic conventions. They are always emitted.

Request attributes:

AttributeTypeDescription
http.request.methodstringHTTP request method
url.fullstringFull request URL
url.schemestringURL scheme (http, https)
url.pathstringURL path
url.querystringURL query string (if present)
server.addressstringServer host
user_agent.originalstringUser agent string
http.request.body.sizeint64Request body size (if > 0)

Response attributes:

AttributeTypeDescription
http.response.status_codeintResponse HTTP status code
http.response.body.sizeint64Response body size

MCP Protocol Attributes

These are set when an MCP JSON-RPC request is parsed by the MCP parsing middleware (pkg/mcp/parser.go).

AttributeTypeConditionDescription
mcp.method.namestringAlwaysMCP JSON-RPC method name
rpc.system.namestringAlwaysAlways "jsonrpc"
jsonrpc.protocol.versionstringAlwaysAlways "2.0"
jsonrpc.request.idstringIf request has IDJSON-RPC request ID
mcp.resource.uristringResource methods onlyResource URI
mcp.server.namestringAlwaysMCP server name
mcp.is_batchboolIf batch requestBatch request indicator

The mcp.resource.uri attribute is set only for the following methods: resources/read, resources/subscribe, resources/unsubscribe, notifications/resources/updated.

Tool, Prompt, and Resource Attributes

For tools/call:

AttributeTypeDescription
gen_ai.tool.namestringTool name
gen_ai.operation.namestringAlways "execute_tool"
gen_ai.tool.call.argumentsstringSanitized tool arguments (max 200 chars)

For prompts/get:

AttributeTypeDescription
gen_ai.prompt.namestringPrompt name

For initialize:

AttributeTypeDescription
mcp.client.namestringClient name from clientInfo

Network and Transport Attributes

AttributeTypeDescriptionValues
network.transportstringNetwork transport protocol"tcp" (SSE, streamable-http), "pipe" (stdio)
network.protocol.namestringApplication protocol"http" (SSE, streamable-http), empty (stdio)
network.protocol.versionstringHTTP protocol version"1.1", "2"
mcp.backend.protocol.versionstringBackend MCP protocol versionSSE: "1.1"

Session and Client Attributes

AttributeTypeConditionDescription
mcp.session.idstringMcp-Session-Id header presentSession identifier
mcp.protocol.versionstringMCP-Protocol-Version header presentMCP protocol version
client.addressstringRemote address availableClient IP address
client.portintPort parseable from remote addressClient port

Error Attributes

AttributeTypeConditionDescription
error.typestringHTTP 5xx errorsHTTP status code as string (e.g., "500")

Span status behavior:

  • HTTP 5xx: Span status set to Error with message "HTTP {code}"
  • HTTP 4xx: Span status left as Unset (client errors per OTEL semconv)
  • HTTP 2xx/3xx: Span status set to Ok

Environment and Custom Attributes

Environment variables (--otel-env-vars): Specified host environment variables are read and added to spans as environment.{VAR_NAME} attributes. Only variables explicitly listed in the configuration are captured.

Custom resource attributes (--otel-custom-attributes or OTEL_RESOURCE_ATTRIBUTES): Key-value pairs added as OTEL resource attributes to all telemetry signals.

SSE Connection Attributes

SSE connections get a dedicated short-lived span (sse.connection_established) with:

AttributeTypeDescription
sse.event_typestringAlways "connection_established"
mcp.server.namestringMCP server name

Plus the standard HTTP, network, and transport attributes.

Span Naming Conventions

Span names follow the OTEL MCP semantic conventions:

PatternWhenExample
{mcp.method.name} {target}MCP request with resource ID"tools/call fetch"
{mcp.method.name}MCP request without resource ID"initialize"
{HTTP_METHOD} {url.path}Non-MCP requests (fallback)"GET /health"
sse.connection_establishedSSE connection setup

All proxy spans use SpanKindServer.

Distributed Tracing

Trace Context Propagation

ToolHive supports W3C Trace Context propagation through two mechanisms:

  1. HTTP headers — Standard traceparent and tracestate headers
  2. MCP _meta field — Trace context embedded in the JSON-RPC params._meta object, as recommended by the MCP OpenTelemetry specification

Priority: When both are present, _meta trace context takes precedence over HTTP headers, since _meta is the MCP-specified propagation mechanism.

How It Works

Inbound (client → ToolHive proxy):

The telemetry middleware first extracts trace context from HTTP headers, then checks for _meta in the parsed MCP request. If _meta contains traceparent (and optionally tracestate), the middleware extracts the trace context from it, which overrides the HTTP header context. A child span is then created with the extracted trace as parent.

{
  "method": "tools/call",
  "params": {
    "name": "fetch",
    "arguments": {"url": "https://example.com"},
    "_meta": {
      "traceparent": "00-abcdef1234567890abcdef1234567890-1234567890abcdef-01",
      "tracestate": "vendor=value"
    }
  }
}

Outbound (vMCP → backend):

The InjectMetaTraceContext function (pkg/telemetry/propagation.go) can inject the current trace context into the _meta field when forwarding requests to backends, enabling end-to-end distributed tracing across the vMCP aggregation layer.

Propagators

ToolHive configures the following OTEL propagators globally:

  • propagation.TraceContext{} — W3C Trace Context
  • propagation.Baggage{} — W3C Baggage

Implementation

The trace context propagation is implemented in pkg/telemetry/propagation.go using a MetaCarrier that implements propagation.TextMapCarrier for MCP _meta maps. The MCP _meta field is extracted by the MCP parsing middleware (pkg/mcp/parser.go) and stored in the request context.

Legacy Attribute Compatibility

ToolHive supports dual emission of span attributes controlled by the useLegacyAttributes configuration option. When set to true (the current default), both legacy and new OTEL semantic convention attribute names are emitted on every span, allowing existing dashboards to continue working during migration.

For a complete mapping of legacy to new attribute names and migration instructions, see the Telemetry Migration Guide.

Virtual MCP Server Telemetry

For observability in the Virtual MCP Server (vMCP), including backend request metrics, workflow execution telemetry, and distributed tracing, see the dedicated Virtual MCP Server Observability documentation.