Observability Guide

February 17, 2026 · View on GitHub

This guide covers how to enable and use observability features in duroxide for production monitoring and debugging.

Overview

Duroxide provides production-grade observability through:

Structured Logging: Correlated logs with instance_id, execution_id, orchestration_name, and more
Metrics Facade: Counters, histograms, and gauges via the metrics crate - install your preferred exporter
Log Analytics Integration: Ready-to-use patterns for Elasticsearch, Loki, CloudWatch, and Azure Monitor

Quick Start

Basic Structured Logging (Text to stdout)

The simplest setup uses the Compact format:

use duroxide::runtime::{RuntimeOptions, ObservabilityConfig, LogFormat};

let options = RuntimeOptions {
    observability: ObservabilityConfig {
        log_format: LogFormat::Compact,
        log_level: "info".to_string(),
        ..Default::default()
    },
    ..Default::default()
};

let rt = Runtime::start_with_options(store, activities, orchestrations, options).await;

Output Format:

2025-11-01T17:22:04.494Z INFO duroxide::runtime Orchestration started instance_id=order-123 execution_id=1 orchestration_name=ProcessOrder worker_id=orch-0-cd54
2025-11-01T17:22:04.596Z INFO duroxide::runtime Activity started instance_id=order-123 execution_id=1 activity_name=ValidatePayment activity_id=3 worker_id=work-1-cd54
2025-11-01T17:22:04.698Z INFO duroxide::runtime Activity completed instance_id=order-123 execution_id=1 activity_name=ValidatePayment worker_id=work-1-cd54 outcome="success" duration_ms=102
2025-11-01T17:22:04.806Z INFO duroxide::orchestration Payment validated successfully instance_id=order-123 execution_id=1 orchestration_name=ProcessOrder worker_id=orch-0-cd54
2025-11-01T17:22:05.010Z INFO duroxide::runtime Orchestration completed instance_id=order-123 execution_id=1 worker_id=orch-0-cd54 history_events=9

Session-related log events include:

... Session claimed/refreshed on fetch session_id=sess-abc owner_id=my-pod
... Session locks renewed sessions_renewed=3
... Session capacity exceeded after fetch (race), abandoning work item session_id=sess-xyz

Format: timestamp level module message field1=value1 field2=value2 ...

All correlation fields are included directly on each log line (flat structure, no span nesting).

Log Formats

Compact Format

Best for development and debugging with minimal noise:

ObservabilityConfig {
    log_format: LogFormat::Compact,
    log_level: "info".to_string(),
    ..Default::default()
}

Output: timestamp level module [instance_id] [worker_id] message

Pretty Format

Full structured fields for detailed debugging:

ObservabilityConfig {
    log_format: LogFormat::Pretty,
    log_level: "debug".to_string(),
    ..Default::default()
}

Output includes all fields as key=value pairs:

2024-10-30T10:15:23.456Z INFO duroxide::runtime instance_id=order-123 execution_id=1 orchestration_name=ProcessOrder version=1.0.0 worker_id=orch-0-a3f9: Orchestration started

JSON Format

For log aggregators and analytics platforms:

ObservabilityConfig {
    log_format: LogFormat::Json,
    log_level: "info".to_string(),
    ..Default::default()
}

Output: One JSON object per line with all structured fields

Correlation Fields

Every log entry includes these fields for correlation:

instance_id - Unique orchestration instance identifier
execution_id - Execution number (1, 2, 3... for ContinueAsNew)
orchestration_name - Name of the orchestration
orchestration_version - Semantic version
activity_name - Activity name (in activity context)
activity_id - Event ID of the activity
worker_id - Dispatcher worker ID (format: work-{index}-{runtime_id} or orch-{index}-{runtime_id})
timestamp - ISO8601 timestamp

Example: Tracing an Orchestration

All logs for instance order-123:

# In Elasticsearch
instance_id:"order-123"

# In CloudWatch Insights
fields @timestamp, message | filter instance_id = "order-123" | sort @timestamp

# In Loki (LogQL)
{job="duroxide"} | json | instance_id="order-123"

User Orchestration Logging

Use ctx.trace_*() methods in your orchestrations for replay-safe logging:

async fn process_order(ctx: OrchestrationContext, order: String) -> Result<String, String> {
    ctx.trace_info("Processing order started");
    
    let validated = ctx.schedule_activity("ValidateOrder", order.clone())
        .await?;
    
    ctx.trace_info(format!("Validation result: {}", validated));
    
    if validated == "invalid" {
        ctx.trace_error("Order validation failed");
        return Err("validation_failed".to_string());
    }
    
    ctx.trace_info("Order processing complete");
    Ok("processed".to_string())
}

Output:

2024-10-30T10:15:23.690Z INFO duroxide::orchestration [order-123] Processing order started
2024-10-30T10:15:23.750Z INFO duroxide::orchestration [order-123] Validation result: valid
2024-10-30T10:15:23.890Z INFO duroxide::orchestration [order-123] Order processing complete

Key Points:

All ctx.trace_*() calls are deterministic (replay-safe)
Context fields automatically included (instance_id, execution_id, etc.)
Only logged on first execution (not during replay)
Creates SystemCall events in history for determinism

Metrics

Duroxide emits metrics via the metrics facade crate. To export metrics, install a recorder before starting the runtime:

// Option 1: Prometheus (direct scraping)
metrics_exporter_prometheus::PrometheusBuilder::new()
    .with_http_listener(([0, 0, 0, 0], 9090))
    .install()?;

// Option 2: OpenTelemetry (via metrics-exporter-opentelemetry)
// metrics_exporter_opentelemetry::Recorder::builder("my-service").install_global()?;

// Option 3: None - metrics become zero-cost no-ops (default)

Available Metrics

All metrics follow Prometheus naming conventions with _total for counters and _seconds for time measurements.

Metrics Summary Table

Metric Name	Type	Category	Labels	Purpose
Orchestration Lifecycle
`duroxide_orchestration_starts_total`	Counter	Orchestration	orchestration_name, version, initiated_by	Track which orchestrations start
`duroxide_orchestration_completions_total`	Counter	Orchestration	orchestration_name, version, status, final_turn_count	Success rate, completion tracking
`duroxide_orchestration_failures_total`	Counter	Orchestration	orchestration_name, version, error_type, error_category	Failure analysis
`duroxide_orchestration_duration_seconds`	Histogram	Orchestration	orchestration_name, version, status	E2E duration, p95/p99 latency
`duroxide_orchestration_history_size`	Histogram	Orchestration	orchestration_name	Detect history growth issues
`duroxide_orchestration_turns`	Histogram	Orchestration	orchestration_name	Identify multi-turn orchestrations
`duroxide_orchestration_continue_as_new_total`	Counter	Orchestration	orchestration_name, execution_id	Track CAN operations
`duroxide_active_orchestrations`	Gauge	Orchestration	(none)	Current active count
Activity Execution
`duroxide_activity_executions_total`	Counter	Activity	activity_name, outcome, retry_attempt	Track success/failure rates
`duroxide_activity_duration_seconds`	Histogram	Activity	activity_name, outcome	Activity latency p95/p99
`duroxide_activity_errors_total`	Counter	Activity	activity_name, error_type, retryable	Detailed error tracking
Sub-Orchestration
`duroxide_suborchestration_calls_total`	Counter	Sub-Orch	parent_orchestration, child_orchestration, outcome	Track sub-orch calls
`duroxide_suborchestration_duration_seconds`	Histogram	Sub-Orch	parent_orchestration, child_orchestration, outcome	Sub-orch latency
Provider (Storage)
`duroxide_provider_operation_duration_seconds`	Histogram	Provider	operation, status	Database operation latency
`duroxide_provider_errors_total`	Counter	Provider	operation, error_type	Storage layer errors
Client Operations
`duroxide_client_orchestration_starts_total`	Counter	Client	orchestration_name	Client-initiated starts
`duroxide_client_external_events_raised_total`	Counter	Client	event_name	External events raised
`duroxide_client_cancellations_total`	Counter	Client	(none)	Cancellation requests
`duroxide_client_wait_duration_seconds`	Histogram	Client	(none)	Client wait latency
Internal (Dispatcher)
`duroxide.orchestration.dispatcher.items_fetched`	Counter	Internal	(none)	Items fetched by orch dispatcher
`duroxide.orchestration.dispatcher.processing_duration_ms`	Histogram	Internal	(none)	Orch processing time
`duroxide.worker.dispatcher.items_fetched`	Counter	Internal	(none)	Items fetched by worker dispatcher
`duroxide.worker.dispatcher.execution_duration_ms`	Histogram	Internal	(none)	Worker execution time

Note: Internal dispatcher metrics use dot notation and milliseconds (not exported to Prometheus by default).

Orchestration Lifecycle Metrics

duroxide_orchestration_starts_total (Counter)

Description: Total orchestrations started
Labels: orchestration_name, version, initiated_by (client|suborchestration|continueAsNew)
Use: Track which orchestrations are running, identify version distribution

duroxide_orchestration_completions_total (Counter)

Description: Orchestrations that completed (successfully or failed)
Labels: orchestration_name, version, status (success|failed|cancelled), final_turn_count (1-5|6-10|11-50|50+)
Use: Success rate tracking, identify orchestrations with high turn counts

duroxide_orchestration_failures_total (Counter)

Description: Orchestration failures with detailed error classification
Labels: orchestration_name, version, error_type (app_error|infrastructure_error|config_error), error_category
Use: Root cause analysis, distinguish infrastructure vs application errors

duroxide_orchestration_duration_seconds (Histogram)

Description: End-to-end orchestration execution time
Labels: orchestration_name, version, status
Buckets: [0.1, 0.5, 1, 2, 5, 10, 30, 60, 300, 600, 1800, 3600] seconds
Use: Identify slow orchestrations, track p50/p95/p99 latency

duroxide_orchestration_history_size (Histogram)

Description: History event count at orchestration completion
Labels: orchestration_name
Buckets: [10, 50, 100, 500, 1000, 5000, 10000] events
Use: Identify orchestrations with unbounded history growth

duroxide_orchestration_turns (Histogram)

Description: Number of turns to orchestration completion
Labels: orchestration_name
Buckets: [1, 2, 5, 10, 20, 50, 100, 200, 500]
Use: Detect orchestrations requiring many turns (potential optimization targets)

duroxide_orchestration_continue_as_new_total (Counter)

Description: Continue-as-new operations performed
Labels: orchestration_name, execution_id
Use: Verify continue-as-new is working, identify long-running actors

duroxide_active_orchestrations (Gauge) 🆕

Description: Current number of orchestration instances that are actively running (not completed/failed)
Value: Count of orchestrations currently in progress
Use: Track concurrent orchestrations, detect leaks, capacity planning
Critical: This is a GAUGE (can increase/decrease), unlike counters that only increase
Note: Continue-as-new does NOT change this count (orchestration stays active)

Activity Execution Metrics

duroxide_activity_executions_total (Counter)

Description: Activity execution attempts
Labels: activity_name, outcome (success|app_error|infra_error|config_error), retry_attempt (0|1|2|3+)
Use: Identify flaky activities, track retry rates

duroxide_activity_duration_seconds (Histogram)

Description: Activity execution time (wall clock)
Labels: activity_name, outcome
Buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10, 30, 60, 120, 300] seconds
Use: Identify slow activities, set appropriate timeouts

duroxide_activity_errors_total (Counter)

Description: Detailed activity error tracking
Labels: activity_name, error_type (app_error|infrastructure_error|config_error), retryable (true|false)
Use: Distinguish transient vs permanent errors

Sub-Orchestration Metrics

duroxide_suborchestration_calls_total (Counter)

Description: Sub-orchestration invocations
Labels: parent_orchestration, child_orchestration, outcome
Use: Understand orchestration composition, trace call graphs

duroxide_suborchestration_duration_seconds (Histogram)

Description: Sub-orchestration execution time
Labels: parent_orchestration, child_orchestration, outcome
Buckets: [0.1, 0.5, 1, 5, 10, 30, 60, 300, 600] seconds

Provider Metrics

duroxide_provider_operation_duration_seconds (Histogram)

Description: Database operation latency
Labels: operation (fetch|ack|save_event|create_instance), status (success|error)
Buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 2, 5] seconds
Use: Database performance monitoring, identify slow queries

duroxide_provider_errors_total (Counter)

Description: Provider/storage layer errors
Labels: operation, error_type (timeout|connection|deadlock|corruption)
Use: Database health monitoring, alerting

Client Metrics

duroxide_client_orchestration_starts_total (Counter)

Description: Orchestrations started via client
Labels: orchestration_name

duroxide_client_external_events_raised_total (Counter)

Description: External events raised via client
Labels: event_name

duroxide_client_cancellations_total (Counter)

Description: Orchestration cancellations via client

duroxide_client_wait_duration_seconds (Histogram)

Description: Client wait operation duration
Buckets: [0.1, 0.5, 1, 2, 5, 10, 30, 60, 120] seconds

Error Classification

Duroxide categorizes errors into three types for better observability:

Infrastructure Errors

Provider failures, data corruption, network issues:

Transaction failures
Lock timeouts
Serialization errors
Database connectivity issues

Metrics: *.infrastructure_errors counters

Configuration Errors

Deployment/registration issues:

Unregistered orchestrations/activities
Missing versions
Nondeterminism detected

Metrics: *.configuration_errors counters

Application Errors

Business logic failures:

Activity returns Err(...)
User-initiated cancellations
Business validation failures

Metrics: Activity/orchestration failure counters with app_error outcome

Log Analytics Integration

1. Elasticsearch / OpenSearch

Setup:

# otel-collector-config.yaml
exporters:
  elasticsearch:
    endpoints: ["http://elasticsearch:9200"]
    logs_index: duroxide-logs

Queries:

# All logs for an orchestration
instance_id:"order-123"

# Failed activities
activity_name:* AND outcome:"app_error"

# Nondeterminism errors
error_type:"nondeterminism" AND level:"ERROR"

2. Grafana Loki

Setup:

# promtail-config.yaml
scrape_configs:
  - job_name: duroxide
    static_configs:
      - targets:
          - localhost
        labels:
          job: duroxide-worker

LogQL:

# Instance timeline
{job="duroxide"} | json | instance_id="order-123"

# Error rate
rate({job="duroxide", level="error"}[5m]) by (orchestration_name)

3. AWS CloudWatch Logs

Setup: Use CloudWatch Logs agent or OTLP exporter

CloudWatch Insights Queries:

# Orchestration trace
fields @timestamp, instance_id, orchestration_name, message
| filter instance_id = "order-123"
| sort @timestamp asc

# Activity success rate
fields activity_name, outcome
| filter activity_name != ""
| stats count() by activity_name, outcome

4. Azure Monitor / Application Insights

Setup: Use Azure Monitor OTLP exporter

KQL Queries:

// Full orchestration timeline
traces
| where customDimensions.instance_id == "order-123"
| project timestamp, severityLevel, message
| order by timestamp asc

Debugging Workflows

Trace a Single Orchestration

Note the instance_id when starting: order-123
Query logs by instance_id
Filter by execution_id if using ContinueAsNew
Trace activity execution within the orchestration

Find Slow Activities

Query activity duration metrics:

# Prometheus query - p99 latency
histogram_quantile(0.99, 
  rate(duroxide_activity_duration_seconds_bucket[5m])
) by (activity_name)

# Activity success rate
rate(duroxide_activity_executions_total{outcome="success"}[5m])
/
rate(duroxide_activity_executions_total[5m])
by (activity_name)

Identify Nondeterminism Issues

Search logs for configuration errors:

error_type:"nondeterminism" AND level:"ERROR"

These indicate scheduling order mismatches that need code fixes.

Performance Impact

Without a Metrics Recorder (default)

Zero overhead from metrics collection (no-op fast path)
Minimal logging overhead (only structured field extraction)
Recommended for development and when metrics export is not needed

With a Metrics Recorder Installed

~2-5% overhead for metrics collection
Lock-free atomic operations
Background export (no blocking)
Recommended for production monitoring

Log Format Impact

Compact: Lowest overhead
Pretty: Slight overhead for formatting
JSON: Minimal additional overhead

Environment Variables

Override configuration via environment variables:

# Logging level
RUST_LOG=info,duroxide=debug

# OpenTelemetry (if using OTLP exporter in your application)
# These are consumed by opentelemetry crates, not duroxide directly
OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317
OTEL_SERVICE_NAME=duroxide-worker
OTEL_SERVICE_VERSION=1.0.0

Example Prometheus Queries

Orchestration Success Rate

rate(duroxide_orchestration_completions_total{status="success"}[5m])
/
rate(duroxide_orchestration_completions_total[5m])
by (orchestration_name)

P95 Orchestration Duration

histogram_quantile(0.95,
  rate(duroxide_orchestration_duration_seconds_bucket[5m])
) by (orchestration_name)

Activity Error Rate

rate(duroxide_activity_executions_total{outcome!="success"}[5m])
by (activity_name)

Top 5 Slowest Activities

topk(5,
  histogram_quantile(0.95,
    rate(duroxide_activity_duration_seconds_bucket[5m])
  ) by (activity_name)
)

Infrastructure Errors (Actionable)

rate(duroxide_orchestration_failures_total{error_type="infrastructure_error"}[5m])
by (orchestration_name)

Orchestrations Requiring Many Turns

# Using completion labels (bucketed ranges)
sum(duroxide_orchestration_completions_total{final_turn_count="50+"})
by (orchestration_name)

# Using turns histogram (p95 turn count)
histogram_quantile(0.95,
  rate(duroxide_orchestration_turns_bucket[5m])
) by (orchestration_name)

# Orchestrations that took >100 turns
sum(rate(duroxide_orchestration_turns_bucket{le="+Inf"}[5m]))
-
sum(rate(duroxide_orchestration_turns_bucket{le="100"}[5m]))

Currently Active Orchestrations (Gauge)

# Total active right now
duroxide_active_orchestrations

# Detect orchestration leaks (if this keeps growing, you have a leak)
increase(duroxide_active_orchestrations[1h]) > 100

Best Practices

Use Compact format in development for readable logs
Use JSON format in production for log aggregation
Include context in user traces: Use descriptive messages
Monitor infrastructure_errors closely - indicates system issues
Track configuration_errors - may require deployment fixes
Use instance_id as primary correlation field
Query by orchestration_name to find patterns across instances
Enable metrics in production for performance insights
Use histogram metrics for p95/p99 latency tracking, not just averages
Alert on error_type labels to distinguish actionable errors from business logic failures

Troubleshooting

Logs Not Showing

Check the log level:

ObservabilityConfig {
    log_level: "debug".to_string(),  // Lower for more logs
    ..Default::default()
}

Missing Context Fields

Ensure logs are within an orchestration or activity span. Framework logs automatically include context.

Metrics Not Exporting

Verify:

observability feature flag is enabled
OTLP endpoint is accessible
Export interval allows time for collection

Example: Full Production Configuration

use duroxide::runtime::{RuntimeOptions, ObservabilityConfig, LogFormat};

let options = RuntimeOptions {
    observability: ObservabilityConfig {
        // Metrics
        metrics_enabled: true,
        metrics_export_endpoint: Some("http://otel-collector:4317".to_string()),
        metrics_export_interval_ms: 10000,
        
        // Logging
        log_format: LogFormat::Json,
        log_level: "info".to_string(),
        
        // Service identification
        service_name: "payment-processor".to_string(),
        service_version: Some(env!("CARGO_PKG_VERSION").to_string()),
    },
    orchestration_concurrency: 4,
    worker_concurrency: 8,
    ..Default::default()
};

Grafana Dashboard

A complete Grafana dashboard with 11 panels is available at:

examples/grafana-dashboard.json

Import this dashboard into Grafana to get:

Orchestration success rate by name
P95/P99 duration tracking
Activity execution rates and latency
Error classification
Continue-as-new operations
History size distribution
Provider operation latency
Turn count distribution