Observability Guide

February 17, 2026 ยท View on GitHub

This guide covers how to enable and use observability features in duroxide for production monitoring and debugging.

Overview

Duroxide provides production-grade observability through:

  • Structured Logging: Correlated logs with instance_id, execution_id, orchestration_name, and more
  • Metrics Facade: Counters, histograms, and gauges via the metrics crate - install your preferred exporter
  • Log Analytics Integration: Ready-to-use patterns for Elasticsearch, Loki, CloudWatch, and Azure Monitor

Quick Start

Basic Structured Logging (Text to stdout)

The simplest setup uses the Compact format:

use duroxide::runtime::{RuntimeOptions, ObservabilityConfig, LogFormat};

let options = RuntimeOptions {
    observability: ObservabilityConfig {
        log_format: LogFormat::Compact,
        log_level: "info".to_string(),
        ..Default::default()
    },
    ..Default::default()
};

let rt = Runtime::start_with_options(store, activities, orchestrations, options).await;

Output Format:

2025-11-01T17:22:04.494Z INFO duroxide::runtime Orchestration started instance_id=order-123 execution_id=1 orchestration_name=ProcessOrder worker_id=orch-0-cd54
2025-11-01T17:22:04.596Z INFO duroxide::runtime Activity started instance_id=order-123 execution_id=1 activity_name=ValidatePayment activity_id=3 worker_id=work-1-cd54
2025-11-01T17:22:04.698Z INFO duroxide::runtime Activity completed instance_id=order-123 execution_id=1 activity_name=ValidatePayment worker_id=work-1-cd54 outcome="success" duration_ms=102
2025-11-01T17:22:04.806Z INFO duroxide::orchestration Payment validated successfully instance_id=order-123 execution_id=1 orchestration_name=ProcessOrder worker_id=orch-0-cd54
2025-11-01T17:22:05.010Z INFO duroxide::runtime Orchestration completed instance_id=order-123 execution_id=1 worker_id=orch-0-cd54 history_events=9

Session-related log events include:

... Session claimed/refreshed on fetch session_id=sess-abc owner_id=my-pod
... Session locks renewed sessions_renewed=3
... Session capacity exceeded after fetch (race), abandoning work item session_id=sess-xyz

Format: timestamp level module message field1=value1 field2=value2 ...

All correlation fields are included directly on each log line (flat structure, no span nesting).

Log Formats

Compact Format

Best for development and debugging with minimal noise:

ObservabilityConfig {
    log_format: LogFormat::Compact,
    log_level: "info".to_string(),
    ..Default::default()
}

Output: timestamp level module [instance_id] [worker_id] message

Pretty Format

Full structured fields for detailed debugging:

ObservabilityConfig {
    log_format: LogFormat::Pretty,
    log_level: "debug".to_string(),
    ..Default::default()
}

Output includes all fields as key=value pairs:

2024-10-30T10:15:23.456Z INFO duroxide::runtime instance_id=order-123 execution_id=1 orchestration_name=ProcessOrder version=1.0.0 worker_id=orch-0-a3f9: Orchestration started

JSON Format

For log aggregators and analytics platforms:

ObservabilityConfig {
    log_format: LogFormat::Json,
    log_level: "info".to_string(),
    ..Default::default()
}

Output: One JSON object per line with all structured fields

Correlation Fields

Every log entry includes these fields for correlation:

  • instance_id - Unique orchestration instance identifier
  • execution_id - Execution number (1, 2, 3... for ContinueAsNew)
  • orchestration_name - Name of the orchestration
  • orchestration_version - Semantic version
  • activity_name - Activity name (in activity context)
  • activity_id - Event ID of the activity
  • worker_id - Dispatcher worker ID (format: work-{index}-{runtime_id} or orch-{index}-{runtime_id})
  • timestamp - ISO8601 timestamp

Example: Tracing an Orchestration

All logs for instance order-123:

# In Elasticsearch
instance_id:"order-123"

# In CloudWatch Insights
fields @timestamp, message | filter instance_id = "order-123" | sort @timestamp

# In Loki (LogQL)
{job="duroxide"} | json | instance_id="order-123"

User Orchestration Logging

Use ctx.trace_*() methods in your orchestrations for replay-safe logging:

async fn process_order(ctx: OrchestrationContext, order: String) -> Result<String, String> {
    ctx.trace_info("Processing order started");
    
    let validated = ctx.schedule_activity("ValidateOrder", order.clone())
        .await?;
    
    ctx.trace_info(format!("Validation result: {}", validated));
    
    if validated == "invalid" {
        ctx.trace_error("Order validation failed");
        return Err("validation_failed".to_string());
    }
    
    ctx.trace_info("Order processing complete");
    Ok("processed".to_string())
}

Output:

2024-10-30T10:15:23.690Z INFO duroxide::orchestration [order-123] Processing order started
2024-10-30T10:15:23.750Z INFO duroxide::orchestration [order-123] Validation result: valid
2024-10-30T10:15:23.890Z INFO duroxide::orchestration [order-123] Order processing complete

Key Points:

  • All ctx.trace_*() calls are deterministic (replay-safe)
  • Context fields automatically included (instance_id, execution_id, etc.)
  • Only logged on first execution (not during replay)
  • Creates SystemCall events in history for determinism

Metrics

Duroxide emits metrics via the metrics facade crate. To export metrics, install a recorder before starting the runtime:

// Option 1: Prometheus (direct scraping)
metrics_exporter_prometheus::PrometheusBuilder::new()
    .with_http_listener(([0, 0, 0, 0], 9090))
    .install()?;

// Option 2: OpenTelemetry (via metrics-exporter-opentelemetry)
// metrics_exporter_opentelemetry::Recorder::builder("my-service").install_global()?;

// Option 3: None - metrics become zero-cost no-ops (default)

Available Metrics

All metrics follow Prometheus naming conventions with _total for counters and _seconds for time measurements.

Metrics Summary Table

Metric NameTypeCategoryLabelsPurpose
Orchestration Lifecycle
duroxide_orchestration_starts_totalCounterOrchestrationorchestration_name, version, initiated_byTrack which orchestrations start
duroxide_orchestration_completions_totalCounterOrchestrationorchestration_name, version, status, final_turn_countSuccess rate, completion tracking
duroxide_orchestration_failures_totalCounterOrchestrationorchestration_name, version, error_type, error_categoryFailure analysis
duroxide_orchestration_duration_secondsHistogramOrchestrationorchestration_name, version, statusE2E duration, p95/p99 latency
duroxide_orchestration_history_sizeHistogramOrchestrationorchestration_nameDetect history growth issues
duroxide_orchestration_turnsHistogramOrchestrationorchestration_nameIdentify multi-turn orchestrations
duroxide_orchestration_continue_as_new_totalCounterOrchestrationorchestration_name, execution_idTrack CAN operations
duroxide_active_orchestrationsGaugeOrchestration(none)Current active count
Activity Execution
duroxide_activity_executions_totalCounterActivityactivity_name, outcome, retry_attemptTrack success/failure rates
duroxide_activity_duration_secondsHistogramActivityactivity_name, outcomeActivity latency p95/p99
duroxide_activity_errors_totalCounterActivityactivity_name, error_type, retryableDetailed error tracking
Sub-Orchestration
duroxide_suborchestration_calls_totalCounterSub-Orchparent_orchestration, child_orchestration, outcomeTrack sub-orch calls
duroxide_suborchestration_duration_secondsHistogramSub-Orchparent_orchestration, child_orchestration, outcomeSub-orch latency
Provider (Storage)
duroxide_provider_operation_duration_secondsHistogramProvideroperation, statusDatabase operation latency
duroxide_provider_errors_totalCounterProvideroperation, error_typeStorage layer errors
Client Operations
duroxide_client_orchestration_starts_totalCounterClientorchestration_nameClient-initiated starts
duroxide_client_external_events_raised_totalCounterClientevent_nameExternal events raised
duroxide_client_cancellations_totalCounterClient(none)Cancellation requests
duroxide_client_wait_duration_secondsHistogramClient(none)Client wait latency
Internal (Dispatcher)
duroxide.orchestration.dispatcher.items_fetchedCounterInternal(none)Items fetched by orch dispatcher
duroxide.orchestration.dispatcher.processing_duration_msHistogramInternal(none)Orch processing time
duroxide.worker.dispatcher.items_fetchedCounterInternal(none)Items fetched by worker dispatcher
duroxide.worker.dispatcher.execution_duration_msHistogramInternal(none)Worker execution time

Note: Internal dispatcher metrics use dot notation and milliseconds (not exported to Prometheus by default).

Orchestration Lifecycle Metrics

duroxide_orchestration_starts_total (Counter)

  • Description: Total orchestrations started
  • Labels: orchestration_name, version, initiated_by (client|suborchestration|continueAsNew)
  • Use: Track which orchestrations are running, identify version distribution

duroxide_orchestration_completions_total (Counter)

  • Description: Orchestrations that completed (successfully or failed)
  • Labels: orchestration_name, version, status (success|failed|cancelled), final_turn_count (1-5|6-10|11-50|50+)
  • Use: Success rate tracking, identify orchestrations with high turn counts

duroxide_orchestration_failures_total (Counter)

  • Description: Orchestration failures with detailed error classification
  • Labels: orchestration_name, version, error_type (app_error|infrastructure_error|config_error), error_category
  • Use: Root cause analysis, distinguish infrastructure vs application errors

duroxide_orchestration_duration_seconds (Histogram)

  • Description: End-to-end orchestration execution time
  • Labels: orchestration_name, version, status
  • Buckets: [0.1, 0.5, 1, 2, 5, 10, 30, 60, 300, 600, 1800, 3600] seconds
  • Use: Identify slow orchestrations, track p50/p95/p99 latency

duroxide_orchestration_history_size (Histogram)

  • Description: History event count at orchestration completion
  • Labels: orchestration_name
  • Buckets: [10, 50, 100, 500, 1000, 5000, 10000] events
  • Use: Identify orchestrations with unbounded history growth

duroxide_orchestration_turns (Histogram)

  • Description: Number of turns to orchestration completion
  • Labels: orchestration_name
  • Buckets: [1, 2, 5, 10, 20, 50, 100, 200, 500]
  • Use: Detect orchestrations requiring many turns (potential optimization targets)

duroxide_orchestration_continue_as_new_total (Counter)

  • Description: Continue-as-new operations performed
  • Labels: orchestration_name, execution_id
  • Use: Verify continue-as-new is working, identify long-running actors

duroxide_active_orchestrations (Gauge) ๐Ÿ†•

  • Description: Current number of orchestration instances that are actively running (not completed/failed)
  • Value: Count of orchestrations currently in progress
  • Use: Track concurrent orchestrations, detect leaks, capacity planning
  • Critical: This is a GAUGE (can increase/decrease), unlike counters that only increase
  • Note: Continue-as-new does NOT change this count (orchestration stays active)

Activity Execution Metrics

duroxide_activity_executions_total (Counter)

  • Description: Activity execution attempts
  • Labels: activity_name, outcome (success|app_error|infra_error|config_error), retry_attempt (0|1|2|3+)
  • Use: Identify flaky activities, track retry rates

duroxide_activity_duration_seconds (Histogram)

  • Description: Activity execution time (wall clock)
  • Labels: activity_name, outcome
  • Buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10, 30, 60, 120, 300] seconds
  • Use: Identify slow activities, set appropriate timeouts

duroxide_activity_errors_total (Counter)

  • Description: Detailed activity error tracking
  • Labels: activity_name, error_type (app_error|infrastructure_error|config_error), retryable (true|false)
  • Use: Distinguish transient vs permanent errors

Sub-Orchestration Metrics

duroxide_suborchestration_calls_total (Counter)

  • Description: Sub-orchestration invocations
  • Labels: parent_orchestration, child_orchestration, outcome
  • Use: Understand orchestration composition, trace call graphs

duroxide_suborchestration_duration_seconds (Histogram)

  • Description: Sub-orchestration execution time
  • Labels: parent_orchestration, child_orchestration, outcome
  • Buckets: [0.1, 0.5, 1, 5, 10, 30, 60, 300, 600] seconds

Provider Metrics

duroxide_provider_operation_duration_seconds (Histogram)

  • Description: Database operation latency
  • Labels: operation (fetch|ack|save_event|create_instance), status (success|error)
  • Buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 2, 5] seconds
  • Use: Database performance monitoring, identify slow queries

duroxide_provider_errors_total (Counter)

  • Description: Provider/storage layer errors
  • Labels: operation, error_type (timeout|connection|deadlock|corruption)
  • Use: Database health monitoring, alerting

Client Metrics

duroxide_client_orchestration_starts_total (Counter)

  • Description: Orchestrations started via client
  • Labels: orchestration_name

duroxide_client_external_events_raised_total (Counter)

  • Description: External events raised via client
  • Labels: event_name

duroxide_client_cancellations_total (Counter)

  • Description: Orchestration cancellations via client

duroxide_client_wait_duration_seconds (Histogram)

  • Description: Client wait operation duration
  • Buckets: [0.1, 0.5, 1, 2, 5, 10, 30, 60, 120] seconds

Error Classification

Duroxide categorizes errors into three types for better observability:

Infrastructure Errors

Provider failures, data corruption, network issues:

  • Transaction failures
  • Lock timeouts
  • Serialization errors
  • Database connectivity issues

Metrics: *.infrastructure_errors counters

Configuration Errors

Deployment/registration issues:

  • Unregistered orchestrations/activities
  • Missing versions
  • Nondeterminism detected

Metrics: *.configuration_errors counters

Application Errors

Business logic failures:

  • Activity returns Err(...)
  • User-initiated cancellations
  • Business validation failures

Metrics: Activity/orchestration failure counters with app_error outcome

Log Analytics Integration

1. Elasticsearch / OpenSearch

Setup:

# otel-collector-config.yaml
exporters:
  elasticsearch:
    endpoints: ["http://elasticsearch:9200"]
    logs_index: duroxide-logs

Queries:

# All logs for an orchestration
instance_id:"order-123"

# Failed activities
activity_name:* AND outcome:"app_error"

# Nondeterminism errors
error_type:"nondeterminism" AND level:"ERROR"

2. Grafana Loki

Setup:

# promtail-config.yaml
scrape_configs:
  - job_name: duroxide
    static_configs:
      - targets:
          - localhost
        labels:
          job: duroxide-worker

LogQL:

# Instance timeline
{job="duroxide"} | json | instance_id="order-123"

# Error rate
rate({job="duroxide", level="error"}[5m]) by (orchestration_name)

3. AWS CloudWatch Logs

Setup: Use CloudWatch Logs agent or OTLP exporter

CloudWatch Insights Queries:

# Orchestration trace
fields @timestamp, instance_id, orchestration_name, message
| filter instance_id = "order-123"
| sort @timestamp asc

# Activity success rate
fields activity_name, outcome
| filter activity_name != ""
| stats count() by activity_name, outcome

4. Azure Monitor / Application Insights

Setup: Use Azure Monitor OTLP exporter

KQL Queries:

// Full orchestration timeline
traces
| where customDimensions.instance_id == "order-123"
| project timestamp, severityLevel, message
| order by timestamp asc

Debugging Workflows

Trace a Single Orchestration

  1. Note the instance_id when starting: order-123
  2. Query logs by instance_id
  3. Filter by execution_id if using ContinueAsNew
  4. Trace activity execution within the orchestration

Find Slow Activities

Query activity duration metrics:

# Prometheus query - p99 latency
histogram_quantile(0.99, 
  rate(duroxide_activity_duration_seconds_bucket[5m])
) by (activity_name)

# Activity success rate
rate(duroxide_activity_executions_total{outcome="success"}[5m])
/
rate(duroxide_activity_executions_total[5m])
by (activity_name)

Identify Nondeterminism Issues

Search logs for configuration errors:

error_type:"nondeterminism" AND level:"ERROR"

These indicate scheduling order mismatches that need code fixes.

Performance Impact

Without a Metrics Recorder (default)

  • Zero overhead from metrics collection (no-op fast path)
  • Minimal logging overhead (only structured field extraction)
  • Recommended for development and when metrics export is not needed

With a Metrics Recorder Installed

  • ~2-5% overhead for metrics collection
  • Lock-free atomic operations
  • Background export (no blocking)
  • Recommended for production monitoring

Log Format Impact

  • Compact: Lowest overhead
  • Pretty: Slight overhead for formatting
  • JSON: Minimal additional overhead

Environment Variables

Override configuration via environment variables:

# Logging level
RUST_LOG=info,duroxide=debug

# OpenTelemetry (if using OTLP exporter in your application)
# These are consumed by opentelemetry crates, not duroxide directly
OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317
OTEL_SERVICE_NAME=duroxide-worker
OTEL_SERVICE_VERSION=1.0.0

Example Prometheus Queries

Orchestration Success Rate

rate(duroxide_orchestration_completions_total{status="success"}[5m])
/
rate(duroxide_orchestration_completions_total[5m])
by (orchestration_name)

P95 Orchestration Duration

histogram_quantile(0.95,
  rate(duroxide_orchestration_duration_seconds_bucket[5m])
) by (orchestration_name)

Activity Error Rate

rate(duroxide_activity_executions_total{outcome!="success"}[5m])
by (activity_name)

Top 5 Slowest Activities

topk(5,
  histogram_quantile(0.95,
    rate(duroxide_activity_duration_seconds_bucket[5m])
  ) by (activity_name)
)

Infrastructure Errors (Actionable)

rate(duroxide_orchestration_failures_total{error_type="infrastructure_error"}[5m])
by (orchestration_name)

Orchestrations Requiring Many Turns

# Using completion labels (bucketed ranges)
sum(duroxide_orchestration_completions_total{final_turn_count="50+"})
by (orchestration_name)

# Using turns histogram (p95 turn count)
histogram_quantile(0.95,
  rate(duroxide_orchestration_turns_bucket[5m])
) by (orchestration_name)

# Orchestrations that took >100 turns
sum(rate(duroxide_orchestration_turns_bucket{le="+Inf"}[5m]))
-
sum(rate(duroxide_orchestration_turns_bucket{le="100"}[5m]))

Currently Active Orchestrations (Gauge)

# Total active right now
duroxide_active_orchestrations

# Detect orchestration leaks (if this keeps growing, you have a leak)
increase(duroxide_active_orchestrations[1h]) > 100

Best Practices

  1. Use Compact format in development for readable logs
  2. Use JSON format in production for log aggregation
  3. Include context in user traces: Use descriptive messages
  4. Monitor infrastructure_errors closely - indicates system issues
  5. Track configuration_errors - may require deployment fixes
  6. Use instance_id as primary correlation field
  7. Query by orchestration_name to find patterns across instances
  8. Enable metrics in production for performance insights
  9. Use histogram metrics for p95/p99 latency tracking, not just averages
  10. Alert on error_type labels to distinguish actionable errors from business logic failures

Troubleshooting

Logs Not Showing

Check the log level:

ObservabilityConfig {
    log_level: "debug".to_string(),  // Lower for more logs
    ..Default::default()
}

Missing Context Fields

Ensure logs are within an orchestration or activity span. Framework logs automatically include context.

Metrics Not Exporting

Verify:

  1. observability feature flag is enabled
  2. OTLP endpoint is accessible
  3. Export interval allows time for collection

Example: Full Production Configuration

use duroxide::runtime::{RuntimeOptions, ObservabilityConfig, LogFormat};

let options = RuntimeOptions {
    observability: ObservabilityConfig {
        // Metrics
        metrics_enabled: true,
        metrics_export_endpoint: Some("http://otel-collector:4317".to_string()),
        metrics_export_interval_ms: 10000,
        
        // Logging
        log_format: LogFormat::Json,
        log_level: "info".to_string(),
        
        // Service identification
        service_name: "payment-processor".to_string(),
        service_version: Some(env!("CARGO_PKG_VERSION").to_string()),
    },
    orchestration_concurrency: 4,
    worker_concurrency: 8,
    ..Default::default()
};

Grafana Dashboard

A complete Grafana dashboard with 11 panels is available at:

Import this dashboard into Grafana to get:

  • Orchestration success rate by name
  • P95/P99 duration tracking
  • Activity execution rates and latency
  • Error classification
  • Continue-as-new operations
  • History size distribution
  • Provider operation latency
  • Turn count distribution

See Also