Observability Guide
February 17, 2026 ยท View on GitHub
This guide covers how to enable and use observability features in duroxide for production monitoring and debugging.
Overview
Duroxide provides production-grade observability through:
- Structured Logging: Correlated logs with instance_id, execution_id, orchestration_name, and more
- Metrics Facade: Counters, histograms, and gauges via the
metricscrate - install your preferred exporter - Log Analytics Integration: Ready-to-use patterns for Elasticsearch, Loki, CloudWatch, and Azure Monitor
Quick Start
Basic Structured Logging (Text to stdout)
The simplest setup uses the Compact format:
use duroxide::runtime::{RuntimeOptions, ObservabilityConfig, LogFormat};
let options = RuntimeOptions {
observability: ObservabilityConfig {
log_format: LogFormat::Compact,
log_level: "info".to_string(),
..Default::default()
},
..Default::default()
};
let rt = Runtime::start_with_options(store, activities, orchestrations, options).await;
Output Format:
2025-11-01T17:22:04.494Z INFO duroxide::runtime Orchestration started instance_id=order-123 execution_id=1 orchestration_name=ProcessOrder worker_id=orch-0-cd54
2025-11-01T17:22:04.596Z INFO duroxide::runtime Activity started instance_id=order-123 execution_id=1 activity_name=ValidatePayment activity_id=3 worker_id=work-1-cd54
2025-11-01T17:22:04.698Z INFO duroxide::runtime Activity completed instance_id=order-123 execution_id=1 activity_name=ValidatePayment worker_id=work-1-cd54 outcome="success" duration_ms=102
2025-11-01T17:22:04.806Z INFO duroxide::orchestration Payment validated successfully instance_id=order-123 execution_id=1 orchestration_name=ProcessOrder worker_id=orch-0-cd54
2025-11-01T17:22:05.010Z INFO duroxide::runtime Orchestration completed instance_id=order-123 execution_id=1 worker_id=orch-0-cd54 history_events=9
Session-related log events include:
... Session claimed/refreshed on fetch session_id=sess-abc owner_id=my-pod
... Session locks renewed sessions_renewed=3
... Session capacity exceeded after fetch (race), abandoning work item session_id=sess-xyz
Format: timestamp level module message field1=value1 field2=value2 ...
All correlation fields are included directly on each log line (flat structure, no span nesting).
Log Formats
Compact Format
Best for development and debugging with minimal noise:
ObservabilityConfig {
log_format: LogFormat::Compact,
log_level: "info".to_string(),
..Default::default()
}
Output: timestamp level module [instance_id] [worker_id] message
Pretty Format
Full structured fields for detailed debugging:
ObservabilityConfig {
log_format: LogFormat::Pretty,
log_level: "debug".to_string(),
..Default::default()
}
Output includes all fields as key=value pairs:
2024-10-30T10:15:23.456Z INFO duroxide::runtime instance_id=order-123 execution_id=1 orchestration_name=ProcessOrder version=1.0.0 worker_id=orch-0-a3f9: Orchestration started
JSON Format
For log aggregators and analytics platforms:
ObservabilityConfig {
log_format: LogFormat::Json,
log_level: "info".to_string(),
..Default::default()
}
Output: One JSON object per line with all structured fields
Correlation Fields
Every log entry includes these fields for correlation:
- instance_id - Unique orchestration instance identifier
- execution_id - Execution number (1, 2, 3... for ContinueAsNew)
- orchestration_name - Name of the orchestration
- orchestration_version - Semantic version
- activity_name - Activity name (in activity context)
- activity_id - Event ID of the activity
- worker_id - Dispatcher worker ID (format:
work-{index}-{runtime_id}ororch-{index}-{runtime_id}) - timestamp - ISO8601 timestamp
Example: Tracing an Orchestration
All logs for instance order-123:
# In Elasticsearch
instance_id:"order-123"
# In CloudWatch Insights
fields @timestamp, message | filter instance_id = "order-123" | sort @timestamp
# In Loki (LogQL)
{job="duroxide"} | json | instance_id="order-123"
User Orchestration Logging
Use ctx.trace_*() methods in your orchestrations for replay-safe logging:
async fn process_order(ctx: OrchestrationContext, order: String) -> Result<String, String> {
ctx.trace_info("Processing order started");
let validated = ctx.schedule_activity("ValidateOrder", order.clone())
.await?;
ctx.trace_info(format!("Validation result: {}", validated));
if validated == "invalid" {
ctx.trace_error("Order validation failed");
return Err("validation_failed".to_string());
}
ctx.trace_info("Order processing complete");
Ok("processed".to_string())
}
Output:
2024-10-30T10:15:23.690Z INFO duroxide::orchestration [order-123] Processing order started
2024-10-30T10:15:23.750Z INFO duroxide::orchestration [order-123] Validation result: valid
2024-10-30T10:15:23.890Z INFO duroxide::orchestration [order-123] Order processing complete
Key Points:
- All
ctx.trace_*()calls are deterministic (replay-safe) - Context fields automatically included (instance_id, execution_id, etc.)
- Only logged on first execution (not during replay)
- Creates SystemCall events in history for determinism
Metrics
Duroxide emits metrics via the metrics facade crate. To export metrics, install a recorder before starting the runtime:
// Option 1: Prometheus (direct scraping)
metrics_exporter_prometheus::PrometheusBuilder::new()
.with_http_listener(([0, 0, 0, 0], 9090))
.install()?;
// Option 2: OpenTelemetry (via metrics-exporter-opentelemetry)
// metrics_exporter_opentelemetry::Recorder::builder("my-service").install_global()?;
// Option 3: None - metrics become zero-cost no-ops (default)
Available Metrics
All metrics follow Prometheus naming conventions with _total for counters and _seconds for time measurements.
Metrics Summary Table
| Metric Name | Type | Category | Labels | Purpose |
|---|---|---|---|---|
| Orchestration Lifecycle | ||||
duroxide_orchestration_starts_total | Counter | Orchestration | orchestration_name, version, initiated_by | Track which orchestrations start |
duroxide_orchestration_completions_total | Counter | Orchestration | orchestration_name, version, status, final_turn_count | Success rate, completion tracking |
duroxide_orchestration_failures_total | Counter | Orchestration | orchestration_name, version, error_type, error_category | Failure analysis |
duroxide_orchestration_duration_seconds | Histogram | Orchestration | orchestration_name, version, status | E2E duration, p95/p99 latency |
duroxide_orchestration_history_size | Histogram | Orchestration | orchestration_name | Detect history growth issues |
duroxide_orchestration_turns | Histogram | Orchestration | orchestration_name | Identify multi-turn orchestrations |
duroxide_orchestration_continue_as_new_total | Counter | Orchestration | orchestration_name, execution_id | Track CAN operations |
duroxide_active_orchestrations | Gauge | Orchestration | (none) | Current active count |
| Activity Execution | ||||
duroxide_activity_executions_total | Counter | Activity | activity_name, outcome, retry_attempt | Track success/failure rates |
duroxide_activity_duration_seconds | Histogram | Activity | activity_name, outcome | Activity latency p95/p99 |
duroxide_activity_errors_total | Counter | Activity | activity_name, error_type, retryable | Detailed error tracking |
| Sub-Orchestration | ||||
duroxide_suborchestration_calls_total | Counter | Sub-Orch | parent_orchestration, child_orchestration, outcome | Track sub-orch calls |
duroxide_suborchestration_duration_seconds | Histogram | Sub-Orch | parent_orchestration, child_orchestration, outcome | Sub-orch latency |
| Provider (Storage) | ||||
duroxide_provider_operation_duration_seconds | Histogram | Provider | operation, status | Database operation latency |
duroxide_provider_errors_total | Counter | Provider | operation, error_type | Storage layer errors |
| Client Operations | ||||
duroxide_client_orchestration_starts_total | Counter | Client | orchestration_name | Client-initiated starts |
duroxide_client_external_events_raised_total | Counter | Client | event_name | External events raised |
duroxide_client_cancellations_total | Counter | Client | (none) | Cancellation requests |
duroxide_client_wait_duration_seconds | Histogram | Client | (none) | Client wait latency |
| Internal (Dispatcher) | ||||
duroxide.orchestration.dispatcher.items_fetched | Counter | Internal | (none) | Items fetched by orch dispatcher |
duroxide.orchestration.dispatcher.processing_duration_ms | Histogram | Internal | (none) | Orch processing time |
duroxide.worker.dispatcher.items_fetched | Counter | Internal | (none) | Items fetched by worker dispatcher |
duroxide.worker.dispatcher.execution_duration_ms | Histogram | Internal | (none) | Worker execution time |
Note: Internal dispatcher metrics use dot notation and milliseconds (not exported to Prometheus by default).
Orchestration Lifecycle Metrics
duroxide_orchestration_starts_total (Counter)
- Description: Total orchestrations started
- Labels:
orchestration_name,version,initiated_by(client|suborchestration|continueAsNew) - Use: Track which orchestrations are running, identify version distribution
duroxide_orchestration_completions_total (Counter)
- Description: Orchestrations that completed (successfully or failed)
- Labels:
orchestration_name,version,status(success|failed|cancelled),final_turn_count(1-5|6-10|11-50|50+) - Use: Success rate tracking, identify orchestrations with high turn counts
duroxide_orchestration_failures_total (Counter)
- Description: Orchestration failures with detailed error classification
- Labels:
orchestration_name,version,error_type(app_error|infrastructure_error|config_error),error_category - Use: Root cause analysis, distinguish infrastructure vs application errors
duroxide_orchestration_duration_seconds (Histogram)
- Description: End-to-end orchestration execution time
- Labels:
orchestration_name,version,status - Buckets:
[0.1, 0.5, 1, 2, 5, 10, 30, 60, 300, 600, 1800, 3600]seconds - Use: Identify slow orchestrations, track p50/p95/p99 latency
duroxide_orchestration_history_size (Histogram)
- Description: History event count at orchestration completion
- Labels:
orchestration_name - Buckets:
[10, 50, 100, 500, 1000, 5000, 10000]events - Use: Identify orchestrations with unbounded history growth
duroxide_orchestration_turns (Histogram)
- Description: Number of turns to orchestration completion
- Labels:
orchestration_name - Buckets:
[1, 2, 5, 10, 20, 50, 100, 200, 500] - Use: Detect orchestrations requiring many turns (potential optimization targets)
duroxide_orchestration_continue_as_new_total (Counter)
- Description: Continue-as-new operations performed
- Labels:
orchestration_name,execution_id - Use: Verify continue-as-new is working, identify long-running actors
duroxide_active_orchestrations (Gauge) ๐
- Description: Current number of orchestration instances that are actively running (not completed/failed)
- Value: Count of orchestrations currently in progress
- Use: Track concurrent orchestrations, detect leaks, capacity planning
- Critical: This is a GAUGE (can increase/decrease), unlike counters that only increase
- Note: Continue-as-new does NOT change this count (orchestration stays active)
Activity Execution Metrics
duroxide_activity_executions_total (Counter)
- Description: Activity execution attempts
- Labels:
activity_name,outcome(success|app_error|infra_error|config_error),retry_attempt(0|1|2|3+) - Use: Identify flaky activities, track retry rates
duroxide_activity_duration_seconds (Histogram)
- Description: Activity execution time (wall clock)
- Labels:
activity_name,outcome - Buckets:
[0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10, 30, 60, 120, 300]seconds - Use: Identify slow activities, set appropriate timeouts
duroxide_activity_errors_total (Counter)
- Description: Detailed activity error tracking
- Labels:
activity_name,error_type(app_error|infrastructure_error|config_error),retryable(true|false) - Use: Distinguish transient vs permanent errors
Sub-Orchestration Metrics
duroxide_suborchestration_calls_total (Counter)
- Description: Sub-orchestration invocations
- Labels:
parent_orchestration,child_orchestration,outcome - Use: Understand orchestration composition, trace call graphs
duroxide_suborchestration_duration_seconds (Histogram)
- Description: Sub-orchestration execution time
- Labels:
parent_orchestration,child_orchestration,outcome - Buckets:
[0.1, 0.5, 1, 5, 10, 30, 60, 300, 600]seconds
Provider Metrics
duroxide_provider_operation_duration_seconds (Histogram)
- Description: Database operation latency
- Labels:
operation(fetch|ack|save_event|create_instance),status(success|error) - Buckets:
[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 2, 5]seconds - Use: Database performance monitoring, identify slow queries
duroxide_provider_errors_total (Counter)
- Description: Provider/storage layer errors
- Labels:
operation,error_type(timeout|connection|deadlock|corruption) - Use: Database health monitoring, alerting
Client Metrics
duroxide_client_orchestration_starts_total (Counter)
- Description: Orchestrations started via client
- Labels:
orchestration_name
duroxide_client_external_events_raised_total (Counter)
- Description: External events raised via client
- Labels:
event_name
duroxide_client_cancellations_total (Counter)
- Description: Orchestration cancellations via client
duroxide_client_wait_duration_seconds (Histogram)
- Description: Client wait operation duration
- Buckets:
[0.1, 0.5, 1, 2, 5, 10, 30, 60, 120]seconds
Error Classification
Duroxide categorizes errors into three types for better observability:
Infrastructure Errors
Provider failures, data corruption, network issues:
- Transaction failures
- Lock timeouts
- Serialization errors
- Database connectivity issues
Metrics: *.infrastructure_errors counters
Configuration Errors
Deployment/registration issues:
- Unregistered orchestrations/activities
- Missing versions
- Nondeterminism detected
Metrics: *.configuration_errors counters
Application Errors
Business logic failures:
- Activity returns
Err(...) - User-initiated cancellations
- Business validation failures
Metrics: Activity/orchestration failure counters with app_error outcome
Log Analytics Integration
1. Elasticsearch / OpenSearch
Setup:
# otel-collector-config.yaml
exporters:
elasticsearch:
endpoints: ["http://elasticsearch:9200"]
logs_index: duroxide-logs
Queries:
# All logs for an orchestration
instance_id:"order-123"
# Failed activities
activity_name:* AND outcome:"app_error"
# Nondeterminism errors
error_type:"nondeterminism" AND level:"ERROR"
2. Grafana Loki
Setup:
# promtail-config.yaml
scrape_configs:
- job_name: duroxide
static_configs:
- targets:
- localhost
labels:
job: duroxide-worker
LogQL:
# Instance timeline
{job="duroxide"} | json | instance_id="order-123"
# Error rate
rate({job="duroxide", level="error"}[5m]) by (orchestration_name)
3. AWS CloudWatch Logs
Setup: Use CloudWatch Logs agent or OTLP exporter
CloudWatch Insights Queries:
# Orchestration trace
fields @timestamp, instance_id, orchestration_name, message
| filter instance_id = "order-123"
| sort @timestamp asc
# Activity success rate
fields activity_name, outcome
| filter activity_name != ""
| stats count() by activity_name, outcome
4. Azure Monitor / Application Insights
Setup: Use Azure Monitor OTLP exporter
KQL Queries:
// Full orchestration timeline
traces
| where customDimensions.instance_id == "order-123"
| project timestamp, severityLevel, message
| order by timestamp asc
Debugging Workflows
Trace a Single Orchestration
- Note the instance_id when starting:
order-123 - Query logs by instance_id
- Filter by execution_id if using ContinueAsNew
- Trace activity execution within the orchestration
Find Slow Activities
Query activity duration metrics:
# Prometheus query - p99 latency
histogram_quantile(0.99,
rate(duroxide_activity_duration_seconds_bucket[5m])
) by (activity_name)
# Activity success rate
rate(duroxide_activity_executions_total{outcome="success"}[5m])
/
rate(duroxide_activity_executions_total[5m])
by (activity_name)
Identify Nondeterminism Issues
Search logs for configuration errors:
error_type:"nondeterminism" AND level:"ERROR"
These indicate scheduling order mismatches that need code fixes.
Performance Impact
Without a Metrics Recorder (default)
- Zero overhead from metrics collection (no-op fast path)
- Minimal logging overhead (only structured field extraction)
- Recommended for development and when metrics export is not needed
With a Metrics Recorder Installed
- ~2-5% overhead for metrics collection
- Lock-free atomic operations
- Background export (no blocking)
- Recommended for production monitoring
Log Format Impact
- Compact: Lowest overhead
- Pretty: Slight overhead for formatting
- JSON: Minimal additional overhead
Environment Variables
Override configuration via environment variables:
# Logging level
RUST_LOG=info,duroxide=debug
# OpenTelemetry (if using OTLP exporter in your application)
# These are consumed by opentelemetry crates, not duroxide directly
OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317
OTEL_SERVICE_NAME=duroxide-worker
OTEL_SERVICE_VERSION=1.0.0
Example Prometheus Queries
Orchestration Success Rate
rate(duroxide_orchestration_completions_total{status="success"}[5m])
/
rate(duroxide_orchestration_completions_total[5m])
by (orchestration_name)
P95 Orchestration Duration
histogram_quantile(0.95,
rate(duroxide_orchestration_duration_seconds_bucket[5m])
) by (orchestration_name)
Activity Error Rate
rate(duroxide_activity_executions_total{outcome!="success"}[5m])
by (activity_name)
Top 5 Slowest Activities
topk(5,
histogram_quantile(0.95,
rate(duroxide_activity_duration_seconds_bucket[5m])
) by (activity_name)
)
Infrastructure Errors (Actionable)
rate(duroxide_orchestration_failures_total{error_type="infrastructure_error"}[5m])
by (orchestration_name)
Orchestrations Requiring Many Turns
# Using completion labels (bucketed ranges)
sum(duroxide_orchestration_completions_total{final_turn_count="50+"})
by (orchestration_name)
# Using turns histogram (p95 turn count)
histogram_quantile(0.95,
rate(duroxide_orchestration_turns_bucket[5m])
) by (orchestration_name)
# Orchestrations that took >100 turns
sum(rate(duroxide_orchestration_turns_bucket{le="+Inf"}[5m]))
-
sum(rate(duroxide_orchestration_turns_bucket{le="100"}[5m]))
Currently Active Orchestrations (Gauge)
# Total active right now
duroxide_active_orchestrations
# Detect orchestration leaks (if this keeps growing, you have a leak)
increase(duroxide_active_orchestrations[1h]) > 100
Best Practices
- Use Compact format in development for readable logs
- Use JSON format in production for log aggregation
- Include context in user traces: Use descriptive messages
- Monitor infrastructure_errors closely - indicates system issues
- Track configuration_errors - may require deployment fixes
- Use instance_id as primary correlation field
- Query by orchestration_name to find patterns across instances
- Enable metrics in production for performance insights
- Use histogram metrics for p95/p99 latency tracking, not just averages
- Alert on error_type labels to distinguish actionable errors from business logic failures
Troubleshooting
Logs Not Showing
Check the log level:
ObservabilityConfig {
log_level: "debug".to_string(), // Lower for more logs
..Default::default()
}
Missing Context Fields
Ensure logs are within an orchestration or activity span. Framework logs automatically include context.
Metrics Not Exporting
Verify:
observabilityfeature flag is enabled- OTLP endpoint is accessible
- Export interval allows time for collection
Example: Full Production Configuration
use duroxide::runtime::{RuntimeOptions, ObservabilityConfig, LogFormat};
let options = RuntimeOptions {
observability: ObservabilityConfig {
// Metrics
metrics_enabled: true,
metrics_export_endpoint: Some("http://otel-collector:4317".to_string()),
metrics_export_interval_ms: 10000,
// Logging
log_format: LogFormat::Json,
log_level: "info".to_string(),
// Service identification
service_name: "payment-processor".to_string(),
service_version: Some(env!("CARGO_PKG_VERSION").to_string()),
},
orchestration_concurrency: 4,
worker_concurrency: 8,
..Default::default()
};
Grafana Dashboard
A complete Grafana dashboard with 11 panels is available at:
Import this dashboard into Grafana to get:
- Orchestration success rate by name
- P95/P99 duration tracking
- Activity execution rates and latency
- Error classification
- Continue-as-new operations
- History size distribution
- Provider operation latency
- Turn count distribution
See Also
- Telemetry Implementation Summary - Complete metrics specification
- Provider Observability Guide - For custom provider implementors
- Library Observability Guide - For library developers
- examples/with_observability.rs - Working example
- examples/grafana-dashboard.json - Production dashboard