Metrics Reference

March 10, 2026 · View on GitHub

The Conductor JavaScript SDK provides built-in Prometheus metrics for monitoring worker performance, API latency, and task execution.

Overview

MetricsCollector implements TaskRunnerEventsListener and records 18 metric types (12 counters + 6 summaries). Metrics are exposed in Prometheus exposition format.

  • Default prefix: conductor_worker
  • Quantiles: p50, p75, p90, p95, p99 (computed from a sliding window)
  • Sliding window: Last 1,000 observations (configurable)

Quick Start

HTTP Server

import { MetricsCollector, MetricsServer, TaskHandler } from "@io-orkes/conductor-javascript";

const metrics = new MetricsCollector({ httpPort: 9090 });

const handler = new TaskHandler({
  client,
  scanForDecorated: true,
  eventListeners: [metrics],
});

await handler.startWorkers();
// GET http://localhost:9090/metrics  — Prometheus text format
// GET http://localhost:9090/health   — { "status": "UP" }

File Output

const metrics = new MetricsCollector({
  filePath: "/tmp/conductor_metrics.prom",
  fileWriteIntervalMs: 10000, // write every 10s
});

The file writer performs an immediate first write, then writes periodically at the configured interval. The timer is unreferenced so it does not prevent Node.js process exit.

prom-client Integration

const metrics = new MetricsCollector({ usePromClient: true });
// Metrics are registered in prom-client's default registry.
// Use prom-client's register.metrics() for native scraping.

Requires npm install prom-client. Falls back to built-in text format if not installed.

All-in-One

const metrics = new MetricsCollector({
  prefix: "myapp_worker",
  httpPort: 9090,
  filePath: "/tmp/metrics.prom",
  fileWriteIntervalMs: 10000,
  slidingWindowSize: 500,
  usePromClient: true,
});

Configuration

OptionTypeDefaultDescription
prefixstring"conductor_worker"Prometheus metric name prefix
httpPortnumberStart built-in HTTP server on this port
filePathstringPeriodically write metrics to this file path
fileWriteIntervalMsnumber5000File write interval in milliseconds
slidingWindowSizenumber1000Max observations kept for quantile calculation
usePromClientbooleanfalseUse prom-client for native Prometheus integration

Counter Metrics

Labeled by task_type

Prometheus NameInternal KeyDescription
{prefix}_task_poll_totalpollTotalTotal number of task polls initiated
{prefix}_task_poll_error_totalpollErrorTotalTotal number of failed task polls
{prefix}_task_execute_totaltaskExecutionTotalTotal number of task executions completed
{prefix}_task_execute_error_totaltaskExecutionErrorTotalTotal task execution errors. Label format: taskType:ExceptionName
{prefix}_task_update_error_totaltaskUpdateFailureTotalTotal task result update failures (result lost from Conductor)
{prefix}_task_ack_error_totaltaskAckErrorTotalTotal task acknowledgement errors
{prefix}_task_execution_queue_full_totaltaskExecutionQueueFullTotalTimes the execution queue was full (concurrency limit reached)
{prefix}_task_paused_totaltaskPausedTotalTotal task paused events

Labeled by payload_type

Prometheus NameInternal KeyDescription
{prefix}_external_payload_used_totalexternalPayloadUsedTotalExternal payload storage usage (e.g., "workflow_input", "task_output")

Global (no labels)

Prometheus NameInternal KeyDescription
{prefix}_thread_uncaught_exceptions_totaluncaughtExceptionTotalTotal uncaught exceptions in worker processes
{prefix}_worker_restart_totalworkerRestartTotalTotal worker restart events
{prefix}_workflow_start_error_totalworkflowStartErrorTotalTotal workflow start errors

Summary Metrics

Each summary emits quantile values, a count, and a sum:

{name}{task_type="myTask",quantile="0.5"} 12.3
{name}{task_type="myTask",quantile="0.75"} 15.1
{name}{task_type="myTask",quantile="0.9"} 18.7
{name}{task_type="myTask",quantile="0.95"} 22.0
{name}{task_type="myTask",quantile="0.99"} 45.2
{name}_count{task_type="myTask"} 1000
{name}_sum{task_type="myTask"} 14523.7

Labeled by task_type

Prometheus NameInternal KeyUnitDescription
{prefix}_task_poll_timepollDurationMsmsTask poll round-trip duration
{prefix}_task_execute_timeexecutionDurationMsmsWorker function execution duration
{prefix}_task_update_timeupdateDurationMsmsTask result update (SDK to server) duration
{prefix}_task_result_size_bytesoutputSizeBytesbytesTask result output payload size

Labeled by workflow_type

Prometheus NameInternal KeyUnitDescription
{prefix}_workflow_input_size_bytesworkflowInputSizeBytesbytesWorkflow input payload size

Labeled by endpoint

Prometheus NameInternal KeyUnitDescription
{prefix}_http_api_client_requestapiRequestDurationMsmsAPI request duration. Label format: METHOD:/api/path:STATUS

Event Listener Methods

These methods are called automatically by the TaskRunner when MetricsCollector is registered as an event listener:

MethodMetrics Updated
onPollStarted(event)Increments pollTotal
onPollCompleted(event)Records pollDurationMs
onPollFailure(event)Increments pollErrorTotal, records pollDurationMs
onTaskExecutionStarted(event)(no-op, counted on completion)
onTaskExecutionCompleted(event)Increments taskExecutionTotal, records executionDurationMs and outputSizeBytes
onTaskExecutionFailure(event)Increments taskExecutionErrorTotal, records executionDurationMs
onTaskUpdateCompleted(event)Records updateDurationMs
onTaskUpdateFailure(event)Increments taskUpdateFailureTotal

Direct Recording Methods

For metrics outside the event listener system, call these methods directly:

const collector = new MetricsCollector();

collector.recordTaskExecutionQueueFull("my_task");
collector.recordUncaughtException();
collector.recordWorkerRestart();
collector.recordTaskPaused("my_task");
collector.recordTaskAckError("my_task");
collector.recordWorkflowStartError();
collector.recordExternalPayloadUsed("task_output");
collector.recordWorkflowInputSize("my_workflow", 2048);
collector.recordApiRequestTime("POST", "/api/tasks", 200, 35);

Exposition Formats

Built-in Prometheus Text

const text = collector.toPrometheusText();
// Returns Prometheus text format (text/plain; version=0.0.4)

Async (with prom-client support)

const text = await collector.toPrometheusTextAsync();
// Uses prom-client registry when available, falls back to built-in

HTTP Server (MetricsServer)

import { MetricsServer } from "@io-orkes/conductor-javascript";

const server = new MetricsServer(collector, 9090);
await server.start();
// GET /metrics — Content-Type from collector.getContentType()
// GET /health  — { "status": "UP" }
await server.stop();

File Output

Configured via filePath in MetricsCollectorConfig. Writes toPrometheusText() output to disk. The file writer performs an immediate first write on construction, then writes periodically at the configured interval.


Sliding Window and Quantile Calculation

Summary metrics use a sliding window (default: 1,000 observations) to calculate percentiles. This provides:

  • Accurate recent percentiles without unbounded memory growth
  • No need to pre-configure histogram bucket boundaries
  • Direct percentile values without interpolation artifacts

Quantiles are computed on-demand using linear interpolation on sorted observations when toPrometheusText() is called.

When using prom-client (usePromClient: true), summaries use prom-client's native implementation with maxAgeSeconds: 600 and ageBuckets: 5.


Monitoring Best Practices

  • Use p95/p99 for SLO monitoring rather than averages. Percentile-based thresholds better capture user-impacting performance variations.
  • Alert on task_update_error_total — a rising count indicates task results are being lost and workers are failing to report back to the Conductor server.
  • Alert on task_execution_queue_full_total — indicates the concurrency limit is consistently reached. Consider increasing worker concurrency.
  • Monitor task_poll_time p99 — high poll latency suggests network issues or server overload.
  • Monitor task_execute_time p95 — watch for execution time regression in worker functions.
  • File output interval: 10-60 seconds recommended for production. Lower intervals increase disk I/O.
  • Clean metrics directory on startup when using file output with multiprocess workers to avoid stale data.

Programmatic Access

const metrics = collector.getMetrics();

// Counter values
metrics.pollTotal.get("my_task");           // number
metrics.taskExecutionTotal.get("my_task");  // number

// Summary observations (raw array)
metrics.pollDurationMs.get("my_task");      // number[]
metrics.executionDurationMs.get("my_task"); // number[]

// Reset all metrics
collector.reset();

// Stop file writer and HTTP server
await collector.stop();