Prometheus Metrics

March 5, 2026 · View on GitHub

The TNS CSI Driver exposes Prometheus metrics on the controller pod to provide observability into volume operations, WebSocket connection health, and CSI operations.

Metrics Endpoint

By default, metrics are exposed on port 8080 at the /metrics endpoint. The metrics endpoint is only available on the controller pod.

Available Metrics

CSI Operation Metrics

These metrics track all CSI RPC operations:

tns_csi_operations_total (counter)
- Total number of CSI operations
- Labels: method (CSI method name, e.g., CreateVolume, DeleteVolume), grpc_status_code
tns_csi_operations_duration_seconds (histogram)
- Duration of CSI operations in seconds
- Labels: method, grpc_status_code
- Buckets: 0.1s, 0.5s, 1s, 2.5s, 5s, 10s, 30s, 60s

Volume Operation Metrics

Protocol-specific volume operations (NFS, NVMe-oF, iSCSI, and SMB):

tns_volume_operations_total (counter)
- Total number of volume operations
- Labels: protocol (nfs, nvmeof, iscsi, or smb), operation (create, delete, expand), status (success or error)
tns_volume_operations_duration_seconds (histogram)
- Duration of volume operations in seconds
- Labels: protocol, operation, status
- Buckets: 0.5s, 1s, 2s, 5s, 10s, 30s, 60s, 120s
tns_volume_capacity_bytes (gauge)
- Capacity of provisioned volumes in bytes
- Labels: volume_id, protocol

NVMe-oF Connect Concurrency Metrics

tns_csi_nvme_connect_concurrent (gauge)
- Number of NVMe-oF connect operations currently in progress
tns_csi_nvme_connect_waiting (gauge)
- Number of NVMe-oF connect operations waiting for the semaphore
- Non-zero values indicate the concurrency limit is actively throttling connections

WebSocket Connection Metrics

Metrics for the TrueNAS API WebSocket connection:

tns_websocket_connected (gauge)
- WebSocket connection status (1 = connected, 0 = disconnected)
tns_websocket_reconnects_total (counter)
- Total number of WebSocket reconnection attempts
tns_websocket_messages_total (counter)
- Total number of WebSocket messages
- Labels: direction (sent or received)
tns_websocket_message_duration_seconds (histogram)
- Duration of WebSocket RPC calls in seconds
- Labels: method (TrueNAS API method name)
- Buckets: 0.1s, 0.25s, 0.5s, 1s, 2s, 5s, 10s, 30s
tns_websocket_connection_duration_seconds (gauge)
- Current WebSocket connection duration in seconds (updated every 20s)

Configuration

Enabling Metrics

Metrics are enabled by default. To disable them:

controller:
  metrics:
    enabled: false

Changing Metrics Port

To use a different port:

controller:
  metrics:
    enabled: true
    port: 9090

Creating Metrics Service

A Kubernetes Service is created by default to expose the metrics endpoint:

controller:
  metrics:
    enabled: true
    service:
      enabled: true
      type: ClusterIP
      port: 8080

Prometheus Operator Integration

To enable automatic scraping with Prometheus Operator, enable the ServiceMonitor:

controller:
  metrics:
    enabled: true
    serviceMonitor:
      enabled: true
      # Add labels that match your Prometheus serviceMonitorSelector
      labels:
        release: prometheus
      interval: 30s
      scrapeTimeout: 10s

Prometheus Configuration

If you're using Prometheus without the Operator, add a scrape config:

scrape_configs:
  - job_name: 'tns-csi-driver'
    kubernetes_sd_configs:
      - role: service
        namespaces:
          names:
            - kube-system  # or your CSI driver namespace
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name]
        action: keep
        regex: tns-csi-driver
      - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_component]
        action: keep
        regex: controller

Example Queries

Volume Operations

Total volume operations by protocol:

sum by (protocol, operation) (rate(tns_volume_operations_total[5m]))

Volume operation error rate:

sum by (protocol, operation) (rate(tns_volume_operations_total{status="error"}[5m])) 
/ 
sum by (protocol, operation) (rate(tns_volume_operations_total[5m]))

95th percentile volume operation latency:

histogram_quantile(0.95, rate(tns_volume_operations_duration_seconds_bucket[5m]))

WebSocket Health

WebSocket connection status:

tns_websocket_connected

WebSocket reconnection rate:

rate(tns_websocket_reconnects_total[5m])

Average WebSocket message duration by method:

rate(tns_websocket_message_duration_seconds_sum[5m]) 
/ 
rate(tns_websocket_message_duration_seconds_count[5m])

CSI Operations

CSI operation rate by method:

sum by (method) (rate(tns_csi_operations_total[5m]))

CSI operation error rate:

sum by (method) (rate(tns_csi_operations_total{grpc_status_code!="OK"}[5m])) 
/ 
sum by (method) (rate(tns_csi_operations_total[5m]))

95th percentile CSI operation latency:

histogram_quantile(0.95, 
  sum by (method, le) (rate(tns_csi_operations_duration_seconds_bucket[5m]))
)

Grafana Dashboard

The Helm chart includes a pre-built Grafana dashboard (tns-csi-overview.json) that provides a comprehensive view of driver operations.

Enabling the Grafana Dashboard

Enable automatic provisioning via Helm values:

grafana:
  dashboards:
    enabled: true
    labels:
      grafana_dashboard: "1"    # Must match your Grafana sidecar label selector
    annotations: {}

This creates a ConfigMap (tns-csi-driver-grafana-dashboard) with the grafana_dashboard: "1" label. If your Grafana deployment uses a sidecar (standard with kube-prometheus-stack), the dashboard is auto-discovered and loaded.

Dashboard Panels

The dashboard includes:

WebSocket Connection — connection status, duration, and reconnect count
Operations Overview — total operations by protocol (NFS, NVMe-oF, iSCSI, SMB) with success/error breakdown
Operations by Type — create, delete, expand counts per protocol
Message Throughput — WebSocket messages sent/received over time
Per-Protocol Breakdown — dedicated panels for NFS, NVMe-oF, iSCSI, and SMB operations

Manual Import

If you don't use Grafana sidecar discovery, import the dashboard JSON manually:

Copy charts/tns-csi-driver/dashboards/tns-csi-overview.json
In Grafana: Dashboards > Import > paste the JSON
Select your Prometheus data source

In-Cluster Web Dashboard

The controller pod can serve a live web dashboard showing volume health, Kubernetes binding, and protocol-specific details.

Enabling the Dashboard

controller:
  dashboard:
    enabled: true
    port: 9090
    service:
      enabled: true
      type: ClusterIP
      port: 9090
    ingress:
      enabled: false    # Optional: expose via Ingress

Accessing the Dashboard

# Port-forward to the dashboard service
kubectl port-forward -n kube-system svc/tns-csi-driver-dashboard 9090:9090

# Open http://localhost:9090/dashboard/

Dashboard Features

The in-cluster dashboard provides:

Volume inventory — all managed volumes with protocol, capacity, and health status
Volume health checks — verifies dataset exists, NFS shares/SMB shares/NVMe-oF subsystems/iSCSI targets are valid
Kubernetes binding — shows PV/PVC names, namespaces, and attached pods
Snapshot and clone tracking — lists all snapshots and clones with source volumes
Unmanaged volume discovery — finds non-CSI volumes on the same pool (requires --dashboard-pool)
Metrics summary — parsed Prometheus metrics (operations, WebSocket health)

API Endpoints

The dashboard exposes JSON API endpoints at /dashboard/api/:

Endpoint	Description
`GET /dashboard/api/volumes`	List all managed volumes
`GET /dashboard/api/volumes/{id}`	Volume details with health check
`GET /dashboard/api/snapshots`	List all snapshots
`GET /dashboard/api/clones`	List all clones
`GET /dashboard/api/summary`	Summary statistics
`GET /dashboard/api/unmanaged`	Unmanaged volumes (needs `--dashboard-pool`)
`GET /dashboard/api/metrics`	Parsed Prometheus metrics
`GET /dashboard/api/metrics/raw`	Raw Prometheus text format

kubectl Plugin Dashboard

The kubectl plugin includes a local dashboard that connects directly to TrueNAS:

# Start dashboard (auto-opens browser at http://localhost:2137)
kubectl tns-csi dashboard

# Custom port, without auto-open
kubectl tns-csi dashboard --port 9090 --open=false

# With pool for unmanaged volume discovery
kubectl tns-csi dashboard --pool storage

The plugin auto-discovers TrueNAS credentials from the installed driver's Secret. Both dashboards (in-cluster and kubectl plugin) share the same UI — the difference is where they run: in-cluster runs inside the controller pod, while the plugin runs locally on your machine.

Troubleshooting

Metrics endpoint not accessible

Check if metrics are enabled:

kubectl get svc -n kube-system | grep tns-csi-driver-metrics

Check controller pod logs:

kubectl logs -n kube-system -l app.kubernetes.io/component=controller -c tns-csi-plugin

Port-forward to test locally:

kubectl port-forward -n kube-system svc/tns-csi-driver-metrics 8080:8080
curl http://localhost:8080/metrics

ServiceMonitor not being scraped

Verify ServiceMonitor labels match Prometheus selector:

kubectl get servicemonitor -n kube-system tns-csi-driver -o yaml

Check Prometheus serviceMonitorSelector:

kubectl get prometheus -A -o yaml | grep -A 5 serviceMonitorSelector

Check Prometheus logs for scrape errors:

kubectl logs -n monitoring prometheus-xxx

Development Notes

Metrics are collected in:

pkg/metrics/metrics.go - Metric definitions and registration
pkg/driver/driver.go - CSI operation metrics via gRPC interceptor
pkg/tnsapi/client.go - WebSocket connection metrics
pkg/driver/controller_nfs.go, controller_nvmeof.go, controller_iscsi.go, and controller_smb.go - Protocol-specific volume operation metrics