Monitoring and Debugging

May 17, 2026 ยท View on GitHub

GARM provides built-in tools for monitoring, live log streaming, event watching, and an interactive terminal dashboard.

Prometheus metrics

Enable metrics

In config.toml:

[metrics]
enable = true
disable_auth = false

Generate a metrics token

garm-cli metrics-token create

The token validity matches the time_to_live in [jwt_auth].

Prometheus configuration

scrape_configs:
  - job_name: "garm"
    scheme: https
    static_configs:
      - targets: ["garm.example.com"]
    authorization:
      credentials: "your-metrics-token"

Metrics reference

All metrics use the garm_ namespace. Metrics fall into two groups:

  • Snapshot metrics are reset and recomputed on every tick (default every 60s, configured via period in [metrics]). These reflect the current state: pools, instances, entities, jobs.
  • Cumulative metrics are counters or gauges updated as GARM operates: webhooks received, provider operations, GitHub API calls, rate limits.

Health

MetricTypeLabels
garm_healthGaugemetadata_url, callback_url, webhook_url, controller_webhook_url, controller_id

Set to 1 if GARM is healthy, 0 otherwise. Useful for alerting.

Webhooks

MetricTypeLabels
garm_webhook_receivedCountervalid, reason

Increments on every webhook received from GitHub/Gitea. The valid label is true/false; reason explains why invalid webhooks were rejected.

Entities (repositories, organizations, enterprises)

MetricTypeLabels
garm_repository_infoGaugename, id
garm_repository_pool_manager_statusGaugename, id, running
garm_organization_infoGaugename, id
garm_organization_pool_manager_statusGaugename, id, running
garm_enterprise_infoGaugename, id
garm_enterprise_pool_manager_statusGaugename, id, running

The _info gauges are always set to 1; the labels are what carry the information. The pool_manager_status gauges are 1 when the pool manager for that entity is running.

Providers

MetricTypeLabels
garm_provider_infoGaugename, type, description

Pools

MetricTypeLabels
garm_pool_infoGaugeid, image, flavor, prefix, os_type, os_arch, tags, provider, pool_owner, pool_type
garm_pool_statusGaugeid, enabled
garm_pool_max_runnersGaugeid
garm_pool_min_idle_runnersGaugeid
garm_pool_bootstrap_timeoutGaugeid

Scale sets

MetricTypeLabels
garm_scaleset_infoGaugeid, scaleset_id, name, image, flavor, prefix, os_type, os_arch, tags, provider, runner_group, scaleset_owner, scaleset_type
garm_scaleset_statusGaugeid, enabled, state
garm_scaleset_max_runnersGaugeid
garm_scaleset_min_idle_runnersGaugeid
garm_scaleset_desired_runner_countGaugeid
garm_scaleset_bootstrap_timeoutGaugeid

The id label is GARM's internal scale set ID; scaleset_id is the numeric ID assigned by GitHub. garm_scaleset_desired_runner_count reflects the runner count GitHub has requested for the scale set (unique to scale sets, since GitHub drives scheduling).

Runner instances

MetricTypeLabels
garm_runner_statusGaugename, status, runner_status, pool_owner, pool_type, pool_id, scaleset_id, provider
garm_runner_operations_totalCounteroperation, provider
garm_runner_errors_totalCounteroperation, provider

garm_runner_status covers both pool-owned and scale-set-owned runners. For any given series, exactly one of pool_id / scaleset_id is populated. pool_owner and pool_type describe the owning entity (repo/org/enterprise) and apply to both.

The operation label on garm_runner_operations_total / garm_runner_errors_total takes one of these values:

OperationDescription
CreateInstanceCreate a new compute instance
DeleteInstanceDelete a compute instance
GetInstanceGet details about a compute instance
ListInstancesList all instances for a pool
RemoveAllInstancesRemove all instances created by a provider
StartBoot up an instance
StopShut down an instance

Jobs

MetricTypeLabels
garm_job_statusGaugejob_id, workflow_job_id, scaleset_job_id, workflow_run_id, name, status, conclusion, runner_name, owner, repository, requested_labels

GitHub/Gitea API

MetricTypeLabels
garm_github_operations_totalCounteroperation, scope
garm_github_errors_totalCounteroperation, scope
garm_github_rate_limit_limitGaugecredential_name, credential_id, endpoint
garm_github_rate_limit_remainingGaugecredential_name, credential_id, endpoint
garm_github_rate_limit_usedGaugecredential_name, credential_id, endpoint
garm_github_rate_limit_reset_timestampGaugecredential_name, credential_id, endpoint

The scope label is Repository, Organization, or Enterprise. The operation label takes one of the values listed below.

GitHub client operations (hooks, runners, registration tokens):

OperationDescription
ListHooksList webhooks on an entity
GetHookGet a single webhook
CreateHookCreate a webhook
DeleteHookDelete a webhook
PingHookPing a webhook
ListEntityRunnersList runners for an entity
ListEntityRunnerApplicationDownloadsList runner application downloads
RemoveEntityRunnerRemove a runner from an entity
CreateEntityRegistrationTokenCreate a runner registration token
ListOrganizationRunnerGroupsList organization runner groups
ListRunnerGroupsList enterprise runner groups
GetEntityJITConfigGenerate a JIT runner configuration
GetRateLimitFetch API rate limit information

Scale set operations (scale set management and message queue):

OperationDescription
GetRunnerScaleSetByNameAndRunnerGroupLook up a scale set by name and runner group
GetRunnerScaleSetByIDLook up a scale set by ID
ListRunnerScaleSetsList all scale sets
CreateRunnerScaleSetCreate a scale set
UpdateRunnerScaleSetUpdate a scale set
DeleteRunnerScaleSetDelete a scale set
GetRunnerGroupByNameLook up a runner group by name
GenerateJitRunnerConfigGenerate a JIT runner config for a scale set
GetRunnerGet a runner by ID
ListAllRunnersList all runners
GetRunnerByNameGet a runner by name
RemoveRunnerRemove a scale set runner
AcquireJobsAcquire jobs for a scale set
GetAcquirableJobsGet acquirable jobs for a scale set
GetActionServiceInfoGet actions service admin info
CreateMessageSessionCreate a message queue session
DeleteMessageSessionDelete a message queue session
RefreshMessageSessionRefresh a message queue session token
GetMessageGet a message from the message queue
DeleteMessageDelete a message from the message queue

Live log streaming

Stream GARM logs to your terminal in real time:

garm-cli debug-log

This requires enable_log_streamer = true in [logging].

Filtering logs

# Only ERROR level and above
garm-cli debug-log --log-level ERROR

# Filter by attribute
garm-cli debug-log --filter "pool_id=9daa34aa-..."

# Filter by message content
garm-cli debug-log --filter "msg=creating instance"

# Multiple filters (OR by default)
garm-cli debug-log --filter "pool_id=abc" --filter "pool_id=def"

# Multiple filters with AND
garm-cli debug-log --filter "pool_id=abc" --filter "msg=error" --filter-mode all

Important

The log streaming and events WebSocket endpoints are authenticated, but you should still only expose them within trusted networks. If GARM is behind a reverse proxy, restrict access to the /api/v1/ws path from untrusted sources.

Database events

The debug-events command consumes database change events. Whenever an entity is created, updated, or deleted in the database, an event is generated and exported via WebSocket. This endpoint is designed for integration -- external tools can subscribe without polling the API.

Watch real-time entity changes:

# All events
garm-cli debug-events --filters='{"send-everything": true}'

# Only instance create/delete events
garm-cli debug-events --filters='{"filters": [{"entity-type": "instance", "operations": ["create", "delete"]}]}'

Available entity types: repository, organization, enterprise, pool, user, instance, job, controller, github_credentials, gitea_credentials, github_endpoint, scaleset

Operations: create, update, delete

Event structure

Each event is a JSON object:

{
    "entity-type": "instance",
    "operation": "create",
    "payload": { ... }
}

The payload contains the same JSON you would get from the corresponding REST API endpoint. Sensitive data (tokens, keys) is stripped. For delete operations, some entities return the full object prior to deletion while others return only the ID. Assume that future versions will return only the ID for all delete operations.

Programmatic access

The events endpoint is a WebSocket at /api/v1/ws/events. Connect with a JWT token and send a filter message to start receiving events. By default, the endpoint returns no events -- all events are filtered until you send a filter message:

// Receive all events
{"send-everything": true}

// Receive only specific entity/operation combinations
{
  "filters": [
    {"entity-type": "instance", "operations": ["create", "delete"]},
    {"entity-type": "pool", "operations": ["update"]}
  ]
}

See the events documentation for the full filter schema and a Go code example using garm-provider-common.

Interactive dashboard

The top command shows a live terminal dashboard:

garm-cli top

This displays entities, pools, scale sets, runner instances, and jobs in an interactive view, refreshing every 5 seconds.

Job monitoring

View recorded workflow jobs:

garm-cli job list

GARM only records jobs for which it has a matching pool or scale set. Jobs whose labels don't match any configured pool are silently ignored -- there's no point in recording jobs GARM can't act on. If you've set everything up but garm-cli job list is empty, verify that your webhook URLs are correct and that GitHub can reach them (see Controller settings).

Reverse proxy considerations

If GARM is behind a reverse proxy, the WebSocket endpoints need special configuration. For nginx:

location /api/v1/ws {
    proxy_pass http://garm_backend;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "Upgrade";
    proxy_set_header Host $host;
}

This is required for debug-log, debug-events, top, and the Web UI. A full sample nginx config with TLS termination is available in the testdata folder.