AIStore Observability: Metrics Reference
March 25, 2026 ยท View on GitHub
AIStore (AIS) exposes a comprehensive set of metrics that provide insights into system performance, resource utilization, and operational status. This reference catalogs available metrics with descriptions and usage guidance.
Table of Contents
- Prometheus: major changes in v3.26
- Variable labels
- Common metrics: AIS targets and gateways
- Target metrics
- Backend metrics
- Related Documentation
Prometheus: major changes in v3.26
- So-called default
go_*counters and gauges (go_gc.go_metstats. etc.) are completely gone - Metrics are now updated directly in real time
- Previously: periodically via
prometheus.Collectinterface - See related note in stats/prom.go
- Previously: periodically via
- AIS is no longer publishing internally computed latencies and throughputs
- Use
*.ns.total(nanoseconds) and*.size(bytes) metrics to compute latency and throughput, respectively- Based on user-controlled time intervals - for reference, see CLI
performance throughputandperformance latency - Note: for Prometheus client, internal
.ns.totalsuffix becomes_ns_total, and.size, respectively,_bytes
- Based on user-controlled time intervals - for reference, see CLI
- In addition to total aggregated numbers there are now separately computed per-backend latency and throughput numbers
- Those with
aws.prefix, for instance.
- Those with
Variable labels
Each AIS metric carries node_id - a static label in Prometheus terminology.
Starting v3.26, majority of the metrics will also contain variable labels:
- Variable Labels:
bucket: Name of the associated bucket.xkind: Job kind.mountpath: Mountpath.
- All I/O metrics now carry the bucket name (or
Cname, to be precise) as a Prometheus variable label - All in-cluster writing generated by xactions (jobs) now also have this xaction label as well: the respective kind
- One major side-effect of the above is that we will now see more PUT metrics, and not only those that result from user PUT requests
- All GET, PUT, and DELETE errors also have the bucket label
- All FSHC related errors (the so called IO errors) carry mountpath (ie., faulty disk) label.
Common metrics: AIS targets and gateways
- Request Metrics:
GetCount: Total number of executed GET(object) requests.- Variable Labels:
bucket
- Variable Labels:
PutCount: Total number of executed PUT(object) requests.- Variable Labels:
bucket,xkind
- Variable Labels:
HeadCount: Total number of executed HEAD(object) requests (currently only remote HEAD).- Variable Labels:
bucket
- Variable Labels:
AppendCount: Total number of executed APPEND(object) requests.- Variable Labels:
bucket
- Variable Labels:
DeleteCount: Total number of executed DELETE(object) requests.- Variable Labels:
bucket
- Variable Labels:
RenameCount: Total number of executed rename(object) requests.- Variable Labels:
bucket
- Variable Labels:
ListCount: Total number of executed list-objects requests.- Variable Labels:
bucket
- Variable Labels:
Common Error Counters
- Error Metrics:
ErrGetCount: Total number of GET(object) errors.- Variable Labels:
bucket
- Variable Labels:
ErrPutCount: Total number of PUT(object) errors.- Variable Labels:
bucket,xkind
- Variable Labels:
ErrHeadCount: Total number of HEAD(object) errors.- Variable Labels:
bucket
- Variable Labels:
ErrAppendCount: Total number of APPEND(object) errors.- Variable Labels:
bucket
- Variable Labels:
ErrDeleteCount: Total number of DELETE(object) errors.- Variable Labels:
bucket
- Variable Labels:
ErrRenameCount: Total number of rename(object) errors.- Variable Labels:
bucket
- Variable Labels:
ErrListCount: Total number of list-objects errors.- Variable Labels:
bucket
- Variable Labels:
Common Latencies
- Latency Metrics:
GetLatency: GET average time (milliseconds) over the last periodic.stats_time interval.- Variable Labels:
bucket
- Variable Labels:
GetLatencyTotal: GET total cumulative time (nanoseconds).- Variable Labels:
bucket
- Variable Labels:
ListLatency: List-objects average time (milliseconds) over the last periodic.stats_time interval.- Variable Labels:
bucket
- Variable Labels:
For convenience, we also include here a (somewhat redundant) table that summarizes common metrics.
| Internal name | Public name | Internal Type | Description (Prometheus help) | Prometheus labels |
|---|---|---|---|---|
get.n | get_count | counter | total number of executed GET(object) requests | default |
put.n | put_count | counter | total number of executed PUT(object) requests | default |
head.n | head_count | counter | total number of executed HEAD(object) requests | default |
append.n | append_count | counter | total number of executed APPEND(object) requests | default |
del.n | del_count | counter | total number of executed DELETE(object) requests | default |
ren.n | ren_count | counter | total number of executed rename(object) requests | default |
lst.n | lst_count | counter | total number of executed list-objects requests | default |
err.get.n | err_get_count | counter | total number of GET(object) errors | default |
err.put.n | err_put_count | counter | total number of PUT(object) errors | default |
err.head.n | err_head_count | counter | total number of HEAD(object) errors | default |
err.append.n | err_append_count | counter | total number of APPEND(object) errors | default |
err.del.n | err_del_count | counter | total number of DELETE(object) errors | default |
err.ren.n | err_ren_count | counter | total number of rename(object) errors | default |
err.lst.n | err_lst_count | counter | total number of list-objects errors | default |
err.http.write.n | err_http_write_count | counter | total number of HTTP write-response errors | default |
err.dl.n | err_dl_count | counter | downloader: number of download errors | default |
err.put.mirror.n | err_put_mirror_count | counter | number of n-way mirroring errors | default |
get.ns | get_ms | latency | GET: average time (milliseconds) over the last periodic.stats_time interval | default |
get.ns.total | get_ns_total | total | GET: total cumulative time (nanoseconds) | default |
lst.ns | lst_ms | latency | list-objects: average time (milliseconds) over the last periodic.stats_time interval | default |
kalive.ns | kalive_ms | latency | in-cluster keep-alive (heartbeat): average time (milliseconds) over the last periodic.stats_time interval | default |
up.ns.time | uptime | special | this node's uptime since its startup (seconds) | default |
state.flags | state_flags | gauge | bitwise 64-bit value that carries enumerated node-state flags, including warnings and alerts; see https://github.com/NVIDIA/aistore/blob/main/cmn/cos/node_state.go |
Target metrics
-
Out-of-Band Metrics:
VerChangeCount: Number of out-of-band updates (by a 3rd party performing remote PUTs from outside this cluster).- Variable Labels:
bucket
- Variable Labels:
VerChangeSize: Total cumulative size (bytes) of objects updated out-of-band across all backends combined.- Variable Labels:
bucket
- Variable Labels:
RemoteDeletedDelCount: Number of out-of-band deletes (by a 3rd party remote DELETE(object) from outside this cluster).- Variable Labels:
bucket
- Variable Labels:
-
PUT Latency Metrics:
PutLatency: PUT average time (milliseconds) over the last periodic.stats_time interval.- Variable Labels:
bucket,xkind
- Variable Labels:
PutLatencyTotal: PUT total cumulative time (nanoseconds).- Variable Labels:
bucket,xkind
- Variable Labels:
-
HEAD Latency Metrics:
HeadLatencyTotal: HEAD total cumulative time (nanoseconds).- Variable Labels:
bucket
- Variable Labels:
-
APPEND Latency Metrics:
AppendLatency: APPEND average time (milliseconds) over the last periodic.stats_time interval.- Variable Labels:
bucket
- Variable Labels:
-
Throughput Metrics:
GetThroughput: GET average throughput (MB/s) over the last periodic.stats_time interval.- Variable Labels:
bucket
- Variable Labels:
PutThroughput: PUT average throughput (MB/s) over the last periodic.stats_time interval.- Variable Labels:
bucket,xkind
- Variable Labels:
-
Size Metrics:
GetSize: GET total cumulative size (bytes).- Variable Labels:
bucket
- Variable Labels:
PutSize: PUT total cumulative size (bytes).- Variable Labels:
bucket,xkind
- Variable Labels:
-
Error Metrics:
ErrPutCksumCount: PUT number of checksum errors.- Variable Labels:
bucket,xkind
- Variable Labels:
ErrFSHCCount: Number of times filesystem health checker (FSHC) was triggered by an I/O error or errors.- Variable Labels:
mountpath
- Variable Labels:
IOErrGetCount: GET number of I/O errors (excluding remote backend and network errors).- Variable Labels:
bucket
- Variable Labels:
IOErrDeleteCount: DELETE(object) number of I/O errors (excluding remote backend and network errors).- Variable Labels:
bucket
- Variable Labels:
For convenience, a table that summarizes target metrics follows below.
| Internal name | Public name | Internal Type | Description (Prometheus help) | Prometheus labels |
|---|---|---|---|---|
disk.<DISK-NAME>.read.bps | disk_read_mbps | computed-bandwidth | read bandwidth (MB/s) | map[disk:<DISK-NAME> node_id:<AIS-NODE-ID>] |
disk.<DISK-NAME>.avg.rsize | disk_avg_rsize | gauge | average read size (bytes) | map[disk:<DISK-NAME> node_id:<AIS-NODE-ID>] |
disk.<DISK-NAME>.write.bps | disk_write_mbps | computed-bandwidth | write bandwidth (MB/s) | map[disk:<DISK-NAME> node_id:<AIS-NODE-ID>] |
disk.<DISK-NAME>.avg.wsize | disk_avg_wsize | gauge | average write size (bytes) | map[disk:<DISK-NAME> node_id:<AIS-NODE-ID>] |
disk.<DISK-NAME>.util | disk_util | gauge | disk utilization (%%) | map[disk:<DISK-NAME> node_id:<AIS-NODE-ID>] |
lru.evict.n | lru_evict_count | counter | number of LRU evictions | default |
lru.evict.size | lru_evict_bytes | size | total cumulative size (bytes) of LRU evictions | default |
cleanup.store.n | cleanup_store_count | counter | space cleanup: number of removed misplaced objects and old work files | default |
cleanup.store.size | cleanup_store_bytes | size | space cleanup: total size (bytes) of all removed misplaced objects and old work files (not including removed deleted objects) | default |
ver.change.n | ver_change_count | counter | number of out-of-band updates (by a 3rd party performing remote PUTs from outside this cluster) | default |
ver.change.size | ver_change_bytes | size | total cumulative size (bytes) of objects that were updated out-of-band across all backends combined | default |
remote.deleted.del.n | remote_deleted_del_count | counter | number of out-of-band deletes (by a 3rd party remote DELETE(object) from outside this cluster) | default |
put.ns | put_ms | latency | PUT: average time (milliseconds) over the last periodic.stats_time interval | default |
put.ns.total | put_ns_total | total | PUT: total cumulative time (nanoseconds) | default |
append.ns | append_ms | latency | APPEND(object): average time (milliseconds) over the last periodic.stats_time interval | default |
get.redir.ns | get_redir_ms | latency | GET: average gateway-to-target HTTP redirect latency (milliseconds) over the last periodic.stats_time interval | default |
put.redir.ns | put_redir_ms | latency | PUT: average gateway-to-target HTTP redirect latency (milliseconds) over the last periodic.stats_time interval | default |
ratelim.retry.get.n | ratelim_retry_get_n | counter | GET: number of rate-limited retries triggered by remote backends returning 409 and 503 status codes | default |
ratelim.retry.get.ns.total | ratelim_retry_get_ns_total | total | GET: total retrying time (nanoseconds) caused by remote backends returning 409 and 503 status codes | default |
ratelim.retry.put.n | ratelim_retry_put_n | counter | PUT: number of rate-limited retries triggered by remote backends returning 409 and 503 status codes | default |
ratelim.retry.put.ns.total | ratelim_retry_put_ns_total | total | PUT: total retrying time (nanoseconds) caused by remote backends returning 409 and 503 status codes | default |
get.bps | get_mbps | bandwidth | GET: average throughput (MB/s) over the last periodic.stats_time interval | default |
put.bps | put_mbps | bandwidth | PUT: average throughput (MB/s) over the last periodic.stats_time interval | default |
get.size | get_bytes | size | GET: total cumulative size (bytes) | default |
put.size | put_bytes | size | PUT: total cumulative size (bytes) | default |
err.cksum.n | err_cksum_count | counter | PUT: number of checksum errors | default |
err.fshc.n | err_fshc_count | counter | number of times filesystem health checker (FSHC) was triggered by an I/O error or errors | default |
err.io.get.n | err_io_get_count | counter | GET: number of I/O errors not including remote backend and network errors | default |
err.io.put.n | err_io_put_count | counter | PUT: number of I/O errors not including remote backend and network errors | default |
err.io.del.n | err_io_del_count | counter | DELETE(object): number of I/O errors not including remote backend and network errors | default |
stream.out.n | stream_out_count | counter | intra-cluster streaming communications: number of sent objects | default |
stream.out.size | stream_out_bytes | size | intra-cluster streaming communications: total cumulative size (bytes) of all transmitted objects | default |
stream.in.n | stream_in_count | counter | intra-cluster streaming communications: number of received objects | default |
stream.in.size | stream_in_bytes | size | intra-cluster streaming communications: total cumulative size (bytes) of all received objects | default |
dl.size | dl_bytes | size | total downloaded size (bytes) | default |
dl.ns.total | dl_ns_total | total | total downloading time (nanoseconds) | default |
dsort.creation.req.n | dsort_creation_req_count | counter | dsort: see https://github.com/NVIDIA/aistore/blob/main/docs/dsort.md#metrics | default |
dsort.creation.resp.n | dsort_creation_resp_count | counter | dsort: see https://github.com/NVIDIA/aistore/blob/main/docs/dsort.md#metrics | default |
dsort.creation.resp.ns | dsort_creation_resp_ms | latency | dsort: see https://github.com/NVIDIA/aistore/blob/main/docs/dsort.md#metrics | default |
dsort.extract.shard.dsk.n | dsort_extract_shard_dsk_count | counter | dsort: see https://github.com/NVIDIA/aistore/blob/main/docs/dsort.md#metrics | default |
dsort.extract.shard.mem.n | dsort_extract_shard_mem_count | counter | dsort: see https://github.com/NVIDIA/aistore/blob/main/docs/dsort.md#metrics | default |
dsort.extract.shard.size | dsort_extract_shard_bytes | size | dsort: see https://github.com/NVIDIA/aistore/blob/main/docs/dsort.md#metrics | default |
lcache.collision.n | lcache_collision_count | counter | number of LOM cache collisions (core, internal) | default |
lcache.evicted.n | lcache_evicted_count | counter | number of LOM cache evictions (core, internal) | default |
lcache.flush.cold.n | lcache_flush_cold_count | counter | number of times a LOM from cache was written to stable storage (core, internal) | default |
remais.get.n | remote_get_count | counter | GET: total number of executed remote requests | map[backend:remais node_id:<AIS-NODE-ID>] |
remais.get.ns.total | remote_get_ns_total | total | GET: total cumulative time (nanoseconds) to execute remote requests and store, copy, or transform objects | map[backend:remais node_id:<AIS-NODE-ID>] |
remais.get.size | remote_get_bytes_total | size | GET: total cumulative size (bytes) of all remote GET transactions | map[backend:remais node_id:<AIS-NODE-ID>] |
remais.head.n | remote_head_count | counter | HEAD: total number of executed remote requests to a given backend | map[backend:remais node_id:<AIS-NODE-ID>] |
remais.put.n | remote_put_count | counter | PUT: total number of executed remote requests to a given backend | map[backend:remais node_id:<AIS-NODE-ID>] |
remais.put.ns.total | remote_put_ns_total | total | PUT: total cumulative time (nanoseconds) to execute remote requests and store new object versions in-cluster | map[backend:remais node_id:<AIS-NODE-ID>] |
remais.e2e.put.ns.total | remote_e2e_put_ns_total | total | PUT: total end-to-end time (nanoseconds) servicing remote requests; includes: receiving PUT payload, storing it in-cluster, executing remote PUT, finalizing new in-cluster object | map[backend:remais node_id:<AIS-NODE-ID>] |
remais.put.size | remote_e2e_put_bytes_total | size | PUT: total cumulative size (bytes) of all PUTs to a given remote backend | map[backend:remais node_id:ClCt8081] |
remais.ver.change.n | remote_ver_change_count | counter | number of out-of-band updates (by a 3rd party performing remote PUTs outside this cluster) | map[backend:remais node_id:<AIS-NODE-ID>] |
remais.ver.change.size | remote_ver_change_bytes_total | size | total cumulative size of objects that were updated out-of-band | map[backend:remais node_id:<AIS-NODE-ID>] |
gcp.get.n | remote_get_count | counter | GET: total number of executed remote requests | map[backend:gcp node_id:<AIS-NODE-ID>] |
gcp.get.ns.total | remote_get_ns_total | total | GET: total cumulative time (nanoseconds) to execute remote requests and store, copy, or transform objects | map[backend:gcp node_id:<AIS-NODE-ID>] |
gcp.get.size | remote_get_bytes_total | size | GET: total cumulative size (bytes) of all remote transactions | map[backend:gcp node_id:<AIS-NODE-ID>] |
gcp.head.n | remote_head_count | counter | HEAD: total number of executed remote requests to a given backend | map[backend:gcp node_id:<AIS-NODE-ID>] |
gcp.put.n | remote_put_count | counter | PUT: total number of executed remote requests to a given backend | map[backend:gcp node_id:<AIS-NODE-ID>] |
gcp.put.ns.total | remote_put_ns_total | total | PUT: total cumulative time (nanoseconds) to execute remote requests and store new object versions in-cluster | map[backend:gcp node_id:<AIS-NODE-ID>] |
gcp.e2e.put.ns.total | remote_e2e_put_ns_total | total | PUT: total end-to-end time (nanoseconds) servicing remote requests; includes: receiving PUT payload, storing it in-cluster, executing remote PUT, finalizing new in-cluster object | map[backend:gcp node_id:<AIS-NODE-ID>] |
gcp.put.size | remote_e2e_put_bytes_total | size | PUT: total cumulative size (bytes) of all PUTs to a given remote backend | map[backend:gcp node_id:<AIS-NODE-ID>] |
gcp.ver.change.n | remote_ver_change_count | counter | number of out-of-band updates (by a 3rd party performing remote PUTs outside this cluster) | map[backend:gcp node_id:<AIS-NODE-ID>] |
gcp.ver.change.size | remote_ver_change_bytes_total | size | total cumulative size of objects that were updated out-of-band | map[backend:gcp node_id:<AIS-NODE-ID>] |
aws.get.n | remote_get_count | counter | GET: total number of executed remote requests | map[backend:aws node_id:<AIS-NODE-ID>] |
aws.get.ns.total | remote_get_ns_total | total | GET: total cumulative time (nanoseconds) to execute remote requests and store, copy, or transform objects | map[backend:aws node_id:<AIS-NODE-ID>] |
aws.get.size | remote_get_bytes_total | size | GET: total cumulative size (bytes) of all remote transactions | map[backend:aws node_id:<AIS-NODE-ID>] |
aws.head.n | remote_head_count | counter | HEAD: total number of executed remote requests to a given backend | map[backend:aws node_id:<AIS-NODE-ID>] |
aws.put.n | remote_put_count | counter | PUT: total number of executed remote requests to a given backend | map[backend:aws node_id:<AIS-NODE-ID>] |
aws.put.ns.total | remote_put_ns_total | total | PUT: total cumulative time (nanoseconds) to execute remote requests and store new object versions in-cluster | map[backend:aws node_id:<AIS-NODE-ID>] |
aws.e2e.put.ns.total | remote_e2e_put_ns_total | total | PUT: total end-to-end time (nanoseconds) servicing remote requests; includes: receiving PUT payload, storing it in-cluster, executing remote PUT, finalizing new in-cluster object | map[backend:aws node_id:<AIS-NODE-ID>] |
aws.put.size | remote_e2e_put_bytes_total | size | PUT: total cumulative size (bytes) of all PUTs to a given remote backend | map[backend:aws node_id:<AIS-NODE-ID>] |
aws.ver.change.n | remote_ver_change_count | counter | number of out-of-band updates (by a 3rd party performing remote PUTs outside this cluster) | map[backend:aws node_id:<AIS-NODE-ID>] |
aws.ver.change.size | remote_ver_change_bytes_total | size | total cumulative size of objects that were updated out-of-band | map[backend:aws node_id:<AIS-NODE-ID>] |
azure.get.n | remote_get_count | counter | GET: total number of executed remote requests | map[backend:azure node_id:<AIS-NODE-ID>] |
azure.get.ns.total | remote_get_ns_total | total | GET: total cumulative time (nanoseconds) to execute remote requests and store, copy, or transform objects | map[backend:azure node_id:<AIS-NODE-ID>] |
azure.get.size | remote_get_bytes_total | size | GET: total cumulative size (bytes) of all remote transactions | map[backend:azure node_id:<AIS-NODE-ID>] |
azure.head.n | remote_head_count | counter | HEAD: total number of executed remote requests to a given backend | map[backend:azure node_id:<AIS-NODE-ID>] |
azure.put.n | remote_put_count | counter | PUT: total number of executed remote requests to a given backend | map[backend:azure node_id:<AIS-NODE-ID>] |
azure.put.ns.total | remote_put_ns_total | total | PUT: total cumulative time (nanoseconds) to execute remote requests and store new object versions in-cluster | map[backend:azure node_id:<AIS-NODE-ID>] |
azure.e2e.put.ns.total | remote_e2e_put_ns_total | total | PUT: total end-to-end time (nanoseconds) servicing remote requests; includes: receiving PUT payload, storing it in-cluster, executing remote PUT, finalizing new in-cluster object | map[backend:azure node_id:<AIS-NODE-ID>] |
azure.put.size | remote_e2e_put_bytes_total | size | PUT: total cumulative size (bytes) of all PUTs to a given remote backend | map[backend:azure node_id:<AIS-NODE-ID>] |
azure.ver.change.n | remote_ver_change_count | counter | number of out-of-band updates (by a 3rd party performing remote PUTs outside this cluster) | map[backend:azure node_id:<AIS-NODE-ID>] |
azure.ver.change.size | remote_ver_change_bytes_total | size | total cumulative size of objects that were updated out-of-band | map[backend:azure node_id:<AIS-NODE-ID>] |
Backend metrics
-
GET Metrics:
remote_get_count: Total number of executed remote GET requests.- Variable Labels:
bucket
- Variable Labels:
remote_get_ns_total: Total cumulative time (nanoseconds) to execute remote requests and store, copy, or transform objects.- Variable Labels:
bucket
- Variable Labels:
remote_get_bytes_total: Total cumulative size (bytes) of all remote GET transactions.- Variable Labels:
bucket
- Variable Labels:
-
PUT Metrics:
remote_put_count: Total number of executed remote PUT requests to a given backend.- Variable Labels:
bucket,xkind
- Variable Labels:
remote_put_ns_total: Total cumulative time (nanoseconds) to execute remote PUT requests and store new object versions in-cluster.- Variable Labels:
bucket,xkind
- Variable Labels:
remote_e2e_put_ns_total: Total end-to-end time (nanoseconds) servicing remote PUT requests (includes receiving PUT payload, storing it in-cluster, executing remote PUT, finalizing new in-cluster object).- Variable Labels:
bucket,xkind
- Variable Labels:
remote_e2e_put_bytes_total: Total cumulative size (bytes) of all PUTs to a given remote backend.- Variable Labels:
bucket,xkind
- Variable Labels:
-
HEAD Metrics:
remote_head_count: Total number of executed remote HEAD requests to a given backend.- Variable Labels:
bucket
- Variable Labels:
remote_head_ns_total: Total cumulative time (nanoseconds) to execute remote HEAD requests.- Variable Labels:
bucket
- Variable Labels:
-
Out-of-Band Updates:
remote_ver_change_count: Number of out-of-band updates (by a 3rd party performing remote PUTs outside this cluster).- Variable Labels:
bucket
- Variable Labels:
remote_ver_change_bytes_total: Total cumulative size (bytes) of objects that were updated out-of-band.- Variable Labels:
bucket
- Variable Labels:
Related Documentation
| Document | Description |
|---|---|
| Overview | Introduction to AIS observability |
| CLI | Command-line monitoring tools |
| Logs | Log-based observability |
| Prometheus | Configuring Prometheus with AIS |
| Grafana | Visualizing AIS metrics with Grafana |
| Kubernetes | Working with Kubernetes monitoring stacks |