Design and Features of PANIC

January 24, 2023 · View on GitHub

High-Level Design

The PANIC alerter can alert a node operator on the following sources:

  • The host systems that the Cosmos-SDK/Substrate/Chainlink nodes are running on based on system metrics obtained from the node via Node Exporter.
  • Chainlink nodes will be monitored through their Prometheus ports.
  • Chainlink contracts are monitored through the use of EVM nodes and Chainlink node addresses.
  • EVM nodes will be monitored through the RPC endpoint.
  • Cosmos nodes will be monitored through their Prometheus, REST, and Tendermint RPC endpoints.
  • Cosmos networks will be monitored using various Cosmos nodes' REST endpoints.
  • Substrate nodes will be monitored through their web-socket URL.
  • Substrate networks will be monitored using various Substrate nodes' web-socket URLs.
  • GitHub repository releases using the GitHub Releases API.
  • DockerHub repository releases using the Docker HUB API.

Note: Systems monitoring and GitHub/DockerHub repositories monitoring were developed as general as possible to give the node operator the option to monitor any system and/or any repository (Don't have to be Substrate/Cosmos-SDK/Chainlink based nodes/repositories).

The diagram below depicts the different components which constitute PANIC and how they interact with each other and the node operator.

PANIC Design

PANIC starts by loading the configurations (saved during installation).

For system monitoring and alerting, PANIC operates as follows:

  • When the Monitors Manager Process receives the configurations, it starts as many System Monitors as there are systems to be monitored.
  • Each System Monitor extracts the system data from the node's Node Exporter endpoint and forwards this data to the System Data Transformer via RabbitMQ.
  • The System Data Transformer starts by listening for data from the System Monitors via RabbitMQ. Whenever a system's data is received, the System Data Transformer combines the received data with the system's state obtained from Redis, and sends the combined data to the Data Store and the System Alerter via RabbitMQ.
  • The System Alerter starts by listening for data from the System Data Transformer via RabbitMQ. Whenever a system's transformed data is received, the System Alerter compares the received data with the alert rules set during installation, and raises an alert if any of these rules are triggered. This alert is then sent to the Alert Router via RabbitMQ .
  • The Data Store also receives data from the System Data Transformer via RabbitMQ and saves this data to both Redis and MongoDB as required.
  • When the Alert Router receives an alert from the System Alerter via RabbitMQ, it checks the configurations to determine which channels should receive this alert. As a result, this alert is then routed to the appropriate channel and the Data Store (so that the alert is stored in a Mongo database) via RabbitMQ.
  • When a Channel Handler receives an alert via RabbitMQ, it simply forwards it to the channel it handles and the Node Operator would be notified via this channel.
  • If the user sets-up a Telegram or Slack Channel with Commands enabled, the user would be able to control and query PANIC via Telegram Bot/Slack App Commands. A list of available commands is given here.

For EVM Node, Cosmos Node, Substrate Node, and GitHub/DockerHub repositories monitoring and alerting, PANIC operates similarly to system monitoring and alerting. The difference is that each monitorable type has its own set of dedicated processes which monitor different endpoints/data sources as required. For example, to monitor Cosmos nodes a Cosmos Node Monitor, Cosmos Node Data Transformer and a Cosmos Node Alerter were written to monitor data obtained from the REST, prometheus and Tendermint-RPC endpoints.

For Chainlink node monitoring and alerting, PANIC operates as follows:

  • When the Monitors Manager Process receives the configurations, it starts as many Chainlink Node Monitors as there are Chainlink configurations to be monitored. A Chainlink configuration could have multiple prometheus data points setup as a node operator would have multiple Chainlink nodes setup but one running. If one Chainlink node goes down another would start operating to ensure fully functional operations. The node monitor is built to consider this and checks all prometheus data points to find the active one, if none are found an appropriate response is passed on.
  • Each Chainlink Node Monitor extracts the Chainlink data from the node's prometheus endpoint and forwards this data to the Chainlink Data Transformer via RabbitMQ.
  • The Chainlink Node Data Transformer starts by listening for data from the Chainlink Node Monitors via RabbitMQ. Whenever a Chainlink node's data is received, the Chainlink Node Data Transformer combines the received data with the Chainlink node's state obtained from Redis, and sends the combined data to the Data Store and the Chainlink Node Alerter via RabbitMQ.
  • The Chainlink Node Alerter starts by listening for data from the Chainlink Node Data Transformer via RabbitMQ. Whenever a Chainlink node's transformed data is received, the Chainlink Node Alerter compares the received data with the alert rules set during installation, and raises an alert if any of these rules are triggered. This alert is then sent to the Alert Router via RabbitMQ .
  • The Data Store also received data from the Chainlink Node Data Transformer via RabbitMQ and saves this data to both Redis and MongoDB as required.
  • When the Alert Router receives an alert from the Chainlink Node Alerter via RabbitMQ, it checks the configurations to determine which channels should receive this alert. As a result, this alert is then routed to the appropriate channel and the Data Store (so that the alert is stored in a Mongo database) via RabbitMQ.
  • When a Channel Handler receives an alert via RabbitMQ, it simply forwards it to the channel it handles and the Node Operator would be notified via this channel.
  • If the user sets-up a Telegram or Slack Channel with Commands enabled, the user would be able to control and query PANIC via Telegram Bot/Slack App Commands. A list of available commands is given here.

For Chainlink contract monitoring and alerting, PANIC operates as follows:

  • When the Monitors Manager Process receives the configurations, it starts one Chainlink Contract Monitor per chain and keeps the configurations updated. A Chainlink Contract monitor uses EVM nodes to retrieve price feed data. The Chainlink contract monitor knows which contracts to monitor as it retrieves the address of the Chainlink nodes previously setup and checks if the addresses exist in the list of contracts from weiwatchers. If a users has multiple EVM nodes setup and one goes down the monitor will attempt to retrieve data from the next node in the list, if none are reachable an appropriate message is passed on.
  • Each Chainlink Contract Monitor extracts the Chainlink contract data from the EVM node's rpc endpoint and forwards this data to the Chainlink Contract Data Transformer via RabbitMQ.
  • The Chainlink Contract Data Transformer starts by listening for data from the Chainlink Contract Monitors via RabbitMQ. Whenever a Chainlink contract's data is received, the Chainlink Contract Data Transformer combines the received data with the Chainlink contract's state obtained from Redis, and sends the combined data to the Data Store and the Chainlink Contract Alerter via RabbitMQ.
  • The Chainlink Contract Alerter starts by listening for data from the Chainlink Contract Data Transformer via RabbitMQ. Whenever a Chainlink contract's transformed data is received, the Chainlink Contract Alerter compares the received data with the alert rules set during installation, and raises an alert if any of these rules are triggered. This alert is then sent to the Alert Router via RabbitMQ .
  • The Data Store also received data from the Chainlink Contract Data Transformer via RabbitMQ and saves this data to both Redis and MongoDB as required.
  • When the Alert Router receives an alert from the Chainlink Contract Alerter via RabbitMQ, it checks the configurations to determine which channels should receive this alert. As a result, this alert is then routed to the appropriate channel and the Data Store (so that the alert is stored in a Mongo database) via RabbitMQ.
  • When a Channel Handler receives an alert via RabbitMQ, it simply forwards it to the channel it handles and the Node Operator would be notified via this channel.
  • If the user sets-up a Telegram or Slack Channel with Commands enabled, the user would be able to control and query PANIC via Telegram Bot/Slack App Commands. A list of available commands is given here.

For Cosmos network monitoring and alerting, PANIC operates as follows:

  • When the Monitors Manager Process receives the configurations, it starts one Cosmos Network Monitor per chain and keeps the configurations updated. A Cosmos Network monitor uses Cosmos nodes to retrieve governance data. If a user has multiple Cosmos nodes setup and one goes down, the monitor will attempt to retrieve data from the next node in the list. If no node is synced and reachable, an appropriate message is passed on.
  • Each Cosmos Network Monitor extracts the Cosmos network data from the Cosmos node's REST endpoint and forwards this data to the Cosmos Network Data Transformer via RabbitMQ.
  • The Cosmos Network Data Transformer starts by listening for data from the Cosmos Network Monitors via RabbitMQ. Whenever a Cosmos network's data is received, the Cosmos Network Data Transformer combines the received data with the Cosmos network's state obtained from Redis, and sends the combined data to the Data Store and the Cosmos Network Alerter via RabbitMQ.
  • The Cosmos Network Alerter starts by listening for data from the Cosmos Network Data Transformer via RabbitMQ. Whenever a Cosmos network's transformed data is received, the Cosmos Network Alerter compares the received data with the alert rules set during installation, and raises an alert if any of these rules are triggered. This alert is then sent to the Alert Router via RabbitMQ .
  • The Data Store also received data from the Cosmos Network Data Transformer via RabbitMQ and saves this data to both Redis and MongoDB as required.
  • When the Alert Router receives an alert from the Cosmos Network Alerter via RabbitMQ, it checks the configurations to determine which channels should receive this alert. As a result, this alert is then routed to the appropriate channel and the Data Store (so that the alert is stored in a Mongo database) via RabbitMQ.
  • When a Channel Handler receives an alert via RabbitMQ, it simply forwards it to the channel it handles and the Node Operator would be notified via this channel.
  • If the user sets-up a Telegram or Slack Channel with Commands enabled, the user would be able to control and query PANIC via Telegram Bot/Slack App Commands. A list of available commands is given here.

For Substrate Network monitoring and alerting, PANIC operates similarly to that of Cosmos Network monitoring and alerting. The difference is that each monitorable type has its own set of dedicated processes which monitor different endpoints/data sources as required. For example, to monitor Substrate networks a Substrate Network Monitor, Substrate Network Data Transformer and a Substrate Network Alerter were written to monitor data obtained from the web-socket URLs.

Notes:

  • Another important component which is not depicted above is the Health-Checker component. The Health-Checker was not included in the image above as it is not part of the monitoring and alerting process, in fact it runs in its own Docker container. The Health-Checker component constitutes of two separate components, the Ping Publisher and the Heartbeat Handler. The Ping Publisher sends ping requests to PANIC's components every 30 seconds via RabbitMQ, and the Heartbeat Handler listens for heartbeats and saves them to Redis. This mechanism makes it possible to deduce whether PANIC's components are running as expected when the node operator enters the /status or /panicstatus commands described here.

Alert Types

Different events vary in severity. We cannot treat an alert for a new version of the Cosmos-SDK as being on the same level as an alert for 100% Storage usage. PANIC makes use of four alert types:

  • CRITICAL: Alerts of this type are the most severe. Such alerts are raised to inform the node operator of a situation which requires immediate action. Example: System's storage usage reached 100%.
  • WARNING: A less severe alert type but which still requires attention as it may be a warning of an incoming critical alert. Example: System's storage usage reached 85%.
  • INFO: Alerts of this type have little to zero severity but consists of information which is still important to acknowledge. Info alerts also include positive events. Example: System's storage usage is no longer at a critical level.
  • ERROR: Alerts of this type are triggered by abnormal events and ranges from zero to high severity based on the error that has occurred and how many times it is triggered. Example: Cannot access GitHub page alert.

Note: The critical and warning values (100% and 85%) mentioned in the examples above are configurable, and these can be configured using the installation procedure mentioned here

Alerting Channels

PANIC supports multiple alerting channels. By default, only the console and logging channels are enabled, allowing the node operator to run the alerter without having to set up extra alerting channels. This is not enough for a more serious and longer-term alerting setup, for which the node operator should set up the remaining alerting channels using the installation process described here.

PANIC supports the following alerting channels:

ChannelSeverities SupportedConfigurable SeveritiesDescription
ConsoleINFO, CRITICAL, WARNING, ERRORAllAlerts printed to standard output (stdout) of the alerter's Docker container.
LogINFO, CRITICAL, WARNING, ERRORAllAlerts logged to an alerts log (alerter/logs/alerts/alerts.log).
TelegramINFO, CRITICAL, WARNING, ERRORAllAlerts delivered to a Telegram chat via a Telegram bot in the form of a text message.
SlackINFO, CRITICAL, WARNING, ERRORAllAlerts delivered to a Slack channel via a Slack app in the form of a text message.
E-mailINFO, CRITICAL, WARNING, ERRORAllAlerts sent as emails using an SMTP server, with option for authentication.
TwilioCRITICALNoneAlerts trigger a phone call to grab the node operator's attention.
OpsenieINFO, CRITICAL, WARNING, ERRORAllAlerts are sent to the node operator's Opsgenie environment using the following severity mapping: CRITICALP1, WARNINGP3, ERRORP3, INFOP5
PagerDutyINFO, CRITICAL, WARNING, ERRORAllAlerts are sent to the node operator's PagerDuty environment using the following severity mapping: CRITICALcritical, WARNINGwarning, ERRORerror, INFOinfo

Using the installation procedure the user is able to specify the chain a node/system/GitHub repository belongs to (if the system/GitHub repository is not related to any chain it can be associated to the GENERAL chain). Due to this, the user is given the capability of associating channels with specific chains, hence obtaining a more organized alerting system. In addition to this, the user can set multiple alerting channels of the same type and enable/disable alert severities on each channel.

For example the node operator may have the following setup:

  • A Telegram Channel for Polkadot alerts with only WARNING and CRITICAL alerts enabled.
  • A Telegram Channel for Cosmos alerts with all severities enabled.
  • A Twilio Channel for all chains added to PANIC.

Telegram and Slack Commands

Telegram bots and Slack apps in PANIC serve two purposes. As mentioned above, they are used to send alerts. However they can also accept commands, allowing the node operator to have some control over the alerter and check its status.

PANIC supports the following commands:

CommandParametersDescription
/startNoneA welcome message is returned.
/pingNonePings the Telegram/Slack Commands Handler associated with the Telegram Chat/Slack Channel and returns PONG!. The user can use this command to check that the associated Telegram/Slack Commands Handler is running.
/helpNoneReturns a guide of acceptable commands and their description.
/mute for Telegram /panicmute for SlackList of severities, for example: /mute INFO CRITICALSuppose that the user types /mute INFO CRITICAL in a Telegram Chat/Slack Channel associated with the chain Polkadot. The /mute command mutes INFO and CRITICAL alerts on all channels (Including all other channels which are set-up, for example Opsgenie) for the chain Polkadot. If no severities are given, all Polkadot alerts are muted on all channels.
/unmuteNoneSuppose that the user types /unmute in a Telegram Chat/Slack Channel associated with the chain Polkadot. This command will unmute all alert severities on all channels (Including all other channels which are set-up ex. Opsgenie) for the chain Polkadot.
/muteallList of severities, for example: /muteall INFO CRITICALSuppose that the user types /muteall INFO CRITICAL in a Telegram Chat/Slack Channel associated with the chain Polkadot. The /muteall command mutes INFO and CRITICAL alerts on all channels (Including all other channels which are set-up, for example Opsgenie) for every chain being monitored (including the GENERAL chain). If no severities are given, all alerts for all chains being monitored are muted on all channels.
/unmuteallNoneSuppose that the user types /unmuteall in a Telegram Chat/Slack Channel associated with the chain Polkadot. This command unmutes all alert severities on all channels (Including all other channels which are set-up ex. Opsgenie) for every chain being monitored (including the GENERAL chain).
/status for Telegram /panicstatus for SlackNoneReturns whether the components that constitute PANIC are running or not. If there are problems, the problems are highlighted in the status message.

List of Alerts

A complete list of alerts will now be presented. These are grouped into:

Each alert has either severity thresholds associated, or is associated a single severity. A severity threshold is a (value, severity) pair such that when a metric associated with the alert reaches value, an alert with severity is raised. For example, the System CPU Usage Critical severity threshold can be configured to 95%, meaning that you will get a CRITICAL SystemCPUUsageIncreasedAboveThresholdAlert alert if the CPU Usage of a system reaches 95%. On the other hand, if an alert is associated a single severity, that alert will always be raised with the same severity whenever the alert rule is obeyed. For example, when a System is back up again after it was down, a SystemBackUpAgainAlert with severity INFO is raised. In addition to this, not all alerts have their severities or severity thresholds configurable, also some alerts can be even disabled altogether.

In the lists below we will show which alerts have severity thresholds and which alerts have a single severity associated. In addition to this we will state which alerts are configurable/non-configurable and which can be disabled/enabled.

Note: Alerts can be configured and/or enabled/disabled using the installation procedure described here

System Alerts

Alert ClassSeverity ThresholdsSeverityConfigurableCan be Enabled/DisabledDescription
SystemWentDownAtAlertWARNING, CRITICALA WARNING/CRITICAL alert is raised if warning_threshold/critical_threshold seconds pass after a system is down respectively.
SystemBackUpAgainAlertINFODepends on SystemWentDownAtAlertThe system was down and is back up again. This alert can only be enabled/disabled if the downtime alert is enabled/disabled respectively.
SystemStillDownAlertCRITICALRaised periodically every critical_repeat seconds if a SystemWentDownAt alert has already been raised.
InvalidUrlAlertERRORThe system's provided Node Exporter endpoint has an invalid URL schema.
ValidUrlAlertINFOThe system's provided Node Exporter endpoint is valid after being invalid.
MetricNotFoundErrorAlertERRORA metric that is being monitored cannot be found at the system's Node Exporter endpoint.
MetricFoundAlertINFOAll metrics can be found at the system's Node Exporter endpoint after a MetricNotFoundErrorAlert is raised.
OpenFileDescriptorsIncreasedAboveThresholdAlertWARNING, CRITICALA WARNING/CRITICAL alert is raised if the percentage number of open file descriptors increases above warning_threshold/critical_threshold respectively. This alert is raised periodically every critical_repeat seconds with CRITICAL severity if the percentage number of open file descriptors is still above critical_threshold.
OpenFileDescriptorsDecreasedBelowThresholdAlertINFOThe percentage number of open file descriptors decreases below warning_threshold/critical_threshold. This alert can only be enabled/disabled if the OpenFileDescriptorsIncreasedAboveThresholdAlert is enabled/disabled respectively.
SystemCPUUsageIncreasedAboveThresholdAlertWARNING, CRITICALA WARNING/CRITICAL alert is raised if the system's CPU usage percentage increases above warning_threshold/critical_threshold respectively. This alert is raised periodically every critical_repeat seconds with CRITICAL severity if the system's CPU usage percentage is still above critical_threshold.
SystemCPUUsageDecreasedBelowThresholdAlertINFOThe system's CPU usage percentage decreases below warning_threshold/critical_threshold. This alert can only be enabled/disabled if the SystemCPUUsageIncreasedAboveThresholdAlert is enabled/disabled respectively.
SystemRAMUsageIncreasedAboveThresholdAlertWARNING, CRITICALA WARNING/CRITICAL alert is raised if the system's RAM usage percentage increases above warning_threshold/critical_threshold respectively. This alert is raised periodically every critical_repeat seconds with CRITICAL severity if the system's RAM usage percentage is still above critical_threshold.
SystemRAMUsageDecreasedBelowThresholdAlertINFOThe system's RAM usage percentage decreases below warning_threshold/critical_threshold. This alert can only be enabled/disabled if the SystemRAMUsageIncreasedAboveThresholdAlert is enabled/disabled respectively.
SystemStorageUsageIncreasedAboveThresholdAlertWARNING, CRITICALA WARNING/CRITICAL alert is raised if the system's storage usage percentage increases above warning_threshold/critical_threshold respectively. This alert is raised periodically every critical_repeat seconds with CRITICAL severity if the system's storage usage percentage is still above critical_threshold.
SystemStorageUsageDecreasedBelowThresholdAlertINFOThe system's storage usage percentage decreases below warning_threshold/critical_threshold. This alert can only be enabled/disabled if the SystemStorageUsageIncreasedAboveThresholdAlert is enabled/disabled respectively.

Note:

  • warning_threshold and critical_threshold represent the WARNING and CRITICAL configurable thresholds respectively. These are set by the user during installation.
  • critical_repeat represents the amount of time that needs to pass for a CRITICAL alert that has already been raised to be raised again. This can also be set by the user during installation.
Alert ClassSeverity ThresholdsSeverityConfigurableCan be Enabled/DisabledDescription
NoChangeInHeightAlertWARNING,CRITICALThere is no change in height for warning and critical time thresholds.
BlockHeightUpdatedAlertINFODepends on NoChangeInHeightAlertThere is a change in height after warning or critical alerts of type NoChangeInHeightAlert have been raised.
NoChangeInTotalHeadersReceivedAlertWARNING,CRITICALThere is no change in total headers received for warning and critical time thresholds.
ReceivedANewHeaderAlertINFODepends on NoChangeInTotalHeadersReceivedAlertThere is a change in total headers received after warning or critical alerts of type NoChangeInTotalHeadersReceivedAlert have been raised.
MaxUnconfirmedBlocksIncreasedAboveThresholdAlertWARNING,CRITICALThe number of max unconfirmed blocks passed warning or critical block amounts thresholds.
MaxUnconfirmedBlocksDecreasedBelowThresholdAlertINFODepends on MaxUnconfirmedBlocksDecreasedBelowThresholdAlertThe amount of max unconfirmed blocks which were previously above warning or critical thresholds are now below them.
ChangeInSourceNodeAlertWARNINGNode goes down and another node takes it's place and begins operating.
GasBumpIncreasedOverNodeGasPriceLimitAlertCRITICALThe gas bump increases over the node gas price limit. This alert doesn't repeat and only alerts once per instance of increase.
NoOfUnconfirmedTxsIncreasedAboveThresholdAlertWARNING,CRITICALThe number of unconfirmed transactions being sent by the node have surpassed warning or critical thresholds.
NoOfUnconfirmedTxsDecreasedBelowThresholdAlertINFODepends on NoOfUnconfirmedTxsIncreasedAboveThresholdAlertThe number of unconfirmed transactions have decreased below warning or critical thresholds.
TotalErroredJobRunsIncreasedAboveThresholdAlertWARNING,CRITICALThe number of total errored job runs increased above warning or critical thresholds.
TotalErroredJobRunsDecreasedBelowThresholdAlertINFODepends on TotalErroredJobRunsIncreasedAboveThresholdAlertThe number of total errored jobs run decreases below warning or critical thresholds.
BalanceIncreasedAboveThresholdAlertINFODepends on BalanceDecreasedBelowThresholdAlertThe account balance increases above warning or critical thresholds.
BalanceDecreasedBelowThresholdAlertWARNING,CRITICALThe account balance decreases below warning or critical thresholds`.
BalanceToppedUpAlertINFOThe account balance is topped up this alert is raised.
InvalidUrlAlertERRORThe URL is unreachable most likely due to an invalid configuration.
ValidUrlAlertINFOThe monitors manage to connect to a valid URL.
PrometheusSourceIsDownAlertWARNINGThe URL given for the prometheus endpoint is unreachable.
PrometheusSourceBackUpAgainAlertINFOThe URL given for the prometheus endpoint is now reachable after being unreachable.
NodeWentDownAtAlertWARNING,CRITICALAll endpoints of a node are unreachable, classifying the node as down.
NodeBackUpAgainAlertINFODepends on NodeWentDownAtAlertValid endpoints have been found meaning that the node is now reachable.
NodeStillDownAlertCRITICALDepends on NodeWentDownAtAlertIf a node has been classified as down for sometime this alert will keep repeating for a period until it is back up again.
MetricNotFoundErrorAlertERRORThe endpoint had it's prometheus data changed therefore PANIC cannot find the correct metrics to read. Either the wrong endpoint was given or PANIC needs updating.
MetricFoundAlertINFOThis is raised when the MetricNotFoundErrorAlert was raised for whatever reason and now PANIC has managed to locate the metric at the prometheus endpoint.
Alert ClassSeverity ThresholdsSeverityConfigurableCan be Enabled/DisabledDescription
PriceFeedObservationsMissedIncreasedAboveThresholdWARNING,CRITICALThe number of missed price feed observations increased above thresholds.
PriceFeedObservedAgainINFODepends on PriceFeedObservationsMissedIncreasedAboveThresholdA Chainlink node starts to observe price feeds again.
PriceFeedDeviationInreasedAboveThresholdWARNING,CRITICALThe price feed observation submitted deviates from the consensus above thresholds.
PriceFeedDeviationDecreasedBelowThresholdINFODepends on PriceFeedDeviationInreasedAboveThresholdThe Chainlink node's price feed submissions are no longer deviating from consensus.
ConsensusFailureWARNINGThe price feed our Chainlink node submits to doesn't reach a consensus.
ErrorContractsNotRetrievedERRORWeiwatchers isn't available therefore contracts cannot be retrieved.
ContractsNowRetrievedINFOWeiwatchers is available again therefore contracts can be retrieved.
ErrorNoSyncedDataSourcesERRORNo EVM nodes are available to retrieve data from.
SyncedDataSourcesFoundINFOSynced EVM nodes are found and contract data can be retrieved again.

EVM Node Alerts

Alert ClassSeverity ThresholdsSeverityConfigurableCan be Enabled/DisabledDescription
NoChangeInBlockHeightWARNING,CRITICALThere hasn't been a change in node block height over a period of time.
BlockHeightUpdatedAlertINFODepends on NoChangeInBlockHeightEVM node starts to update it's block height.
BlockHeightDifferenceIncreasedAboveThresholdAlertWARNING,CRITICALThe block height difference between multiple EVM nodes increased above thresholds.
BlockHeightDifferenceDecreasedBelowThresholdAlertINFODepends on BlockHeightDifferenceIncreasedAboveThresholdAlertThe difference between EVM node's block heights decreased below thresholds.
InvalidUrlAlertERROREVM node URL is invalid.
ValidUrlAlertINFOEVM node URL is found after being invalid.
NodeWentDownAtAlertWARNING,CRITICALEVM node is unreachable.
NodeBackUpAgainAlertINFODepends on NodeWentDownAtAlertEVM node is back up again.
NodeStillDownAlertCRITICAL✓ but depends on NodeWentDownAtAlertEVM node is still detected as down after a period of time.

Cosmos Node Alerts

Alert ClassSeverity ThresholdsSeverityConfigurableCan be Enabled/DisabledDescription
NodeWentDownAtAlertWARNING,CRITICALAll endpoints of a node are unreachable, classifying the node as down.
NodeBackUpAgainAlertINFODepends on NodeWentDownAtAlertSome node endpoints are accessible again, meaning that the node is now reachable.
NodeStillDownAlertCRITICAL✓ but depends on NodeWentDownAtAlertIf a node has been classified as down for sometime this alert will keep repeating for a period until it is back up again.
ValidatorWasSlashedAlertCRITICALValidator has been slashed.
NodeIsSyncingAlertINFO,WARNINGNode or validator is syncing.
NodeIsNoLongerSyncingAlertINFODepends on NodeIsSyncingAlertNode or validator is no longer syncing.
NodeIsPeeredWithSentinelAlertINFONode or validator is peered with the sentinel (this is only relevant for mev-tendermint nodes).
NodeIsNotPeeredWithSentinelAlertINFODepends on NodeIsPeeredWithSentinelAlertNode or validator is not peered with the sentinel.
ValidatorIsNotActiveAlertCRITICALValidator is not active in the current consensus session.
ValidatorIsActiveAlertINFODepends on ValidatorIsNotActiveAlertValidator is active in the current consensus session after not being active in a previous consensus session.
ValidatorIsJailedAlertCRITICALValidator is jailed.
ValidatorIsNoLongerJailedAlertINFODepends on ValidatorIsJailedAlertValidator is no longer jailed.
BlocksMissedIncreasedAboveThresholdAlertWARNING,CRITICALThe number of missed block signatures increased above warning or critical thresholds.
BlocksMissedDecreasedBelowThresholdAlertINFODepends on BlocksMissedIncreasedAboveThresholdAlertThe number of missed block signatures decreased below warning or critical thresholds.
NoChangeInHeightAlertWARNING,CRITICALThere hasn't been a change in node block height over a period of time.
BlockHeightUpdatedAlertINFODepends on NoChangeInHeightAlertCosmos node starts to update it's block height.
BlockHeightDifferenceIncreasedAboveThresholdAlertWARNING,CRITICALThe block height difference between multiple Cosmos nodes increased above thresholds.
BlockHeightDifferenceDecreasedBelowThresholdAlertINFODepends on BlockHeightDifferenceIncreasedAboveThresholdAlertThe difference between Cosmos node's block heights decreased below thresholds.
PrometheusInvalidUrlAlertERRORA node's provided Prometheus endpoint has an invalid URL schema.
PrometheusValidUrlAlertINFOA node's provided Prometheus endpoint is valid after PrometheusInvalidUrlAlert is raised.
CosmosRestInvalidUrlAlertERRORA node's provided Cosmos REST endpoint has an invalid URL schema.
CosmosRestValidUrlAlertINFOA node's provided Cosmos REST endpoint is valid after CosmosRestInvalidUrlAlert is raised.
TendermintRPCInvalidUrlAlertERRORA node's provided Tendermint RPC endpoint has an invalid URL schema.
TendermintRPCValidUrlAlertINFOA node's provided Tendermint RPC endpoint is valid after TendermintRPCInvalidUrlAlert is raised.
PrometheusSourceIsDownAlertWARNING,CRITICALA node's provided Prometheus endpoint is unreachable.
PrometheusSourceStillDownAlertCRITICAL✓ but depends on PrometheusSourceIsDownAlertIf a node's Prometheus endpoint has been classified as down for sometime this alert will keep repeating for a period until it is back up again.
PrometheusSourceBackUpAgainAlertINFODepends on PrometheusSourceIsDownAlertA node's provided Prometheus endpoint is no longer unreachable.
CosmosRestSourceIsDownAlertWARNING,CRITICALThe node's provided Cosmos REST endpoint is unreachable.
CosmosRestSourceStillDownAlertCRITICAL✓ but depends on CosmosRestSourceIsDownAlertIf a node's Cosmos REST endpoint has been classified as down for sometime this alert will keep repeating for a period until it is back up again.
CosmosRestSourceBackUpAgainAlertINFODepends on CosmosRestSourceIsDownAlertA node's provided Cosmos REST endpoint is no longer unreachable.
TendermintRPCSourceIsDownAlertWARNING,CRITICALThe node's provided Tendermint RPC endpoint is unreachable.
TendermintRPCSourceStillDownAlertCRITICAL✓ but depends on TendermintRPCSourceIsDownAlertIf a node's Tendermint RPC endpoint has been classified as down for sometime this alert will keep repeating for a period until it is back up again.
TendermintRPCSourceBackUpAgainAlertINFODepends on TendermintRPCSourceIsDownAlertA node's provided Tendermint RPC endpoint is no longer unreachable.
ErrorNoSyncedCosmosRestDataSourcesAlertERRORNo synced Cosmos node was available as a Cosmos REST data source.
SyncedCosmosRestDataSourcesFoundAlertINFOPANIC found a Cosmos node that could act as a Cosmos REST data source again.
ErrorNoSyncedTendermintRPCDataSourcesAlertERRORNo synced Cosmos node was available as a Tendermint-RPC data source.
SyncedTendermintRPCDataSourcesFoundAlertINFOPANIC found a Cosmos node that could act as a Tendermint-RPC data source again.
CosmosRestServerDataCouldNotBeObtainedAlertERRORCould not obtain data from Cosmos REST for a given node.
CosmosRestServerDataObtainedAlertINFOObtained data from Cosmos REST for a given node after CosmosRestServerDataCouldNotBeObtainedAlert is raised.
TendermintRPCDataCouldNotBeObtainedAlertERRORCould not obtain data from Tendermint RPC for a given node.
TendermintRPCDataObtainedAlertINFOObtained data from Tendermint RPC for a given node after TendermintRPCDataCouldNotBeObtainedAlert is raised.
MetricNotFoundErrorAlertERRORA node's prometheus data changed therefore PANIC cannot find the correct metrics to read. Either the wrong endpoint was given or PANIC needs updating.
MetricFoundAlertINFOManaged to locate the metric which was previously not found at the prometheus endpoint.

Cosmos Network Alerts

Alert ClassSeverity ThresholdsSeverityConfigurableCan be Enabled/DisabledDescription
NewProposalSubmittedAlertINFOA new proposal has been submitted to the governance forum.
ProposalConcludedAlertINFOA governance proposal has concluded with the respective result returned.
ErrorNoSyncedCosmosRestDataSourcesAlertERRORNo synced Cosmos node was available as a Cosmos REST data source.
SyncedCosmosRestDataSourcesFoundAlertINFOPANIC found a Cosmos node that could act as a Cosmos REST data source again.
CosmosNetworkDataCouldNotBeObtainedAlertERRORCould not obtain network data using given nodes.
CosmosNetworkDataObtainedAlertINFOObtained network data using a given node after CosmosNetworkDataCouldNotBeObtainedAlert is raised.

Substrate Node Alerts

Alert ClassSeverity ThresholdsSeverityConfigurableCan be Enabled/DisabledDescription
NodeWentDownAtAlertWARNING,CRITICALWeb-socket of a node is unreachable, classifying the node as down.
NodeBackUpAgainAlertINFODepends on NodeWentDownAtAlertWeb-socket is accessible again, meaning that the node is now reachable.
NodeStillDownAlertCRITICAL✓ but depends on NodeWentDownAtAlertIf a node has been classified as down for sometime this alert will keep repeating for a period until it is back up again.
NoChangeInBestBlockHeightAlertWARNING,CRITICALThere hasn't been a change in node's best block height over a period of time.
BestBlockHeightUpdatedAlertINFODepends on NoChangeInBestBlockHeightAlertSubstrate node starts to update it's best block height.
NoChangeInFinalizedBlockHeightAlertWARNING,CRITICALThere hasn't been a change in node's finalized block height over a period of time.
FinalizedBlockHeightUpdatedAlertINFODepends on NoChangeInFinalizedBlockHeightAlertSubstrate node starts to update it's finalized block height.
NodeIsSyncingAlertWARNING,CRITICALNode or validator is syncing. The threshold between the target height and the node's best block height was elapsed.
NodeIsNoLongerSyncingAlertINFODepends on NodeIsSyncingAlertNode or validator is no longer syncing.
ValidatorIsNotActiveAlertWARNINGValidator is not in the active set of validators.
ValidatorIsActiveAlertINFODepends on ValidatorIsNotActiveAlertValidator is in the active set of validators after previously not being in the active set of validators.
ValidatorIsDisabledAlertCRITICALValidator is disabled.
ValidatorIsNoLongerDisabledAlertINFODepends on ValidatorIsDisabledAlertValidator is no longer disabled.
ValidatorWasNotElectedAlertWARNINGValidator was not elected for next session.
ValidatorWasElectedAlertINFODepends on ValidatorWasNotElectedAlertValidator was elected for next session after previously not being elected.
ValidatorBondedAmountChangedAlertINFOThe bonded amount of a validator changed.
ValidatorNoHeartbeatAndBlockAuthoredYetAlertWARNING,CRITICALValidator did not send a heartbeat and did not author block in a session after a session has being ongoing for a period.
ValidatorHeartbeatSentOrBlockAuthoredAlertINFODepends on ValidatorNoHeartbeatAndBlockAuthoredYetAlertValidator sent a heartbeat or authored a block in a session after ValidatorNoHeartbeatAndBlockAuthoredYetAlert is raised.
ValidatorWasOfflineAlertCRITICALAn offline event was generated for a validator.
ValidatorWasSlashedAlertCRITICALValidator was slashed.
ValidatorPayoutNotClaimedAlertWARNING,CRITICALValidator has not claimed a payout after an era threshold is reached from when the payout was available.
ValidatorPayoutClaimedAlertINFODepends on ValidatorPayoutNotClaimedAlertValidator claimed a payout.
ValidatorControllerAddressChangedAlertWARNINGThe controller address of a validator changed.
ErrorNoSyncedSubstrateWebSocketDataSourcesAlertERRORNo synced Substrate node was available as a web-socket data source.
SyncedSubstrateWebSocketDataSourcesFoundAlertINFOPANIC found a Substrate node that could act as a web-socket data source again.
SubstrateWebSocketDataCouldNotBeObtainedAlertERRORCould not obtain data from web-socket for a given node.
SubstrateWebSocketDataObtainedAlertINFOObtained data from web-socket for a given node after SubstrateWebSocketDataCouldNotBeObtainedAlert is raised.
SubstrateApiIsNotReachableAlertERRORCould not reach the Substrate API. Probably means that the Substrate API container is not running.
SubstrateApiIsReachableAlertINFOManaged to reach the Substrate API after SubstrateApiIsNotReachableAlert is raised.

Substrate Network Alerts

Alert ClassSeverity ThresholdsSeverityConfigurableCan be Enabled/DisabledDescription
GrandpaIsStalledAlertWARNINGAlert is raised when GRANDPA is stalled.
GrandpaIsNoLongerStalledAlertINFODepends on GrandpaIsStalledAlertAlert is raised when GRANDPA is no longer stalled.
NewProposalSubmittedAlertINFOA new proposal has been submitted in the network.
NewReferendumSubmittedAlertINFOA new referendum has been submitted in the network.
ReferendumConcludedAlertINFOA governance referendum has concluded. Final result is also returned.
ErrorNoSyncedSubstrateWebSocketDataSourcesAlertERRORNo synced Substrate node was available as a web-socket data source.
SyncedSubstrateWebSocketDataSourcesFoundAlertINFOPANIC found a Substrate node that could act as a web-socket data source again.
SubstrateNetworkDataCouldNotBeObtainedAlertERRORCould not obtain network data from web-socket for a given node.
SubstrateNetworkDataObtainedAlertINFOObtained network data from web-socket for a given node after SubstrateNetworkDataCouldNotBeObtainedAlert is raised.
SubstrateApiIsNotReachableAlertERRORCould not reach the Substrate API. Probably means that the Substrate API container is not running.
SubstrateApiIsReachableAlertINFOManaged to reach the Substrate API after SubstrateApiIsNotReachableAlert is raised.

GitHub Repository Alerts

Alert ClassSeverityConfigurableCan be Enabled/DisabledDescription
NewGitHubReleaseAlertINFOA new release is published for a GitHub repository. Some release details are also given. Note, this alert cannot be enabled/disabled unless the operator decides to not monitor a repo altogether.
CannotAccessGitHubPageAlertERRORAlerter cannot access the GitHub repository's Releases API Page.
GitHubPageNowAccessibleAlertINFOAlerter is able to access the GitHub repository's Releases API Page after a CannotAccessGitHubPageAlert is raised.
GitHubAPICallErrorAlertERRORThe GitHub releases API call fails.
GitHubAPICallErrorResolvedAlertINFOAlerter no longer detects errors related to the GitHub API call.

DockerHub Repository Alerts

Alert ClassSeverityConfigurableCan be Enabled/DisabledDescription
DockerHubNewTagAlertINFOA new tag is published for a DockerHub repository. The new tag is also given. Note, this alert cannot be enabled/disabled unless the operator decides to not monitor a repo altogether.
DockerHubUpdatedTagAlertINFOAn existing tag for a DockerHub repository is updated. The updated tag is also given. Note, this alert cannot be enabled/disabled unless the operator decides to not monitor a repo altogether.
DockerHubDeletedTagAlertINFOAn existing tag for a DockerHub repository is deleted. The deleted tag is also given. Note, this alert cannot be enabled/disabled unless the operator decides to not monitor a repo altogether.
CannotAccessDockerHubPageAlertERRORAlerter cannot access the DockerHub API.
DockerHubPageNowAccessibleAlertINFOAlerter is able to access the DockerHub API after a CannotAccessDockerHubPageAlert is raised.
DockerHubTagsAPICallErrorAlertERRORDockerHub Tags API call fails.
DockerHubTagsAPICallErrorResolvedAlertINFOAlerter no longer detects errors related to the DockerHub Tags API call.

Back to front page