Skip to main content

Prometheus Alertmanager

If your Prometheus server is configured to send alerts to Alertmanager, you need to configure Alert Rules to be notified when issues occur. To simplify this process, MetricsHub provides the following alert rules that you can tailor to your specific needs:

Alert RulesWhen to UseAlerts Triggered When
MetricsHubAlways
  • A host cannot be reached
  • A connector has failed
  • A protocol has failed
  • The MetricsHub Agent is not sending metrics.
HardwareWhen hardware monitoring is performed
  • Battery charge is critically or abnormally low
  • Devices report high error rates (e.g. CPU, memory, disks, network)
  • Fan speed is too low
  • LUN has too few or no available paths
  • Network card error ratio is high
  • Physical disk endurance is low
  • Power supply usage is abnormally high
  • Temperature or voltage is out of range
  • A hardware device is missing, degraded, predicted to fail or failing.
SystemWhen system monitoring is performed
  • CPU usage, file system utilization, memory usage, or bandwidth usage is abnormally high
  • Too many network errors are detected
  • A high page faults rate occurs over an extended period of time.

Notes:

  • These alert rules are distinct from the internal alerts generated by MetricsHub and emitted as OpenTelemetry logs. The alert rules described in this page are managed exclusively by Prometheus Alertmanager.
  • To see alert descriptions, you must use the full Prometheus Alertmanager interface (usually available on port 9093). The simple web UI bundled with Prometheus does not display this additional alert information.

Alert Rules Thresholds

The alert rules rely on two types of thresholds:

  • Static thresholds: Used when the same threshold applies to all devices (e.g., battery charge). The alert rule compares the metric to a fixed, hardcoded value.
  • Dynamic thresholds: Used when thresholds vary across devices (e.g., temperature or fan speed). In this case, two additional metrics define the warning and critical thresholds. The alert rules compare the base metric to the corresponding threshold metrics.

Static Threshold Example

For the hw_battery_charge_ratio metric:

  • a warning alert is triggered when the battery charge is below 0.5 (50%)
  • a critical alert is triggered when the battery charge is below 0.3 (30%)
  • both warning and critical alerts are triggered when the value is below 0.3, since the above conditions are met.
- name: MetricsHub-Hardware-Battery-Charge
rules:
- alert: MetricsHub-Hardware-Battery-Charge-Warning
expr: hw_battery_charge_ratio >= 0 AND hw_battery_charge_ratio * 100 <= 50
for: 5m
labels:
severity: warning

- alert: MetricsHub-Hardware-Battery-Charge-Critical
expr: hw_battery_charge_ratio >= 0 AND hw_battery_charge_ratio * 100 < 30
for: 5m
labels:
severity: critical

Dynamic Threshold Example

For the hw_temperature_celsius metric:

  • a warning alert is triggered when the temperature exceeds the value of hw_temperature_limit_celsius{limit_type="high.degraded"}
  • a critical alert is triggered when the temperature exceeds the value of hw_temperature_limit_celsius{limit_type="high.critical"}
- name: Temperature
rules:
- alert: Temperature-High-Warning
expr: hw_temperature_celsius >= ignoring(limit_type) hw_temperature_limit_celsius{limit_type="high.degraded"}
labels:
severity: warning

- alert: Temperature-High-Critical
expr: hw_temperature_celsius >= ignoring(limit_type) hw_temperature_limit_celsius{limit_type="high.critical"}
labels:
severity: critical

The table below summarizes the metrics that should be compared to their corresponding dynamic threshold metrics:

Base MetricDynamic Threshold Metrics
rate(hw_errors_total[1h])ignoring(limit_type) hw_errors_limit{limit_type="degraded"}
ignoring(limit_type) hw_errors_limit{limit_type="critical"}
hw_fan_speed_rpmignoring(limit_type) hw_fan_speed_limit_rpm{limit_type="low.degraded"}
ignoring(limit_type) hw_fan_speed_limit_rpm{limit_type="low.critical"}
hw_fan_speed_ratioignoring(limit_type) hw_fan_speed_ratio_limit{limit_type="low.degraded"}
ignoring(limit_type) hw_fan_speed_ratio_limit{limit_type="low.critical"}
hw_lun_paths{type="available"}ignoring(limit_type) hw_lun_paths_limit{limit_type="low.degraded"}
hw_network_error_ratioignoring(limit_type) hw_network_error_ratio_limit{limit_type="degraded"}
ignoring(limit_type) hw_network_error_ratio_limit{limit_type="critical"}
hw_other_device_usesignoring(limit_type) hw_other_device_uses_limit{limit_type="degraded"}
ignoring(limit_type) hw_other_device_uses_limit{limit_type="critical"}
hw_other_device_valueignoring(limit_type) hw_other_device_value_limit{limit_type="degraded"}
ignoring(limit_type) hw_other_device_value_limit{limit_type="critical"}
hw_temperature_celsiusignoring(limit_type) hw_temperature_limit_celsius{limit_type="high.degraded"}
ignoring(limit_type) hw_temperature_limit_celsius{limit_type="high.critical"}
hw_voltage_voltsignoring(limit_type) hw_voltage_limit_volts{limit_type="low.critical"}
ignoring(limit_type) hw_voltage_limit_volts{limit_type="high.critical"}

Install

To activate the alert rules:

  1. Copy the required configuration files into your Prometheus installation folder:

    • config/metricshub-rules.yaml
    • config/metricshub-hardware-rules.yaml
    • config/metricshub-system-rules.yaml
  2. Declare them in the prometheus.yaml file:

    rule_files:
    - metricshub-rules.yaml
    - metricshub-hardware-rules.yaml
    - metricshub-system-rules.yaml
  3. Restart your Prometheus server to take the new rules into account.