NVIDIA DCGM Exporter (Prometheus)

Description

This connector scrapes NVIDIA dcgm-exporter Prometheus metrics over HTTP(S).

gpu hardware nvidia

Enterprise Connector

This connector requires the Enterprise edition of MetricsHub.

Target

Typical platform: Nvidia

Operating system: Out-Of-Band

This connector is not available for the local host (it is applicable to remote hosts only).

Prerequisites

Leverages: NVIDIA DCGM Exporter

Technology and protocols: HTTP

Examples

CLI

metricshub HOSTNAME -t oob -c +NvidiaDCGMExporter --https --http-port 443 -u USERNAME

metricshub.yaml

resourceGroups:
  <RESOURCE_GROUP>:
    resources:
      <HOSTNAME-ID>:
        attributes:
          host.name: <HOSTNAME> # Change with actual host name
          host.type: oob
        connectors: [ +NvidiaDCGMExporter ] # Optional, to load only this connector
        protocols:
          http:
            https: true
            port: 443 # or probably something else
            username: <USERNAME> # Change with actual credentials
            password: <PASSWORD> # Encrypted using metricshub-encrypt

Connector Activation Criteria

The NVIDIA DCGM Exporter (Prometheus) connector will be automatically activated, and its status will be reported as OK if all the below criteria are met:

The HTTP Request below to the managed host succeeds:
- get /metrics
- The response body contains: DCGM_FI_DEV_GPU_UTIL (regex)

Metrics

Type	Collected Metrics	Specific Attributes
gpu	`hw.errors{hw.type="gpu", hw.error.type="ecc-memory-corrected", hw.error.severity="non-critical"}` `hw.errors{hw.type="gpu", hw.error.type="ecc-memory-uncorrected", hw.error.severity="critical"}` `hw.errors{hw.type="gpu", hw.error.type="xid"}` `hw.gpu.clock{hw.type="graphics"}` `hw.gpu.clock{hw.type="memory"}` `hw.gpu.clock{hw.type="sm"}` `hw.gpu.io{hw.gpu.direction="receive", hw.gpu.interface="pcie"}` `hw.gpu.io{hw.gpu.direction="transmit", hw.gpu.interface="pcie"}` `hw.gpu.io{hw.gpu.interface="nvlink"}` `hw.gpu.memory.limit` `hw.gpu.memory.usage{hw.gpu.memory.state="free"}` `hw.gpu.memory.usage{hw.gpu.memory.state="used"}` `hw.gpu.utilization{hw.gpu.task="decoder"}` `hw.gpu.utilization{hw.gpu.task="encoder"}` `hw.gpu.utilization{hw.gpu.task="general"}` `hw.power.limit{hw.type="gpu"}` `hw.power{hw.type="gpu"}` `hw.status{hw.type="gpu", state="present"}` `hw.temperature{sensor_location="gpu"}` `hw.temperature{sensor_location="gpu-memory"}`	`host.name` `hw.gpu.index` `hw.gpu.model` `hw.gpu.nvidia.mig_instance_id` `hw.gpu.pci.bus_id` `hw.gpu.uuid` `hw.id` `hw.name` `id` `name` `type`

Description​

Target​

Prerequisites​

Examples​

CLI​

metricshub.yaml​

Connector Activation Criteria​

Metrics​