NVIDIA DCGM Exporter (Prometheus)
Description
This connector scrapes NVIDIA dcgm-exporter Prometheus metrics over HTTP(S).
Enterprise Connector
This connector requires the Enterprise edition of MetricsHub.
Target
Typical platform: Nvidia
Operating system: Out-Of-Band
This connector is not available for the local host (it is applicable to remote hosts only).
Prerequisites
Leverages: NVIDIA DCGM Exporter
Technology and protocols: HTTP
Examples
CLI
metricshub HOSTNAME -t oob -c +NvidiaDCGMExporter --https --http-port 443 -u USERNAME
metricshub.yaml
resourceGroups:
<RESOURCE_GROUP>:
resources:
<HOSTNAME-ID>:
attributes:
host.name: <HOSTNAME> # Change with actual host name
host.type: oob
connectors: [ +NvidiaDCGMExporter ] # Optional, to load only this connector
protocols:
http:
https: true
port: 443 # or probably something else
username: <USERNAME> # Change with actual credentials
password: <PASSWORD> # Encrypted using metricshub-encrypt
Connector Activation Criteria
The NVIDIA DCGM Exporter (Prometheus) connector will be automatically activated, and its status will be reported as OK if all the below criteria are met:
- The HTTP Request below to the managed host succeeds:
- get
/metrics - The response body contains:
DCGM_FI_DEV_GPU_UTIL(regex)
- get
Metrics
| Type | Collected Metrics | Specific Attributes |
|---|---|---|
| gpu | hw.errors{hw.type="gpu", hw.error.type="ecc-memory-corrected", hw.error.severity="non-critical"}hw.errors{hw.type="gpu", hw.error.type="ecc-memory-uncorrected", hw.error.severity="critical"}hw.errors{hw.type="gpu", hw.error.type="xid"}hw.gpu.clock{hw.type="graphics"}hw.gpu.clock{hw.type="memory"}hw.gpu.clock{hw.type="sm"}hw.gpu.io{hw.gpu.direction="receive", hw.gpu.interface="pcie"}hw.gpu.io{hw.gpu.direction="transmit", hw.gpu.interface="pcie"}hw.gpu.io{hw.gpu.interface="nvlink"}hw.gpu.memory.limithw.gpu.memory.usage{hw.gpu.memory.state="free"}hw.gpu.memory.usage{hw.gpu.memory.state="used"}hw.gpu.utilization{hw.gpu.task="decoder"}hw.gpu.utilization{hw.gpu.task="encoder"}hw.gpu.utilization{hw.gpu.task="general"}hw.power.limit{hw.type="gpu"}hw.power{hw.type="gpu"}hw.status{hw.type="gpu", state="present"}hw.temperature{sensor_location="gpu"}hw.temperature{sensor_location="gpu-memory"} | host.namehw.gpu.indexhw.gpu.modelhw.gpu.nvidia.mig_instance_idhw.gpu.pci.bus_idhw.gpu.uuidhw.idhw.nameidnametype |