Skip to main content

NVIDIA DCGM Exporter (Prometheus)

Description

This connector scrapes NVIDIA dcgm-exporter Prometheus metrics over HTTP(S).

gpu hardware nvidia

Enterprise Connector

This connector requires the Enterprise edition of MetricsHub.

Target

Typical platform: Nvidia

Operating system: Out-Of-Band

This connector is not available for the local host (it is applicable to remote hosts only).

Prerequisites

Leverages: NVIDIA DCGM Exporter

Technology and protocols: HTTP

Examples

CLI

metricshub HOSTNAME -t oob -c +NvidiaDCGMExporter --https --http-port 443 -u USERNAME

metricshub.yaml

resourceGroups:
<RESOURCE_GROUP>:
resources:
<HOSTNAME-ID>:
attributes:
host.name: <HOSTNAME> # Change with actual host name
host.type: oob
connectors: [ +NvidiaDCGMExporter ] # Optional, to load only this connector
protocols:
http:
https: true
port: 443 # or probably something else
username: <USERNAME> # Change with actual credentials
password: <PASSWORD> # Encrypted using metricshub-encrypt

Connector Activation Criteria

The NVIDIA DCGM Exporter (Prometheus) connector will be automatically activated, and its status will be reported as OK if all the below criteria are met:

  • The HTTP Request below to the managed host succeeds:
    • get /metrics
    • The response body contains: DCGM_FI_DEV_GPU_UTIL (regex)

Metrics

TypeCollected MetricsSpecific Attributes
gpuhw.errors{hw.type="gpu", hw.error.type="ecc-memory-corrected", hw.error.severity="non-critical"}
hw.errors{hw.type="gpu", hw.error.type="ecc-memory-uncorrected", hw.error.severity="critical"}
hw.errors{hw.type="gpu", hw.error.type="xid"}
hw.gpu.clock{hw.type="graphics"}
hw.gpu.clock{hw.type="memory"}
hw.gpu.clock{hw.type="sm"}
hw.gpu.io{hw.gpu.direction="receive", hw.gpu.interface="pcie"}
hw.gpu.io{hw.gpu.direction="transmit", hw.gpu.interface="pcie"}
hw.gpu.io{hw.gpu.interface="nvlink"}
hw.gpu.memory.limit
hw.gpu.memory.usage{hw.gpu.memory.state="free"}
hw.gpu.memory.usage{hw.gpu.memory.state="used"}
hw.gpu.utilization{hw.gpu.task="decoder"}
hw.gpu.utilization{hw.gpu.task="encoder"}
hw.gpu.utilization{hw.gpu.task="general"}
hw.power.limit{hw.type="gpu"}
hw.power{hw.type="gpu"}
hw.status{hw.type="gpu", state="present"}
hw.temperature{sensor_location="gpu"}
hw.temperature{sensor_location="gpu-memory"}
host.name
hw.gpu.index
hw.gpu.model
hw.gpu.nvidia.mig_instance_id
hw.gpu.pci.bus_id
hw.gpu.uuid
hw.id
hw.name
id
name
type