Skip to main content

Nvidia DGX Server (REST)

Description

This connector monitors hardware for Nvidia DGX Servers.

hardware nvidia

Enterprise Connector

This connector requires the Enterprise edition of MetricsHub.

Supersedes

This connector supersedes:

Target

Typical platform: Nvidia DGX

Operating system: Out-Of-Band

This connector is not available for the local host (it is applicable to remote hosts only).

Prerequisites

Leverages: Nvidia DGX REST API

Technology and protocols: HTTP

Examples

CLI

metricshub HOSTNAME -t oob -c +NvidiaDGXREST --https --http-port 443 -u USERNAME

metricshub.yaml

resourceGroups:
<RESOURCE_GROUP>:
resources:
<HOSTNAME-ID>:
attributes:
host.name: <HOSTNAME> # Change with actual host name
host.type: oob
connectors: [ +NvidiaDGXREST ] # Optional, to load only this connector
protocols:
http:
https: true
port: 443 # or probably something else
username: <USERNAME> # Change with actual credentials
password: <PASSWORD> # Encrypted using metricshub-encrypt

Connector Activation Criteria

The Nvidia DGX Server (REST) connector will be automatically activated, and its status will be reported as OK if all the below criteria are met:

  • The HTTP Request below to the managed host succeeds:
    • get /redfish/v1/Systems
    • Request Header: ${file::httpHeader}
    • The response body contains: redfish (regex)
  • The HTTP Request below to the managed host succeeds:
    • get /redfish/v1/Systems/DGX
    • Request Header: ${file::httpHeader}
    • The response body contains: redfish (regex)

Metrics

TypeCollected MetricsSpecific Attributes
cpuhw.cpu.speed.limit{limit_type="max"}
hw.status{hw.type="cpu", state="degraded|failed|ok"}
hw.status{hw.type="cpu", state="present"}
hw.parent.id
hw.parent.type
id
model
name
vendor
enclosurehw.enclosure.energy
hw.enclosure.power
hw.power.limit{hw.type="enclosure", limit_type="high.critical"}
hw.status{hw.type="enclosure", state="degraded|failed|ok"}
hw.status{hw.type="enclosure", state="present"}
id
model
name
serial_number
type
vendor
fanhw.fan.speed
hw.fan.speed.limit{limit_type="low.critical"}
hw.fan.speed_ratio
hw.status{hw.type="fan", state="degraded|failed|ok"}
hw.status{hw.type="fan", state="present"}
hw.parent.id
hw.parent.type
id
name
gpuhw.energy{hw.type="gpu"}
hw.gpu.speed
hw.gpu.speed.limit{limit_type="high.critical"}
hw.gpu.speed.limit{limit_type="high.degraded"}
hw.gpu.speed.limit{limit_type="low.degraded"}
hw.power.limit{hw.type="gpu", limit_type="high.critical"}
hw.power{hw.type="gpu"}
hw.status{hw.type="gpu", state="degraded|failed|ok"}
hw.status{hw.type="gpu", state="present"}
hw.parent.id
hw.parent.type
id
model
name
serial_number
vendor
memoryhw.memory.limit
hw.status{hw.type="memory", state="degraded|failed|ok"}
hw.status{hw.type="memory", state="present"}
hw.parent.id
hw.parent.type
id
model
name
serial_number
type
vendor
networkhw.network.up
hw.status{hw.type="network", state="degraded|failed|ok"}
hw.status{hw.type="network", state="present"}
hw.parent.id
hw.parent.type
id
name
physical_address
physical_diskhw.physical_disk.size
hw.status{hw.type="physical_disk", state="degraded|failed|ok"}
hw.status{hw.type="physical_disk", state="present"}
hw.parent.id
hw.parent.type
id
model
name
vendor
power_supplyhw.power_supply.limit
hw.power_supply.power
hw.power_supply.utilization
hw.status{hw.type="power_supply", state="degraded|failed|ok"}
hw.status{hw.type="power_supply", state="present"}
hw.parent.id
hw.parent.type
id
model
name
power_supply_type
serial_number
vendor
temperaturehw.status{hw.type="temperature", state="degraded|failed|ok"}
hw.status{hw.type="temperature", state="present"}
hw.temperature
hw.temperature.limit{limit_type="high.critical"}
hw.temperature.limit{limit_type="high.degraded"}
hw.parent.id
hw.parent.type
id
name
voltagehw.status{hw.type="voltage", state="degraded|failed|ok"}
hw.status{hw.type="voltage", state="present"}
hw.voltage
hw.parent.id
hw.parent.type
id
name