Nvidia DGX Server (REST)
Description
This connector monitors hardware for Nvidia DGX Servers.
Enterprise Connector
This connector requires the Enterprise edition of MetricsHub.
Supersedes
This connector supersedes:
Target
Typical platform: Nvidia DGX
Operating system: Out-Of-Band
This connector is not available for the local host (it is applicable to remote hosts only).
Prerequisites
Leverages: Nvidia DGX REST API
Technology and protocols: HTTP
Examples
CLI
metricshub HOSTNAME -t oob -c +NvidiaDGXREST --https --http-port 443 -u USERNAME
metricshub.yaml
resourceGroups:
<RESOURCE_GROUP>:
resources:
<HOSTNAME-ID>:
attributes:
host.name: <HOSTNAME> # Change with actual host name
host.type: oob
connectors: [ +NvidiaDGXREST ] # Optional, to load only this connector
protocols:
http:
https: true
port: 443 # or probably something else
username: <USERNAME> # Change with actual credentials
password: <PASSWORD> # Encrypted using metricshub-encrypt
Connector Activation Criteria
The Nvidia DGX Server (REST) connector will be automatically activated, and its status will be reported as OK if all the below criteria are met:
- The HTTP Request below to the managed host succeeds:
- get
/redfish/v1/Systems - Request Header:
${file::httpHeader} - The response body contains:
redfish(regex)
- get
- The HTTP Request below to the managed host succeeds:
- get
/redfish/v1/Systems/DGX - Request Header:
${file::httpHeader} - The response body contains:
redfish(regex)
- get
Metrics
| Type | Collected Metrics | Specific Attributes |
|---|---|---|
| cpu | hw.cpu.speed.limit{limit_type="max"}hw.status{hw.type="cpu", state="degraded|failed|ok"}hw.status{hw.type="cpu", state="present"} | hw.parent.idhw.parent.typeidmodelnamevendor |
| enclosure | hw.enclosure.energyhw.enclosure.powerhw.power.limit{hw.type="enclosure", limit_type="high.critical"}hw.status{hw.type="enclosure", state="degraded|failed|ok"}hw.status{hw.type="enclosure", state="present"} | idmodelnameserial_numbertypevendor |
| fan | hw.fan.speedhw.fan.speed.limit{limit_type="low.critical"}hw.fan.speed_ratiohw.status{hw.type="fan", state="degraded|failed|ok"}hw.status{hw.type="fan", state="present"} | hw.parent.idhw.parent.typeidname |
| gpu | hw.energy{hw.type="gpu"}hw.gpu.speedhw.gpu.speed.limit{limit_type="high.critical"}hw.gpu.speed.limit{limit_type="high.degraded"}hw.gpu.speed.limit{limit_type="low.degraded"}hw.power.limit{hw.type="gpu", limit_type="high.critical"}hw.power{hw.type="gpu"}hw.status{hw.type="gpu", state="degraded|failed|ok"}hw.status{hw.type="gpu", state="present"} | hw.parent.idhw.parent.typeidmodelnameserial_numbervendor |
| memory | hw.memory.limithw.status{hw.type="memory", state="degraded|failed|ok"}hw.status{hw.type="memory", state="present"} | hw.parent.idhw.parent.typeidmodelnameserial_numbertypevendor |
| network | hw.network.uphw.status{hw.type="network", state="degraded|failed|ok"}hw.status{hw.type="network", state="present"} | hw.parent.idhw.parent.typeidnamephysical_address |
| physical_disk | hw.physical_disk.sizehw.status{hw.type="physical_disk", state="degraded|failed|ok"}hw.status{hw.type="physical_disk", state="present"} | hw.parent.idhw.parent.typeidmodelnamevendor |
| power_supply | hw.power_supply.limithw.power_supply.powerhw.power_supply.utilizationhw.status{hw.type="power_supply", state="degraded|failed|ok"}hw.status{hw.type="power_supply", state="present"} | hw.parent.idhw.parent.typeidmodelnamepower_supply_typeserial_numbervendor |
| temperature | hw.status{hw.type="temperature", state="degraded|failed|ok"}hw.status{hw.type="temperature", state="present"}hw.temperaturehw.temperature.limit{limit_type="high.critical"}hw.temperature.limit{limit_type="high.degraded"} | hw.parent.idhw.parent.typeidname |
| voltage | hw.status{hw.type="voltage", state="degraded|failed|ok"}hw.status{hw.type="voltage", state="present"}hw.voltage | hw.parent.idhw.parent.typeidname |