GPU Operator Node Status Exporter Metrics
Monitor NVIDIA GPU Operator node validation with gpu_operator_node_driver_ready and status exporter metrics. Prometheus alerts for GPU node health.
π‘ Quick Answer: The GPU Operator node-status-exporter exposes validation metrics at
:9400/metrics. Key metric:gpu_operator_node_driver_ready{node="gpu-node-1"} 1indicates the driver is ready. Monitorgpu_operator_node_*_readyfor driver, toolkit, device-plugin, and DCGM validation states. Scrape with Prometheus ServiceMonitor and alert ongpu_operator_node_driver_ready == 0to catch driver failures.
The Problem
GPU Operator manages multiple components per node (driver, toolkit, device-plugin, DCGM). When any component fails:
- Pods requesting GPUs stay Pending with no clear error
- Node labels show
nvidia.com/gpu.present=truebutnvidia.com/gpuallocatable is 0 - Manual
kubectl describe nodeis required to diagnose - No alerting on GPU node degradation
The Solution
Node Status Exporter Metrics
# The GPU Operator deploys node-status-exporter as a DaemonSet
kubectl get pods -n gpu-operator -l app=nvidia-operator-validator
# nvidia-operator-validator-xxxxx 1/1 Running 0 5m
# Check metrics endpoint
kubectl exec -n gpu-operator nvidia-operator-validator-xxxxx -- \
curl -s localhost:9400/metrics | grep gpu_operator
# Key metrics:
# gpu_operator_node_driver_ready{node="gpu-node-1"} 1
# gpu_operator_node_container_toolkit_ready{node="gpu-node-1"} 1
# gpu_operator_node_device_plugin_ready{node="gpu-node-1"} 1
# gpu_operator_node_dcgm_ready{node="gpu-node-1"} 1
# gpu_operator_node_dcgm_exporter_ready{node="gpu-node-1"} 1
# gpu_operator_node_mig_manager_ready{node="gpu-node-1"} 1
# gpu_operator_gpu_nodes_total 4
# gpu_operator_gpu_nodes_ready 4Metrics Reference
| Metric | Values | Meaning |
|---|---|---|
gpu_operator_node_driver_ready | 0/1 | NVIDIA driver loaded and functional |
gpu_operator_node_container_toolkit_ready | 0/1 | nvidia-container-toolkit configured |
gpu_operator_node_device_plugin_ready | 0/1 | nvidia-device-plugin running |
gpu_operator_node_dcgm_ready | 0/1 | DCGM daemon running |
gpu_operator_node_dcgm_exporter_ready | 0/1 | DCGM exporter scraping GPU metrics |
gpu_operator_node_mig_manager_ready | 0/1 | MIG manager operational (if MIG enabled) |
gpu_operator_gpu_nodes_total | int | Total nodes with GPU hardware |
gpu_operator_gpu_nodes_ready | int | Nodes with all validations passing |
Prometheus ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: gpu-operator-validator
namespace: gpu-operator
labels:
app: gpu-operator
spec:
selector:
matchLabels:
app: nvidia-operator-validator
endpoints:
- port: metrics
interval: 30s
path: /metricsAlerting Rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: gpu-operator-alerts
namespace: gpu-operator
spec:
groups:
- name: gpu-operator
rules:
# Alert when GPU driver is not ready on any node
- alert: GPUDriverNotReady
expr: gpu_operator_node_driver_ready == 0
for: 5m
labels:
severity: critical
annotations:
summary: "GPU driver not ready on {{ $labels.node }}"
description: "NVIDIA driver validation failed for 5+ minutes. GPU workloads cannot schedule."
# Alert when device plugin is down
- alert: GPUDevicePluginNotReady
expr: gpu_operator_node_device_plugin_ready == 0
for: 5m
labels:
severity: critical
annotations:
summary: "GPU device plugin not ready on {{ $labels.node }}"
# Alert when not all GPU nodes are ready
- alert: GPUNodesNotFullyReady
expr: gpu_operator_gpu_nodes_ready < gpu_operator_gpu_nodes_total
for: 10m
labels:
severity: warning
annotations:
summary: "{{ $value }} of {{ with query \"gpu_operator_gpu_nodes_total\" }}{{ . | first | value }}{{ end }} GPU nodes ready"
# Alert on DCGM exporter failure (metrics gap)
- alert: DCGMExporterNotReady
expr: gpu_operator_node_dcgm_exporter_ready == 0
for: 5m
labels:
severity: warning
annotations:
summary: "DCGM exporter not ready on {{ $labels.node }}"Grafana Dashboard
{
"panels": [
{
"title": "GPU Nodes Ready",
"type": "stat",
"targets": [{
"expr": "gpu_operator_gpu_nodes_ready / gpu_operator_gpu_nodes_total * 100"
}],
"fieldConfig": {
"defaults": { "unit": "percent", "thresholds": { "steps": [
{"color": "red", "value": 0},
{"color": "yellow", "value": 80},
{"color": "green", "value": 100}
]}}
}
},
{
"title": "Node Validation Status",
"type": "table",
"targets": [{
"expr": "{__name__=~\"gpu_operator_node_.*_ready\"}",
"format": "table",
"instant": true
}]
}
]
}Troubleshoot Failed Validations
# Check which validation is failing
kubectl get pods -n gpu-operator -l app=nvidia-operator-validator -o wide
kubectl logs -n gpu-operator nvidia-operator-validator-xxxxx
# Check node labels for validation state
kubectl get node gpu-node-1 -o jsonpath='{.metadata.labels}' | jq 'with_entries(select(.key | startswith("nvidia")))'
# Key labels:
# nvidia.com/gpu.deploy.driver: "true"
# nvidia.com/gpu.deploy.container-toolkit: "true"
# nvidia.com/gpu.deploy.device-plugin: "true"
# nvidia.com/gpu.present: "true"Common Issues
Metrics endpoint not accessible
ServiceMonitor selector doesnβt match. Check labels: kubectl get svc -n gpu-operator --show-labels | grep validator.
gpu_operator_node_driver_ready stuck at 0
Driver pod is in CrashLoopBackOff. Check: kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset. Common cause: kernel version mismatch.
Metrics show ready but GPUs not allocatable
Device plugin is ready but failed to register with kubelet. Check: kubectl describe node <node> | grep nvidia.com/gpu.
Best Practices
- Alert on
driver_ready == 0as critical β no GPU workloads can run - Alert on
nodes_ready < nodes_totalas warning β partial cluster degradation - 30s scrape interval β validation state doesnβt change frequently
- Include node label in alerts β identifies which physical node needs attention
- Pair with DCGM metrics for complete GPU observability (operator health + GPU hardware)
Key Takeaways
- GPU Operator exposes
gpu_operator_node_*_readymetrics via node-status-exporter - Monitor driver, toolkit, device-plugin, DCGM, and MIG manager readiness per node
- Set Prometheus alerts on
== 0states to catch GPU node failures before users notice gpu_operator_gpu_nodes_ready vs totalgives cluster-level GPU health at a glance- Pair with DCGM Exporter metrics for hardware-level GPU monitoring

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
