πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
Observability intermediate ⏱ 15 minutes K8s 1.28+

GPU Operator Node Status Exporter Metrics

Monitor NVIDIA GPU Operator node validation with gpu_operator_node_driver_ready and status exporter metrics. Prometheus alerts for GPU node health.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: The GPU Operator node-status-exporter exposes validation metrics at :9400/metrics. Key metric: gpu_operator_node_driver_ready{node="gpu-node-1"} 1 indicates the driver is ready. Monitor gpu_operator_node_*_ready for driver, toolkit, device-plugin, and DCGM validation states. Scrape with Prometheus ServiceMonitor and alert on gpu_operator_node_driver_ready == 0 to catch driver failures.

The Problem

GPU Operator manages multiple components per node (driver, toolkit, device-plugin, DCGM). When any component fails:

  • Pods requesting GPUs stay Pending with no clear error
  • Node labels show nvidia.com/gpu.present=true but nvidia.com/gpu allocatable is 0
  • Manual kubectl describe node is required to diagnose
  • No alerting on GPU node degradation

The Solution

Node Status Exporter Metrics

# The GPU Operator deploys node-status-exporter as a DaemonSet
kubectl get pods -n gpu-operator -l app=nvidia-operator-validator
# nvidia-operator-validator-xxxxx   1/1   Running   0   5m

# Check metrics endpoint
kubectl exec -n gpu-operator nvidia-operator-validator-xxxxx -- \
  curl -s localhost:9400/metrics | grep gpu_operator

# Key metrics:
# gpu_operator_node_driver_ready{node="gpu-node-1"} 1
# gpu_operator_node_container_toolkit_ready{node="gpu-node-1"} 1
# gpu_operator_node_device_plugin_ready{node="gpu-node-1"} 1
# gpu_operator_node_dcgm_ready{node="gpu-node-1"} 1
# gpu_operator_node_dcgm_exporter_ready{node="gpu-node-1"} 1
# gpu_operator_node_mig_manager_ready{node="gpu-node-1"} 1
# gpu_operator_gpu_nodes_total 4
# gpu_operator_gpu_nodes_ready 4

Metrics Reference

MetricValuesMeaning
gpu_operator_node_driver_ready0/1NVIDIA driver loaded and functional
gpu_operator_node_container_toolkit_ready0/1nvidia-container-toolkit configured
gpu_operator_node_device_plugin_ready0/1nvidia-device-plugin running
gpu_operator_node_dcgm_ready0/1DCGM daemon running
gpu_operator_node_dcgm_exporter_ready0/1DCGM exporter scraping GPU metrics
gpu_operator_node_mig_manager_ready0/1MIG manager operational (if MIG enabled)
gpu_operator_gpu_nodes_totalintTotal nodes with GPU hardware
gpu_operator_gpu_nodes_readyintNodes with all validations passing

Prometheus ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: gpu-operator-validator
  namespace: gpu-operator
  labels:
    app: gpu-operator
spec:
  selector:
    matchLabels:
      app: nvidia-operator-validator
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics

Alerting Rules

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: gpu-operator-alerts
  namespace: gpu-operator
spec:
  groups:
  - name: gpu-operator
    rules:
    # Alert when GPU driver is not ready on any node
    - alert: GPUDriverNotReady
      expr: gpu_operator_node_driver_ready == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "GPU driver not ready on {{ $labels.node }}"
        description: "NVIDIA driver validation failed for 5+ minutes. GPU workloads cannot schedule."

    # Alert when device plugin is down
    - alert: GPUDevicePluginNotReady
      expr: gpu_operator_node_device_plugin_ready == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "GPU device plugin not ready on {{ $labels.node }}"

    # Alert when not all GPU nodes are ready
    - alert: GPUNodesNotFullyReady
      expr: gpu_operator_gpu_nodes_ready < gpu_operator_gpu_nodes_total
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "{{ $value }} of {{ with query \"gpu_operator_gpu_nodes_total\" }}{{ . | first | value }}{{ end }} GPU nodes ready"

    # Alert on DCGM exporter failure (metrics gap)
    - alert: DCGMExporterNotReady
      expr: gpu_operator_node_dcgm_exporter_ready == 0
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "DCGM exporter not ready on {{ $labels.node }}"

Grafana Dashboard

{
  "panels": [
    {
      "title": "GPU Nodes Ready",
      "type": "stat",
      "targets": [{
        "expr": "gpu_operator_gpu_nodes_ready / gpu_operator_gpu_nodes_total * 100"
      }],
      "fieldConfig": {
        "defaults": { "unit": "percent", "thresholds": { "steps": [
          {"color": "red", "value": 0},
          {"color": "yellow", "value": 80},
          {"color": "green", "value": 100}
        ]}}
      }
    },
    {
      "title": "Node Validation Status",
      "type": "table",
      "targets": [{
        "expr": "{__name__=~\"gpu_operator_node_.*_ready\"}",
        "format": "table",
        "instant": true
      }]
    }
  ]
}

Troubleshoot Failed Validations

# Check which validation is failing
kubectl get pods -n gpu-operator -l app=nvidia-operator-validator -o wide
kubectl logs -n gpu-operator nvidia-operator-validator-xxxxx

# Check node labels for validation state
kubectl get node gpu-node-1 -o jsonpath='{.metadata.labels}' | jq 'with_entries(select(.key | startswith("nvidia")))'

# Key labels:
# nvidia.com/gpu.deploy.driver: "true"
# nvidia.com/gpu.deploy.container-toolkit: "true"
# nvidia.com/gpu.deploy.device-plugin: "true"
# nvidia.com/gpu.present: "true"

Common Issues

Metrics endpoint not accessible

ServiceMonitor selector doesn’t match. Check labels: kubectl get svc -n gpu-operator --show-labels | grep validator.

gpu_operator_node_driver_ready stuck at 0

Driver pod is in CrashLoopBackOff. Check: kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset. Common cause: kernel version mismatch.

Metrics show ready but GPUs not allocatable

Device plugin is ready but failed to register with kubelet. Check: kubectl describe node <node> | grep nvidia.com/gpu.

Best Practices

  • Alert on driver_ready == 0 as critical β€” no GPU workloads can run
  • Alert on nodes_ready < nodes_total as warning β€” partial cluster degradation
  • 30s scrape interval β€” validation state doesn’t change frequently
  • Include node label in alerts β€” identifies which physical node needs attention
  • Pair with DCGM metrics for complete GPU observability (operator health + GPU hardware)

Key Takeaways

  • GPU Operator exposes gpu_operator_node_*_ready metrics via node-status-exporter
  • Monitor driver, toolkit, device-plugin, DCGM, and MIG manager readiness per node
  • Set Prometheus alerts on == 0 states to catch GPU node failures before users notice
  • gpu_operator_gpu_nodes_ready vs total gives cluster-level GPU health at a glance
  • Pair with DCGM Exporter metrics for hardware-level GPU monitoring
#nvidia #gpu-operator #prometheus #metrics #monitoring
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens