πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
Troubleshooting intermediate ⏱ 20 minutes K8s 1.28+

Troubleshooting Pods with GPU Devices

Fix GPU device issues in Kubernetes pods. Troubleshoot device plugin errors, DRA claims, CUDA failures, driver mismatches.

By Luca Berton β€’ β€’ πŸ“– 6 min read

πŸ’‘ Quick Answer: GPU pod failures on Kubernetes typically fall into 5 categories: device plugin not running (no GPUs advertised), driver version mismatch (CUDA error), insufficient GPU resources (Unschedulable), DRA claim pending (no matching device), or container runtime misconfiguration (nvidia runtime not default). Start with kubectl describe pod β†’ check Events β†’ verify kubectl get nodes -o json | jq '.items[].status.capacity' shows GPUs.

The Problem

Kubernetes v1.36 highlights device-related pod failures as a growing operational concern as GPU workloads increase. GPU scheduling adds complexity beyond CPU/memory: device drivers, container runtime hooks, device plugins or DRA drivers, CUDA library compatibility, and multi-GPU topology. When something fails, error messages are often cryptic.

flowchart TB
    POD["Pod requesting GPU"] --> SCHED{"Scheduler:<br/>GPU available?"}
    SCHED -->|"No"| UNSCHED["❌ Unschedulable<br/>0/3 nodes have<br/>nvidia.com/gpu"]
    SCHED -->|"Yes"| RUNTIME{"Container Runtime:<br/>nvidia hook?"}
    RUNTIME -->|"No"| HOOKFAIL["❌ OCI runtime error<br/>nvidia-container-cli"]
    RUNTIME -->|"Yes"| DRIVER{"Driver:<br/>compatible?"}
    DRIVER -->|"No"| CUDAFAIL["❌ CUDA error:<br/>driver version insufficient"]
    DRIVER -->|"Yes"| APP{"Application:<br/>finds GPU?"}
    APP -->|"No"| VISIBLE["❌ No CUDA devices<br/>visible"]
    APP -->|"Yes"| OK["βœ… Running"]

The Solution

Step 1: Check GPU Resources on Nodes

# Are GPUs visible to Kubernetes?
kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
GPU_CAPACITY:.status.capacity.nvidia\.com/gpu,\
GPU_ALLOCATABLE:.status.allocatable.nvidia\.com/gpu

# Expected:
# NAME          GPU_CAPACITY   GPU_ALLOCATABLE
# gpu-node-1    4              4
# gpu-node-2    4              3    (1 allocated)
# cpu-node-1    <none>         <none>

# If GPU_CAPACITY is <none>, device plugin is not running

Step 2: Verify Device Plugin / GPU Operator

# Check NVIDIA device plugin pods
kubectl get pods -n gpu-operator -l app=nvidia-device-plugin-daemonset
kubectl get pods -n kube-system -l app=nvidia-device-plugin

# Check logs for errors
kubectl logs -n gpu-operator -l app=nvidia-device-plugin-daemonset --tail=50

# Common errors:
# "failed to initialize NVML" β†’ driver not loaded
# "no devices found" β†’ driver loaded but no GPUs detected
# "failed to start device plugin server" β†’ socket permission issue

# Check GPU Operator status
kubectl get clusterpolicy -o yaml | grep -A5 "status:"

Step 3: Diagnose Pod Scheduling Failures

# Pod stuck in Pending
kubectl describe pod my-gpu-pod

# Look for Events like:
# "0/5 nodes are available: 5 Insufficient nvidia.com/gpu"
# β†’ Not enough GPUs available
# "0/5 nodes are available: 5 didn't match Pod's node affinity"
# β†’ Node selector/affinity too restrictive

# Check current GPU allocation
kubectl get pods -A -o json | jq '
  [.items[] | select(.spec.containers[].resources.limits["nvidia.com/gpu"] != null) |
  {name: .metadata.name, namespace: .metadata.namespace, 
   node: .spec.nodeName, gpus: .spec.containers[].resources.limits["nvidia.com/gpu"]}]'

Step 4: Fix Container Runtime Errors

# Error: "OCI runtime create failed: nvidia-container-cli: initialization error"
# β†’ NVIDIA container runtime not configured

# Check containerd config
cat /etc/containerd/config.toml | grep -A5 nvidia

# Expected:
# [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
#   runtime_type = "io.containerd.runc.v2"
# [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
#   BinaryName = "/usr/bin/nvidia-container-runtime"

# Fix: install nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=containerd
sudo systemctl restart containerd

# Verify runtime works
sudo ctr run --rm --gpus 0 docker.io/nvidia/cuda:12.5.0-base-ubuntu22.04 test nvidia-smi

Step 5: Fix CUDA / Driver Mismatches

# Error: "CUDA driver version is insufficient for CUDA runtime version"
# β†’ Container needs newer driver than what's installed on the node

# Check driver version on node
nvidia-smi --query-gpu=driver_version --format=csv,noheader

# Check what CUDA version the container needs
# CUDA 12.5 needs driver β‰₯ 555.42
# CUDA 12.4 needs driver β‰₯ 550.54
# CUDA 12.2 needs driver β‰₯ 535.86
# CUDA 12.0 needs driver β‰₯ 525.60
# CUDA 11.8 needs driver β‰₯ 520.61

# Fix: either upgrade node driver or use older CUDA base image

Step 6: DRA-Specific Troubleshooting

# For DRA (v1.34+): ResourceClaim not being allocated
kubectl get resourceclaims -n ai-inference
kubectl describe resourceclaim training-gpu -n ai-inference

# Common DRA issues:
# "no devices match the claim's selectors"
# β†’ CEL expression too restrictive or wrong device attribute name

# Check what devices are available
kubectl get resourceslices -o yaml

# Check device attributes
kubectl get resourceslices -o json | jq '.items[].devices[].basic.attributes'

Common Error Messages and Fixes

ErrorCauseFix
Insufficient nvidia.com/gpuAll GPUs allocated or no GPU nodesScale GPU nodes or wait for pods to finish
nvidia-container-cli: initialization errorNVIDIA runtime not configuredInstall nvidia-container-toolkit, restart containerd
CUDA driver version insufficientNode driver too old for container CUDAUpgrade driver or pin older CUDA image
failed to initialize NVML: driver not loadedNVIDIA kernel module not loadedsudo modprobe nvidia or reboot node
GPU UUID not foundDevice plugin stale after driver updateRestart nvidia-device-plugin pod
no devices found on plugin registrationGPU hardware fault or driver crashCheck dmesg for Xid errors, reset GPU
MIG configuration invalidWrong MIG profile for workloadReconfigure MIG with nvidia-smi mig
ResourceClaim pendingDRA selector matches no deviceRelax CEL expression or add matching hardware
ErrImagePull for NVCR imagesMissing NGC API keyCreate imagePullSecret with NGC credentials

GPU Health Diagnostic Script

#!/bin/bash
# gpu-diagnostics.sh β€” run on GPU node
echo "=== Driver ==="
nvidia-smi --query-gpu=driver_version,name,memory.total,memory.used \
  --format=csv

echo -e "\n=== Processes ==="
nvidia-smi pmon -c 1

echo -e "\n=== Xid Errors (hardware faults) ==="
dmesg | grep -i "NVRM: Xid" | tail -10

echo -e "\n=== Container Runtime ==="
nvidia-container-cli info

echo -e "\n=== Device Plugin ==="
ls -la /var/lib/kubelet/device-plugins/nvidia*.sock 2>/dev/null || echo "No device plugin socket found"

echo -e "\n=== MIG Status ==="
nvidia-smi mig -lgi 2>/dev/null || echo "MIG not enabled"

Monitoring GPU Health

# DCGM Exporter for Prometheus GPU metrics
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dcgm-exporter
  namespace: gpu-operator
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  template:
    spec:
      containers:
        - name: dcgm-exporter
          image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.8-3.6.0-ubuntu22.04
          ports:
            - containerPort: 9400
          env:
            - name: DCGM_EXPORTER_COLLECTORS
              value: "/etc/dcgm-exporter/dcp-metrics-included.csv"
          securityContext:
            privileged: true
# Key metrics to alert on:
# DCGM_FI_DEV_GPU_TEMP > 90Β°C β†’ thermal throttling
# DCGM_FI_DEV_XID_ERRORS > 0 β†’ hardware fault
# DCGM_FI_DEV_MEM_COPY_UTIL > 95% β†’ memory pressure
# DCGM_FI_DEV_GPU_UTIL == 0 for 30m β†’ GPU idle (waste)

Best Practices

  • Always check kubectl describe pod first β€” Events section tells you exactly what failed
  • Verify GPU capacity on nodes β€” if nodes show 0 GPUs, fix device plugin before anything else
  • Pin CUDA versions β€” match container CUDA to node driver compatibility matrix
  • Monitor Xid errors β€” hardware faults (Xid 48, 63, 79) need physical GPU replacement
  • Restart device plugin after driver updates β€” stale GPU UUIDs cause phantom devices
  • Use DCGM Exporter β€” Prometheus metrics for temperature, utilization, errors

Key Takeaways

  • GPU pod failures have 5 main root causes: device plugin, runtime, driver, scheduling, DRA
  • Start diagnosis with kubectl describe pod β†’ Events β†’ node GPU capacity
  • CUDA driver compatibility matrix is the #1 cause of container GPU failures
  • DRA (v1.34+) adds new failure modes: ResourceClaim selectors must match device attributes
  • Monitor GPU health with DCGM Exporter β€” Xid errors indicate hardware faults
  • Always pin CUDA versions and driver versions for reproducible GPU deployments
#gpu-troubleshooting #device-plugin #nvidia #dra #cuda
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens