Troubleshooting Pods with GPU Devices
Fix GPU device issues in Kubernetes pods. Troubleshoot device plugin errors, DRA claims, CUDA failures, driver mismatches.
π‘ Quick Answer: GPU pod failures on Kubernetes typically fall into 5 categories: device plugin not running (no GPUs advertised), driver version mismatch (CUDA error), insufficient GPU resources (Unschedulable), DRA claim pending (no matching device), or container runtime misconfiguration (nvidia runtime not default). Start with
kubectl describe podβ check Events β verifykubectl get nodes -o json | jq '.items[].status.capacity'shows GPUs.
The Problem
Kubernetes v1.36 highlights device-related pod failures as a growing operational concern as GPU workloads increase. GPU scheduling adds complexity beyond CPU/memory: device drivers, container runtime hooks, device plugins or DRA drivers, CUDA library compatibility, and multi-GPU topology. When something fails, error messages are often cryptic.
flowchart TB
POD["Pod requesting GPU"] --> SCHED{"Scheduler:<br/>GPU available?"}
SCHED -->|"No"| UNSCHED["β Unschedulable<br/>0/3 nodes have<br/>nvidia.com/gpu"]
SCHED -->|"Yes"| RUNTIME{"Container Runtime:<br/>nvidia hook?"}
RUNTIME -->|"No"| HOOKFAIL["β OCI runtime error<br/>nvidia-container-cli"]
RUNTIME -->|"Yes"| DRIVER{"Driver:<br/>compatible?"}
DRIVER -->|"No"| CUDAFAIL["β CUDA error:<br/>driver version insufficient"]
DRIVER -->|"Yes"| APP{"Application:<br/>finds GPU?"}
APP -->|"No"| VISIBLE["β No CUDA devices<br/>visible"]
APP -->|"Yes"| OK["β
Running"]The Solution
Step 1: Check GPU Resources on Nodes
# Are GPUs visible to Kubernetes?
kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
GPU_CAPACITY:.status.capacity.nvidia\.com/gpu,\
GPU_ALLOCATABLE:.status.allocatable.nvidia\.com/gpu
# Expected:
# NAME GPU_CAPACITY GPU_ALLOCATABLE
# gpu-node-1 4 4
# gpu-node-2 4 3 (1 allocated)
# cpu-node-1 <none> <none>
# If GPU_CAPACITY is <none>, device plugin is not runningStep 2: Verify Device Plugin / GPU Operator
# Check NVIDIA device plugin pods
kubectl get pods -n gpu-operator -l app=nvidia-device-plugin-daemonset
kubectl get pods -n kube-system -l app=nvidia-device-plugin
# Check logs for errors
kubectl logs -n gpu-operator -l app=nvidia-device-plugin-daemonset --tail=50
# Common errors:
# "failed to initialize NVML" β driver not loaded
# "no devices found" β driver loaded but no GPUs detected
# "failed to start device plugin server" β socket permission issue
# Check GPU Operator status
kubectl get clusterpolicy -o yaml | grep -A5 "status:"Step 3: Diagnose Pod Scheduling Failures
# Pod stuck in Pending
kubectl describe pod my-gpu-pod
# Look for Events like:
# "0/5 nodes are available: 5 Insufficient nvidia.com/gpu"
# β Not enough GPUs available
# "0/5 nodes are available: 5 didn't match Pod's node affinity"
# β Node selector/affinity too restrictive
# Check current GPU allocation
kubectl get pods -A -o json | jq '
[.items[] | select(.spec.containers[].resources.limits["nvidia.com/gpu"] != null) |
{name: .metadata.name, namespace: .metadata.namespace,
node: .spec.nodeName, gpus: .spec.containers[].resources.limits["nvidia.com/gpu"]}]'Step 4: Fix Container Runtime Errors
# Error: "OCI runtime create failed: nvidia-container-cli: initialization error"
# β NVIDIA container runtime not configured
# Check containerd config
cat /etc/containerd/config.toml | grep -A5 nvidia
# Expected:
# [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
# runtime_type = "io.containerd.runc.v2"
# [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
# BinaryName = "/usr/bin/nvidia-container-runtime"
# Fix: install nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=containerd
sudo systemctl restart containerd
# Verify runtime works
sudo ctr run --rm --gpus 0 docker.io/nvidia/cuda:12.5.0-base-ubuntu22.04 test nvidia-smiStep 5: Fix CUDA / Driver Mismatches
# Error: "CUDA driver version is insufficient for CUDA runtime version"
# β Container needs newer driver than what's installed on the node
# Check driver version on node
nvidia-smi --query-gpu=driver_version --format=csv,noheader
# Check what CUDA version the container needs
# CUDA 12.5 needs driver β₯ 555.42
# CUDA 12.4 needs driver β₯ 550.54
# CUDA 12.2 needs driver β₯ 535.86
# CUDA 12.0 needs driver β₯ 525.60
# CUDA 11.8 needs driver β₯ 520.61
# Fix: either upgrade node driver or use older CUDA base imageStep 6: DRA-Specific Troubleshooting
# For DRA (v1.34+): ResourceClaim not being allocated
kubectl get resourceclaims -n ai-inference
kubectl describe resourceclaim training-gpu -n ai-inference
# Common DRA issues:
# "no devices match the claim's selectors"
# β CEL expression too restrictive or wrong device attribute name
# Check what devices are available
kubectl get resourceslices -o yaml
# Check device attributes
kubectl get resourceslices -o json | jq '.items[].devices[].basic.attributes'Common Error Messages and Fixes
| Error | Cause | Fix |
|---|---|---|
Insufficient nvidia.com/gpu | All GPUs allocated or no GPU nodes | Scale GPU nodes or wait for pods to finish |
nvidia-container-cli: initialization error | NVIDIA runtime not configured | Install nvidia-container-toolkit, restart containerd |
CUDA driver version insufficient | Node driver too old for container CUDA | Upgrade driver or pin older CUDA image |
failed to initialize NVML: driver not loaded | NVIDIA kernel module not loaded | sudo modprobe nvidia or reboot node |
GPU UUID not found | Device plugin stale after driver update | Restart nvidia-device-plugin pod |
no devices found on plugin registration | GPU hardware fault or driver crash | Check dmesg for Xid errors, reset GPU |
MIG configuration invalid | Wrong MIG profile for workload | Reconfigure MIG with nvidia-smi mig |
ResourceClaim pending | DRA selector matches no device | Relax CEL expression or add matching hardware |
ErrImagePull for NVCR images | Missing NGC API key | Create imagePullSecret with NGC credentials |
GPU Health Diagnostic Script
#!/bin/bash
# gpu-diagnostics.sh β run on GPU node
echo "=== Driver ==="
nvidia-smi --query-gpu=driver_version,name,memory.total,memory.used \
--format=csv
echo -e "\n=== Processes ==="
nvidia-smi pmon -c 1
echo -e "\n=== Xid Errors (hardware faults) ==="
dmesg | grep -i "NVRM: Xid" | tail -10
echo -e "\n=== Container Runtime ==="
nvidia-container-cli info
echo -e "\n=== Device Plugin ==="
ls -la /var/lib/kubelet/device-plugins/nvidia*.sock 2>/dev/null || echo "No device plugin socket found"
echo -e "\n=== MIG Status ==="
nvidia-smi mig -lgi 2>/dev/null || echo "MIG not enabled"Monitoring GPU Health
# DCGM Exporter for Prometheus GPU metrics
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: dcgm-exporter
namespace: gpu-operator
spec:
selector:
matchLabels:
app: dcgm-exporter
template:
spec:
containers:
- name: dcgm-exporter
image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.8-3.6.0-ubuntu22.04
ports:
- containerPort: 9400
env:
- name: DCGM_EXPORTER_COLLECTORS
value: "/etc/dcgm-exporter/dcp-metrics-included.csv"
securityContext:
privileged: true# Key metrics to alert on:
# DCGM_FI_DEV_GPU_TEMP > 90Β°C β thermal throttling
# DCGM_FI_DEV_XID_ERRORS > 0 β hardware fault
# DCGM_FI_DEV_MEM_COPY_UTIL > 95% β memory pressure
# DCGM_FI_DEV_GPU_UTIL == 0 for 30m β GPU idle (waste)Best Practices
- Always check
kubectl describe podfirst β Events section tells you exactly what failed - Verify GPU capacity on nodes β if nodes show 0 GPUs, fix device plugin before anything else
- Pin CUDA versions β match container CUDA to node driver compatibility matrix
- Monitor Xid errors β hardware faults (Xid 48, 63, 79) need physical GPU replacement
- Restart device plugin after driver updates β stale GPU UUIDs cause phantom devices
- Use DCGM Exporter β Prometheus metrics for temperature, utilization, errors
Key Takeaways
- GPU pod failures have 5 main root causes: device plugin, runtime, driver, scheduling, DRA
- Start diagnosis with
kubectl describe podβ Events β node GPU capacity - CUDA driver compatibility matrix is the #1 cause of container GPU failures
- DRA (v1.34+) adds new failure modes: ResourceClaim selectors must match device attributes
- Monitor GPU health with DCGM Exporter β Xid errors indicate hardware faults
- Always pin CUDA versions and driver versions for reproducible GPU deployments

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses β