πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai beginner ⏱ 15 minutes K8s 1.28+

nvidia-smi Monitoring in K8s Pods

Run nvidia-smi inside Kubernetes pods for GPU monitoring. Memory usage, temperature, utilization, and automated health checks with liveness probes.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Run nvidia-smi inside Kubernetes pods for GPU monitoring. Memory usage, temperature, utilization, and automated health checks with liveness probes.

The Problem

Run nvidia-smi inside Kubernetes pods for GPU monitoring. Without proper setup, GPU workloads on Kubernetes suffer from wasted resources, failed scheduling, or degraded inference performance.

The Solution

Prerequisites

# Verify GPU nodes are available
kubectl get nodes -l nvidia.com/gpu.present=true
kubectl describe node <gpu-node> | grep -A5 "Allocatable"

# Check NVIDIA driver and CUDA
kubectl exec -it <gpu-pod> -- nvidia-smi

Configuration

# nvidia-smi Monitoring in K8s Pods β€” production configuration
apiVersion: v1
kind: Pod
metadata:
  name: gpu-workload
  namespace: gpu-inference
spec:
  containers:
  - name: inference
    image: nvcr.io/nvidia/pytorch:24.07-py3
    resources:
      limits:
        nvidia.com/gpu: 1
      requests:
        nvidia.com/gpu: 1
    env:
    - name: NVIDIA_VISIBLE_DEVICES
      value: "all"
    - name: NVIDIA_DRIVER_CAPABILITIES
      value: "compute,utility"
  nodeSelector:
    nvidia.com/gpu.present: "true"
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

Deployment

# Apply GPU workload
kubectl apply -f gpu-workload.yaml

# Verify GPU allocation
kubectl describe pod gpu-workload | grep -A3 "Limits"

# Monitor GPU utilization
kubectl exec -it gpu-workload -- nvidia-smi dmon -s pucvmet -d 5

Verification

# Check GPU is accessible inside the pod
kubectl exec -it gpu-workload -- python3 -c "
import torch
print(f'CUDA available: {torch.cuda.is_available()}')
print(f'GPU count: {torch.cuda.device_count()}')
print(f'GPU name: {torch.cuda.get_device_name(0)}')
print(f'Memory: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB')
"
graph TD
    A[GPU Node] --> B[NVIDIA Driver]
    B --> C[Container Toolkit]
    C --> D[Device Plugin]
    D --> E[Pod GPU Access]
    E --> F{Inference / Training}
    F --> G[Monitor with nvidia-smi]
    G --> H[Scale with HPA/KEDA]

Common Issues

GPU not visible inside pod

Check that the NVIDIA device plugin DaemonSet is running on the node. Verify with kubectl get pods -n gpu-operator -l app=nvidia-device-plugin-daemonset. If missing, the GPU Operator may need reinstalling.

CUDA version mismatch

The container CUDA version must be compatible with the host driver. Use nvidia-smi on the node to check driver version, then select a compatible container image from NVIDIA NGC catalog.

Out of memory on GPU

Reduce batch size, enable gradient checkpointing for training, or use model quantization (AWQ/GPTQ) for inference. Monitor with nvidia-smi to track peak memory usage.

Best Practices

  • Always set resources.limits for nvidia.com/gpu β€” without it, pods won’t get GPU access
  • Use node selectors or affinity to target specific GPU types (A100, H100, etc.)
  • Monitor GPU utilization with DCGM Exporter + Prometheus β€” idle GPUs waste expensive resources
  • Pin CUDA container versions β€” don’t use latest tags in production
  • Enable GPU health checks with liveness probes that verify CUDA functionality

Key Takeaways

  • nvidia-smi Monitoring in K8s Pods is critical for production GPU workloads on Kubernetes
  • Proper resource configuration prevents scheduling failures and resource waste
  • Monitor GPU utilization to right-size allocations and reduce cloud costs
  • Use NVIDIA GPU Operator for automated driver and toolkit lifecycle management
  • Combine with KEDA or custom metrics HPA for GPU-aware autoscaling
#nvidia-smi #gpu-monitoring #health-check
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens