πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai intermediate ⏱ 25 minutes K8s 1.28+

Install NVIDIA GPU Operator on Kubernetes

Deploy the NVIDIA GPU Operator to automate GPU driver, container toolkit, and device plugin management across your Kubernetes cluster.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Install with Helm: helm install gpu-operator nvidia/gpu-operator -n gpu-operator --create-namespace. The operator auto-deploys GPU drivers, container toolkit, device plugin, and monitoring on every GPU node. Verify with kubectl get pods -n gpu-operator and kubectl get nodes -o json | jq '.items[].status.capacity["nvidia.com/gpu"]'.

The NVIDIA GPU Operator automates the deployment and lifecycle of GPU software components on Kubernetes. Instead of manually installing drivers and plugins on each node, the operator handles everything as DaemonSets.

What the GPU Operator Manages

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  GPU Operator (namespace: gpu-operator)      β”‚
β”‚                                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ GPU Driver     β”‚  β”‚ Container Toolkit β”‚  β”‚
β”‚  β”‚ (DaemonSet)    β”‚  β”‚ (DaemonSet)       β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ Device Plugin  β”‚  β”‚ DCGM Exporter     β”‚  β”‚
β”‚  β”‚ (DaemonSet)    β”‚  β”‚ (Monitoring)      β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ GPU Feature    β”‚  β”‚ MIG Manager       β”‚  β”‚
β”‚  β”‚ Discovery      β”‚  β”‚ (Optional)        β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Install with Helm

# Add NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Install GPU Operator
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set devicePlugin.enabled=true \
  --set dcgmExporter.enabled=true

# Wait for all pods to be ready
kubectl wait --for=condition=Ready pods --all -n gpu-operator --timeout=600s

Pre-Installed Drivers (Skip Driver Install)

If GPU drivers are already installed on nodes (common on OpenShift or cloud-managed nodes):

helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=false

OpenShift Installation

On OpenShift, install via the OperatorHub:

  1. Go to Operators β†’ OperatorHub
  2. Search for NVIDIA GPU Operator
  3. Click Install
  4. Select namespace nvidia-gpu-operator
  5. Accept defaults and click Install

Or via CLI:

oc apply -f - <<EOF
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: gpu-operator-certified
  namespace: nvidia-gpu-operator
spec:
  channel: v24.9
  name: gpu-operator-certified
  source: certified-operators
  sourceNamespace: openshift-marketplace
EOF

Then create a ClusterPolicy:

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  operator:
    defaultRuntime: crio
  driver:
    enabled: false    # OpenShift provides pre-installed drivers
  toolkit:
    enabled: true
  devicePlugin:
    enabled: true
  dcgm:
    enabled: true
  dcgmExporter:
    enabled: true
  gfd:
    enabled: true
  migManager:
    enabled: false

Verify GPU Availability

# Check operator pods
kubectl get pods -n gpu-operator

# Verify GPU resources on nodes
kubectl get nodes -o custom-columns=NAME:.metadata.name,GPUs:.status.capacity.nvidia\\.com/gpu

# Describe a GPU node
kubectl describe node <gpu-node-name> | grep -A5 "Capacity\|Allocatable"

# Check GPU details from within a pod
kubectl run gpu-test --rm -it --restart=Never \
  --image=nvidia/cuda:12.4.0-base-ubuntu22.04 \
  --limits=nvidia.com/gpu=1 \
  -- nvidia-smi

Expected nvidia-smi output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15   Driver Version: 550.54.15   CUDA Version: 12.4      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id       Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM4   On    | 00000000:07:00.0 Off |                    0 |
| N/A   32C    P0    62W / 400W |      0MiB / 81920MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Enable GPU Time-Slicing

Share a single GPU across multiple pods using time-slicing:

# time-slicing-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
          - name: nvidia.com/gpu
            replicas: 4    # Each GPU appears as 4 virtual GPUs

Apply it:

kubectl apply -f time-slicing-config.yaml

# Patch the ClusterPolicy to use time-slicing
kubectl patch clusterpolicy/gpu-cluster-policy \
  -n gpu-operator --type merge \
  -p '{"spec":{"devicePlugin":{"config":{"name":"time-slicing-config","default":"any"}}}}'

After restart, each GPU node shows 4Γ— the GPU count:

kubectl get nodes -o custom-columns=NAME:.metadata.name,GPUs:.status.capacity.nvidia\\.com/gpu
# Node with 1 A100 now shows GPUs: 4

Enable MIG (Multi-Instance GPU)

For A100/H100 GPUs, MIG provides hardware-isolated GPU partitions:

# Enable MIG Manager in ClusterPolicy
spec:
  migManager:
    enabled: true
    config:
      name: default-mig-parted-config

DCGM Monitoring Metrics

The GPU Operator deploys DCGM Exporter which serves Prometheus metrics:

# Check DCGM Exporter is running
kubectl get pods -n gpu-operator -l app=nvidia-dcgm-exporter

# Sample metrics
kubectl port-forward -n gpu-operator svc/nvidia-dcgm-exporter 9400:9400
curl localhost:9400/metrics | grep DCGM_FI_DEV_GPU_UTIL

Key metrics:

MetricDescription
DCGM_FI_DEV_GPU_UTILGPU utilization %
DCGM_FI_DEV_FB_USEDGPU memory used (MB)
DCGM_FI_DEV_FB_FREEGPU memory free (MB)
DCGM_FI_DEV_GPU_TEMPGPU temperature (Β°C)
DCGM_FI_DEV_POWER_USAGEPower consumption (W)

Troubleshooting

SymptomCauseFix
No nvidia.com/gpu on nodesDevice plugin not runningCheck kubectl get pods -n gpu-operator
Driver pod in CrashLoopKernel headers missingInstall matching kernel-devel package
GPU test pod pendingNo allocatable GPUsVerify node labels and taints
DCGM metrics emptyExporter not runningCheck DCGM Exporter pod logs
#nvidia #gpu-operator #gpu #drivers #device-plugin #ai-workloads #infrastructure
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens