GPU Operator Canary Upgrade Strategy
Safely upgrade NVIDIA GPU Operator using canary node pools, 48-hour bake periods, validation gates, and Git-based rollback.
π‘ Quick Answer: Label one GPU node as
gpu-canary: "true", upgrade GPU Operator with node affinity targeting canary only, bake for 48 hours with validation gates (smoke training, inference, RDMA health, GPU errors), then promote to all nodes. Rollback = git revert.
The Problem
GPU Operator upgrades touch kernel modules, CUDA runtime, device plugin, and DCGM. A bad upgrade kills all GPU workloads cluster-wide. You canβt βtry it and seeβ on production β you need a staged rollout with validation before committing.
The Solution
Known-Good Version Matrix
# gpu-version-matrix.yaml (stored in Git)
versions:
current:
gpu_operator: "v24.9.0"
network_operator: "v24.7.0"
driver: "560.35.03"
cuda: "12.6"
firmware: "28.40.1000"
sriov: "4.18.0"
openshift: "4.16.23"
status: "production"
canary:
gpu_operator: "v24.12.0"
network_operator: "v24.10.0"
driver: "565.57.01"
cuda: "12.8"
firmware: "28.42.1000"
sriov: "4.18.0"
openshift: "4.16.23"
status: "testing"Canary Node Setup
# Label canary GPU node
oc label node gpu-worker-4 gpu-canary=true
# Create canary MachineConfigPool (OpenShift)
cat <<EOF | oc apply -f -
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
name: gpu-canary
spec:
machineConfigSelector:
matchExpressions:
- key: machineconfiguration.openshift.io/role
operator: In
values: ["worker", "gpu-worker", "gpu-canary"]
nodeSelector:
matchLabels:
gpu-canary: "true"
paused: false
EOFStaged Upgrade
# Step 1: Drain canary node
oc adm drain gpu-worker-4 --ignore-daemonsets --delete-emptydir-data
# Step 2: Apply canary GPU Operator version
# In ClusterPolicy, use nodeSelector to target canary
oc patch clusterpolicy gpu-cluster-policy --type=merge -p '
{
"spec": {
"driver": {
"version": "565.57.01",
"nodeSelector": {"gpu-canary": "true"}
}
}
}'
# Step 3: Uncordon and verify
oc adm uncordon gpu-worker-4
# Step 4: Run validation suite
./validate-gpu.sh gpu-worker-4
# Step 5: Bake for 48 hours
echo "Canary deployed at $(date). Monitor for 48h before promotion."
# Step 6: Promote to all nodes (after validation)
oc patch clusterpolicy gpu-cluster-policy --type=merge -p '
{
"spec": {
"driver": {
"version": "565.57.01",
"nodeSelector": {}
}
}
}'Validation Gate Script
#!/bin/bash
# validate-gpu.sh β run on canary node
NODE=$1
echo "=== GPU Validation on $NODE ==="
# 1. Smoke training
echo "1. Smoke training test..."
kubectl run gpu-train-test --image=nvcr.io/nvidia/pytorch:24.03-py3 \
--restart=Never --rm -it --node-name=$NODE \
--limits='nvidia.com/gpu=1' -- \
python -c "
import torch
x = torch.randn(1000, 1000, device='cuda')
y = torch.mm(x, x)
print(f'Training smoke: OK ({y.shape})')
"
# 2. Smoke inference
echo "2. Smoke inference test..."
kubectl run gpu-infer-test --image=nvcr.io/nvidia/pytorch:24.03-py3 \
--restart=Never --rm -it --node-name=$NODE \
--limits='nvidia.com/gpu=1' -- \
python -c "
import torch
model = torch.nn.Linear(1024, 1024).cuda()
x = torch.randn(32, 1024, device='cuda')
with torch.no_grad():
y = model(x)
print(f'Inference smoke: OK ({y.shape})')
"
# 3. RDMA health
echo "3. RDMA health check..."
kubectl exec -it $(kubectl get pods -n gpu-operator -l app=nvidia-driver-daemonset \
--field-selector spec.nodeName=$NODE -o name | head -1) \
-n gpu-operator -- ibstat | grep -E "State|Rate"
# 4. GPU errors
echo "4. GPU error check..."
kubectl exec -it $(kubectl get pods -n gpu-operator -l app=nvidia-driver-daemonset \
--field-selector spec.nodeName=$NODE -o name | head -1) \
-n gpu-operator -- nvidia-smi --query-gpu=ecc.errors.corrected.aggregate.total,ecc.errors.uncorrected.aggregate.total --format=csv
echo "=== Validation Complete ==="Upgrade Flow
graph TD
A[New GPU Op Version] --> B[Update version-matrix.yaml]
B --> C[PR Review]
C --> D[Merge to canary branch]
D --> E[Deploy to Canary Node]
E --> F[Validation Suite]
F --> G{Pass?}
G -->|Yes| H[Bake 48 hours]
G -->|No| I[Git Revert - Rollback]
H --> J{Stable?}
J -->|Yes| K[Promote to All Nodes]
J -->|No| I
K --> L[Update matrix: canary to current]
I --> M[Investigate and retry]Common Issues
- Canary upgrade breaks DaemonSet on all nodes β use
nodeSelectorin ClusterPolicy to scope driver version to canary only - Validation passes but production fails β canary may not exercise all workload patterns; extend validation to include multi-GPU and distributed training tests
- Rollback takes too long β git revert + ArgoCD sync is fastest; manual
oc patchas fallback
Best Practices
- Always maintain a version matrix in Git β know whatβs running everywhere
- Canary on a single GPU node first β never upgrade all nodes simultaneously
- Validate: smoke training, smoke inference, RDMA health, GPU ECC errors
- Bake for 48 hours minimum β some issues only surface under sustained load
- Rollback is a git revert β ArgoCD syncs previous known-good state
- Test canary with real tenant workloads if possible (route a subset of traffic)
Key Takeaways
- Canary upgrade strategy: one node β validate β bake β promote
- Version matrix in Git provides audit trail and known-good reference
- Validation gates: training, inference, RDMA, GPU errors β all must pass
- 48-hour bake catches issues that quick tests miss
- Rollback = git revert β ArgoCD auto-syncs previous version

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
