How to Upgrade Kubernetes Clusters Safely
Perform Kubernetes cluster upgrades with zero downtime. Learn upgrade strategies, pre-flight checks, rollback procedures, and best practices for production clusters.
The Problem
Kubernetes releases new versions every ~4 months with security patches, bug fixes, and new features. Clusters running outdated versions miss critical updates and eventually become unsupported. Upgrading incorrectly can cause outages.
The Solution
Follow a systematic upgrade process: prepare the cluster, backup critical data, upgrade control plane first, then worker nodes, and validate at each step. Always have a rollback plan.
Kubernetes Upgrade Overview
flowchart TB
subgraph P1["Phase 1: PREPARATION"]
P1A["β Review release notes and breaking changes"]
P1B["β Check component compatibility - CNI, CSI, Ingress"]
P1C["β Backup etcd"]
P1D["β Ensure PodDisruptionBudgets are in place"]
P1E["β Test upgrade in staging environment"]
end
subgraph P2["Phase 2: CONTROL PLANE UPGRADE"]
P2A["Control Plane Node 1 β Node 2 β Node 3<br/>one at a time for HA"]
P2B["Order: etcd β kube-apiserver β<br/>kube-controller-manager β kube-scheduler β<br/>cloud-controller-manager"]
end
subgraph P3["Phase 3: WORKER NODE UPGRADE"]
P3A["Worker 1: Cordon β Drain β Upgrade β Uncordon"]
P3B["Worker 2: Cordon β Drain β Upgrade β Uncordon"]
P3C["Worker N: Cordon β Drain β Upgrade β Uncordon"]
end
subgraph P4["Phase 4: VALIDATION"]
P4A["β All nodes Ready"]
P4B["β All system pods Running"]
P4C["β Application health checks passing"]
P4D["β Monitoring and logging functional"]
end
P1 --> P2
P2 --> P3
P3 --> P4Version Skew Policy
flowchart LR
subgraph SKEW["Kubernetes Version Skew Rules"]
direction TB
API["kube-apiserver<br/>N/A - defines the version"]
CM["kube-controller-mgr<br/>Β±1 minor from apiserver"]
SCHED["kube-scheduler<br/>Β±1 minor from apiserver"]
KL["kubelet<br/>-3 to 0 from apiserver"]
KP["kube-proxy<br/>Same as kubelet"]
KC["kubectl<br/>Β±1 minor from apiserver"]
endExample: If API server is 1.29:
- Control plane: Must be 1.28, 1.29, or 1.30
- Kubelet: Can be 1.26, 1.27, 1.28, or 1.29
- kubectl: Can be 1.28, 1.29, or 1.30
Phase 1: Pre-Upgrade Preparation
Check Current Versions
# Check cluster version
kubectl version
# Check node versions
kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
VERSION:.status.nodeInfo.kubeletVersion,\
OS:.status.nodeInfo.osImage
# Check component versions
kubectl get pods -n kube-system -o custom-columns=\
NAME:.metadata.name,\
IMAGE:.spec.containers[*].image | grep -E "kube-|etcd|coredns"Review Release Notes
# Key areas to check:
# 1. Deprecations and removals
# 2. API changes
# 3. Feature gate changes
# 4. Known issues
# Check for deprecated APIs in use
kubectl get --raw /metrics | grep apiserver_requested_deprecated_apis
# Or use kubent (kube-no-trouble)
kubentVerify Component Compatibility
# Check CNI plugin compatibility
kubectl get pods -n kube-system -l k8s-app=calico-node -o jsonpath='{.items[0].spec.containers[0].image}'
# Check CSI driver compatibility
kubectl get csidrivers
# Check Ingress controller version
kubectl get pods -n ingress-nginx -o jsonpath='{.items[0].spec.containers[0].image}'Backup etcd
# Create etcd snapshot
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-pre-upgrade-$(date +%Y%m%d).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Verify backup
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-pre-upgrade-*.db --write-out=tableEnsure PodDisruptionBudgets
# Check existing PDBs
kubectl get pdb -A
# Create PDBs for critical workloads if missing
cat <<EOF | kubectl apply -f -
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: critical-app-pdb
namespace: production
spec:
minAvailable: 2
selector:
matchLabels:
app: critical-app
EOFPre-Upgrade Checklist Script
#!/bin/bash
# pre-upgrade-check.sh
echo "=== Kubernetes Pre-Upgrade Checklist ==="
echo -e "\n1. Cluster Health"
kubectl get nodes
kubectl get componentstatuses 2>/dev/null || kubectl get --raw='/readyz?verbose'
echo -e "\n2. System Pods Status"
kubectl get pods -n kube-system | grep -v Running
echo -e "\n3. Deprecated APIs in Use"
kubectl get --raw /metrics 2>/dev/null | grep apiserver_requested_deprecated_apis | head -10
echo -e "\n4. PodDisruptionBudgets"
kubectl get pdb -A
echo -e "\n5. Pending PVCs"
kubectl get pvc -A | grep -v Bound
echo -e "\n6. Nodes with Issues"
kubectl get nodes -o json | jq -r '.items[] | select(.status.conditions[] | select(.type=="Ready" and .status!="True")) | .metadata.name'
echo -e "\n7. etcd Health"
kubectl exec -n kube-system etcd-$(hostname) -- etcdctl endpoint health \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key 2>/dev/null || echo "Run on control plane node"
echo -e "\n=== Pre-upgrade checks complete ==="Phase 2: Control Plane Upgrade (kubeadm)
Upgrade First Control Plane Node
# Check available versions
apt update
apt-cache madison kubeadm
# Upgrade kubeadm
apt-mark unhold kubeadm
apt-get update && apt-get install -y kubeadm=1.29.0-1.1
apt-mark hold kubeadm
# Verify kubeadm version
kubeadm version
# Check upgrade plan
kubeadm upgrade plan
# Apply upgrade (first control plane node only)
kubeadm upgrade apply v1.29.0
# Upgrade kubelet and kubectl
apt-mark unhold kubelet kubectl
apt-get update && apt-get install -y kubelet=1.29.0-1.1 kubectl=1.29.0-1.1
apt-mark hold kubelet kubectl
# Restart kubelet
systemctl daemon-reload
systemctl restart kubeletUpgrade Additional Control Plane Nodes
# On each additional control plane node:
# Upgrade kubeadm
apt-mark unhold kubeadm
apt-get update && apt-get install -y kubeadm=1.29.0-1.1
apt-mark hold kubeadm
# Upgrade node (not "apply", use "node")
kubeadm upgrade node
# Upgrade kubelet and kubectl
apt-mark unhold kubelet kubectl
apt-get update && apt-get install -y kubelet=1.29.0-1.1 kubectl=1.29.0-1.1
apt-mark hold kubelet kubectl
# Restart kubelet
systemctl daemon-reload
systemctl restart kubeletPhase 3: Worker Node Upgrade
Upgrade Worker Nodes (One at a Time)
# From a machine with kubectl access:
# 1. Cordon the node (mark unschedulable)
kubectl cordon worker-1
# 2. Drain the node (evict pods)
kubectl drain worker-1 \
--ignore-daemonsets \
--delete-emptydir-data \
--force \
--grace-period=300
# 3. SSH to worker node and upgrade
ssh worker-1
# Upgrade kubeadm
apt-mark unhold kubeadm
apt-get update && apt-get install -y kubeadm=1.29.0-1.1
apt-mark hold kubeadm
# Upgrade node configuration
kubeadm upgrade node
# Upgrade kubelet and kubectl
apt-mark unhold kubelet kubectl
apt-get update && apt-get install -y kubelet=1.29.0-1.1 kubectl=1.29.0-1.1
apt-mark hold kubelet kubectl
# Restart kubelet
systemctl daemon-reload
systemctl restart kubelet
exit
# 4. Uncordon the node (allow scheduling)
kubectl uncordon worker-1
# 5. Verify node is ready
kubectl get node worker-1Automated Worker Upgrade Script
#!/bin/bash
# upgrade-worker.sh
NODE=$1
VERSION=${2:-1.29.0-1.1}
echo "Upgrading $NODE to version $VERSION"
# Cordon
echo "Cordoning $NODE..."
kubectl cordon $NODE
# Drain
echo "Draining $NODE..."
kubectl drain $NODE \
--ignore-daemonsets \
--delete-emptydir-data \
--force \
--grace-period=300 \
--timeout=600s
# Upgrade via SSH
echo "Upgrading packages on $NODE..."
ssh $NODE << EOF
apt-mark unhold kubeadm kubelet kubectl
apt-get update
apt-get install -y kubeadm=$VERSION kubelet=$VERSION kubectl=$VERSION
apt-mark hold kubeadm kubelet kubectl
kubeadm upgrade node
systemctl daemon-reload
systemctl restart kubelet
EOF
# Wait for node to be ready
echo "Waiting for $NODE to be ready..."
kubectl wait --for=condition=Ready node/$NODE --timeout=300s
# Uncordon
echo "Uncordoning $NODE..."
kubectl uncordon $NODE
echo "$NODE upgrade complete!"Phase 4: Post-Upgrade Validation
Validate Cluster Health
# Check all nodes are Ready and correct version
kubectl get nodes
# Check control plane components
kubectl get pods -n kube-system
# Check component health
kubectl get --raw='/readyz?verbose'
# Verify etcd health
kubectl exec -n kube-system etcd-controlplane -- etcdctl endpoint health \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.keyValidate Applications
# Check all deployments are healthy
kubectl get deployments -A -o wide
# Check for pods not running
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
# Run application-specific health checks
kubectl exec -n production deploy/myapp -- /health-check.shValidation Script
#!/bin/bash
# post-upgrade-validation.sh
EXPECTED_VERSION=$1
echo "=== Post-Upgrade Validation ==="
echo -e "\n1. Node Versions"
kubectl get nodes -o custom-columns=NAME:.metadata.name,VERSION:.status.nodeInfo.kubeletVersion | \
grep -v $EXPECTED_VERSION && echo "WARNING: Version mismatch found!" || echo "All nodes on $EXPECTED_VERSION"
echo -e "\n2. Node Status"
NOT_READY=$(kubectl get nodes -o json | jq -r '.items[] | select(.status.conditions[] | select(.type=="Ready" and .status!="True")) | .metadata.name')
if [ -n "$NOT_READY" ]; then
echo "WARNING: Nodes not ready: $NOT_READY"
else
echo "All nodes Ready"
fi
echo -e "\n3. System Pods"
FAILING=$(kubectl get pods -n kube-system --field-selector=status.phase!=Running,status.phase!=Succeeded 2>/dev/null)
if [ -n "$FAILING" ]; then
echo "WARNING: Failing system pods:"
echo "$FAILING"
else
echo "All system pods healthy"
fi
echo -e "\n4. API Server Health"
kubectl get --raw='/readyz' && echo " - API server healthy" || echo "WARNING: API server issues"
echo -e "\n5. CoreDNS Test"
kubectl run dns-test --image=busybox:1.28 --rm -it --restart=Never -- nslookup kubernetes.default 2>/dev/null && \
echo "DNS working" || echo "WARNING: DNS issues"
echo -e "\n=== Validation Complete ==="Managed Kubernetes Upgrades
EKS Upgrade
# Check available versions
aws eks describe-addon-versions --kubernetes-version 1.29 --query 'addons[].addonVersions[0].addonVersion'
# Update control plane
aws eks update-cluster-version \
--name my-cluster \
--kubernetes-version 1.29
# Wait for update
aws eks wait cluster-active --name my-cluster
# Update node groups
aws eks update-nodegroup-version \
--cluster-name my-cluster \
--nodegroup-name my-nodegroup \
--kubernetes-version 1.29
# Update add-ons
aws eks update-addon --cluster-name my-cluster --addon-name coredns
aws eks update-addon --cluster-name my-cluster --addon-name kube-proxy
aws eks update-addon --cluster-name my-cluster --addon-name vpc-cniAKS Upgrade
# Check available upgrades
az aks get-upgrades --resource-group myResourceGroup --name myAKSCluster --output table
# Upgrade control plane and nodes
az aks upgrade \
--resource-group myResourceGroup \
--name myAKSCluster \
--kubernetes-version 1.29.0
# Or upgrade control plane only
az aks upgrade \
--resource-group myResourceGroup \
--name myAKSCluster \
--kubernetes-version 1.29.0 \
--control-plane-onlyGKE Upgrade
# Check available versions
gcloud container get-server-config --zone us-central1-a
# Upgrade control plane
gcloud container clusters upgrade my-cluster \
--master \
--cluster-version 1.29 \
--zone us-central1-a
# Upgrade node pools
gcloud container clusters upgrade my-cluster \
--node-pool default-pool \
--cluster-version 1.29 \
--zone us-central1-aRollback Procedures
Rollback Control Plane (kubeadm)
# Rollback is complex - restore from etcd backup is usually safer
# 1. Stop API server
sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
# 2. Restore etcd from backup
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-pre-upgrade.db \
--data-dir=/var/lib/etcd-restored
# 3. Update etcd to use restored data
sudo mv /var/lib/etcd /var/lib/etcd-old
sudo mv /var/lib/etcd-restored /var/lib/etcd
# 4. Downgrade packages
apt-get install -y kubeadm=1.28.0-1.1 kubelet=1.28.0-1.1 kubectl=1.28.0-1.1
# 5. Restore API server
sudo mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/
# 6. Restart kubelet
systemctl daemon-reload
systemctl restart kubeletRollback Worker Nodes
# Drain node
kubectl drain worker-1 --ignore-daemonsets --delete-emptydir-data
# SSH and downgrade
ssh worker-1
apt-get install -y kubeadm=1.28.0-1.1 kubelet=1.28.0-1.1 kubectl=1.28.0-1.1
kubeadm upgrade node
systemctl daemon-reload
systemctl restart kubelet
exit
# Uncordon
kubectl uncordon worker-1Best Practices
1. Upgrade Regularly
Recommended cadence:
- Minor versions: Within 3 months of release
- Patch versions: Within 2 weeks of release (security)
- Never skip minor versions (1.27 β 1.29 is NOT supported)2. Stage the Upgrade
Environment progression:
Dev β Staging β Pre-prod β Production
Allow 1-2 weeks between each stage for issue discovery3. Maintain PodDisruptionBudgets
# Ensure critical workloads have PDBs
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-pdb
spec:
minAvailable: 50%
selector:
matchLabels:
app: web4. Document Everything
## Upgrade Runbook: v1.28 β v1.29
### Pre-requisites Completed
- [ ] etcd backup taken
- [ ] Staging upgrade successful
- [ ] Change window approved
### Control Plane Upgrade
- [ ] cp-1 upgraded at HH:MM
- [ ] cp-2 upgraded at HH:MM
- [ ] cp-3 upgraded at HH:MM
### Worker Node Upgrade
- [ ] worker-1 upgraded
- [ ] worker-2 upgraded
...Summary
Kubernetes cluster upgrades require careful planning and execution. Always backup etcd, follow the control-plane-first approach, upgrade one node at a time, validate after each step, and have a tested rollback plan. Regular upgrades keep your cluster secure and supported.
π Go Further with Kubernetes Recipes
Love this recipe? Thereβs so much more! This is just one of 100+ hands-on recipes in our comprehensive Kubernetes Recipes book.
Inside the book, youβll master:
- β Production-ready deployment strategies
- β Advanced networking and security patterns
- β Observability, monitoring, and troubleshooting
- β Real-world best practices from industry experts
βThe practical, recipe-based approach made complex Kubernetes concepts finally click for me.β
π Get Your Copy Now β Start building production-grade Kubernetes skills today!
π Get All 100+ Recipes in One Book
Stop searching β get every production-ready pattern with detailed explanations, best practices, and copy-paste YAML.
Want More Kubernetes Recipes?
This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.