Kubernetes Rolling Update Zero Downtime Guide
Configure Kubernetes rolling updates for zero-downtime deployments. maxSurge, maxUnavailable, readiness probes, preStop hooks, and graceful shutdown strategies.
π‘ Quick Answer: Zero-downtime rolling updates require: (1) readiness probe to delay traffic until ready, (2) `preStop` lifecycle hook to drain connections before termination, (3) proper `maxSurge`/`maxUnavailable` settings, and (4) `terminationGracePeriodSeconds` long enough for in-flight requests to complete.
The Problem
A basic rolling update creates new pods and kills old ones. But without proper configuration, users experience errors during the transition β new pods receive traffic before theyβre ready, old pods are killed while handling requests, or all pods update simultaneously causing a brief outage.
flowchart TB
subgraph BROKEN["β Broken Rolling Update"]
B1["New pod gets traffic<br/>before app is ready"] --> ERR1["502 errors"]
B2["Old pod killed mid-request"] --> ERR2["Connection reset"]
B3["All pods updating at once"] --> ERR3["Downtime"]
end
subgraph FIXED["β
Zero-Downtime Update"]
F1["Readiness probe gates traffic<br/>β New pod ready first"]
F2["preStop hook drains connections<br/>β Old pod finishes requests"]
F3["maxUnavailable: 0<br/>β Always has capacity"]
endThe Solution
Complete Zero-Downtime Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Create 1 extra pod during update
maxUnavailable: 0 # Never remove a pod until new one is ready
template:
spec:
terminationGracePeriodSeconds: 60 # Allow 60s for graceful shutdown
containers:
- name: app
image: myapp:v2.0
ports:
- containerPort: 8080
# 1. READINESS PROBE: Don't send traffic until ready
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3
# 2. LIVENESS PROBE: Restart if stuck
livenessProbe:
httpGet:
path: /livez
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
failureThreshold: 3
# 3. PRESTOP HOOK: Drain connections before termination
lifecycle:
preStop:
exec:
command: ["sh", "-c", "sleep 10"]
# Why: Kubernetes removes pod from endpoints,
# but iptables/IPVS rules take time to propagate.
# sleep gives kube-proxy time to update.Rolling Update Parameters
| Parameter | Default | Recommended | Effect |
|---|---|---|---|
| `maxSurge` | 25% | `1` or `25%` | Extra pods created during update |
| `maxUnavailable` | 25% | `0` | Pods removed before new ones ready |
# Conservative (safest β always at full capacity)
maxSurge: 1
maxUnavailable: 0
# Process: create 1 new β wait until ready β terminate 1 old β repeat
# Balanced (faster, slight capacity reduction)
maxSurge: 1
maxUnavailable: 1
# Process: create 1 new + terminate 1 old simultaneously
# Fast (for non-critical workloads)
maxSurge: "50%"
maxUnavailable: "50%"
# Process: replace half the fleet at onceThe preStop Hook Explained
Timeline of pod termination:
t=0: Pod marked for termination
t=0: preStop hook starts (sleep 10)
t=0: Pod removed from Service endpoints (async)
t=1-5: kube-proxy updates iptables rules on all nodes
t=10: preStop completes β SIGTERM sent to app
t=10-50: App handles SIGTERM, finishes in-flight requests
t=60: terminationGracePeriodSeconds reached β SIGKILL
The sleep(10) in preStop ensures iptables are updated
BEFORE the app starts shutting down.Application-Side Graceful Shutdown
# Python Flask example
import signal
import sys
def graceful_shutdown(signum, frame):
print("SIGTERM received, finishing in-flight requests...")
# Stop accepting new connections
server.shutdown()
# Wait for current requests to complete
sys.exit(0)
signal.signal(signal.SIGTERM, graceful_shutdown)// Go example
ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGTERM)
defer stop()
srv := &http.Server{Addr: ":8080"}
go srv.ListenAndServe()
<-ctx.Done()
// Graceful shutdown with 30s timeout
shutdownCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
srv.Shutdown(shutdownCtx)Monitor Rolling Update Progress
# Watch rollout status
kubectl rollout status deployment/web-app
# Waiting for deployment "web-app" rollout to finish: 1 of 3 updated replicas are available...
# deployment "web-app" successfully rolled out
# Watch pods during update
kubectl get pods -w -l app=web-app
# NAME READY STATUS AGE
# web-app-v1-abc12 1/1 Running 1h
# web-app-v1-def34 1/1 Running 1h
# web-app-v1-ghi56 1/1 Running 1h
# web-app-v2-jkl78 0/1 ContainerCreating 0s
# web-app-v2-jkl78 1/1 Running 5s β ready
# web-app-v1-abc12 1/1 Terminating 1h β old pod draining
# Rollback if needed
kubectl rollout undo deployment/web-appCommon Issues
| Issue | Cause | Fix |
|---|---|---|
| 502 during update | No readiness probe β traffic sent to unready pods | Add readiness probe |
| Connection reset on update | No preStop hook β pod killed mid-request | Add `preStop: sleep 10` |
| All pods down briefly | `maxUnavailable: 100%` | Set `maxUnavailable: 0` |
| Update takes too long | `maxSurge: 1` with many replicas | Increase maxSurge to 25-50% |
| Rollout stuck | Readiness probe failing on new version | Check new image, rollback with `kubectl rollout undo` |
| SIGKILL before cleanup done | `terminationGracePeriodSeconds` too short | Increase to cover max request duration |
Best Practices
- Always set `maxUnavailable: 0` for user-facing services β guarantees full capacity
- Always add readiness probe β the single most important zero-downtime setting
- Always add `preStop: sleep 10` β gives iptables time to propagate
- Set `terminationGracePeriodSeconds` to longest expected request + preStop duration
- Test with load testing β run `hey` or `k6` during updates to verify zero errors
- Use `kubectl rollout status` in CI/CD β waits for successful rollout
Key Takeaways
- Zero-downtime = readiness probe + preStop hook + maxUnavailable: 0
- `preStop: sleep 10` is essential β bridges the gap between endpoint removal and actual termination
- `maxSurge` controls speed, `maxUnavailable` controls safety
- Application must handle SIGTERM for graceful shutdown
- `terminationGracePeriodSeconds` must exceed preStop + app shutdown time
- Always verify with load testing during updates β theory isnβt enough

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
