Kubernetes Startup Probes for Slow Containers
Configure Kubernetes startup probes for containers with long initialization. Separate startup from liveness checks, failureThreshold tuning.
π‘ Quick Answer: Startup probes protect slow-starting containers from being killed by liveness probes. The liveness and readiness probes are disabled until the startup probe succeeds. Set `failureThreshold Γ periodSeconds` to cover your worst-case startup time (e.g., `failureThreshold: 30, periodSeconds: 10` = 5-minute startup window).
The Problem
Java applications, ML model loading, or database containers can take 60-300+ seconds to start. If your liveness probe has a 30-second `initialDelaySeconds`, pods get killed during startup and enter CrashLoopBackOff. Increasing `initialDelaySeconds` delays failure detection after the app is running. Startup probes solve this by separating startup detection from runtime health checking.
flowchart TB
subgraph WITHOUT["Without Startup Probe"]
START1["Container starts<br/>(takes 120s)"] -->|"30s"| LIVE1["Liveness probe fires"]
LIVE1 -->|"App not ready yet"| KILL["β Killed!<br/>CrashLoopBackOff"]
end
subgraph WITH["With Startup Probe"]
START2["Container starts<br/>(takes 120s)"] -->|"Startup probe<br/>checks every 10s"| WAIT["Waiting..."]
WAIT -->|"120s: App ready"| PASS["β
Startup probe passes"]
PASS --> LIVE2["Liveness + readiness<br/>probes activate"]
endThe Solution
Basic Startup Probe
apiVersion: v1
kind: Pod
metadata:
name: java-app
spec:
containers:
- name: app
image: spring-boot-app:v1.0
ports:
- containerPort: 8080
# Startup probe: protects during slow startup
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30 # 30 failures Γ 10s = 300s max startup
periodSeconds: 10
# Liveness probe: only runs AFTER startup probe succeeds
livenessProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 10
failureThreshold: 3
# Readiness probe: only runs AFTER startup probe succeeds
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5
failureThreshold: 3How the Timing Works
Timeline:
t=0: Container starts
t=0-300s: Startup probe checks every 10s (up to 30 failures allowed)
t=120s: App responds to /healthz β startup probe SUCCEEDS
t=120s+: Liveness probe starts (every 10s)
t=120s+: Readiness probe starts (every 5s)
If app never starts within 300s β pod killed (startup probe failed)
If app crashes after startup β liveness probe catches it in 30sJava Spring Boot Example
startupProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
failureThreshold: 30
periodSeconds: 10 # 30 Γ 10 = 300s for Spring Boot startup
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
periodSeconds: 10
failureThreshold: 3 # Kill after 30s of failures
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
periodSeconds: 5
failureThreshold: 3ML Model Loading Example
# NIM or HuggingFace model that takes 2-10 minutes to load
startupProbe:
httpGet:
path: /v1/health/ready
port: 8000
failureThreshold: 60
periodSeconds: 10 # 60 Γ 10 = 600s (10 min) for model loading
livenessProbe:
httpGet:
path: /v1/health/live
port: 8000
periodSeconds: 15
failureThreshold: 3TCP Startup Probe
For services that accept connections before HTTP endpoints are ready:
startupProbe:
tcpSocket:
port: 5432 # PostgreSQL port
failureThreshold: 30
periodSeconds: 5 # 30 Γ 5 = 150s for database startupExec Startup Probe
For custom readiness checks:
startupProbe:
exec:
command:
- sh
- -c
- "pg_isready -U postgres -d mydb"
failureThreshold: 20
periodSeconds: 5Probe Comparison
| Probe | When It Runs | On Failure | Purpose |
|---|---|---|---|
| Startup | Until first success | Kill pod (after failureThreshold) | Protect slow-starting containers |
| Liveness | After startup succeeds | Kill pod β restart | Detect deadlocked/stuck apps |
| Readiness | After startup succeeds | Remove from Service endpoints | Control traffic routing |
Calculate Your Startup Budget
Max startup time = failureThreshold Γ periodSeconds
Examples:
30 Γ 10s = 300s (5 min) β Java apps, Spring Boot
60 Γ 10s = 600s (10 min) β ML model loading
12 Γ 5s = 60s (1 min) β Standard web apps
90 Γ 10s = 900s (15 min) β Large NIM models on slow storageCommon Issues
| Issue | Cause | Fix |
|---|---|---|
| Pod killed during startup | Startup budget too short | Increase `failureThreshold` or `periodSeconds` |
| Startup probe never succeeds | App crash, wrong port/path | Check pod logs, verify health endpoint |
| Liveness probe killing healthy pod | No startup probe, `initialDelaySeconds` too short | Add startup probe instead |
| readiness never runs | Startup probe failing | Fix startup probe first |
| False positive startup | Probe endpoint returns 200 before app is truly ready | Use a dedicated startup endpoint that checks all dependencies |
Best Practices
- Always use startup probes for slow containers β cleaner than large `initialDelaySeconds`
- Set generous startup budgets β 2Γ your worst observed startup time
- Use different endpoints β `/healthz` for liveness, `/ready` for readiness, can share for startup
- Donβt check dependencies in liveness β liveness should only check if the process is alive
- Monitor startup durations β track P99 startup time to right-size failureThreshold
Key Takeaways
- Startup probes disable liveness/readiness probes until the container is ready
- Max startup window = `failureThreshold Γ periodSeconds`
- Essential for Java, ML models, databases, and any container with >30s startup
- Liveness and readiness probes only activate after startup probe succeeds
- Replaces the anti-pattern of large `initialDelaySeconds` on liveness probes
- Available since Kubernetes 1.20 (GA)

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
