Fix Kubernetes Job Failures and Retries
Debug Kubernetes Jobs stuck in backoff or hitting retry limits. Covers backoffLimit, activeDeadlineSeconds, and CronJob overlap.
π‘ Quick Answer: Debug Jobs stuck in backoff, hitting retry limits, or producing wrong completions count. Covers backoffLimit, activeDeadlineSeconds, TTL cleanup, and indexed Jobs.
The Problem
This is a common issue in Kubernetes troubleshooting that catches both beginners and experienced operators.
The Solution
Step 1: Check Job Status
kubectl describe job my-job | grep -A10 "Pods Statuses\|Events"
# Pods Statuses: 0 Active / 0 Succeeded / 6 Failed
# Check why pods failed
kubectl logs job/my-job
kubectl logs job/my-job --previousStep 2: Common Fixes
Job hit backoffLimit:
apiVersion: batch/v1
kind: Job
metadata:
name: my-job
spec:
backoffLimit: 6 # Default: 6 retries
activeDeadlineSeconds: 600 # Kill after 10 minutes total
template:
spec:
restartPolicy: Never # Never = new pod per retry
# OnFailure = restart same podCronJob overlap:
apiVersion: batch/v1
kind: CronJob
metadata:
name: my-cronjob
spec:
schedule: "*/5 * * * *"
concurrencyPolicy: Forbid # Skip if previous still running
startingDeadlineSeconds: 300 # Skip if >5min late
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 5Job completed but pods not cleaned up:
spec:
ttlSecondsAfterFinished: 3600 # Auto-delete 1h after completionJob needs to run on specific node:
template:
spec:
nodeSelector:
node-type: compute
tolerations:
- key: workload
value: batch
effect: NoScheduleBest Practices
- Monitor proactively with Prometheus alerts before issues become incidents
- Document runbooks for your teamβs most common failure scenarios
- Use
kubectl describeand events as your first debugging tool - Automate recovery where possible with operators or scripts
Key Takeaways
- Always check events and logs first β Kubernetes tells you whatβs wrong
- Most issues have clear error messages pointing to the root cause
- Prevention through monitoring and proper configuration beats reactive debugging
- Keep this recipe bookmarked for quick reference during incidents

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses β