Kubernetes OOMKilled Troubleshooting and Prevention
Debug and prevent OOMKilled container terminations in Kubernetes. Understand memory limits, diagnose memory leaks, configure resource requests, and implement
π‘ Quick Answer:
OOMKilled(exit code 137) means the container exceeded its memory limit and the kernel OOM-killer terminated it. Fix by: 1) Increasingresources.limits.memory, 2) Fixing memory leaks in your application, 3) Using VPA to auto-right-size, or 4) Reducing memory footprint (heap size, cache limits). Check current usage withkubectl top podbefore adjusting.
The Problem
- Container keeps restarting with
OOMKilledreason (exit code 137) - Application works locally but OOMs in Kubernetes
- Memory limit set too low or application has a memory leak
- Node-level OOM (no container limit) kills random pods
- Java/Python applications consume more memory than expected due to runtime overhead
The Solution
Diagnose OOMKilled
# Check pod status
kubectl get pod <pod-name> -n <namespace>
# NAME READY STATUS RESTARTS AGE
# my-app 0/1 OOMKilled 5 10m
# Get detailed termination info
kubectl describe pod <pod-name> -n <namespace>
# Last State: Terminated
# Reason: OOMKilled
# Exit Code: 137
# Check previous container's last state
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].lastState.terminated}'
# {"exitCode":137,"reason":"OOMKilled","startedAt":"...","finishedAt":"..."}
# Check current memory usage (before OOM)
kubectl top pod <pod-name> -n <namespace>
# NAME CPU(cores) MEMORY(bytes)
# my-app 50m 245Mi
# Check memory limit
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].resources.limits.memory}'
# 256Mi β if usage is 245Mi, container is about to OOMFix: Increase Memory Limit
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
template:
spec:
containers:
- name: app
image: registry.example.com/app:v1
resources:
requests:
memory: "256Mi" # Scheduler guarantee
cpu: "100m"
limits:
memory: "512Mi" # Hard cap β OOMKilled if exceeded
cpu: "500m" # CPU is throttled, not killedFix: Java Application Memory
# Java apps need JVM heap + metaspace + native memory
# Rule of thumb: container limit = 1.5-2x max heap
containers:
- name: java-app
image: registry.example.com/java-app:v1
env:
- name: JAVA_OPTS
value: "-Xms256m -Xmx384m -XX:MaxMetaspaceSize=128m"
# Container limit should be >= Xmx + Metaspace + ~100MB overhead
resources:
requests:
memory: "512Mi"
limits:
memory: "640Mi" # 384 heap + 128 metaspace + 128 overheadFix: Python Application Memory
# Python: watch for pandas/numpy large datasets, model loading
containers:
- name: python-app
env:
# Limit Python's memory allocator
- name: PYTHONMALLOC
value: "malloc" # Use system malloc (more predictable)
- name: MALLOC_TRIM_THRESHOLD_
value: "65536" # Release memory back to OS sooner
resources:
limits:
memory: "1Gi"Use VPA for Auto-Sizing
# Vertical Pod Autoscaler recommends correct memory limits
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: my-app-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
updatePolicy:
updateMode: "Auto" # Auto | Off | Initial
resourcePolicy:
containerPolicies:
- containerName: app
minAllowed:
memory: "128Mi"
maxAllowed:
memory: "4Gi"Monitor Memory Usage Over Time
# Prometheus query: containers near memory limit
container_memory_working_set_bytes{container!=""}
/
container_spec_memory_limit_bytes{container!=""} > 0.8
# Count OOMKilled events
increase(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[1h])
# Memory usage trend
rate(container_memory_working_set_bytes{pod="my-app-xxx"}[5m])Common Issues
OOMKilled immediately on start (exit code 137, 0 restarts then dies)
- Cause: Application startup memory exceeds limit (model loading, large init)
- Fix: Increase limit; or reduce startup memory (lazy loading, streaming)
OOMKilled after running for hours
- Cause: Memory leak β gradual increase until limit hit
- Fix: Profile application memory; fix leak; or add periodic restart with
livenessProbe
Pod killed but no OOMKilled reason shown
- Cause: Node-level OOM β kernel killed the pod (no container limit set)
- Fix: Always set memory limits; check
dmesgon node for OOM messages
Container shows 137 exit code but reason is blank
- Cause: Pod was evicted by kubelet (memory pressure)
- Fix: Set
requestsproperly so scheduler places on nodes with capacity
Best Practices
- Always set memory limits β prevents runaway containers from killing other pods
- Set requests = typical usage β limits = peak usage (1.5-2x requests)
- Use VPA in recommendation mode β observe before auto-adjusting
- Monitor memory/limit ratio β alert when >80% consistently
- JVM: set
-Xmxto 70-80% of container limit β leave room for native memory - Profile before guessing β use
kubectl top, Prometheus, or profilers - Consider QoS class β Guaranteed (requests=limits) pods are last to be evicted
Key Takeaways
OOMKilled= container exceededresources.limits.memory(exit code 137)- CPU limits throttle; memory limits kill β critical difference
- Java: container limit β₯ Xmx + Metaspace + ~100-200MB native overhead
kubectl top podshows current usage; compare against limit to predict OOM- VPA auto-recommends correct memory limits based on actual usage history
- Node-level OOM evicts pods by QoS: BestEffort first, then Burstable, then Guaranteed
- Always set both
requests(scheduling) andlimits(hard cap)

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses β