🎤Speaking at KubeCon EU 2026Lessons Learned Orchestrating Multi-Tenant GPUs on OpenShift AIView Session
Troubleshooting intermediate ⏱ 15 minutes K8s 1.28+

How to Debug OOMKilled Pods

Troubleshoot Kubernetes pods killed due to Out of Memory (OOM). Learn to identify memory leaks, set proper limits, and prevent OOMKilled errors.

By Luca Berton

How to Debug OOMKilled Pods

OOMKilled occurs when a container exceeds its memory limit or the node runs out of memory. Learn to diagnose, fix, and prevent OOM issues in Kubernetes.

Identifying OOMKilled Pods

# Find OOMKilled pods
kubectl get pods --all-namespaces -o json | jq -r '
  .items[] |
  select(.status.containerStatuses[]?.lastState.terminated.reason == "OOMKilled") |
  [.metadata.namespace, .metadata.name] | @tsv'

# Check specific pod status
kubectl describe pod myapp-pod

# Look for this in output:
#   Last State:     Terminated
#     Reason:       OOMKilled
#     Exit Code:    137

Check Current Memory Usage

# Pod memory usage
kubectl top pod myapp-pod

# Container-level usage
kubectl top pod myapp-pod --containers

# Node memory usage
kubectl top nodes

# Detailed node memory
kubectl describe node <node-name> | grep -A5 "Allocated resources"

Analyze Memory with Debug Container

# Attach debug container (Kubernetes 1.25+)
kubectl debug myapp-pod -it --image=busybox --target=myapp

# Inside debug container, check memory
cat /proc/meminfo
cat /sys/fs/cgroup/memory/memory.usage_in_bytes
cat /sys/fs/cgroup/memory/memory.limit_in_bytes

# For cgroups v2
cat /sys/fs/cgroup/memory.current
cat /sys/fs/cgroup/memory.max

Check Container Memory Limits

# deployment.yaml - Proper memory configuration
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  template:
    spec:
      containers:
        - name: myapp
          image: myapp:v1
          resources:
            requests:
              memory: "256Mi"  # Scheduling guarantee
            limits:
              memory: "512Mi"  # Hard limit (OOMKilled if exceeded)

Memory Debugging for Java Applications

# java-app.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: java-app
spec:
  template:
    spec:
      containers:
        - name: app
          image: java-app:v1
          resources:
            requests:
              memory: "1Gi"
            limits:
              memory: "2Gi"
          env:
            # Let JVM respect container limits
            - name: JAVA_OPTS
              value: >-
                -XX:+UseContainerSupport
                -XX:MaxRAMPercentage=75.0
                -XX:InitialRAMPercentage=50.0
                -XX:+HeapDumpOnOutOfMemoryError
                -XX:HeapDumpPath=/tmp/heapdump.hprof
            # Or explicit sizing
            - name: JAVA_TOOL_OPTIONS
              value: "-Xmx1536m -Xms512m"
          volumeMounts:
            - name: heap-dumps
              mountPath: /tmp
      volumes:
        - name: heap-dumps
          emptyDir:
            sizeLimit: 2Gi

Memory Debugging for Node.js

# nodejs-app.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nodejs-app
spec:
  template:
    spec:
      containers:
        - name: app
          image: nodejs-app:v1
          resources:
            limits:
              memory: "512Mi"
          env:
            # Set Node.js heap limit (in MB)
            - name: NODE_OPTIONS
              value: "--max-old-space-size=384"
          command:
            - node
            - --expose-gc
            - --max-old-space-size=384
            - app.js

Memory Debugging for Python

# python-app.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: python-app
spec:
  template:
    spec:
      containers:
        - name: app
          image: python-app:v1
          resources:
            limits:
              memory: "512Mi"
          env:
            # Enable memory profiling
            - name: PYTHONTRACEMALLOC
              value: "1"
            # Reduce memory fragmentation
            - name: MALLOC_TRIM_THRESHOLD_
              value: "65536"

Create Memory Monitoring

# memory-monitoring.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: memory-alerts
spec:
  groups:
    - name: memory
      rules:
        # Alert before OOM
        - alert: ContainerMemoryHigh
          expr: |
            (container_memory_working_set_bytes / container_spec_memory_limit_bytes) > 0.9
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Container {{ $labels.container }} using >90% memory"
            description: "{{ $labels.pod }} is at {{ $value | humanizePercentage }} of limit"
        
        # Track OOMKilled events
        - alert: ContainerOOMKilled
          expr: |
            kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1
          labels:
            severity: critical
          annotations:
            summary: "Container {{ $labels.container }} was OOMKilled"

Vertical Pod Autoscaler for Right-Sizing

# vpa.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: myapp-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  updatePolicy:
    updateMode: "Off"  # Just recommend, don't auto-update
  resourcePolicy:
    containerPolicies:
      - containerName: myapp
        minAllowed:
          memory: "128Mi"
        maxAllowed:
          memory: "4Gi"

Check recommendations:

kubectl describe vpa myapp-vpa

Investigate with Ephemeral Debug Container

# debug-pod.yaml - Full debugging capabilities
apiVersion: v1
kind: Pod
metadata:
  name: memory-debug
spec:
  containers:
    - name: debug
      image: alpine:latest
      command: ["sleep", "infinity"]
      securityContext:
        capabilities:
          add: ["SYS_PTRACE"]
      resources:
        limits:
          memory: "256Mi"
# Run memory profiling tools
kubectl exec -it memory-debug -- sh

# Install tools
apk add --no-cache procps htop

# Monitor memory
watch -n 1 'cat /proc/meminfo | grep -E "MemTotal|MemFree|Buffers|Cached"'

Node-Level OOM Investigation

# Check node events for OOM
kubectl get events --field-selector reason=OOMKilling -A

# Check kubelet logs
journalctl -u kubelet | grep -i oom

# Check kernel OOM killer logs
dmesg | grep -i "out of memory"
dmesg | grep -i "killed process"

# Check node memory pressure
kubectl describe node <node> | grep -A5 Conditions

Memory Limit Best Practices

# Recommended configuration
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  template:
    spec:
      containers:
        - name: myapp
          resources:
            requests:
              # Set to expected average usage
              memory: "256Mi"
            limits:
              # Set 1.5-2x requests for burst headroom
              # But not too high to prevent node-level OOM
              memory: "512Mi"

Quick Fixes

# Increase memory limit temporarily
kubectl set resources deployment myapp \
  --limits=memory=1Gi --requests=memory=512Mi

# Scale down to reduce node pressure
kubectl scale deployment myapp --replicas=2

# Restart pod to clear memory
kubectl rollout restart deployment myapp

Common Causes and Solutions

CauseSolution
Memory leakProfile app, fix leak, implement GC
Limit too lowIncrease limit based on profiling
JVM heap misconfiguredUse -XX:MaxRAMPercentage
Large file processingStream instead of loading fully
Unbounded cachesAdd size limits, use LRU eviction
Node memory exhaustionAdd nodes, use resource quotas

Summary

OOMKilled errors indicate memory limit violations. Debug by checking container and node memory usage, analyze application memory patterns, set appropriate limits with headroom, and use VPA for recommendations. For production, implement memory alerting to catch issues before they cause OOMKilled events.


📘 Go Further with Kubernetes Recipes

Love this recipe? There’s so much more! This is just one of 100+ hands-on recipes in our comprehensive Kubernetes Recipes book.

Inside the book, you’ll master:

  • ✅ Production-ready deployment strategies
  • ✅ Advanced networking and security patterns
  • ✅ Observability, monitoring, and troubleshooting
  • ✅ Real-world best practices from industry experts

“The practical, recipe-based approach made complex Kubernetes concepts finally click for me.”

👉 Get Your Copy Now — Start building production-grade Kubernetes skills today!

#oom #memory #troubleshooting #debugging #resources

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.