πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
Troubleshooting advanced ⏱ 15 minutes K8s 1.28+

Thanos Receive OOMKilled CrashLoopBackOff

Debug and fix Thanos Receive StatefulSet OOMKilled CrashLoopBackOff caused by WAL replay exceeding memory limits. Covers ArgoCD conflict resolution, liveness

By Luca Berton β€’ β€’ πŸ“– 6 min read

πŸ’‘ Quick Answer: Thanos Receive OOMKilled during WAL (Write-Ahead Log) replay at startup needs both a liveness probe timeout increase (to survive replay) AND a memory limit increase (to hold all series in memory post-replay). When managed by ArgoCD, you must commit the fix to GitOps β€” manual patches get reverted.

The Problem

Thanos Receive StatefulSet enters CrashLoopBackOff with these symptoms:

  • Pod starts, begins loading WAL segments (hundreds of segments)
  • Liveness probe fails during replay (default 30s timeout too short)
  • OR: WAL replay completes but steady-state memory exceeds limit β†’ OOMKilled
  • ArgoCD continuously reverts manual patches back to Git state
  • Multi-Attach volume errors when Pod reschedules to different node

The Solution

Diagnose the Root Cause

# Check Pod events
oc describe pod thanos-receive-0 -n monitoring

# Look for:
# - OOMKilled (exit code 137)
# - Liveness probe failed
# - Back-off restarting failed container
# - Multi-Attach error for volume

# Check previous container logs
oc logs thanos-receive-0 -n monitoring --previous | tail -50

# Check current memory usage (if Pod is running)
oc exec thanos-receive-0 -n monitoring -- \
  cat /sys/fs/cgroup/memory.current

Identify WAL Replay Duration

# Count WAL segments
oc exec thanos-receive-0 -n monitoring -- \
  ls /var/thanos/receive/wal/ | wc -l

# Typical output: 347 segments
# Each segment β‰ˆ 128MB β†’ replay loads all into memory
# 347 Γ— estimated 3KB/series Γ— 1M series β‰ˆ needs 2-4Gi RAM

Logs During WAL Replay

level=info component=receive component=multi-tsdb tenant=default-tenant
  caller=head.go:825 msg="WAL segment loaded" segment=341
level=info component=receive component=multi-tsdb tenant=default-tenant
  caller=head.go:825 msg="WAL segment loaded" segment=342
...
# Hundreds of these lines before the Pod is ready
# If liveness kills it before completion β†’ infinite CrashLoop

Fix 1: Liveness Probe Timeout (Survive Startup)

# StatefulSet spec.template.spec.containers[0]
livenessProbe:
  httpGet:
    path: /-/healthy
    port: http
    scheme: HTTP
  initialDelaySeconds: 120    # Give 2 min before first check
  timeoutSeconds: 30          # Individual probe timeout
  periodSeconds: 10
  successThreshold: 1
  failureThreshold: 30        # 30 failures Γ— 10s = 5 min grace

Or use a startupProbe (preferred for slow-starting containers):

startupProbe:
  httpGet:
    path: /-/healthy
    port: http
  initialDelaySeconds: 30
  periodSeconds: 10
  failureThreshold: 60        # 60 Γ— 10s = 10 min for WAL replay
livenessProbe:
  httpGet:
    path: /-/healthy
    port: http
  timeoutSeconds: 5
  periodSeconds: 10
  failureThreshold: 6

Fix 2: Memory Limit (Survive Post-Replay)

containers:
  - name: receive
    resources:
      limits:
        cpu: 800m
        memory: 4Gi      # ← Increase from 1Gi
      requests:
        cpu: 500m
        memory: 2Gi      # ← Increase from 1Gi

Sizing formula:

Required memory β‰ˆ 
  WAL size on disk Γ— 2-3x (decompressed in-memory)
  + active series Γ— 2KB per series
  + receive buffer (incoming writes)
  + Go runtime overhead (~200MB)

Example:
  WAL: 347 segments β‰ˆ 4GB on disk
  Active series: ~500K Γ— 2KB = 1GB
  Buffer: 500MB
  Runtime: 200MB
  Total: ~3-4Gi minimum β†’ set limit to 4Gi

Fix 3: Commit to GitOps (Permanent)

# In your Helm values (GitOps repo):
thanos:
  receive:
    tolerations: *tolerations
    resources:
      limits:
        cpu: 800m
        memory: 4Gi
      requests:
        cpu: 500m
        memory: 2Gi
    livenessProbe:
      initialDelaySeconds: 120
      failureThreshold: 30
# Commit and push
git add values.yaml
git commit -m "fix: increase thanos-receive memory to 4Gi for WAL replay"
git push origin main

# ArgoCD will auto-sync (if enabled)
# Or manually sync:
argocd app sync monitoring --resource apps/StatefulSet/thanos-receive

Handle ArgoCD Conflict

If ArgoCD keeps reverting your manual patches:

# Option A: Pause auto-sync temporarily
oc patch application monitoring -n argocd --type=merge \
  -p '{"spec":{"syncPolicy":{"automated":null}}}'

# Apply manual fix
oc patch sts thanos-receive -n monitoring --type=json -p='[
  {"op":"replace","path":"/spec/template/spec/containers/0/resources/limits/memory","value":"4Gi"},
  {"op":"replace","path":"/spec/template/spec/containers/0/resources/requests/memory","value":"2Gi"}
]'

# Delete Pod to pick up new spec
oc delete pod thanos-receive-0 -n monitoring

# Option B: Add ignoreDifferences (not recommended for resources)
# Option C: Commit proper fix to Git (BEST)

Handle Multi-Attach Volume Error

Multi-Attach error for volume "csi-vol-4651b1036..."
Volume is already exclusively attached to one node and can't be attached to another
# This happens when Pod reschedules to a different node
# The PVC (RWO) is still attached to the old node

# Wait for old VolumeAttachment to expire (usually 6 min)
oc get volumeattachment | grep thanos-receive

# Or force-detach (DANGEROUS - data loss risk if old Pod still writing)
oc delete volumeattachment <attachment-name>

# Better: ensure Pod stays on same node via nodeAffinity

Common Issues

WAL replay takes >10 minutes

  • Cause: Massive WAL accumulation from long downtime
  • Fix: Increase startupProbe failureThreshold; consider compacting WAL manually

Memory keeps growing after replay

  • Cause: High cardinality metrics (too many unique label combinations)
  • Fix: Add relabeling rules to drop high-cardinality series; increase memory limit

ArgoCD shows β€œOutOfSync” after manual fix

  • Cause: Live state differs from Git
  • Fix: Commit the fix to Git; ArgoCD will show β€œSynced” again

Pod stuck in Pending after OOM

  • Cause: New memory request exceeds node available memory
  • Fix: Check oc describe node; reduce requests or move to larger node

Best Practices

  1. Use startupProbe for Thanos Receive β€” separate startup from liveness
  2. Size memory at 3-4x WAL disk size β€” decompression + runtime overhead
  3. Never fight ArgoCD β€” always commit the real fix to Git
  4. Monitor WAL size β€” set alerts when WAL exceeds expected size
  5. Use retention flags β€” --tsdb.retention=15d limits WAL growth
  6. Set fsGroup in securityContext β€” ensures WAL files are writable (fsGroup: 1001)
  7. Pin StatefulSet to node β€” avoids Multi-Attach errors on RWO volumes

Key Takeaways

  • Thanos Receive OOM has two phases: startup (WAL replay) and steady-state
  • Liveness probe timeout alone doesn’t fix OOM β€” kernel OOMKiller ignores probes
  • startupProbe is the correct K8s primitive for slow-starting containers
  • ArgoCD will revert manual patches within its sync interval (default 3 min)
  • Memory sizing: WAL disk Γ— 3 + active series Γ— 2KB + 700MB overhead
  • Multi-Attach errors resolve by waiting for VolumeAttachment timeout (6 min)
  • The permanent fix must go in the GitOps repo β€” no exceptions
#thanos #oom #crashloopbackoff #statefulset #argocd
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens