Thanos Receive OOMKilled CrashLoopBackOff
Debug and fix Thanos Receive StatefulSet OOMKilled CrashLoopBackOff caused by WAL replay exceeding memory limits. Covers ArgoCD conflict resolution, liveness
π‘ Quick Answer: Thanos Receive OOMKilled during WAL (Write-Ahead Log) replay at startup needs both a liveness probe timeout increase (to survive replay) AND a memory limit increase (to hold all series in memory post-replay). When managed by ArgoCD, you must commit the fix to GitOps β manual patches get reverted.
The Problem
Thanos Receive StatefulSet enters CrashLoopBackOff with these symptoms:
- Pod starts, begins loading WAL segments (hundreds of segments)
- Liveness probe fails during replay (default 30s timeout too short)
- OR: WAL replay completes but steady-state memory exceeds limit β OOMKilled
- ArgoCD continuously reverts manual patches back to Git state
- Multi-Attach volume errors when Pod reschedules to different node
The Solution
Diagnose the Root Cause
# Check Pod events
oc describe pod thanos-receive-0 -n monitoring
# Look for:
# - OOMKilled (exit code 137)
# - Liveness probe failed
# - Back-off restarting failed container
# - Multi-Attach error for volume
# Check previous container logs
oc logs thanos-receive-0 -n monitoring --previous | tail -50
# Check current memory usage (if Pod is running)
oc exec thanos-receive-0 -n monitoring -- \
cat /sys/fs/cgroup/memory.currentIdentify WAL Replay Duration
# Count WAL segments
oc exec thanos-receive-0 -n monitoring -- \
ls /var/thanos/receive/wal/ | wc -l
# Typical output: 347 segments
# Each segment β 128MB β replay loads all into memory
# 347 Γ estimated 3KB/series Γ 1M series β needs 2-4Gi RAMLogs During WAL Replay
level=info component=receive component=multi-tsdb tenant=default-tenant
caller=head.go:825 msg="WAL segment loaded" segment=341
level=info component=receive component=multi-tsdb tenant=default-tenant
caller=head.go:825 msg="WAL segment loaded" segment=342
...
# Hundreds of these lines before the Pod is ready
# If liveness kills it before completion β infinite CrashLoopFix 1: Liveness Probe Timeout (Survive Startup)
# StatefulSet spec.template.spec.containers[0]
livenessProbe:
httpGet:
path: /-/healthy
port: http
scheme: HTTP
initialDelaySeconds: 120 # Give 2 min before first check
timeoutSeconds: 30 # Individual probe timeout
periodSeconds: 10
successThreshold: 1
failureThreshold: 30 # 30 failures Γ 10s = 5 min graceOr use a startupProbe (preferred for slow-starting containers):
startupProbe:
httpGet:
path: /-/healthy
port: http
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 60 # 60 Γ 10s = 10 min for WAL replay
livenessProbe:
httpGet:
path: /-/healthy
port: http
timeoutSeconds: 5
periodSeconds: 10
failureThreshold: 6Fix 2: Memory Limit (Survive Post-Replay)
containers:
- name: receive
resources:
limits:
cpu: 800m
memory: 4Gi # β Increase from 1Gi
requests:
cpu: 500m
memory: 2Gi # β Increase from 1GiSizing formula:
Required memory β
WAL size on disk Γ 2-3x (decompressed in-memory)
+ active series Γ 2KB per series
+ receive buffer (incoming writes)
+ Go runtime overhead (~200MB)
Example:
WAL: 347 segments β 4GB on disk
Active series: ~500K Γ 2KB = 1GB
Buffer: 500MB
Runtime: 200MB
Total: ~3-4Gi minimum β set limit to 4GiFix 3: Commit to GitOps (Permanent)
# In your Helm values (GitOps repo):
thanos:
receive:
tolerations: *tolerations
resources:
limits:
cpu: 800m
memory: 4Gi
requests:
cpu: 500m
memory: 2Gi
livenessProbe:
initialDelaySeconds: 120
failureThreshold: 30# Commit and push
git add values.yaml
git commit -m "fix: increase thanos-receive memory to 4Gi for WAL replay"
git push origin main
# ArgoCD will auto-sync (if enabled)
# Or manually sync:
argocd app sync monitoring --resource apps/StatefulSet/thanos-receiveHandle ArgoCD Conflict
If ArgoCD keeps reverting your manual patches:
# Option A: Pause auto-sync temporarily
oc patch application monitoring -n argocd --type=merge \
-p '{"spec":{"syncPolicy":{"automated":null}}}'
# Apply manual fix
oc patch sts thanos-receive -n monitoring --type=json -p='[
{"op":"replace","path":"/spec/template/spec/containers/0/resources/limits/memory","value":"4Gi"},
{"op":"replace","path":"/spec/template/spec/containers/0/resources/requests/memory","value":"2Gi"}
]'
# Delete Pod to pick up new spec
oc delete pod thanos-receive-0 -n monitoring
# Option B: Add ignoreDifferences (not recommended for resources)
# Option C: Commit proper fix to Git (BEST)Handle Multi-Attach Volume Error
Multi-Attach error for volume "csi-vol-4651b1036..."
Volume is already exclusively attached to one node and can't be attached to another# This happens when Pod reschedules to a different node
# The PVC (RWO) is still attached to the old node
# Wait for old VolumeAttachment to expire (usually 6 min)
oc get volumeattachment | grep thanos-receive
# Or force-detach (DANGEROUS - data loss risk if old Pod still writing)
oc delete volumeattachment <attachment-name>
# Better: ensure Pod stays on same node via nodeAffinityCommon Issues
WAL replay takes >10 minutes
- Cause: Massive WAL accumulation from long downtime
- Fix: Increase startupProbe failureThreshold; consider compacting WAL manually
Memory keeps growing after replay
- Cause: High cardinality metrics (too many unique label combinations)
- Fix: Add relabeling rules to drop high-cardinality series; increase memory limit
ArgoCD shows βOutOfSyncβ after manual fix
- Cause: Live state differs from Git
- Fix: Commit the fix to Git; ArgoCD will show βSyncedβ again
Pod stuck in Pending after OOM
- Cause: New memory request exceeds node available memory
- Fix: Check
oc describe node; reduce requests or move to larger node
Best Practices
- Use startupProbe for Thanos Receive β separate startup from liveness
- Size memory at 3-4x WAL disk size β decompression + runtime overhead
- Never fight ArgoCD β always commit the real fix to Git
- Monitor WAL size β set alerts when WAL exceeds expected size
- Use retention flags β
--tsdb.retention=15dlimits WAL growth - Set fsGroup in securityContext β ensures WAL files are writable (fsGroup: 1001)
- Pin StatefulSet to node β avoids Multi-Attach errors on RWO volumes
Key Takeaways
- Thanos Receive OOM has two phases: startup (WAL replay) and steady-state
- Liveness probe timeout alone doesnβt fix OOM β kernel OOMKiller ignores probes
- startupProbe is the correct K8s primitive for slow-starting containers
- ArgoCD will revert manual patches within its sync interval (default 3 min)
- Memory sizing: WAL disk Γ 3 + active series Γ 2KB + 700MB overhead
- Multi-Attach errors resolve by waiting for VolumeAttachment timeout (6 min)
- The permanent fix must go in the GitOps repo β no exceptions

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses β