High Availability OpenClaw with Kubernetes
Run OpenClaw in a high-availability configuration on Kubernetes with health checks, automatic restarts, backup strategies, and monitoring for.
π‘ Quick Answer: OpenClaw runs as a single-instance gateway (WhatsApp/Signal require exclusive connections), but you can achieve high availability through Kubernetes restart policies, health probes, PodDisruptionBudgets, PVC backups, and monitoring alerts. The key is fast recovery, not horizontal scaling.
Key concept: HA for OpenClaw = fast automatic recovery + state persistence + monitoring, not multiple replicas (messaging protocols donβt allow it).
Gotcha: Donβt set
replicas: 2β messaging channels like WhatsApp and Discord only allow one active connection per token/session.
The Problem
- If the OpenClaw pod dies, your AI assistant goes offline
- Pod rescheduling can take minutes if nodes are full
- PVC data loss means re-pairing all messaging channels
- No alerting means you wonβt know the bot is down
The Solution
Use Kubernetes health probes, priority classes, PDB, and monitoring for fast automatic recovery.
Architecture Overview
flowchart TB
subgraph cluster["βΈοΈ KUBERNETES CLUSTER"]
subgraph ha["HA Configuration"]
DEP["Deployment<br/>replicas: 1<br/>Recreate"]
HP["Health Probes<br/>Liveness + Readiness"]
PC["PriorityClass<br/>High Priority"]
PDB["PodDisruptionBudget<br/>minAvailable: 1"]
end
subgraph backup["Backup"]
PVC["PVC<br/>State"]
SNAP["VolumeSnapshot<br/>CronJob"]
S3["S3 Backup"]
end
subgraph monitoring["Monitoring"]
PROM["Prometheus"]
ALERT["AlertManager"]
end
end
DEP --> HP
DEP --> PC & PDB
PVC --> SNAP --> S3
DEP --> PROM --> ALERTStep 1: HA Deployment Configuration
# openclaw-ha.yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: openclaw-critical
value: 1000000
globalDefault: false
description: "OpenClaw gateway β keep running at all costs"
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: openclaw-pdb
namespace: openclaw
spec:
minAvailable: 1
selector:
matchLabels:
app: openclaw
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: openclaw-gateway
namespace: openclaw
spec:
replicas: 1
strategy:
type: Recreate
selector:
matchLabels:
app: openclaw
template:
metadata:
labels:
app: openclaw
spec:
priorityClassName: openclaw-critical
terminationGracePeriodSeconds: 30
containers:
- name: openclaw
image: node:22-slim
command: ["sh", "-c", "npm i -g openclaw@latest && openclaw gateway"]
ports:
- containerPort: 18789
livenessProbe:
httpGet:
path: /
port: 18789
initialDelaySeconds: 60
periodSeconds: 30
failureThreshold: 3
timeoutSeconds: 5
readinessProbe:
httpGet:
path: /
port: 18789
initialDelaySeconds: 30
periodSeconds: 10
startupProbe:
httpGet:
path: /
port: 18789
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 30 # 2.5 min for npm install + startup
resources:
requests:
cpu: 250m
memory: 512Mi
limits:
cpu: "1"
memory: 1GiStep 2: Automated PVC Backup
# openclaw-backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: openclaw-backup
namespace: openclaw
spec:
schedule: "0 */6 * * *" # Every 6 hours
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: amazon/aws-cli:latest
command: ["sh", "-c"]
args:
- |
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
cd /backup-source
tar czf /tmp/openclaw-backup-${TIMESTAMP}.tar.gz .
aws s3 cp /tmp/openclaw-backup-${TIMESTAMP}.tar.gz \
s3://my-backups/openclaw/openclaw-backup-${TIMESTAMP}.tar.gz
# Retain last 7 days
aws s3 ls s3://my-backups/openclaw/ | \
sort | head -n -28 | awk '{print $4}' | \
xargs -I{} aws s3 rm s3://my-backups/openclaw/{}
envFrom:
- secretRef:
name: aws-backup-credentials
volumeMounts:
- name: state
mountPath: /backup-source
readOnly: true
volumes:
- name: state
persistentVolumeClaim:
claimName: openclaw-state
restartPolicy: OnFailureStep 3: Monitoring and Alerts
# openclaw-monitoring.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: openclaw-alerts
namespace: openclaw
spec:
groups:
- name: openclaw
rules:
- alert: OpenClawDown
expr: |
absent(up{job="openclaw"}) or up{job="openclaw"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "OpenClaw gateway is down"
- alert: OpenClawHighRestarts
expr: |
increase(kube_pod_container_status_restarts_total{container="openclaw"}[1h]) > 3
labels:
severity: warning
annotations:
summary: "OpenClaw pod restarting frequently"
- alert: OpenClawPVCNearFull
expr: |
kubelet_volume_stats_used_bytes{persistentvolumeclaim="openclaw-state"} /
kubelet_volume_stats_capacity_bytes{persistentvolumeclaim="openclaw-state"} > 0.85
labels:
severity: warning
annotations:
summary: "OpenClaw PVC is 85% full"Common Issues
Issue 1: Pod stuck in Pending after node failure
# High PriorityClass ensures OpenClaw is scheduled first
# If still pending, check node resources:
kubectl describe pod -n openclaw -l app=openclaw
kubectl get nodes -o wideIssue 2: Health probe fails during model loading
# Use startupProbe with generous timeout
startupProbe:
failureThreshold: 60 # 5 minutes
periodSeconds: 5Best Practices
- Use PriorityClass β Ensure OpenClaw is never evicted for lower-priority workloads
- PodDisruptionBudget β Prevent voluntary disruptions (drains, upgrades) from killing the pod
- Backup PVC every 6 hours β WhatsApp sessions and memory are hard to recreate
- Monitor pod health β Alert immediately when the gateway goes down
- Pre-build a custom image β Eliminate npm install time from cold starts
Key Takeaways
- Single replica with fast recovery is the HA pattern for OpenClaw
- PriorityClass + PDB keep the pod running through cluster disruptions
- PVC backups protect against data loss and enable disaster recovery
- Health probes enable automatic restart on failures
- Monitoring alerts ensure you know when the bot is down
π Get All 100+ Recipes in One Book
Stop searching β get every production-ready pattern with detailed explanations, best practices, and copy-paste YAML.
π Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
Want More Kubernetes Recipes?
This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.