Monitor OpenClaw with Prometheus and Grafana

💡 Quick Answer: OpenClaw exposes a Control UI on port 18789. Monitor its health with blackbox-exporter HTTP probes, pod metrics via kube-state-metrics, and container resource usage via cAdvisor. Set up alerts for pod restarts, OOM kills, and service downtime.
Key concept: OpenClaw is a single-process gateway — monitoring focuses on availability, resource usage, and restart frequency rather than request throughput.
Gotcha: OpenClaw doesn’t expose a /metrics Prometheus endpoint natively. Use blackbox-exporter for HTTP health checks and kube-state-metrics for pod-level metrics.

The Problem

No visibility into whether the AI assistant is online and responding
Pod crashes go unnoticed until users complain
No resource usage trending to right-size the deployment
Channel disconnections (WhatsApp session expiry) aren’t detected

The Solution

Combine Kubernetes-native metrics (kube-state-metrics, cAdvisor) with HTTP health probes to build a comprehensive monitoring stack.

Monitoring Setup

# openclaw-monitoring.yaml
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: openclaw-health
  namespace: openclaw
spec:
  selector:
    matchLabels:
      app: openclaw
  podMetricsEndpoints: []
---
# Blackbox exporter probe for HTTP health
apiVersion: monitoring.coreos.com/v1
kind: Probe
metadata:
  name: openclaw-http
  namespace: openclaw
spec:
  interval: 30s
  module: http_2xx
  prober:
    url: blackbox-exporter.monitoring:9115
  targets:
    staticConfig:
      static:
        - openclaw.openclaw.svc.cluster.local:80
---
# Alerting rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: openclaw-alerts
  namespace: openclaw
spec:
  groups:
    - name: openclaw.rules
      rules:
        - alert: OpenClawDown
          expr: probe_success{job="probe/openclaw/openclaw-http"} == 0
          for: 3m
          labels:
            severity: critical
          annotations:
            summary: "OpenClaw gateway is unreachable"
            description: "HTTP health check failed for 3+ minutes"
        
        - alert: OpenClawCrashLooping
          expr: increase(kube_pod_container_status_restarts_total{namespace="openclaw",container="openclaw"}[1h]) > 3
          labels:
            severity: warning
          annotations:
            summary: "OpenClaw is crash-looping ({{ $value }} restarts/hour)"
        
        - alert: OpenClawOOM
          expr: kube_pod_container_status_last_terminated_reason{namespace="openclaw",reason="OOMKilled"} == 1
          labels:
            severity: warning
          annotations:
            summary: "OpenClaw was OOM killed — increase memory limits"
        
        - alert: OpenClawHighMemory
          expr: |
            container_memory_working_set_bytes{namespace="openclaw",container="openclaw"} /
            container_spec_memory_limit_bytes{namespace="openclaw",container="openclaw"} > 0.85
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "OpenClaw memory usage above 85%"
        
        - alert: OpenClawPVCFull
          expr: |
            kubelet_volume_stats_used_bytes{namespace="openclaw",persistentvolumeclaim=~"openclaw.*"} /
            kubelet_volume_stats_capacity_bytes > 0.9
          labels:
            severity: critical
          annotations:
            summary: "OpenClaw PVC 90% full"

Grafana Dashboard

{
  "title": "OpenClaw Gateway",
  "panels": [
    {
      "title": "Gateway Health",
      "type": "stat",
      "targets": [{"expr": "probe_success{job='probe/openclaw/openclaw-http'}"}]
    },
    {
      "title": "Pod Restarts (24h)",
      "type": "stat",
      "targets": [{"expr": "increase(kube_pod_container_status_restarts_total{namespace='openclaw'}[24h])"}]
    },
    {
      "title": "Memory Usage",
      "type": "timeseries",
      "targets": [{"expr": "container_memory_working_set_bytes{namespace='openclaw',container='openclaw'}"}]
    },
    {
      "title": "CPU Usage",
      "type": "timeseries",
      "targets": [{"expr": "rate(container_cpu_usage_seconds_total{namespace='openclaw',container='openclaw'}[5m])"}]
    },
    {
      "title": "PVC Usage",
      "type": "gauge",
      "targets": [{"expr": "kubelet_volume_stats_used_bytes{namespace='openclaw'} / kubelet_volume_stats_capacity_bytes"}]
    }
  ]
}

Common Issues

Issue 1: Blackbox exporter not reaching OpenClaw

# Verify service DNS resolution
kubectl exec -n monitoring deploy/blackbox-exporter -- \
  wget -qO- http://openclaw.openclaw.svc.cluster.local:80/

# Check NetworkPolicy isn't blocking cross-namespace traffic

Best Practices

Alert on downtime, not just restarts — Use blackbox-exporter HTTP probes
Track PVC usage — Session data grows over time; alert before it’s full
Monitor memory trends — Right-size limits based on actual usage
Set up PagerDuty/Slack alerts — Critical alerts should reach you immediately
Dashboard rotation — Include OpenClaw panel in your NOC dashboard

Key Takeaways

Blackbox-exporter provides HTTP health monitoring for OpenClaw
kube-state-metrics tracks pod restarts, OOM kills, and lifecycle events
cAdvisor provides CPU and memory usage for right-sizing
PVC monitoring prevents session data from filling up storage
Alerting ensures you know when your AI assistant goes offline

The Problem

The Solution

Monitoring Setup

Grafana Dashboard

Common Issues

Issue 1: Blackbox exporter not reaching OpenClaw

Best Practices

Key Takeaways

Want More Kubernetes Recipes?