πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
Observability intermediate ⏱ 15 minutes K8s 1.28+

Kubernetes Alerting Best Practices

Design effective Kubernetes alerts that reduce noise and catch real issues. Covers alert severity tiers, golden signals, runbook links, and alert fatigue prevention.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Design effective Kubernetes alerts that reduce noise and catch real issues. Covers alert severity tiers, golden signals, runbook links, and alert fatigue prevention.

The Problem

Engineers frequently search for this topic but find scattered, incomplete guides. This recipe provides a comprehensive, production-ready reference.

The Solution

Alert Severity Tiers

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: kubernetes-alerts
spec:
  groups:
    - name: critical-alerts
      rules:
        # CRITICAL: Pages on-call, needs immediate action
        - alert: PodOOMKilled
          expr: kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} > 0
          for: 0m
          labels:
            severity: critical
          annotations:
            summary: "Container {{ $labels.container }} OOMKilled in {{ $labels.pod }}"
            runbook: "https://wiki.example.com/runbooks/oomkilled"

        - alert: NodeNotReady
          expr: kube_node_status_condition{condition="Ready",status="true"} == 0
          for: 5m
          labels:
            severity: critical

    - name: warning-alerts
      rules:
        # WARNING: Slack notification, investigate during business hours
        - alert: HighCPUUsage
          expr: |
            sum(rate(container_cpu_usage_seconds_total[5m])) by (pod, namespace)
            / sum(kube_pod_container_resource_requests{resource="cpu"}) by (pod, namespace)
            > 0.9
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "Pod {{ $labels.pod }} CPU >90% of request for 15m"

        - alert: PersistentVolumeSpaceLow
          expr: kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes < 0.1
          for: 5m
          labels:
            severity: warning

AlertManager Routing

# Route critical β†’ PagerDuty, warning β†’ Slack
route:
  group_by: ['alertname', 'namespace']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'slack-warnings'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      repeat_interval: 1h
    - match:
        severity: warning
      receiver: 'slack-warnings'
receivers:
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: '<key>'
  - name: 'slack-warnings'
    slack_configs:
      - channel: '#k8s-alerts'
        api_url: '<webhook-url>'

The Four Golden Signals

SignalWhat to measureAlert when
LatencyRequest duration p99p99 > 1s for 5m
TrafficRequests per secondDrop >50% in 5m
ErrorsError rate (5xx)>1% for 5m
SaturationCPU/memory/disk usage>85% for 10m
graph TD
    A[Prometheus Alert fires] --> B[AlertManager]
    B --> C{Severity?}
    C -->|Critical| D[PagerDuty β†’ page on-call]
    C -->|Warning| E[Slack β†’ investigate in hours]
    C -->|Info| F[Dashboard only]

Frequently Asked Questions

How do I reduce alert fatigue?

Set appropriate for durations (don’t alert on 1-second spikes), group related alerts, use severity tiers, and add runbook links. Every alert should be actionable.

Best Practices

  • Start with the simplest approach that solves your problem
  • Test thoroughly in staging before production
  • Monitor and iterate based on real metrics
  • Document decisions for your team

Key Takeaways

  • This is essential Kubernetes operational knowledge
  • Production-readiness requires proper configuration and monitoring
  • Use kubectl describe and logs for troubleshooting
  • Automate where possible to reduce human error
#alerting #prometheus #alertmanager #sre #on-call
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens