Kubernetes Alerting Best Practices
Design effective Kubernetes alerts that reduce noise and catch real issues. Covers alert severity tiers, golden signals, runbook links, and alert fatigue prevention.
π‘ Quick Answer: Design effective Kubernetes alerts that reduce noise and catch real issues. Covers alert severity tiers, golden signals, runbook links, and alert fatigue prevention.
The Problem
Engineers frequently search for this topic but find scattered, incomplete guides. This recipe provides a comprehensive, production-ready reference.
The Solution
Alert Severity Tiers
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: kubernetes-alerts
spec:
groups:
- name: critical-alerts
rules:
# CRITICAL: Pages on-call, needs immediate action
- alert: PodOOMKilled
expr: kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} > 0
for: 0m
labels:
severity: critical
annotations:
summary: "Container {{ $labels.container }} OOMKilled in {{ $labels.pod }}"
runbook: "https://wiki.example.com/runbooks/oomkilled"
- alert: NodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 5m
labels:
severity: critical
- name: warning-alerts
rules:
# WARNING: Slack notification, investigate during business hours
- alert: HighCPUUsage
expr: |
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod, namespace)
/ sum(kube_pod_container_resource_requests{resource="cpu"}) by (pod, namespace)
> 0.9
for: 15m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} CPU >90% of request for 15m"
- alert: PersistentVolumeSpaceLow
expr: kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes < 0.1
for: 5m
labels:
severity: warningAlertManager Routing
# Route critical β PagerDuty, warning β Slack
route:
group_by: ['alertname', 'namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'slack-warnings'
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
repeat_interval: 1h
- match:
severity: warning
receiver: 'slack-warnings'
receivers:
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: '<key>'
- name: 'slack-warnings'
slack_configs:
- channel: '#k8s-alerts'
api_url: '<webhook-url>'The Four Golden Signals
| Signal | What to measure | Alert when |
|---|---|---|
| Latency | Request duration p99 | p99 > 1s for 5m |
| Traffic | Requests per second | Drop >50% in 5m |
| Errors | Error rate (5xx) | >1% for 5m |
| Saturation | CPU/memory/disk usage | >85% for 10m |
graph TD
A[Prometheus Alert fires] --> B[AlertManager]
B --> C{Severity?}
C -->|Critical| D[PagerDuty β page on-call]
C -->|Warning| E[Slack β investigate in hours]
C -->|Info| F[Dashboard only]Frequently Asked Questions
How do I reduce alert fatigue?
Set appropriate for durations (donβt alert on 1-second spikes), group related alerts, use severity tiers, and add runbook links. Every alert should be actionable.
Best Practices
- Start with the simplest approach that solves your problem
- Test thoroughly in staging before production
- Monitor and iterate based on real metrics
- Document decisions for your team
Key Takeaways
- This is essential Kubernetes operational knowledge
- Production-readiness requires proper configuration and monitoring
- Use
kubectl describeand logs for troubleshooting - Automate where possible to reduce human error

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
