πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
Observability intermediate ⏱ 20 minutes K8s 1.28+

Prometheus Alerting Rules Kubernetes

Write effective Prometheus alerting rules for Kubernetes. Alertmanager routing, inhibition, silence, and production-ready alert templates for CPU, memory.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Create PrometheusRule resources with meaningful alerts. Use recording rules for expensive queries, route alerts through Alertmanager with severity-based routing, and implement inhibition rules to prevent alert storms. Start with the 5 essential K8s alerts: PodCrashing, NodeNotReady, PVCFull, CertExpiring, and HighErrorRate.

The Problem

Default Prometheus installations come with hundreds of alerts β€” most teams disable them all because of alert fatigue. The result: no alerts at all, and issues are discovered by users. You need a curated set of actionable alerts with proper routing and severity.

The Solution

Essential Kubernetes Alerts

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: kubernetes-essential
  namespace: monitoring
spec:
  groups:
    - name: kubernetes.essential
      rules:
        - alert: PodCrashLooping
          expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
          for: 1h
          labels:
            severity: warning
          annotations:
            summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
            
        - alert: NodeNotReady
          expr: kube_node_status_condition{condition="Ready",status="true"} == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Node {{ $labels.node }} is not ready"
            
        - alert: PVCNearlyFull
          expr: kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.85
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "PVC {{ $labels.persistentvolumeclaim }} is {{ $value | humanizePercentage }} full"
            
        - alert: HighErrorRate
          expr: |
            sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
            / sum(rate(http_requests_total[5m])) by (service) > 0.05
          for: 10m
          labels:
            severity: critical
          annotations:
            summary: "{{ $labels.service }} has {{ $value | humanizePercentage }} error rate"
            
        - alert: CertificateExpiringSoon
          expr: certmanager_certificate_expiration_timestamp_seconds - time() < 7 * 86400
          labels:
            severity: warning

Alertmanager Routing

apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: routing
spec:
  route:
    groupBy: ['alertname', 'namespace']
    groupWait: 30s
    groupInterval: 5m
    repeatInterval: 12h
    receiver: default
    routes:
      - matchers:
          - name: severity
            value: critical
        receiver: pagerduty
      - matchers:
          - name: severity
            value: warning
        receiver: slack
  receivers:
    - name: pagerduty
      pagerdutyConfigs:
        - routingKey:
            name: pagerduty-secret
            key: routing-key
    - name: slack
      slackConfigs:
        - channel: '#alerts'
          apiURL:
            name: slack-secret
            key: webhook-url
graph TD
    PROM[Prometheus<br/>Evaluate rules] -->|Firing alerts| AM[Alertmanager<br/>Route by severity]
    AM -->|critical| PD[PagerDuty<br/>Wake someone up]
    AM -->|warning| SLACK[Slack<br/>#alerts channel]
    AM -->|info| LOG[Log only]
    
    AM -->|Inhibition| SUPPRESS[Suppress PodCrash<br/>if NodeNotReady]

Common Issues

Alert fatigue β€” too many alerts: Start with 5-10 essential alerts. Every alert must have a clear action. If the response is β€˜look at it later,’ it should be a warning, not critical.

Alerts firing during maintenance: Use Alertmanager silences: amtool silence add alertname=NodeNotReady --duration=2h.

Best Practices

  • Every alert must have a runbook β€” link in annotation
  • Critical = wake someone up β€” use sparingly
  • Warning = investigate during business hours
  • Group by namespace β€” reduces alert spam
  • Inhibition: NodeNotReady suppresses pod alerts on that node

Key Takeaways

  • Start with 5 essential alerts: PodCrashing, NodeNotReady, PVCFull, CertExpiring, HighErrorRate
  • Route critical alerts to PagerDuty, warnings to Slack
  • Inhibition rules prevent cascading alert storms
  • Every alert must have a clear action β€” if you can’t act on it, delete it
  • Group alerts by namespace and alertname to reduce noise
#prometheus #alerting #alertmanager #monitoring
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens