πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
Observability intermediate ⏱ 20 minutes K8s 1.28+

K8s Golden Signals: SLI and SLO Monitoring

Implement Google SRE golden signals on Kubernetes. Define SLIs, set SLO targets, configure error budgets, and build SLO dashboards with Prometheus and Sloth.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: The four golden signals are latency, traffic, errors, and saturation. Define SLIs as Prometheus queries (e.g., http_request_duration_seconds_bucket for latency), set SLO targets (e.g., 99.9% requests under 500ms), calculate error budgets, and alert when budgets are burning too fast. Use Sloth to auto-generate Prometheus recording rules from SLO definitions.

The Problem

Most Kubernetes monitoring alerts on symptoms (CPU > 80%, pod restarting) rather than user impact. A pod using 90% CPU might be fine if users are happy. SLIs/SLOs flip the model: define what β€œgood” means from the user’s perspective, measure it, and only alert when you’re burning through your error budget.

flowchart TB
    subgraph GOLDEN["Four Golden Signals"]
        LAT["πŸ“Š Latency<br/>How long requests take"]
        TRAF["πŸ“ˆ Traffic<br/>How many requests"]
        ERR["❌ Errors<br/>How many failures"]
        SAT["πŸ’Ύ Saturation<br/>How full the system is"]
    end
    
    subgraph SLO_CHAIN["SLI β†’ SLO β†’ Error Budget"]
        SLI["SLI: 98.2% of requests<br/>complete in <500ms"]
        SLO["SLO Target: 99.5%<br/>of requests in <500ms"]
        EB["Error Budget: 0.5%<br/>(~3.6h/month downtime)"]
        ALERT["Alert: Budget burn rate<br/>exceeds threshold"]
    end
    
    GOLDEN --> SLI --> SLO --> EB --> ALERT

The Solution

Define SLIs with PromQL

# SLI: Availability (percentage of successful requests)
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# SLI: Latency (percentage of requests under 500ms)
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))

# SLI: Freshness (data pipeline delivered within SLA)
(time() - max(pipeline_last_success_timestamp)) < 3600

# Golden Signal: Traffic
sum(rate(http_requests_total[5m])) by (service)

# Golden Signal: Saturation
# CPU saturation
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)
/
sum(kube_pod_container_resource_limits{resource="cpu"}) by (pod)

# Memory saturation
container_memory_working_set_bytes / container_spec_memory_limit_bytes

SLO Definition with Sloth

# sloth.yaml β€” generates Prometheus recording rules
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: api-server
spec:
  service: "api-server"
  labels:
    team: platform
  slos:
    - name: "requests-availability"
      objective: 99.9               # 99.9% SLO
      description: "99.9% of API requests succeed"
      sli:
        events:
          errorQuery: sum(rate(http_requests_total{service="api-server",status=~"5.."}[{{.window}}]))
          totalQuery: sum(rate(http_requests_total{service="api-server"}[{{.window}}]))
      alerting:
        name: APIServerAvailability
        labels:
          severity: critical
        annotations:
          summary: "API server error budget burn rate is too high"
        pageAlert:
          labels:
            severity: critical
        ticketAlert:
          labels:
            severity: warning

    - name: "requests-latency"
      objective: 99.5               # 99.5% SLO
      description: "99.5% of requests complete in under 500ms"
      sli:
        events:
          errorQuery: |
            sum(rate(http_request_duration_seconds_count{service="api-server"}[{{.window}}]))
            -
            sum(rate(http_request_duration_seconds_bucket{service="api-server",le="0.5"}[{{.window}}]))
          totalQuery: sum(rate(http_request_duration_seconds_count{service="api-server"}[{{.window}}]))
      alerting:
        name: APIServerLatency
# Generate Prometheus rules from Sloth SLO
sloth generate -i sloth.yaml -o prometheus-rules.yaml
kubectl apply -f prometheus-rules.yaml

Error Budget Calculation

# Error budget remaining (30-day window)
# SLO = 99.9% β†’ Error budget = 0.1% = 43.2 minutes/month

# Budget consumed (percentage)
1 - (
  sum_over_time(slo:sli_error:ratio_rate5m{sloth_service="api-server",sloth_slo="requests-availability"}[30d])
  / (30 * 24 * 12)  # 30 days Γ— 24h Γ— 12 five-minute windows/hour
) / (1 - 0.999)     # 0.999 = SLO target

# Burn rate (how fast are we consuming budget?)
# 1.0 = consuming at exactly the SLO rate
# 2.0 = burning 2Γ— faster than sustainable
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
/
(1 - 0.999)         # Normalize by error budget

Multi-Window Burn Rate Alerts

# Google SRE multi-window burn rate alerting
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: slo-burn-rate-alerts
spec:
  groups:
    - name: slo.burn-rate
      rules:
        # Page: 2% budget consumed in 1 hour (14.4Γ— burn rate)
        - alert: SLOBurnRateCritical
          expr: |
            slo:sli_error:ratio_rate1h{sloth_service="api-server"} > (14.4 * 0.001)
            and
            slo:sli_error:ratio_rate5m{sloth_service="api-server"} > (14.4 * 0.001)
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "High error budget burn rate β€” page on-call"

        # Ticket: 5% budget consumed in 6 hours (6Γ— burn rate)  
        - alert: SLOBurnRateWarning
          expr: |
            slo:sli_error:ratio_rate6h{sloth_service="api-server"} > (6 * 0.001)
            and
            slo:sli_error:ratio_rate30m{sloth_service="api-server"} > (6 * 0.001)
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Elevated error budget burn rate β€” create ticket"

Grafana SLO Dashboard

{
  "title": "SLO Dashboard",
  "panels": [
    {
      "title": "Error Budget Remaining",
      "type": "gauge",
      "thresholds": [
        {"color": "red", "value": 0},
        {"color": "yellow", "value": 25},
        {"color": "green", "value": 50}
      ]
    },
    {
      "title": "SLI: Availability (30d rolling)",
      "type": "stat",
      "targets": [{"expr": "avg_over_time(slo:sli_error:ratio_rate5m{sloth_slo='requests-availability'}[30d])"}]
    },
    {
      "title": "Burn Rate (1h window)",
      "type": "timeseries"
    },
    {
      "title": "Error Budget Consumption Over Time",
      "type": "timeseries"
    }
  ]
}

Common Issues

IssueCauseFix
SLI always 100%Too few requestsIncrease measurement window or lower SLO
Error budget depleted too fastSingle outage consumed all budgetImprove MTTR; don’t change SLO
Alerts too noisyBurn rate threshold too sensitiveUse multi-window approach (1h AND 5m)
No data for SLIWrong metric name or labelsVerify metric exists in Prometheus
SLO too aggressive99.99% unrealistic for your tierStart with 99.5% and tighten over time

Best Practices

  • Start with 3 SLOs per service β€” availability, latency, and one domain-specific
  • Use multi-window burn rates β€” prevents alert fatigue while catching real incidents
  • Review error budgets monthly β€” adjust SLO targets based on business needs
  • Don’t change SLOs after incidents β€” improve reliability, not the target
  • Automate with Sloth β€” generates recording rules from declarative SLO definitions
  • Display SLO dashboards publicly β€” transparency builds trust with stakeholders

Key Takeaways

  • Four golden signals: latency, traffic, errors, saturation β€” measure all four
  • SLIs are the metrics; SLOs are the targets; error budgets are the consequence
  • Multi-window burn rate alerts catch real incidents without false alarms
  • Sloth generates all Prometheus recording rules from a simple YAML SLO spec
  • Error budget policy: when budget is exhausted, prioritize reliability over features
  • SRE approach shifts from β€œis the system healthy?” to β€œare users happy?”
#sli #slo #golden-signals #error-budget #sre
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens