πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event

πŸ“Š Observability

Kubernetes observability recipes: Prometheus, Grafana, EFK logging, Jaeger tracing, health probes, custom metrics, and GPU monitoring.

60 recipes 🟒 7 beginner 🟑 41 intermediate πŸ”΄ 12 advanced
intermediate ⏱ 15 minutes

Grafana Kubernetes Monitoring Dashboards Guide

Deploy and configure Grafana dashboards for Kubernetes monitoring including dashboard 6417 for pod metrics, dashboard 315 for cluster overview, and custom

grafanaprometheusmonitoringdashboards
intermediate ⏱ 15 minutes

Kubernetes EFK Stack Centralized Logging

Deploy the EFK stack (Elasticsearch, Fluentd, Kibana) on Kubernetes for centralized log collection, processing, and visualization. DaemonSet log

efkelasticsearchfluentdkibana
advanced ⏱ 15 minutes

NVIDIA CNS with Insight Operator for Network Diagnostics

Deploy NVIDIA Cloud-Native Stack (CNS) with the Insight Operator and NVIDIA Insight tools for deep GPU fabric diagnostics. Collect NIC firmware health, link

nvidiacnsinsightnetworking
advanced ⏱ 15 minutes

NVIDIA DOCA Telemetry for Network Monitoring on Kubernetes

Deploy NVIDIA DOCA Telemetry Service (DTS) to collect real-time network metrics from BlueField DPUs and ConnectX NICs. Export RoCE counters, port

nvidiadocatelemetrydpus
advanced ⏱ 15 minutes

NVIDIA Nsight Operator for GPU Profiling on Kubernetes

Deploy NVIDIA Nsight Systems and Nsight Compute on Kubernetes for GPU workload profiling. Capture kernel traces, memory bandwidth, SM occupancy, and NCCL

nvidiansightprofilinggpu
advanced ⏱ 15 minutes

Run:ai Observability with OpenTelemetry

Configure Run:ai observability on OpenShift with OpenTelemetry Collector, Prometheus receivers, metrics enrichment, OAuth2 export, and GPU metric collection

runaiopentelemetryobservabilityopenshift
advanced ⏱ 15 minutes

Thanos Receive Memory Sizing Guide

Calculate correct memory limits for Thanos Receive based on WAL segments, active series, retention, and ingestion rate. Prevent OOMKill crash loops

thanosmemorycapacity-planningobservability
intermediate ⏱ 15 minutes

Kubernetes 1.36 Native Histogram Metrics

Enable Prometheus native histograms in Kubernetes 1.36 for higher-resolution metrics with lower storage cost. Covers all control plane components.

kubernetes-1.36prometheusmetricsobservability
intermediate ⏱ 15 minutes

GPU Operator Node Status Exporter Metrics

Monitor NVIDIA GPU Operator node validation with gpu_operator_node_driver_ready and status exporter metrics. Prometheus alerts for GPU node health.

nvidiagpu-operatorprometheusmetrics
beginner ⏱ 15 minutes

Grafana Dashboard 6417 Kubernetes Pods

Import Grafana dashboard 6417 for Kubernetes pod monitoring. Configure Prometheus data source, visualize CPU, memory, network, and disk usage per pod.

grafanaprometheusmonitoringdashboards
beginner ⏱ 8 minutes

K8s Metrics Server: kubectl top Guide

Install Kubernetes Metrics Server for kubectl top and HPA. Resource usage monitoring, troubleshooting metrics, and custom metrics integration.

metricsmonitoringkubectlhpa
advanced ⏱ 12 minutes

OpenTelemetry in Kubernetes: Traces and Metrics

Deploy OpenTelemetry Collector in Kubernetes for distributed tracing and metrics. Auto-instrumentation, OTLP export, Jaeger integration.

opentelemetrytracingobservabilitymetrics
intermediate ⏱ 15 minutes

Prometheus: K8s Monitoring and Alerting

Deploy Prometheus monitoring in Kubernetes with kube-prometheus-stack. ServiceMonitor, PrometheusRule, Grafana dashboards, and alerting for production clusters.

prometheusmonitoringalertingobservability
intermediate ⏱ 20 minutes

Kubernetes Logging Fluent Bit Guide

Deploy Fluent Bit for centralized Kubernetes logging. DaemonSet configuration, parsing, filtering, and forwarding logs to Elasticsearch, Loki, or S3.

loggingfluent-bitobservabilityelasticsearch
intermediate ⏱ 25 minutes

Prometheus Monitoring Kubernetes Guide

Deploy Prometheus for Kubernetes cluster monitoring. ServiceMonitor, PodMonitor, alerting rules, Grafana dashboards, and kube-prometheus-stack Helm install.

prometheusmonitoringalertinggrafana
advanced ⏱ 30 minutes

DOCA Telemetry BlueField Kubernetes

Collect NVIDIA BlueField DPU telemetry in Kubernetes using DOCA Telemetry libraries. Monitor adaptive retransmission, PCC, diagnostics, and PCI metrics.

nvidiadocabluefieldtelemetry
intermediate ⏱ 15 minutes

Cilium Hubble Observability Guide

Monitor Kubernetes network flows with Cilium Hubble. CLI usage, Hubble UI, flow filtering, DNS visibility, and L7 HTTP observability.

ciliumhubblenetwork-flowsobservability
intermediate ⏱ 15 minutes

EFK Logging System Principles K8s

EFK logging system principles for Kubernetes. Elasticsearch, Fluentd, Kibana architecture, log pipeline design, parsing, and retention strategies.

efkelasticsearchfluentdkibana
intermediate ⏱ 10 minutes

Grafana Dashboards for Kubernetes Guide

Import and customize Grafana dashboards for Kubernetes monitoring. Dashboard 315, 6417, kube-prometheus-stack, and custom panel creation.

grafanadashboardsmonitoringprometheus
intermediate ⏱ 15 minutes

NVIDIA DCGM Exporter GPU Monitoring

Monitor GPU metrics with DCGM Exporter on K8s. Prometheus integration, Grafana dashboards, and alerting on utilization and temperature.

dcgmgpu-monitoringprometheusnvidia
intermediate ⏱ 20 minutes

AI Workload Monitoring Kubernetes

Monitor AI and GPU workloads on Kubernetes with DCGM Exporter, Prometheus, and Grafana. GPU utilization, memory usage, inference latency.

gpu-monitoringdcgmprometheusgrafana
intermediate ⏱ 15 minutes

Jaeger Tracing Kubernetes Guide

Deploy Jaeger for distributed tracing on Kubernetes. Collector, storage backends, sampling strategies, and trace analysis for microservice debugging.

jaegertracingdistributed-tracingobservability
intermediate ⏱ 20 minutes

Loki Log Aggregation Kubernetes

Deploy Grafana Loki for log aggregation on Kubernetes. Promtail DaemonSet, LogQL queries, structured logging, retention policies, and Grafana integration.

lokiloggingpromtailgrafana
intermediate ⏱ 20 minutes

OpenTelemetry Collector Kubernetes

Deploy the OpenTelemetry Collector on Kubernetes for unified observability. Traces, metrics, and logs pipeline configuration, auto-instrumentation.

opentelemetrytracingobservabilitycollector
intermediate ⏱ 20 minutes

Prometheus Alerting Rules Kubernetes

Write effective Prometheus alerting rules for Kubernetes. Alertmanager routing, inhibition, silence, and production-ready alert templates for CPU, memory.

prometheusalertingalertmanagermonitoring
intermediate ⏱ 15 minutes

Grafana Tempo Tracing Kubernetes

Deploy Grafana Tempo for cost-effective distributed tracing on Kubernetes. Object storage backend, TraceQL queries, and Grafana integration.

tempotracinggrafanatraceql
advanced ⏱ 15 minutes

Thanos HA Prometheus Kubernetes

Scale Prometheus with Thanos for high availability and long-term storage on Kubernetes. Sidecar, Store, Compactor, and Query frontend for multi-cluster metrics.

thanosprometheushigh-availabilitylong-term-storage
intermediate ⏱ 20 minutes

Continuous Profiling with Pyroscope

Deploy Pyroscope on Kubernetes for continuous CPU and memory profiling. Identify performance bottlenecks in production without overhead.

profilingpyroscopeperformanceobservability
intermediate ⏱ 15 minutes

OpenTelemetry Auto-Instrumentation

Configure OpenTelemetry Operator auto-instrumentation to inject tracing into pods without code changes. Supports Java, Python, Node.js, .NET, and Go.

opentelemetrytracingauto-instrumentationobservability
intermediate ⏱ 20 minutes

Alertmanager Routing, Grouping, and Silences

Configure Alertmanager routing trees, receiver integrations, inhibition rules, silences, and alert grouping for production Kubernetes monitoring stacks.

alertmanagerroutingsilencespagerduty
intermediate ⏱ 20 minutes

K8s Golden Signals: SLI and SLO Monitoring

Implement Google SRE golden signals on Kubernetes. Define SLIs, set SLO targets, configure error budgets, and build SLO dashboards with Prometheus and Sloth.

slislogolden-signalserror-budget
intermediate ⏱ 20 minutes

Kubernetes Log Aggregation with Grafana Loki

Aggregate Kubernetes logs with Grafana Loki and Promtail. Install Loki stack, LogQL queries, label-based filtering, and Grafana log exploration dashboards.

lokiloggingpromtailgrafana
beginner ⏱ 15 minutes

K8s Metrics Server: Install and Configure

Install and configure Kubernetes Metrics Server for kubectl top, HPA autoscaling, and resource monitoring. Troubleshoot common metrics-server errors and TL.

metrics-servermonitoringkubectl-tophpa
intermediate ⏱ 20 minutes

Network Observability with Cilium Hubble

Monitor Kubernetes network traffic with Cilium Hubble. Service maps, DNS visibility, HTTP flow logs, network policy auditing, and Hubble UI dashboards.

ciliumhubblenetwork-observabilityebpf
intermediate ⏱ 20 minutes

K8s Pod Resource Monitoring with Grafana

Monitor Kubernetes pod CPU and memory with Grafana dashboards. Prometheus queries for resource usage, request vs limit tracking.

grafanaprometheusresource-monitoringdashboards
beginner ⏱ 10 minutes

Grafana Dashboard 6417: Node Exporter Setup

Import Grafana Dashboard 6417 for Kubernetes pod monitoring. Node Exporter Full setup with Prometheus, CPU, memory, disk, and network metrics.

grafanadashboard-6417node-exporterprometheus
intermediate ⏱ 15 minutes

Kubernetes Alerting Best Practices

Design effective Kubernetes alerts that reduce noise and catch real issues. Covers severity tiers, golden signals, runbook links, and fatigue prevention.

alertingprometheusalertmanagersre
beginner ⏱ 15 minutes

Kubernetes Cost Monitoring with Kubecost

Monitor and optimize Kubernetes costs with Kubecost. Track per-namespace and per-deployment spend with cloud billing integration and savings tips.

kubecostcost-monitoringfinopsoptimization
intermediate ⏱ 15 minutes

EFK Stack: Kubernetes Centralized Logging

Deploy EFK stack for Kubernetes centralized logging. Elasticsearch, Fluentd, Kibana setup, log collection, parsing, and retention policies.

loggingelasticsearchfluentdkibana
intermediate ⏱ 15 minutes

K8s Monitoring with Prometheus and Grafana

Set up Kubernetes monitoring with Prometheus and Grafana. Covers kube-prometheus-stack, custom dashboards, alerting rules, and key metrics to monitor.

monitoringprometheusgrafanaalerting
advanced ⏱ 15 minutes

OpenTelemetry Complete Setup on Kubernetes

Deploy OpenTelemetry Collector, auto-instrumentation, and exporters on Kubernetes. Unified traces, metrics, and logs pipeline to Jaeger, Prometheus, and Loki.

opentelemetryoteltracingmetrics
beginner ⏱ 15 minutes

OpenClaw Health Probes on Kubernetes

Configure liveness and readiness probes for OpenClaw on Kubernetes. Custom Node.js health checks against /healthz and /readyz endpoints with proper timing.

openclawhealth-probeslivenessreadiness
intermediate ⏱ 20 minutes

Enable User Workload Monitoring OpenShift

Enable user workload monitoring on OpenShift. Deploy ServiceMonitor, PodMonitor, alerting rules, and Grafana dashboards.

openshiftmonitoringprometheusservicemonitor
intermediate ⏱ 15 minutes

Per-Tenant GPU Monitoring and Chargeback

Build per-tenant GPU monitoring dashboards with queue time, utilization, thermal metrics, and GPU-hour chargeback on Kubernetes.

monitoringgpuchargebackprometheus
intermediate ⏱ 15 minutes

GPU Tenant SLO Observability on Kubernetes

Define and monitor GPU tenant SLOs for queue time, inference latency, GPU utilization, and job completion rate with Prometheus alerting.

slogpuobservabilityprometheus
intermediate ⏱ 15 minutes

OpenClaw Logging with EFK Stack

Collect and analyze OpenClaw agent logs using Elasticsearch, Fluent Bit, and Kibana (EFK stack) for debugging and audit trails.

openclawloggingelasticsearchfluent-bit
intermediate ⏱ 20 minutes

Monitor OpenClaw with Prometheus and Grafana

Set up monitoring for OpenClaw AI gateway on Kubernetes with Prometheus metrics, Grafana dashboards, and alerting for uptime, message throughput, and.

openclawprometheusgrafanamonitoring
intermediate ⏱ 30 minutes

Monitor NCCL Benchmark Runs Prometheus & Gr...

Track NCCL benchmark outcomes and GPU telemetry over time with Prometheus and Grafana dashboards to detect communication regressions early.

ncclprometheusgrafanaobservability
intermediate ⏱ 20 minutes

How to Set Up Node Problem Detector

Detect and report node-level issues automatically with Node Problem Detector. Learn to identify kernel problems, hardware failures, and container.

node-problem-detectorobservabilitymonitoringtroubleshooting
intermediate ⏱ 15 minutes

How to Set Up Alertmanager for Prometheus

Configure Alertmanager to route and manage Prometheus alerts. Set up notification channels including Slack, PagerDuty, and email with routing rules.

alertmanagerprometheusalertsnotifications
intermediate ⏱ 15 minutes

How to Implement Container Logging Patterns

Configure logging for Kubernetes applications. Implement sidecar logging, log aggregation, and structured logging best practices.

loggingobservabilitysidecarfluentd
advanced ⏱ 15 minutes

Implement Distributed Tracing with Jaeger

Deploy Jaeger for distributed tracing in Kubernetes. Learn to instrument applications, trace requests across services, and identify performance.

tracingjaegeropentelemetryobservability
intermediate ⏱ 15 minutes

Grafana Dashboard 6417: K8s Pod Monitoring

Set up Grafana dashboard 6417 for Kubernetes pod monitoring. Import, customize panels, PromQL queries, and cluster-wide resource visualization.

grafanamonitoringdashboardsprometheus
intermediate ⏱ 15 minutes

Jaeger Distributed Tracing on Kubernetes

Deploy Jaeger for distributed tracing in Kubernetes. Trace requests across microservices to identify latency issues and debug complex systems.

jaegertracingobservabilityopentelemetry
beginner ⏱ 15 minutes

How to Use Kubernetes Events for Monitoring

Monitor cluster activity through Kubernetes events. Capture, filter, and alert on events for troubleshooting and operational visibility.

eventsmonitoringtroubleshootingobservability
advanced ⏱ 15 minutes

Set Up Centralized Logging with EFK Stack

Deploy Elasticsearch, Fluentd, and Kibana for centralized Kubernetes logging. Learn to collect, parse, and visualize container logs at scale.

loggingelasticsearchfluentdkibana
advanced ⏱ 15 minutes

Collect Metrics with OpenTelemetry Collector

Deploy OpenTelemetry Collector for unified metrics, traces, and logs collection in Kubernetes. Learn pipelines, processors, and exporters configuration.

opentelemetryotelmetricsobservability
intermediate ⏱ 15 minutes

How to Monitor Kubernetes with Prometheus

Set up Prometheus monitoring for Kubernetes clusters. Configure scraping, alerting rules, and visualize metrics with Grafana dashboards.

prometheusmonitoringmetricsgrafana
intermediate ⏱ 30 minutes

How to Configure Alertmanager for K8s Alerts

Set up Alertmanager to route, group, and deliver Kubernetes alerts. Learn to configure Slack, PagerDuty, and email notifications.

alertmanagermonitoringalertsnotifications
intermediate ⏱ 35 minutes

Set Up Prometheus Monitoring for Applications

Learn to instrument your Kubernetes applications with Prometheus metrics. Complete guide to ServiceMonitors, scraping configuration, and custom metrics.

prometheusmonitoringmetricsobservability
Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens