π Observability
Kubernetes observability recipes: Prometheus, Grafana, EFK logging, Jaeger tracing, health probes, custom metrics, and GPU monitoring.
Grafana Kubernetes Monitoring Dashboards Guide
Deploy and configure Grafana dashboards for Kubernetes monitoring including dashboard 6417 for pod metrics, dashboard 315 for cluster overview, and custom
Kubernetes EFK Stack Centralized Logging
Deploy the EFK stack (Elasticsearch, Fluentd, Kibana) on Kubernetes for centralized log collection, processing, and visualization. DaemonSet log
NVIDIA CNS with Insight Operator for Network Diagnostics
Deploy NVIDIA Cloud-Native Stack (CNS) with the Insight Operator and NVIDIA Insight tools for deep GPU fabric diagnostics. Collect NIC firmware health, link
NVIDIA DOCA Telemetry for Network Monitoring on Kubernetes
Deploy NVIDIA DOCA Telemetry Service (DTS) to collect real-time network metrics from BlueField DPUs and ConnectX NICs. Export RoCE counters, port
NVIDIA Nsight Operator for GPU Profiling on Kubernetes
Deploy NVIDIA Nsight Systems and Nsight Compute on Kubernetes for GPU workload profiling. Capture kernel traces, memory bandwidth, SM occupancy, and NCCL
Run:ai Observability with OpenTelemetry
Configure Run:ai observability on OpenShift with OpenTelemetry Collector, Prometheus receivers, metrics enrichment, OAuth2 export, and GPU metric collection
Thanos Receive Memory Sizing Guide
Calculate correct memory limits for Thanos Receive based on WAL segments, active series, retention, and ingestion rate. Prevent OOMKill crash loops
Kubernetes 1.36 Native Histogram Metrics
Enable Prometheus native histograms in Kubernetes 1.36 for higher-resolution metrics with lower storage cost. Covers all control plane components.
GPU Operator Node Status Exporter Metrics
Monitor NVIDIA GPU Operator node validation with gpu_operator_node_driver_ready and status exporter metrics. Prometheus alerts for GPU node health.
Grafana Dashboard 6417 Kubernetes Pods
Import Grafana dashboard 6417 for Kubernetes pod monitoring. Configure Prometheus data source, visualize CPU, memory, network, and disk usage per pod.
K8s Metrics Server: kubectl top Guide
Install Kubernetes Metrics Server for kubectl top and HPA. Resource usage monitoring, troubleshooting metrics, and custom metrics integration.
OpenTelemetry in Kubernetes: Traces and Metrics
Deploy OpenTelemetry Collector in Kubernetes for distributed tracing and metrics. Auto-instrumentation, OTLP export, Jaeger integration.
Prometheus: K8s Monitoring and Alerting
Deploy Prometheus monitoring in Kubernetes with kube-prometheus-stack. ServiceMonitor, PrometheusRule, Grafana dashboards, and alerting for production clusters.
Kubernetes Logging Fluent Bit Guide
Deploy Fluent Bit for centralized Kubernetes logging. DaemonSet configuration, parsing, filtering, and forwarding logs to Elasticsearch, Loki, or S3.
Prometheus Monitoring Kubernetes Guide
Deploy Prometheus for Kubernetes cluster monitoring. ServiceMonitor, PodMonitor, alerting rules, Grafana dashboards, and kube-prometheus-stack Helm install.
DOCA Telemetry BlueField Kubernetes
Collect NVIDIA BlueField DPU telemetry in Kubernetes using DOCA Telemetry libraries. Monitor adaptive retransmission, PCC, diagnostics, and PCI metrics.
Cilium Hubble Observability Guide
Monitor Kubernetes network flows with Cilium Hubble. CLI usage, Hubble UI, flow filtering, DNS visibility, and L7 HTTP observability.
EFK Logging System Principles K8s
EFK logging system principles for Kubernetes. Elasticsearch, Fluentd, Kibana architecture, log pipeline design, parsing, and retention strategies.
Grafana Dashboards for Kubernetes Guide
Import and customize Grafana dashboards for Kubernetes monitoring. Dashboard 315, 6417, kube-prometheus-stack, and custom panel creation.
NVIDIA DCGM Exporter GPU Monitoring
Monitor GPU metrics with DCGM Exporter on K8s. Prometheus integration, Grafana dashboards, and alerting on utilization and temperature.
AI Workload Monitoring Kubernetes
Monitor AI and GPU workloads on Kubernetes with DCGM Exporter, Prometheus, and Grafana. GPU utilization, memory usage, inference latency.
Jaeger Tracing Kubernetes Guide
Deploy Jaeger for distributed tracing on Kubernetes. Collector, storage backends, sampling strategies, and trace analysis for microservice debugging.
Loki Log Aggregation Kubernetes
Deploy Grafana Loki for log aggregation on Kubernetes. Promtail DaemonSet, LogQL queries, structured logging, retention policies, and Grafana integration.
OpenTelemetry Collector Kubernetes
Deploy the OpenTelemetry Collector on Kubernetes for unified observability. Traces, metrics, and logs pipeline configuration, auto-instrumentation.
Prometheus Alerting Rules Kubernetes
Write effective Prometheus alerting rules for Kubernetes. Alertmanager routing, inhibition, silence, and production-ready alert templates for CPU, memory.
Grafana Tempo Tracing Kubernetes
Deploy Grafana Tempo for cost-effective distributed tracing on Kubernetes. Object storage backend, TraceQL queries, and Grafana integration.
Thanos HA Prometheus Kubernetes
Scale Prometheus with Thanos for high availability and long-term storage on Kubernetes. Sidecar, Store, Compactor, and Query frontend for multi-cluster metrics.
Continuous Profiling with Pyroscope
Deploy Pyroscope on Kubernetes for continuous CPU and memory profiling. Identify performance bottlenecks in production without overhead.
OpenTelemetry Auto-Instrumentation
Configure OpenTelemetry Operator auto-instrumentation to inject tracing into pods without code changes. Supports Java, Python, Node.js, .NET, and Go.
Alertmanager Routing, Grouping, and Silences
Configure Alertmanager routing trees, receiver integrations, inhibition rules, silences, and alert grouping for production Kubernetes monitoring stacks.
K8s Golden Signals: SLI and SLO Monitoring
Implement Google SRE golden signals on Kubernetes. Define SLIs, set SLO targets, configure error budgets, and build SLO dashboards with Prometheus and Sloth.
Kubernetes Log Aggregation with Grafana Loki
Aggregate Kubernetes logs with Grafana Loki and Promtail. Install Loki stack, LogQL queries, label-based filtering, and Grafana log exploration dashboards.
K8s Metrics Server: Install and Configure
Install and configure Kubernetes Metrics Server for kubectl top, HPA autoscaling, and resource monitoring. Troubleshoot common metrics-server errors and TL.
Network Observability with Cilium Hubble
Monitor Kubernetes network traffic with Cilium Hubble. Service maps, DNS visibility, HTTP flow logs, network policy auditing, and Hubble UI dashboards.
K8s Pod Resource Monitoring with Grafana
Monitor Kubernetes pod CPU and memory with Grafana dashboards. Prometheus queries for resource usage, request vs limit tracking.
Grafana Dashboard 6417: Node Exporter Setup
Import Grafana Dashboard 6417 for Kubernetes pod monitoring. Node Exporter Full setup with Prometheus, CPU, memory, disk, and network metrics.
Kubernetes Alerting Best Practices
Design effective Kubernetes alerts that reduce noise and catch real issues. Covers severity tiers, golden signals, runbook links, and fatigue prevention.
Kubernetes Cost Monitoring with Kubecost
Monitor and optimize Kubernetes costs with Kubecost. Track per-namespace and per-deployment spend with cloud billing integration and savings tips.
EFK Stack: Kubernetes Centralized Logging
Deploy EFK stack for Kubernetes centralized logging. Elasticsearch, Fluentd, Kibana setup, log collection, parsing, and retention policies.
K8s Monitoring with Prometheus and Grafana
Set up Kubernetes monitoring with Prometheus and Grafana. Covers kube-prometheus-stack, custom dashboards, alerting rules, and key metrics to monitor.
OpenTelemetry Complete Setup on Kubernetes
Deploy OpenTelemetry Collector, auto-instrumentation, and exporters on Kubernetes. Unified traces, metrics, and logs pipeline to Jaeger, Prometheus, and Loki.
OpenClaw Health Probes on Kubernetes
Configure liveness and readiness probes for OpenClaw on Kubernetes. Custom Node.js health checks against /healthz and /readyz endpoints with proper timing.
Enable User Workload Monitoring OpenShift
Enable user workload monitoring on OpenShift. Deploy ServiceMonitor, PodMonitor, alerting rules, and Grafana dashboards.
Per-Tenant GPU Monitoring and Chargeback
Build per-tenant GPU monitoring dashboards with queue time, utilization, thermal metrics, and GPU-hour chargeback on Kubernetes.
GPU Tenant SLO Observability on Kubernetes
Define and monitor GPU tenant SLOs for queue time, inference latency, GPU utilization, and job completion rate with Prometheus alerting.
OpenClaw Logging with EFK Stack
Collect and analyze OpenClaw agent logs using Elasticsearch, Fluent Bit, and Kibana (EFK stack) for debugging and audit trails.
Monitor OpenClaw with Prometheus and Grafana
Set up monitoring for OpenClaw AI gateway on Kubernetes with Prometheus metrics, Grafana dashboards, and alerting for uptime, message throughput, and.
Monitor NCCL Benchmark Runs Prometheus & Gr...
Track NCCL benchmark outcomes and GPU telemetry over time with Prometheus and Grafana dashboards to detect communication regressions early.
How to Set Up Node Problem Detector
Detect and report node-level issues automatically with Node Problem Detector. Learn to identify kernel problems, hardware failures, and container.
How to Set Up Alertmanager for Prometheus
Configure Alertmanager to route and manage Prometheus alerts. Set up notification channels including Slack, PagerDuty, and email with routing rules.
How to Implement Container Logging Patterns
Configure logging for Kubernetes applications. Implement sidecar logging, log aggregation, and structured logging best practices.
Implement Distributed Tracing with Jaeger
Deploy Jaeger for distributed tracing in Kubernetes. Learn to instrument applications, trace requests across services, and identify performance.
Grafana Dashboard 6417: K8s Pod Monitoring
Set up Grafana dashboard 6417 for Kubernetes pod monitoring. Import, customize panels, PromQL queries, and cluster-wide resource visualization.
Jaeger Distributed Tracing on Kubernetes
Deploy Jaeger for distributed tracing in Kubernetes. Trace requests across microservices to identify latency issues and debug complex systems.
How to Use Kubernetes Events for Monitoring
Monitor cluster activity through Kubernetes events. Capture, filter, and alert on events for troubleshooting and operational visibility.
Set Up Centralized Logging with EFK Stack
Deploy Elasticsearch, Fluentd, and Kibana for centralized Kubernetes logging. Learn to collect, parse, and visualize container logs at scale.
Collect Metrics with OpenTelemetry Collector
Deploy OpenTelemetry Collector for unified metrics, traces, and logs collection in Kubernetes. Learn pipelines, processors, and exporters configuration.
How to Monitor Kubernetes with Prometheus
Set up Prometheus monitoring for Kubernetes clusters. Configure scraping, alerting rules, and visualize metrics with Grafana dashboards.
How to Configure Alertmanager for K8s Alerts
Set up Alertmanager to route, group, and deliver Kubernetes alerts. Learn to configure Slack, PagerDuty, and email notifications.
Set Up Prometheus Monitoring for Applications
Learn to instrument your Kubernetes applications with Prometheus metrics. Complete guide to ServiceMonitors, scraping configuration, and custom metrics.