📚Book Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) — free book giveaway!RSVP Booking.com Event

🔧 Troubleshooting Recipes

Diagnose and fix common Kubernetes issues including pod failures, networking problems, resource constraints, and cluster issues.

30 recipes available

Intermediate

Troubleshoot CatalogSource and OLM Issues

Debug CatalogSource failures including pod crashes, gRPC errors, stale caches, and operator install problems in OpenShift OLM environments.

⏱ 15 minutes K8s 1.26+

Validate GPU and NIC Topology Before NCCL Benchmarks

Inspect node-level GPU, NIC, and PCI topology on Kubernetes workers to predict and explain NCCL benchmark performance before running tests.

⏱ 15 minutes K8s 1.28+

Check Bonding and Interface Status for SR-IOV

Inspect bond membership, interface state, and link aggregation to confirm which NICs can be correctly targeted by SR-IOV network policies on Kubernetes.

⏱ 15 minutes K8s Any

Identify Mellanox Interface Models from Linux and PCI Data

Map interface names to PCI addresses and Mellanox model generations to build accurate SR-IOV policies and GPU networking configurations on Kubernetes.

⏱ 15 minutes K8s Any

Validate SR-IOV Operator Health Across Multiple Worker Nodes

Run a full checklist to confirm SR-IOV discovery, VF creation, scheduler resources, and pod attachment on multiple nodes.

⏱ 30 minutes K8s 1.28+

How to Troubleshoot Kubernetes Networking

Debug and resolve Kubernetes networking issues systematically. Learn to diagnose DNS problems, service connectivity, network policies, and CNI issues.

⏱ 30 minutes K8s 1.28+

How to Debug Kubernetes Node Issues

Diagnose and troubleshoot node problems in Kubernetes clusters. Identify resource pressure, connectivity issues, and component failures.

⏱ 15 minutes K8s 1.28+

OOMKilled in Kubernetes: How to Debug and Fix

Fix OOMKilled errors in Kubernetes pods. Learn why containers get OOMKilled (exit code 137), how to set memory limits, debug memory leaks, and prevent OOM.

⏱ 15 minutes K8s 1.28+

How to Debug Pod Networking Issues

Diagnose and fix Kubernetes networking problems. Troubleshoot connectivity, DNS resolution, service discovery, and network policies with practical tools.

⏱ 15 minutes K8s 1.28+

How to Debug Pod Scheduling Failures

Troubleshoot pods stuck in Pending state due to scheduling issues. Learn to diagnose resource constraints, node affinity, taints, and topology spread.

⏱ 15 minutes K8s 1.28+

How to Use Ephemeral Containers for Debugging

Debug running pods using ephemeral containers without restarting. Learn kubectl debug techniques for troubleshooting production workloads.

⏱ 15 minutes K8s 1.28+

How to Manage Kubernetes Finalizers and Stuck Resources

Understand and manage finalizers for controlled resource deletion. Handle stuck resources and implement custom cleanup logic.

⏱ 15 minutes K8s 1.28+

How to Debug DNS Issues in Kubernetes

Troubleshoot and resolve DNS problems in Kubernetes. Learn to diagnose CoreDNS issues, test resolution, and fix common DNS failures.

⏱ 20 minutes K8s 1.28+

Troubleshooting Pending PersistentVolumeClaims

Diagnose and fix PVCs stuck in Pending status. Learn common causes including StorageClass issues, capacity problems, and node affinity conflicts with.

⏱ 15 minutes K8s 1.25+

Advanced

SR-IOV VF Troubleshooting on Kubernetes

Diagnose and fix SR-IOV Virtual Function issues including VF creation failures, device plugin errors, RDMA problems, and network attachment failures.

⏱ 20 minutes K8s 1.27+

Diagnose NVIDIA Memory-Only Kernel Modules on OpenShift

Understand why lsmod shows NVIDIA modules loaded but modinfo fails, and how the GPU Operator's proprietary driver container inserts modules without.

⏱ 15 minutes K8s 1.28+

Fix NVIDIA Peer Memory Driver Not Detected

Diagnose and resolve the 'NVIDIA peer memory driver not detected' error when running GPU workloads with RDMA on Kubernetes and OpenShift.

⏱ 30 minutes K8s 1.28+

Troubleshoot nvidia-fs Module Conflict on OpenShift

Diagnose and fix the 'insmod: ERROR: could not insert module nvidia-fs.ko: File exists' error when enabling GPUDirect Storage with the NVIDIA GPU Operator.

⏱ 30 minutes K8s 1.28+

Debug NCCL Timeouts and Hangs in Kubernetes

Systematically troubleshoot NCCL runs that stall or timeout across multi-GPU and multi-node Kubernetes jobs with step-by-step diagnostic commands.

⏱ 30 minutes K8s 1.28+

Diagnose GPU Peer-to-Peer Latency with NCCL Tests

Use NCCL point-to-point and collective tests to isolate GPU peer-to-peer latency issues between GPU pairs in multi-node Kubernetes clusters.

⏱ 25 minutes K8s 1.28+

Troubleshoot NVIDIA NIM TensorRT-LLM Initialization Failures

Diagnose and fix common NIM TensorRT-LLM executor failures including DecoderState mismatch, version incompatibilities, and engine build errors.

⏱ 20 minutes K8s 1.28+

Troubleshoot nv-ipam 'Pool Not Found' Errors in Multus

Fix nv-ipam IPPool lookup failures in Multus by aligning SriovNetwork, NetworkAttachmentDefinition, and IPPool names and namespaces correctly.

⏱ 20 minutes K8s 1.28+

Fix 'No Supported NIC Is Selected' in SR-IOV

Diagnose SR-IOV operator webhook rejections by validating node state, label selectors, PF eligibility, and SriovNetworkNodePolicy configuration.

⏱ 30 minutes K8s 1.28+

Want more troubleshooting patterns?

Our book includes an entire chapter dedicated to troubleshooting with dozens more examples.

📖 Explore All Chapters
Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens