Diagnose and fix common Kubernetes issues including pod failures, networking problems, resource constraints, and cluster issues.
30 recipes available
Decode base64-encoded dockerconfigjson secrets to verify registry credentials, troubleshoot ImagePullBackOff errors, and audit pull secret configurations.
Troubleshoot Kubernetes ImagePullBackOff and ErrImagePull errors. Learn to diagnose registry authentication, image tags, and network connectivity issues.
Debug and fix container image pull failures. Resolve authentication issues, registry connectivity, and image availability problems.
Create local Kubernetes clusters using kind (Kubernetes in Docker). Set up multi-node clusters, configure networking, and test applications locally.
Master kubectl debugging commands to troubleshoot Kubernetes issues. Learn to inspect pods, view logs, debug networking, and diagnose cluster problems.
Enhance kubectl with custom plugins using Krew. Discover, install, and create plugins to boost Kubernetes productivity.
Fix CrashLoopBackOff in Kubernetes pods. Learn why pods crash loop, systematic debugging with kubectl logs and describe, and solutions for common causes.
Debug CatalogSource failures including pod crashes, gRPC errors, stale caches, and operator install problems in OpenShift OLM environments.
Inspect node-level GPU, NIC, and PCI topology on Kubernetes workers to predict and explain NCCL benchmark performance before running tests.
Inspect bond membership, interface state, and link aggregation to confirm which NICs can be correctly targeted by SR-IOV network policies on Kubernetes.
Map interface names to PCI addresses and Mellanox model generations to build accurate SR-IOV policies and GPU networking configurations on Kubernetes.
Run a full checklist to confirm SR-IOV discovery, VF creation, scheduler resources, and pod attachment on multiple nodes.
Debug and resolve Kubernetes networking issues systematically. Learn to diagnose DNS problems, service connectivity, network policies, and CNI issues.
Diagnose and troubleshoot node problems in Kubernetes clusters. Identify resource pressure, connectivity issues, and component failures.
Fix OOMKilled errors in Kubernetes pods. Learn why containers get OOMKilled (exit code 137), how to set memory limits, debug memory leaks, and prevent OOM.
Diagnose and fix Kubernetes networking problems. Troubleshoot connectivity, DNS resolution, service discovery, and network policies with practical tools.
Troubleshoot pods stuck in Pending state due to scheduling issues. Learn to diagnose resource constraints, node affinity, taints, and topology spread.
Debug running pods using ephemeral containers without restarting. Learn kubectl debug techniques for troubleshooting production workloads.
Understand and manage finalizers for controlled resource deletion. Handle stuck resources and implement custom cleanup logic.
Troubleshoot and resolve DNS problems in Kubernetes. Learn to diagnose CoreDNS issues, test resolution, and fix common DNS failures.
Diagnose and fix PVCs stuck in Pending status. Learn common causes including StorageClass issues, capacity problems, and node affinity conflicts with.
Diagnose and fix SR-IOV Virtual Function issues including VF creation failures, device plugin errors, RDMA problems, and network attachment failures.
Understand why lsmod shows NVIDIA modules loaded but modinfo fails, and how the GPU Operator's proprietary driver container inserts modules without.
Diagnose and resolve the 'NVIDIA peer memory driver not detected' error when running GPU workloads with RDMA on Kubernetes and OpenShift.
Diagnose and fix the 'insmod: ERROR: could not insert module nvidia-fs.ko: File exists' error when enabling GPUDirect Storage with the NVIDIA GPU Operator.
Systematically troubleshoot NCCL runs that stall or timeout across multi-GPU and multi-node Kubernetes jobs with step-by-step diagnostic commands.
Use NCCL point-to-point and collective tests to isolate GPU peer-to-peer latency issues between GPU pairs in multi-node Kubernetes clusters.
Diagnose and fix common NIM TensorRT-LLM executor failures including DecoderState mismatch, version incompatibilities, and engine build errors.
Fix nv-ipam IPPool lookup failures in Multus by aligning SriovNetwork, NetworkAttachmentDefinition, and IPPool names and namespaces correctly.
Diagnose SR-IOV operator webhook rejections by validating node state, label selectors, PF eligibility, and SriovNetworkNodePolicy configuration.
Our book includes an entire chapter dedicated to troubleshooting with dozens more examples.
📖 Explore All Chapters