πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event

πŸ”§ Troubleshooting

Debug and fix: CrashLoopBackOff, OOMKilled, ImagePullBackOff, pending pods, networking issues, SR-IOV VF troubleshooting, and NFS-oRDMA performance debugging.

75 recipes 🟒 18 beginner 🟑 36 intermediate πŸ”΄ 21 advanced
beginner ⏱ 15 minutes

kubectl exec: Run Commands Inside Kubernetes Pods

Use kubectl exec to run commands and open shells inside Kubernetes pods. Covers interactive sessions, multi-container pods, and debugging with ephemeral containers.

kubectl-execdebuggingshelltroubleshooting
beginner ⏱ 15 minutes

Fix ImagePullBackOff in Kubernetes

Debug and fix ImagePullBackOff errors in Kubernetes. Covers wrong image names, private registry auth, rate limits, and network connectivity issues.

imagepullbackofftroubleshootingregistrypull-secret
beginner ⏱ 15 minutes

Fix OOMKilled in Kubernetes Pods

Debug and fix OOMKilled errors in Kubernetes. Find memory leaks, set correct limits, use VPA for right-sizing, and prevent container OOM kills.

oomkilledmemoryout-of-memorytroubleshooting
intermediate ⏱ 15 minutes

Kubernetes Pod Eviction: Causes and Prevention

Understand why Kubernetes evicts pods and how to prevent it. Covers resource pressure, priority classes, PDBs, and eviction policies.

evictionresource-pressurepriority-classqos
advanced ⏱ 15 minutes

Fix API Server Timeout and Overload

Debug kubectl timeouts, API server overload, and connection refused errors. Covers etcd latency, webhook timeouts, and rate limiting.

api-servertimeoutconnectivityperformance
intermediate ⏱ 15 minutes

Fix CoreDNS Resolution Failures in Kubernetes

Debug DNS resolution failures in Kubernetes pods. Covers CoreDNS crashes, NXDOMAIN errors, ndots configuration, and upstream DNS timeouts.

corednsdnsnetworkingresolution
beginner ⏱ 15 minutes

CrashLoopBackOff Fix: Kubernetes Troubleshooting

Fix CrashLoopBackOff in Kubernetes step by step. Debug OOMKilled, missing configs, failed health probes, and image errors causing pod crash loops.

crashloopbackoffpodsdebuggingtroubleshooting
advanced ⏱ 15 minutes

Fix etcd High Latency and Slow API Server

Debug etcd performance issues causing slow kubectl responses and API server timeouts. Covers disk I/O, compaction, defragmentation, and leader elections.

etcdperformanceapi-serverlatency
advanced ⏱ 15 minutes

Fix fio libaio Silent Exit on OpenShift crun Nodes

Debug fio instantly exiting with no output on crun-based OpenShift nodes. The root cause is seccomp blocking libaio syscalls β€” fix with psync or unconfined.

fiolibaioseccompcrun
beginner ⏱ 15 minutes

Fix ImagePullBackOff in Kubernetes

Debug and resolve ImagePullBackOff errors including auth failures, wrong tags, private registry access, and rate limiting from Docker Hub and Quay.

imagepullbackoffregistrypull-secrettroubleshooting
intermediate ⏱ 15 minutes

Fix Kubernetes Job Failures and Retries

Debug Kubernetes Jobs stuck in backoff or hitting retry limits. Covers backoffLimit, activeDeadlineSeconds, and CronJob overlap.

jobscronjobbackoffretry
intermediate ⏱ 15 minutes

Fix Kubelet NotReady and Node Pressure Issues

Debug kubelet NotReady status, node pressure conditions, and eviction issues. Covers disk pressure, memory pressure, PID pressure, and network not ready.

kubeletnodenotreadyeviction
beginner ⏱ 15 minutes

Kubernetes Debugging Toolkit and Commands

Essential kubectl debugging commands and tools for Kubernetes troubleshooting. Covers ephemeral containers, debug pods, network debugging, and log analysis.

debuggingkubectltroubleshootingephemeral-containers
intermediate ⏱ 15 minutes

Fix OOMKilled Containers in Kubernetes

Debug and resolve OOMKilled container terminations. Understand memory limits, kernel OOM killer behavior, and right-sizing strategies for Kubernetes pods.

oomkilledmemoryresourcestroubleshooting
advanced ⏱ 15 minutes

OpenShift crun vs runc Runtime Differences

Understand why pods behave differently on GPU vs CPU nodes in OpenShift. Compare crun and runc container runtimes, seccomp profiles, and syscall filtering.

crunruncopenshiftcontainer-runtime
intermediate ⏱ 15 minutes

Fix Unexpected Pod Evictions in Kubernetes

Debug pods being evicted due to node pressure, preemption, or taint-based eviction. Understand eviction priorities, QoS classes, and PodDisruptionBudgets.

evictionpreemptionpdbnode-pressure
beginner ⏱ 15 minutes

Fix Pod Stuck in Pending State

Debug pods stuck in Pending status. Covers insufficient resources, node affinity mismatches, taint/toleration issues, and PVC binding failures.

pendingschedulingresourcestroubleshooting
intermediate ⏱ 15 minutes

Fix Podman TLS x509 Certificate Errors Behind Corporate Proxy

Resolve podman pull x509 certificate signed by unknown authority errors caused by corporate TLS-intercepting proxies. Extract and install the proxy CA.

podmantlsx509proxy
intermediate ⏱ 15 minutes

Fix PVC Stuck in Pending State

Debug PersistentVolumeClaims stuck in Pending status. Covers storage class issues, provisioner failures, capacity problems, and access mode mismatches.

pvcstoragepersistent-volumetroubleshooting
advanced ⏱ 15 minutes

Fix Service Mesh Sidecar Injection Failures

Debug Istio and Envoy sidecar injection issues. Covers missing sidecars, port conflicts, init container failures, and mTLS connection errors.

istioenvoysidecarservice-mesh
intermediate ⏱ 10 minutes

OpenShift oc debug Mount Limitation

Why NFS and filesystem mounts via oc debug node disappear after the debug pod exits. Understand the container namespace isolation and use MachineConfig instead.

openshiftoc-debugmounttroubleshooting
beginner ⏱ 15 minutes

Fix ConfigMap Too Large Error

Resolve the 1MB ConfigMap size limit error. Split configs, use Secrets for binary data, mount volumes, or use external stores.

configmapsize-limitconfigurationtroubleshooting
intermediate ⏱ 15 minutes

Debug CRI-O Container Runtime Errors

Troubleshoot CRI-O issues on OpenShift nodes. Fix image pull failures, container start errors, storage driver problems, and CNI networking plugin failures.

cri-ocontainer-runtimeopenshifttroubleshooting
advanced ⏱ 15 minutes

Debug MCP Degraded Nodes

Fix nodes stuck Degraded after MachineConfig updates. Check MCD logs, on-disk validation, and recovery for degraded workers.

openshiftmachineconfigdegradedmcd
intermediate ⏱ 15 minutes

Debug Pod Eviction Reasons

Investigate why pods were evicted. Check node pressure, resource limits, priority classes, and preemption events.

evictionnode-pressureresourcestroubleshooting
intermediate ⏱ 15 minutes

Debug DNS Resolution Failures in Pods

Troubleshoot pods unable to resolve DNS names. Check CoreDNS health, ndots configuration, search domains, and NetworkPolicies blocking UDP port 53 DNS traffic.

dnscorednsresolutionnetworking
advanced ⏱ 15 minutes

Debug etcd Performance Issues

Diagnose slow etcd causing API latency and leader election storms. Check disk IOPS, compaction, defrag, and network latency.

etcdperformancelatencydisk-io
advanced ⏱ 15 minutes

Fix Expired Certificates in Kubernetes

Renew expired certificates causing API server failures and kubelet disconnections. Manual and automatic renewal for kubeadm and OpenShift.

certificatestlsexpirationkubeadm
intermediate ⏱ 15 minutes

Fix OpenShift ImageStream Import Errors

Debug ImageStream import failures in OpenShift. Resolve DNS errors, auth issues, TLS problems, and registry rate limiting.

openshiftimagestreamimportregistry
advanced ⏱ 25 minutes

ITMS Race Condition with Ingress Controllers

Resolve the ITMS race condition where ImageTagMirrorSet rollouts deadlock with hostNetwork ingress controllers during MCO drain.

openshiftitmsingressmachineconfig
advanced ⏱ 20 minutes

Fix Stale MachineConfigPool Updates

Debug and resolve stale OpenShift MachineConfigPool updates. Identify blocked nodes, check MachineConfigDaemon logs, and unblock stuck MCP rollouts.

openshiftmachineconfigmcptroubleshooting
advanced ⏱ 15 minutes

MCP Drain Blocked by PDB: Workaround

Resolve OpenShift MachineConfigPool drain failures caused by PodDisruptionBudget violations. Scale down and restore after update.

openshiftpdbdrainmachineconfig
beginner ⏱ 15 minutes

Fix Namespace Stuck in Terminating

Remove Kubernetes namespaces stuck in Terminating state. Identify blocking finalizers, orphaned API resources, and safely force namespace cleanup procedures.

namespaceterminatingfinalizercleanup
intermediate ⏱ 15 minutes

Debug NetworkPolicy Connectivity Issues

Troubleshoot pods unable to communicate despite correct Services. Verify NetworkPolicy rules, label selectors, and default deny.

networkpolicyconnectivitydebuggingfirewall
advanced ⏱ 15 minutes

Node Drain Blocked by hostNetwork Port Conflicts

Debug and fix OpenShift node drains that fail because hostNetwork pods cannot schedule replacements due to port exhaustion across the cluster.

openshifthostnetworkdrainscheduling
intermediate ⏱ 15 minutes

Debug Node NotReady Status

Diagnose Kubernetes nodes stuck in NotReady state. Check kubelet logs, container runtime, network, disk pressure, and certificates.

nodenot-readykubelettroubleshooting
intermediate ⏱ 20 minutes

OpenShift Ingress Router Troubleshooting

Debug OpenShift HAProxy router issues: pods stuck Pending, hostPort conflicts, PDB violations during maintenance, and custom router deployment scaling problems.

openshiftingresshaproxyrouter
intermediate ⏱ 15 minutes

Debug MachineConfigDaemon Logs

Read and interpret OpenShift MachineConfigDaemon logs to diagnose node update failures. Common error patterns, drain issues, and config application problems.

openshiftmachineconfigmcddebugging
intermediate ⏱ 15 minutes

Debug OpenShift OAuth Login Failures

Troubleshoot OpenShift console and CLI login failures. Check OAuth server pods, identity provider config, and expired tokens.

openshiftoauthauthenticationlogin
intermediate ⏱ 15 minutes

Fix Stuck OLM Operator Subscriptions

Debug Operator Lifecycle Manager subscriptions stuck in pending or failed state. Resolve catalog source issues, approval policies, and CSV dependency conflicts.

openshiftolmoperatorsubscription
intermediate ⏱ 15 minutes

Fix PV Stuck in Terminating State

Resolve PVs and PVCs stuck in Terminating status. Remove finalizers safely, check volume detachment, and handle storage issues.

pvpvcterminatingfinalizer
intermediate ⏱ 15 minutes

PDB Allowed Disruptions Zero: Debugging

Debug PodDisruptionBudgets stuck at zero allowed disruptions. Understand minAvailable vs maxUnavailable, fix eviction failures, and plan for maintenance.

pdbdisruption-budgetevictionmaintenance
beginner ⏱ 15 minutes

Fix ResourceQuota Exceeded Errors

Debug resource quota violations preventing pod scheduling. Understand LimitRange defaults, ResourceQuota, and namespace management.

resourcequotalimitrangeschedulingresources
beginner ⏱ 15 minutes

Debug Service with No Ready Endpoints

Troubleshoot Services showing zero endpoints. Verify label selectors, readiness probes, pod status, and port configuration.

serviceendpointsreadinessnetworking
beginner ⏱ 15 minutes

Debug Taint and Toleration Scheduling

Fix pods stuck Pending due to node taints. Understand NoSchedule, PreferNoSchedule, NoExecute effects and toleration syntax.

taintstolerationsschedulingnodes
intermediate ⏱ 15 minutes

Fix Admission Webhook Timeout Errors

Debug admission webhook failures blocking pod creation. Identify failing webhooks, check timeouts, and set failurePolicy.

webhookadmissiontimeoutapi-server
beginner ⏱ 10 minutes

Decode and Inspect Kubernetes Docker Secrets

Decode base64-encoded dockerconfigjson secrets to verify registry credentials, troubleshoot ImagePullBackOff errors, and audit pull secret configurations.

secretsbase64troubleshootingdebugging
intermediate ⏱ 15 minutes

Troubleshoot CatalogSource and OLM Issues

Debug CatalogSource failures including pod crashes, gRPC errors, stale caches, and operator install problems in OpenShift OLM environments.

catalogsourceolmtroubleshootingopenshift
advanced ⏱ 20 minutes

SR-IOV VF Troubleshooting on Kubernetes

Diagnose and fix SR-IOV Virtual Function issues including VF creation failures, device plugin errors, RDMA problems, and network attachment failures.

sriovtroubleshootingnetworkingrdma
advanced ⏱ 15 minutes

Diagnose NVIDIA Memory-Only Kernel Modules on OpenShift

Understand why lsmod shows NVIDIA modules loaded but modinfo fails, and how the GPU Operator's proprietary driver container inserts modules without.

nvidiagpukernel-modulestroubleshooting
advanced ⏱ 30 minutes

Fix NVIDIA Peer Memory Driver Not Detected

Diagnose and resolve the 'NVIDIA peer memory driver not detected' error when running GPU workloads with RDMA on Kubernetes and OpenShift.

nvidiagpurdmapeermem
advanced ⏱ 30 minutes

Troubleshoot nvidia-fs Module Conflict on OpenShift

Diagnose and fix the 'insmod: ERROR: could not insert module nvidia-fs.ko: File exists' error when enabling GPUDirect Storage with the NVIDIA GPU Operator.

nvidiagpugdsnvidia-fs
advanced ⏱ 30 minutes

Debug NCCL Timeouts and Hangs in Kubernetes

Systematically troubleshoot NCCL runs that stall or timeout across multi-GPU and multi-node Kubernetes jobs with step-by-step diagnostic commands.

nccltimeouthangtroubleshooting
advanced ⏱ 25 minutes

Diagnose GPU Peer-to-Peer Latency with NCCL Tests

Use NCCL point-to-point and collective tests to isolate GPU peer-to-peer latency issues between GPU pairs in multi-node Kubernetes clusters.

nccllatencyp2pgpu
intermediate ⏱ 15 minutes

Validate GPU and NIC Topology Before NCCL Benchmarks

Inspect node-level GPU, NIC, and PCI topology on Kubernetes workers to predict and explain NCCL benchmark performance before running tests.

nccltopologypcigpu
intermediate ⏱ 15 minutes

Check Bonding and Interface Status for SR-IOV

Inspect bond membership, interface state, and link aggregation to confirm which NICs can be correctly targeted by SR-IOV network policies on Kubernetes.

bondingnetworkingsriovlinux
intermediate ⏱ 15 minutes

Identify Mellanox Interface Models from Linux and PCI Data

Map interface names to PCI addresses and Mellanox model generations to build accurate SR-IOV policies and GPU networking configurations on Kubernetes.

mellanoxconnectxpcisriov
advanced ⏱ 20 minutes

Troubleshoot NVIDIA NIM TensorRT-LLM Initialization Failures

Diagnose and fix common NIM TensorRT-LLM executor failures including DecoderState mismatch, version incompatibilities, and engine build errors.

nvidia-nimtensorrt-llmtroubleshootinggpu
advanced ⏱ 30 minutes

Fix 'No Supported NIC Is Selected' in SR-IOV

Diagnose SR-IOV operator webhook rejections by validating node state, label selectors, PF eligibility, and SriovNetworkNodePolicy configuration.

sriovtroubleshootingwebhookopenshift
advanced ⏱ 20 minutes

Troubleshoot nv-ipam 'Pool Not Found' Errors in Multus

Fix nv-ipam IPPool lookup failures in Multus by aligning SriovNetwork, NetworkAttachmentDefinition, and IPPool names and namespaces correctly.

nv-ipammultussriovtroubleshooting
intermediate ⏱ 30 minutes

Validate SR-IOV Operator Health Across Multiple Worker Nodes

Run a full checklist to confirm SR-IOV discovery, VF creation, scheduler resources, and pod attachment on multiple nodes.

sriovvalidationmultinodeopenshift
intermediate ⏱ 30 minutes

How to Troubleshoot Kubernetes Networking

Debug and resolve Kubernetes networking issues systematically. Learn to diagnose DNS problems, service connectivity, network policies, and CNI issues.

networkingtroubleshootingdnsservices
beginner ⏱ 15 minutes

How to Debug ImagePullBackOff Errors

Troubleshoot Kubernetes ImagePullBackOff and ErrImagePull errors. Learn to diagnose registry authentication, image tags, and network connectivity issues.

imagepulltroubleshootingregistryauthentication
intermediate ⏱ 15 minutes

How to Debug Kubernetes Node Issues

Diagnose and troubleshoot node problems in Kubernetes clusters. Identify resource pressure, connectivity issues, and component failures.

nodesdebuggingtroubleshootingkubelet
intermediate ⏱ 15 minutes

OOMKilled in Kubernetes: How to Debug and Fix

Fix OOMKilled errors in Kubernetes pods. Learn why containers get OOMKilled (exit code 137), how to set memory limits, debug memory leaks, and prevent OOM.

oomkilledoommemorytroubleshooting
intermediate ⏱ 15 minutes

How to Debug Pod Networking Issues

Diagnose and fix Kubernetes networking problems. Troubleshoot connectivity, DNS resolution, service discovery, and network policies with practical tools.

networkingdebuggingtroubleshootingconnectivity
intermediate ⏱ 15 minutes

How to Debug Pod Scheduling Failures

Troubleshoot pods stuck in Pending state due to scheduling issues. Learn to diagnose resource constraints, node affinity, taints, and topology spread.

schedulingpendingtroubleshootingresources
intermediate ⏱ 15 minutes

How to Use Ephemeral Containers for Debugging

Debug running pods using ephemeral containers without restarting. Learn kubectl debug techniques for troubleshooting production workloads.

debuggingephemeralkubectltroubleshooting
beginner ⏱ 15 minutes

How to Run Kubernetes in Docker (kind)

Create local Kubernetes clusters using kind (Kubernetes in Docker). Set up multi-node clusters, configure networking, and test applications locally.

kindlocal-developmentdockertesting
beginner ⏱ 15 minutes

Essential kubectl Commands for Debugging

Master kubectl debugging commands to troubleshoot Kubernetes issues. Learn to inspect pods, view logs, debug networking, and diagnose cluster problems.

kubectldebuggingtroubleshootingcli
beginner ⏱ 15 minutes

How to Extend kubectl with Plugins

Enhance kubectl with custom plugins using Krew package manager. Discover, install, and create plugins to boost K8s productivity.

kubectlkrewpluginscli
intermediate ⏱ 15 minutes

How to Manage Kubernetes Finalizers and Stuck Resources

Understand and manage finalizers for controlled resource deletion. Handle stuck resources and implement custom cleanup logic.

finalizersdeletioncleanupstuck-resources
beginner ⏱ 15 minutes

CrashLoopBackOff: How to Fix in Kubernetes

Fix CrashLoopBackOff in Kubernetes pods. Learn why pods crash loop, systematic debugging with kubectl logs and describe, and solutions for common causes.

troubleshootingcrashloopbackoffdebugginglogs
intermediate ⏱ 20 minutes

How to Debug DNS Issues in Kubernetes

Troubleshoot and resolve DNS problems in Kubernetes. Learn to diagnose CoreDNS issues, test resolution, and fix common DNS failures.

dnscorednstroubleshootingnetworking
intermediate ⏱ 15 minutes

Troubleshooting Pending PersistentVolumeClaims

Diagnose and fix PVCs stuck in Pending status. Learn common causes including StorageClass issues, capacity problems, and node affinity conflicts with.

troubleshootingpvcstoragepending
Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens