πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event

πŸ”§ Troubleshooting

Debug and fix: CrashLoopBackOff, OOMKilled, ImagePullBackOff, pending pods, networking issues, SR-IOV VF troubleshooting, and NFS-oRDMA performance debugging.

132 recipes 🟒 39 beginner 🟑 65 intermediate πŸ”΄ 28 advanced
intermediate ⏱ 15 minutes

NCCL Debug Subsystems for GPU Network Troubleshooting

Configure NCCL_DEBUG and NCCL_DEBUG_SUBSYS for targeted logging during multi-node GPU training. Covers INIT, NET, GRAPH subsystems, log

nccltroubleshootingobservabilitygpu
advanced ⏱ 15 minutes

NCCL Network Validation Troubleshooting Checklist

Complete troubleshooting checklist for NCCL multi-node GPU bandwidth validation. Covers SR-IOV VF allocation, /dev/infiniband visibility, RoCE GID

nccltroubleshootingrdmanetworking
intermediate ⏱ 15 minutes

Kubernetes Ephemeral Containers for Debugging

Debug running pods with Kubernetes ephemeral containers. Attach debug containers without restarting pods, troubleshoot distroless images, inspect network

ephemeral-containersdebuggingkubectl-debugtroubleshooting
intermediate ⏱ 15 minutes

Kubernetes Finalizers Explained and Troubleshooting

Understand Kubernetes finalizers for resource cleanup. How finalizers block deletion, common stuck resource scenarios, manual removal

finalizersresource-lifecycletroubleshootingdeletion
beginner ⏱ 15 minutes

Kubernetes ImagePullBackOff Troubleshooting Guide

Debug and fix ImagePullBackOff and ErrImagePull errors in Kubernetes. Resolve authentication failures, registry connectivity, image not found, TLS certificate

imagepullbackofftroubleshootingcontainer-registryauthentication
beginner ⏱ 15 minutes

Kubernetes OOMKilled Troubleshooting and Prevention

Debug and prevent OOMKilled container terminations in Kubernetes. Understand memory limits, diagnose memory leaks, configure resource requests, and implement

oomkilledtroubleshootingmemoryresource-limits
intermediate ⏱ 15 minutes

Chaos Mesh Fault Injection on Kubernetes

Deploy Chaos Mesh for chaos engineering on Kubernetes. Covers PodChaos, NetworkChaos, IOChaos, StressChaos experiments, scheduling, RBAC

chaos-engineeringchaos-meshfault-injectionresilience
intermediate ⏱ 15 minutes

LitmusChaos Engineering on Kubernetes

Deploy LitmusChaos for resilience testing on Kubernetes. Covers ChaosEngine, ChaosExperiment, ChaosResult CRDs, built-in experiments, GameDay planning, Litmus

chaos-engineeringlitmusresiliencetesting
intermediate ⏱ 15 minutes

Ephemeral Containers for Live Debugging

Use kubectl debug with ephemeral containers to troubleshoot running Pods without restarting them. Attach debugging tools to distroless containers, inspect

ephemeral-containersdebuggingkubectl-debugtroubleshooting
beginner ⏱ 15 minutes

OpenShift oc cp File Copy Guide

Use oc cp to copy files and directories between local machine and Pods. Covers tar-based transfer, container selection, large file handling, and comparison

openshiftoc-cpfile-transferdebugging
beginner ⏱ 15 minutes

OpenShift oc rsync File Transfer

Use oc rsync to copy files between local machine and Pods in OpenShift. Covers upload, download, live sync, filtering, and common patterns for debugging

openshiftoc-rsyncfile-transferdebugging
advanced ⏱ 15 minutes

Thanos Receive OOMKilled CrashLoopBackOff

Debug and fix Thanos Receive StatefulSet OOMKilled CrashLoopBackOff caused by WAL replay exceeding memory limits. Covers ArgoCD conflict resolution, liveness

thanosoomcrashloopbackoffstatefulset
intermediate ⏱ 15 minutes

Fix Thanos Receive OOMKilled in Run:ai

Troubleshoot and fix Thanos Receive OOMKilled (exit code 137) with 143+ restarts in Run:ai backend on OpenShift. Covers memory tuning, TSDB

thanosrunaioomkilledtroubleshooting
beginner ⏱ 15 minutes

Kubernetes 1.36 Statusz and Flagz Endpoints

Use /statusz and /flagz debug endpoints in Kubernetes 1.36 control plane components. Inspect runtime status and effective flag values without log parsing.

kubernetes-1.36debuggingcontrol-planeoperations
beginner ⏱ 8 minutes

kubectl describe: Read Pod Events Guide

Use kubectl describe pod to read events, conditions, and container states. Diagnose scheduling failures, image pulls, crashes, and probe failures.

kubectltroubleshootingeventscka
beginner ⏱ 8 minutes

kubectl exec: Run Commands in Pods

Use kubectl exec to run commands inside running pods. Interactive shell, multi-container pods, debugging techniques, and security considerations.

kubectltroubleshootingdebuggingcka
intermediate ⏱ 10 minutes

K8s CoreDNS: Troubleshoot DNS Issues

Troubleshoot Kubernetes CoreDNS resolution failures. Debug dns pods, ndots settings, search domains, custom Corefile, and forward plugin configuration.

corednsdnstroubleshootingnetworking
beginner ⏱ 8 minutes

Fix CreateContainerError in Kubernetes

Troubleshoot Kubernetes CreateContainerError with step-by-step debugging. ConfigMap mounts, Secret references, volume permissions, and container runtime issues.

troubleshootingcontainerserrorsdebugging
beginner ⏱ 8 minutes

Troubleshoot ImagePullBackOff and ErrImagePull

Troubleshoot Kubernetes ImagePullBackOff and ErrImagePull errors. Private registry auth, image pull secrets, tag verification, and network connectivity fixes.

troubleshootingimage-pullcontainersregistry
intermediate ⏱ 10 minutes

kubectl debug: Advanced Pod Debugging

Use kubectl debug for ephemeral containers, node debugging, and pod copy debugging. Debug distroless images, share process namespaces, and node-level access.

kubectldebuggingephemeral-containerstroubleshooting
intermediate ⏱ 12 minutes

K8s Network Debugging: Connectivity Guide

Debug Kubernetes network issues with tcpdump, netshoot, and connectivity tests. Pod-to-pod, pod-to-service, DNS, and external connectivity troubleshooting.

networkingtroubleshootingdebuggingdns
beginner ⏱ 5 minutes

Fix Untolerated Taint node-role master

Fix 'node untolerated taint node-role.kubernetes.io/master' scheduling error. Remove or tolerate control plane taints to schedule pods on master nodes.

taintstolerationsschedulingcontrol-plane
intermediate ⏱ 15 minutes

Journald Verify Config Kubernetes Nodes

Validate journald configuration on Kubernetes nodes. Fix journal corruption, tune storage limits, configure persistence, and troubleshoot systemd-journald.

journaldsystemdloggingtroubleshooting
intermediate ⏱ 15 minutes

NXDOMAIN DNS Troubleshooting Kubernetes

Fix NXDOMAIN errors in Kubernetes. Debug CoreDNS failures, ndots configuration, search domain issues, and external DNS lookup problems.

dnsnxdomaincorednstroubleshooting
advanced ⏱ 20 minutes

oc-mirror Troubleshooting Disconnected

Troubleshoot oc-mirror failures in disconnected OpenShift. Fix archive corruption, registry auth errors, v1/v2 mismatches, and delta mirror issues.

oc-mirrordisconnectedopenshifttroubleshooting
advanced ⏱ 20 minutes

OpenShift Cluster Operator Upgrade Debug

Debug degraded cluster operators during OpenShift upgrades. Identify stuck operators, decode status conditions, and unblock stalled rollouts.

openshiftcluster-operatorsupgradestroubleshooting
advanced ⏱ 25 minutes

OpenShift MCP Validation Broken Rules

Validate MachineConfigPool rules before applying in OpenShift. Detect broken MachineConfigs, degraded MCPs, and implement pre-flight checks.

openshiftmachineconfigmcpvalidation
intermediate ⏱ 15 minutes

SELinux SSH Login Failure Troubleshoot

Fix SSH login failures caused by SELinux enforcement. Diagnose AVC denials, restore file labels, fix custom SSH ports, and resolve PAM denials.

selinuxsshtroubleshootingsecurity
intermediate ⏱ 15 minutes

Cilium Debug Pod Troubleshooting

Debug Kubernetes networking with Cilium debug pods and containers. cilium-dbg, netshoot, hubble observe, and endpoint connectivity troubleshooting.

ciliumdebugnetshoothubble
intermediate ⏱ 15 minutes

Fix CUDA Out of Memory K8s Pods

Troubleshoot CUDA out of memory errors in Kubernetes GPU pods. Memory fragmentation, batch size tuning, gradient checkpointing, and resource limits.

cudaoomgpu-memorypytorch
intermediate ⏱ 15 minutes

NVIDIA GPU Operator Troubleshooting

Fix common NVIDIA GPU Operator issues on Kubernetes. Driver pod crashes, toolkit failures, device plugin not ready, and validation pod errors.

gpu-operatornvidiadrivertoolkit
advanced ⏱ 15 minutes

Fix etcd Leader Election Timeout

Troubleshoot etcd leader election timeouts in K8s. Disk latency, network partition, heartbeat interval, and recovery steps.

etcdleader-electiontimeoutcluster
intermediate ⏱ 15 minutes

Fix Certificate Errors Kubernetes

Troubleshoot TLS certificate errors in K8s. x509 unknown authority, expired certs, cert-manager issues, and custom CA bundles.

certificatetlsx509cert-manager
intermediate ⏱ 15 minutes

Fix DNS Resolution Issues in Kubernetes

Troubleshoot Kubernetes DNS resolution failures. ndots, search domains, CoreDNS CrashLoop, and pod-level DNS debugging steps.

dnsresolutioncorednsndots
advanced ⏱ 15 minutes

Fix Pod cgroup Memory Errors K8s

Fix cgroup memory limit and OOM errors in Kubernetes pods. Covers cgroup v2 migration, memory.max, swap settings, and kernel tuning for stable workloads.

cgroupmemoryoomkernel
intermediate ⏱ 15 minutes

Fix Service Not Reachable in Kubernetes

Debug Kubernetes Service connectivity issues. Endpoint selection, kube-proxy rules, DNS resolution, and NetworkPolicy blocks.

serviceconnectivityendpointskube-proxy
beginner ⏱ 15 minutes

Fix 502 Bad Gateway Kubernetes Ingress

Fix 502 Bad Gateway errors in Kubernetes Ingress. Backend not ready, timeout tuning, readiness probes, and NGINX ingress controller troubleshooting.

502-bad-gatewayingresstroubleshootingnginx
beginner ⏱ 15 minutes

kubectl exec Into Pods: Complete Guide

Use kubectl exec to debug running pods. Interactive shells, non-interactive commands, multi-container pods, and ephemeral debug containers.

kubectlexecdebugshell
beginner ⏱ 15 minutes

Fix Namespace Stuck Terminating K8s

Fix Kubernetes namespaces stuck in Terminating state. Finalizer removal, API resource cleanup, and force deletion of stuck namespaces.

namespaceterminatingfinalizersstuck
intermediate ⏱ 15 minutes

Fix Node NotReady Status in Kubernetes

Troubleshoot Kubernetes nodes in NotReady state. Kubelet issues, disk pressure, network problems, certificate expiration, and recovery procedures.

node-notreadykubelettroubleshootingcluster-health
beginner ⏱ 10 minutes

Fix node-role.kubernetes.io/master

Remove the node-role.kubernetes.io/master taint to schedule pods on control plane nodes. Single-node clusters, tolerations, and untolerated taint fix.

taintmastercontrol-planescheduling
beginner ⏱ 15 minutes

Fix OOMKilled Kubernetes Guide

Troubleshoot and fix OOMKilled errors in Kubernetes. Memory limit tuning, Java heap sizing, memory leak detection, and VPA recommendations.

oomkilledmemorytroubleshootingresources
beginner ⏱ 15 minutes

Fix Pending Pods Kubernetes Guide

Troubleshoot Kubernetes pods stuck in Pending state. Insufficient resources, node selector mismatch, PVC binding, taints, and scheduling failures.

pendingschedulingtroubleshootingresources
intermediate ⏱ 15 minutes

Fix Pod Eviction Kubernetes Guide

Troubleshoot Kubernetes pod evictions. DiskPressure, MemoryPressure, ephemeral storage limits, and eviction thresholds configuration.

evictiondisk-pressurememory-pressureresources
beginner ⏱ 15 minutes

Pod Lifecycle and States Guide

Understand Kubernetes pod lifecycle phases and container states. Pending, Running, Succeeded, Failed, Unknown, and troubleshooting stuck pods.

pod-lifecyclepod-statespendingrunning
intermediate ⏱ 15 minutes

LitmusChaos Chaos Engineering K8s

Run chaos experiments on Kubernetes with LitmusChaos. Pod kill, network latency, disk fill, and CPU stress experiments for resilience testing.

chaos-engineeringlitmusresiliencetesting
beginner ⏱ 10 minutes

Debug Containers and Ephemeral Pods

Use kubectl debug with ephemeral containers to troubleshoot running pods without restart. Debug distroless images, node debugging.

debugephemeral-containerstroubleshootingkubectl-debug
intermediate ⏱ 15 minutes

DNS Debugging Kubernetes Guide

Debug Kubernetes DNS issues systematically. CoreDNS troubleshooting, ndots configuration, search domains, and resolving slow DNS lookups.

dnscorednsdebuggingndots
intermediate ⏱ 15 minutes

Network Debugging Tools Kubernetes

Debug Kubernetes networking with tcpdump, netshoot, iptables tracing, conntrack inspection, and DNS resolution testing techniques.

networkingtcpdumpdebugconntrack
intermediate ⏱ 15 minutes

Fix RBAC Permission Errors K8s

Debug Kubernetes RBAC permission errors. kubectl auth can-i, impersonation testing, ClusterRole aggregation, and common permission mistakes.

rbactroubleshootingpermissionsauthorization
intermediate ⏱ 15 minutes

Fix 502 Bad Gateway in Kubernetes

Troubleshoot and fix 502 Bad Gateway errors in Kubernetes. Causes include pod readiness timing, ingress misconfiguration, upstream timeouts.

502bad-gatewayingresstroubleshooting
beginner ⏱ 5 minutes

kubectl cp Copy Files to and from Pods

Copy files between local machine and Kubernetes pods with kubectl cp. Supports containers, namespaces, tar-based transfer, and common troubleshooting.

kubectlcpcopyfiles
beginner ⏱ 10 minutes

kubectl logs View Pod Logs Guide

View and stream Kubernetes pod logs with kubectl logs. Multi-container pods, previous crashes, label selectors, timestamps, and log aggregation patterns.

kubectllogsdebuggingtroubleshooting
beginner ⏱ 10 minutes

Check Kubernetes Node Status with kubectl

Check and troubleshoot Kubernetes node status with kubectl. Node conditions (Ready, MemoryPressure, DiskPressure), NotReady debugging, and capacity monitoring.

nodestatuskubectltroubleshooting
intermediate ⏱ 20 minutes

Troubleshooting Pods with GPU Devices

Fix GPU device issues in Kubernetes pods. Troubleshoot device plugin errors, DRA claims, CUDA failures, driver mismatches.

gpu-troubleshootingdevice-pluginnvidiadra
intermediate ⏱ 15 minutes

Debug Kubernetes Pods: Complete Guide

Debug Kubernetes pods with kubectl debug, ephemeral containers, and netshoot. Troubleshoot distroless images, network issues, and crashed pods step by step.

debugkubectl-debugephemeral-containersnetshoot
intermediate ⏱ 15 minutes

Kubernetes Troubleshooting Flowchart

Systematic Kubernetes troubleshooting guide with flowcharts. Debug pods, services, networking, storage, and node issues step by step with kubectl commands.

troubleshootingdebuggingflowchartkubectl
beginner ⏱ 15 minutes

kubectl exec: Run Commands Inside K8s Pods

Use kubectl exec to run commands inside Kubernetes pods. Covers interactive sessions, multi-container pods, and ephemeral container debugging.

kubectl-execdebuggingshelltroubleshooting
beginner ⏱ 15 minutes

Fix ImagePullBackOff in Kubernetes

Debug and fix ImagePullBackOff errors in Kubernetes. Covers wrong image names, private registry auth, rate limits, and network connectivity issues.

imagepullbackofftroubleshootingregistrypull-secret
beginner ⏱ 15 minutes

Debug and Fix OOMKilled Errors in Kubernetes

Debug and fix OOMKilled errors in Kubernetes. Find memory leaks, set correct limits, use VPA for right-sizing, and prevent container OOM kills.

oomkilledmemoryout-of-memorytroubleshooting
intermediate ⏱ 15 minutes

Kubernetes Pod Eviction: Causes and Prevention

Understand why Kubernetes evicts pods and how to prevent it. Covers resource pressure, priority classes, PDBs, and eviction policies.

evictionresource-pressurepriority-classqos
advanced ⏱ 15 minutes

Fix API Server Timeout and Overload

Debug kubectl timeouts, API server overload, and connection refused errors. Covers etcd latency, webhook timeouts, and rate limiting.

api-servertimeoutconnectivityperformance
intermediate ⏱ 15 minutes

Fix CoreDNS Resolution Failures in Kubernetes

Debug DNS resolution failures in Kubernetes pods. Covers CoreDNS crashes, NXDOMAIN errors, ndots configuration, and upstream DNS timeouts.

corednsdnsnetworkingresolution
beginner ⏱ 15 minutes

How to Fix CrashLoopBackOff in Kubernetes

Fix CrashLoopBackOff in Kubernetes with step-by-step troubleshooting. Debug OOMKilled, failed probes, missing configs, and image errors causing pod crash loops.

crashloopbackoffpodsdebuggingtroubleshooting
advanced ⏱ 15 minutes

Fix etcd High Latency and Slow API Server

Debug etcd performance issues causing slow kubectl responses and API server timeouts. Covers disk I/O, compaction, defragmentation, and leader elections.

etcdperformanceapi-serverlatency
advanced ⏱ 15 minutes

Fix fio libaio Silent Exit on OpenShift cru...

Debug fio instantly exiting with no output on crun-based OpenShift nodes. The root cause is seccomp blocking libaio syscalls β€” fix with psync or unconfined.

fiolibaioseccompcrun
beginner ⏱ 15 minutes

ImagePullBackOff Troubleshooting Guide

Debug and resolve ImagePullBackOff errors including auth failures, wrong tags, private registry access, and rate limiting from Docker Hub and Quay.

imagepullbackoffregistrypull-secrettroubleshooting
intermediate ⏱ 15 minutes

Fix Kubernetes Job Failures and Retries

Debug Kubernetes Jobs stuck in backoff or hitting retry limits. Covers backoffLimit, activeDeadlineSeconds, and CronJob overlap.

jobscronjobbackoffretry
intermediate ⏱ 15 minutes

Fix Kubelet NotReady and Node Pressure Issues

Debug kubelet NotReady status, node pressure conditions, and eviction issues. Covers disk pressure, memory pressure, PID pressure, and network not ready.

kubeletnodenotreadyeviction
beginner ⏱ 15 minutes

Kubernetes Debugging Toolkit and Commands

Essential kubectl debugging commands and tools for Kubernetes troubleshooting. Covers ephemeral containers, debug pods, network debugging, and log analysis.

debuggingkubectltroubleshootingephemeral-containers
intermediate ⏱ 15 minutes

Fix OOMKilled Containers in Kubernetes

Debug and resolve OOMKilled container terminations. Understand memory limits, kernel OOM killer behavior, and right-sizing strategies for Kubernetes pods.

oomkilledmemoryresourcestroubleshooting
advanced ⏱ 15 minutes

OpenShift crun vs runc Runtime Differences

Understand why pods behave differently on GPU vs CPU nodes in OpenShift. Compare crun and runc container runtimes, seccomp profiles, and syscall filtering.

crunruncopenshiftcontainer-runtime
intermediate ⏱ 15 minutes

Fix Unexpected Pod Evictions in Kubernetes

Debug pods being evicted due to node pressure, preemption, or taint-based eviction. Understand eviction priorities, QoS classes, and PodDisruptionBudgets.

evictionpreemptionpdbnode-pressure
beginner ⏱ 15 minutes

Fix Pod Stuck in Pending State

Debug pods stuck in Pending status. Covers insufficient resources, node affinity mismatches, taint/toleration issues, and PVC binding failures.

pendingschedulingresourcestroubleshooting
intermediate ⏱ 15 minutes

Fix Podman TLS x509 Behind Corporate Proxy

Resolve podman pull x509 certificate signed by unknown authority errors caused by corporate TLS-intercepting proxies. Extract and install the proxy CA.

podmantlsx509proxy
intermediate ⏱ 15 minutes

Fix PVC Stuck in Pending State

Debug PersistentVolumeClaims stuck in Pending status. Covers storage class issues, provisioner failures, capacity problems, and access mode mismatches.

pvcstoragepersistent-volumetroubleshooting
advanced ⏱ 15 minutes

Fix Service Mesh Sidecar Injection Failures

Debug Istio and Envoy sidecar injection issues. Covers missing sidecars, port conflicts, init container failures, and mTLS connection errors.

istioenvoysidecarservice-mesh
intermediate ⏱ 10 minutes

OpenShift oc debug Mount Limitation

Why NFS and filesystem mounts via oc debug node disappear after the debug pod exits. Understand the container namespace isolation and use MachineConfig instead.

openshiftoc-debugmounttroubleshooting
beginner ⏱ 15 minutes

Fix the Kubernetes ConfigMap Too Large Error

Resolve the 1MB ConfigMap size limit error. Split configs, use Secrets for binary data, mount volumes, or use external stores.

configmapsize-limitconfigurationtroubleshooting
intermediate ⏱ 15 minutes

Debug CRI-O Container Runtime Errors

Troubleshoot CRI-O issues on OpenShift nodes. Fix image pull failures, container start errors, storage driver problems, and CNI networking plugin failures.

cri-ocontainer-runtimeopenshifttroubleshooting
advanced ⏱ 15 minutes

Debug Degraded MachineConfigPool Nodes

Fix nodes stuck Degraded after MachineConfig updates. Check MCD logs, on-disk validation, and recovery for degraded workers.

openshiftmachineconfigdegradedmcd
intermediate ⏱ 15 minutes

Debug Kubernetes Pod Eviction Reasons

Investigate why pods were evicted from Kubernetes nodes. Check node pressure conditions, resource limits, priority classes, and preemption events.

evictionnode-pressureresourcestroubleshooting
intermediate ⏱ 15 minutes

Debug DNS Resolution Failures in Pods

Troubleshoot pods unable to resolve DNS names. Check CoreDNS health, ndots configuration, search domains, and NetworkPolicies blocking UDP port 53 DNS traffic.

dnscorednsresolutionnetworking
advanced ⏱ 15 minutes

Debug etcd Performance Issues in Kubernetes

Diagnose slow etcd causing API latency and leader election storms. Check disk IOPS, compaction, defrag, and network latency.

etcdperformancelatencydisk-io
advanced ⏱ 15 minutes

Fix Expired Certificates in Kubernetes

Renew expired certificates causing API server failures and kubelet disconnections. Manual and automatic renewal for kubeadm and OpenShift.

certificatestlsexpirationkubeadm
intermediate ⏱ 15 minutes

Fix OpenShift ImageStream Import Errors

Debug ImageStream import failures in OpenShift. Resolve DNS errors, auth issues, TLS problems, and registry rate limiting.

openshiftimagestreamimportregistry
advanced ⏱ 25 minutes

ITMS Race Condition with Ingress Controllers

Resolve the ITMS race condition where ImageTagMirrorSet rollouts deadlock with hostNetwork ingress controllers during MCO drain.

openshiftitmsingressmachineconfig
advanced ⏱ 15 minutes

MCP Drain Blocked by PDB: Workaround

Resolve OpenShift MachineConfigPool drain failures caused by PodDisruptionBudget violations. Scale down and restore after update.

openshiftpdbdrainmachineconfig
advanced ⏱ 20 minutes

Fix Stale MachineConfigPool Updates

Debug and resolve stale OpenShift MachineConfigPool updates. Identify blocked nodes, check MachineConfigDaemon logs, and unblock stuck MCP rollouts.

openshiftmachineconfigmcptroubleshooting
beginner ⏱ 15 minutes

Fix Namespace Stuck in Terminating

Remove Kubernetes namespaces stuck in Terminating state. Identify blocking finalizers, orphaned API resources, and safely force namespace cleanup procedures.

namespaceterminatingfinalizercleanup
intermediate ⏱ 15 minutes

Debug NetworkPolicy Connectivity Issues

Troubleshoot pods unable to communicate despite correct Services. Verify NetworkPolicy rules, label selectors, and default deny.

networkpolicyconnectivitydebuggingfirewall
advanced ⏱ 15 minutes

Node Drain Blocked by hostNetwork Port Conf...

Debug and fix OpenShift node drains that fail because hostNetwork pods cannot schedule replacements due to port exhaustion across the cluster.

openshifthostnetworkdrainscheduling
intermediate ⏱ 15 minutes

Debug Node NotReady Status in Kubernetes

Diagnose Kubernetes nodes stuck in NotReady state. Check kubelet logs, container runtime, network, disk pressure, and certificates.

nodenot-readykubelettroubleshooting
intermediate ⏱ 20 minutes

OpenShift Ingress Router Troubleshooting

Debug OpenShift HAProxy router issues: pods stuck Pending, hostPort conflicts, PDB violations during maintenance, and custom router deployment scaling problems.

openshiftingresshaproxyrouter
intermediate ⏱ 15 minutes

Debug MachineConfigDaemon Logs

Read and interpret OpenShift MachineConfigDaemon logs to diagnose node update failures. Common error patterns, drain issues, and config application problems.

openshiftmachineconfigmcddebugging
intermediate ⏱ 15 minutes

Debug OpenShift OAuth Login Failures

Troubleshoot OpenShift console and CLI login failures. Check OAuth server pods, identity provider config, and expired tokens.

openshiftoauthauthenticationlogin
intermediate ⏱ 15 minutes

Fix Stuck OLM Operator Subscriptions

Debug Operator Lifecycle Manager subscriptions stuck in pending or failed state. Resolve catalog source issues, approval policies, and CSV dependency conflicts.

openshiftolmoperatorsubscription
intermediate ⏱ 15 minutes

PDB Allowed Disruptions Zero: Debugging

Debug PodDisruptionBudgets stuck at zero allowed disruptions. Understand minAvailable vs maxUnavailable, fix eviction failures, and plan for maintenance.

pdbdisruption-budgetevictionmaintenance
intermediate ⏱ 15 minutes

Fix PV Stuck in Terminating State

Resolve PVs and PVCs stuck in Terminating status. Remove finalizers safely, check volume detachment, and handle storage issues.

pvpvcterminatingfinalizer
beginner ⏱ 15 minutes

Fix ResourceQuota Exceeded Errors

Debug resource quota violations preventing pod scheduling. Understand LimitRange defaults, ResourceQuota, and namespace management.

resourcequotalimitrangeschedulingresources
beginner ⏱ 15 minutes

Debug Service with No Ready Endpoints

Troubleshoot Services showing zero endpoints. Verify label selectors, readiness probes, pod status, and port configuration.

serviceendpointsreadinessnetworking
beginner ⏱ 15 minutes

Fix Node Untolerated Taint Scheduling Errors

Fix node untolerated taint errors causing pods stuck in Pending. NoSchedule, PreferNoSchedule, NoExecute effects, and toleration syntax guide.

taintstolerationsschedulingnodes
intermediate ⏱ 15 minutes

Fix Admission Webhook Timeout Errors

Debug admission webhook failures blocking pod creation. Identify failing webhooks, check timeouts, and set failurePolicy.

webhookadmissiontimeoutapi-server
beginner ⏱ 10 minutes

Decode and Inspect Kubernetes Docker Secrets

Decode base64-encoded dockerconfigjson secrets to verify registry credentials, troubleshoot ImagePullBackOff errors, and audit pull secret configurations.

secretsbase64troubleshootingdebugging
intermediate ⏱ 15 minutes

Troubleshoot CatalogSource and OLM Issues

Debug CatalogSource failures including pod crashes, gRPC errors, stale caches, and operator install problems in OpenShift OLM environments.

catalogsourceolmtroubleshootingopenshift
advanced ⏱ 20 minutes

SR-IOV VF Troubleshooting on Kubernetes

Diagnose and fix SR-IOV Virtual Function issues including VF creation failures, device plugin errors, RDMA problems, and network attachment failures.

sriovtroubleshootingnetworkingrdma
advanced ⏱ 15 minutes

Diagnose NVIDIA Memory-Only Kernel Modules ...

Understand why lsmod shows NVIDIA modules loaded but modinfo fails, and how the GPU Operator's proprietary driver container inserts modules without.

nvidiagpukernel-modulestroubleshooting
advanced ⏱ 30 minutes

Fix NVIDIA Peer Memory Driver Not Detected

Diagnose and resolve the 'NVIDIA peer memory driver not detected' error when running GPU workloads with RDMA on Kubernetes and OpenShift.

nvidiagpurdmapeermem
advanced ⏱ 30 minutes

Fix nvidia-fs Module Conflict on OpenShift

Diagnose and fix the 'insmod: ERROR: could not insert module nvidia-fs.ko: File exists' error when enabling GPUDirect Storage with the NVIDIA GPU Operator.

nvidiagpugdsnvidia-fs
advanced ⏱ 30 minutes

Debug NCCL Timeouts and Hangs in Kubernetes

Systematically troubleshoot NCCL runs that stall or timeout across multi-GPU and multi-node Kubernetes jobs with step-by-step diagnostic commands.

nccltimeouthangtroubleshooting
advanced ⏱ 25 minutes

Diagnose GPU Peer-to-Peer Latency NCCL Tests

Use NCCL point-to-point and collective tests to isolate GPU peer-to-peer latency issues between GPU pairs in multi-node Kubernetes clusters.

nccllatencyp2pgpu
intermediate ⏱ 15 minutes

Validate GPU & NIC Topology Before NCCL Ben...

Inspect node-level GPU, NIC, and PCI topology on Kubernetes workers to predict and explain NCCL benchmark performance before running tests.

nccltopologypcigpu
intermediate ⏱ 15 minutes

Check Bonding and Interface Status for SR-IOV

Inspect bond membership, interface state, and link aggregation to confirm which NICs can be correctly targeted by SR-IOV network policies on Kubernetes.

bondingnetworkingsriovlinux
intermediate ⏱ 15 minutes

Identify Mellanox Interface Models from Lin...

Map interface names to PCI addresses and Mellanox model generations to build accurate SR-IOV policies and GPU networking configurations on Kubernetes.

mellanoxconnectxpcisriov
advanced ⏱ 20 minutes

Fix NVIDIA NIM TensorRT-LLM Initialization ...

Diagnose and fix common NIM TensorRT-LLM executor failures including DecoderState mismatch, version incompatibilities, and engine build errors.

nvidia-nimtensorrt-llmtroubleshootinggpu
advanced ⏱ 30 minutes

Fix 'No Supported NIC Is Selected' in SR-IOV

Diagnose SR-IOV operator webhook rejections by validating node state, label selectors, PF eligibility, and SriovNetworkNodePolicy configuration.

sriovtroubleshootingwebhookopenshift
advanced ⏱ 20 minutes

Fix nv-ipam 'Pool Not Found' Errors in Multus

Fix nv-ipam IPPool lookup failures in Multus by aligning SriovNetwork, NetworkAttachmentDefinition, and IPPool names and namespaces correctly.

nv-ipammultussriovtroubleshooting
intermediate ⏱ 30 minutes

Validate SR-IOV Operator Health Across Mult...

Run a full checklist to confirm SR-IOV discovery, VF creation, scheduler resources, and pod attachment on multiple nodes.

sriovvalidationmultinodeopenshift
intermediate ⏱ 30 minutes

How to Troubleshoot Kubernetes Networking

Debug and resolve Kubernetes networking issues systematically. Learn to diagnose DNS problems, service connectivity, network policies, and CNI issues.

networkingtroubleshootingdnsservices
beginner ⏱ 15 minutes

How to Debug ImagePullBackOff Errors

Troubleshoot Kubernetes ImagePullBackOff and ErrImagePull errors. Learn to diagnose registry authentication, image tags, and network connectivity issues.

imagepulltroubleshootingregistryauthentication
intermediate ⏱ 15 minutes

How to Debug Kubernetes Node Issues

Diagnose and troubleshoot node problems in Kubernetes clusters. Identify resource pressure, connectivity issues, and component failures.

nodesdebuggingtroubleshootingkubelet
intermediate ⏱ 15 minutes

Fix OOMKilled in Kubernetes Pods

Fix OOMKilled errors in Kubernetes pods (exit code 137). Debug memory leaks, set correct memory limits, and prevent OOM kills in containers.

oomkilledoommemorytroubleshooting
intermediate ⏱ 15 minutes

Debug Pod Scheduling Failures in K8s

Fix pods stuck in Pending from scheduling failures. Diagnose resource constraints, node affinity, taints, tolerations, and topology spread conflicts.

schedulingpendingtroubleshootingresources
intermediate ⏱ 15 minutes

How to Debug Pod Networking Issues

Diagnose and fix Kubernetes networking problems. Troubleshoot connectivity, DNS resolution, service discovery, and network policies with practical tools.

networkingdebuggingtroubleshootingconnectivity
intermediate ⏱ 15 minutes

Ephemeral Containers: Debug Running Pods

Debug running pods with ephemeral containers using kubectl debug. Attach debug containers without restart for production troubleshooting on Kubernetes.

debuggingephemeralkubectltroubleshooting
beginner ⏱ 15 minutes

How to Run Kubernetes in Docker (kind)

Create local Kubernetes clusters using kind (Kubernetes in Docker). Set up multi-node clusters, configure networking, and test applications locally.

kindlocal-developmentdockertesting
beginner ⏱ 15 minutes

Essential kubectl Commands for Debugging

Master kubectl debugging commands to troubleshoot Kubernetes issues. Learn to inspect pods, view logs, debug networking, and diagnose cluster problems.

kubectldebuggingtroubleshootingcli
beginner ⏱ 15 minutes

How to Extend kubectl with Plugins

Enhance kubectl with custom plugins using Krew package manager. Discover, install, and create plugins to boost K8s productivity.

kubectlkrewpluginscli
intermediate ⏱ 15 minutes

Fix K8s Stuck Resources and Finalizers

Fix Kubernetes resources stuck in Terminating state by managing finalizers. Remove stuck namespaces, PVs, and CRDs with force-delete procedures.

finalizersdeletioncleanupstuck-resources
beginner ⏱ 15 minutes

CrashLoopBackOff: How to Fix in Kubernetes

Fix CrashLoopBackOff in Kubernetes pods. Learn why pods crash loop, systematic debugging with kubectl logs and describe, and solutions for common causes.

troubleshootingcrashloopbackoffdebugginglogs
intermediate ⏱ 20 minutes

How to Debug DNS Issues in Kubernetes

Troubleshoot and resolve DNS problems in Kubernetes. Learn to diagnose CoreDNS issues, test resolution, and fix common DNS failures.

dnscorednstroubleshootingnetworking
intermediate ⏱ 15 minutes

Fix Pending PVC Status in Kubernetes

Fix PersistentVolumeClaims stuck in Pending status. Diagnose StorageClass issues, capacity problems, node affinity conflicts, and provisioner failures.

troubleshootingpvcstoragepending
Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens