π§ Troubleshooting
Debug and fix: CrashLoopBackOff, OOMKilled, ImagePullBackOff, pending pods, networking issues, SR-IOV VF troubleshooting, and NFS-oRDMA performance debugging.
kubectl exec: Run Commands Inside Kubernetes Pods
Use kubectl exec to run commands and open shells inside Kubernetes pods. Covers interactive sessions, multi-container pods, and debugging with ephemeral containers.
Fix ImagePullBackOff in Kubernetes
Debug and fix ImagePullBackOff errors in Kubernetes. Covers wrong image names, private registry auth, rate limits, and network connectivity issues.
Fix OOMKilled in Kubernetes Pods
Debug and fix OOMKilled errors in Kubernetes. Find memory leaks, set correct limits, use VPA for right-sizing, and prevent container OOM kills.
Kubernetes Pod Eviction: Causes and Prevention
Understand why Kubernetes evicts pods and how to prevent it. Covers resource pressure, priority classes, PDBs, and eviction policies.
Fix API Server Timeout and Overload
Debug kubectl timeouts, API server overload, and connection refused errors. Covers etcd latency, webhook timeouts, and rate limiting.
Fix CoreDNS Resolution Failures in Kubernetes
Debug DNS resolution failures in Kubernetes pods. Covers CoreDNS crashes, NXDOMAIN errors, ndots configuration, and upstream DNS timeouts.
CrashLoopBackOff Fix: Kubernetes Troubleshooting
Fix CrashLoopBackOff in Kubernetes step by step. Debug OOMKilled, missing configs, failed health probes, and image errors causing pod crash loops.
Fix etcd High Latency and Slow API Server
Debug etcd performance issues causing slow kubectl responses and API server timeouts. Covers disk I/O, compaction, defragmentation, and leader elections.
Fix fio libaio Silent Exit on OpenShift crun Nodes
Debug fio instantly exiting with no output on crun-based OpenShift nodes. The root cause is seccomp blocking libaio syscalls β fix with psync or unconfined.
Fix ImagePullBackOff in Kubernetes
Debug and resolve ImagePullBackOff errors including auth failures, wrong tags, private registry access, and rate limiting from Docker Hub and Quay.
Fix Kubernetes Job Failures and Retries
Debug Kubernetes Jobs stuck in backoff or hitting retry limits. Covers backoffLimit, activeDeadlineSeconds, and CronJob overlap.
Fix Kubelet NotReady and Node Pressure Issues
Debug kubelet NotReady status, node pressure conditions, and eviction issues. Covers disk pressure, memory pressure, PID pressure, and network not ready.
Kubernetes Debugging Toolkit and Commands
Essential kubectl debugging commands and tools for Kubernetes troubleshooting. Covers ephemeral containers, debug pods, network debugging, and log analysis.
Fix OOMKilled Containers in Kubernetes
Debug and resolve OOMKilled container terminations. Understand memory limits, kernel OOM killer behavior, and right-sizing strategies for Kubernetes pods.
OpenShift crun vs runc Runtime Differences
Understand why pods behave differently on GPU vs CPU nodes in OpenShift. Compare crun and runc container runtimes, seccomp profiles, and syscall filtering.
Fix Unexpected Pod Evictions in Kubernetes
Debug pods being evicted due to node pressure, preemption, or taint-based eviction. Understand eviction priorities, QoS classes, and PodDisruptionBudgets.
Fix Pod Stuck in Pending State
Debug pods stuck in Pending status. Covers insufficient resources, node affinity mismatches, taint/toleration issues, and PVC binding failures.
Fix Podman TLS x509 Certificate Errors Behind Corporate Proxy
Resolve podman pull x509 certificate signed by unknown authority errors caused by corporate TLS-intercepting proxies. Extract and install the proxy CA.
Fix PVC Stuck in Pending State
Debug PersistentVolumeClaims stuck in Pending status. Covers storage class issues, provisioner failures, capacity problems, and access mode mismatches.
Fix Service Mesh Sidecar Injection Failures
Debug Istio and Envoy sidecar injection issues. Covers missing sidecars, port conflicts, init container failures, and mTLS connection errors.
OpenShift oc debug Mount Limitation
Why NFS and filesystem mounts via oc debug node disappear after the debug pod exits. Understand the container namespace isolation and use MachineConfig instead.
Fix ConfigMap Too Large Error
Resolve the 1MB ConfigMap size limit error. Split configs, use Secrets for binary data, mount volumes, or use external stores.
Debug CRI-O Container Runtime Errors
Troubleshoot CRI-O issues on OpenShift nodes. Fix image pull failures, container start errors, storage driver problems, and CNI networking plugin failures.
Debug MCP Degraded Nodes
Fix nodes stuck Degraded after MachineConfig updates. Check MCD logs, on-disk validation, and recovery for degraded workers.
Debug Pod Eviction Reasons
Investigate why pods were evicted. Check node pressure, resource limits, priority classes, and preemption events.
Debug DNS Resolution Failures in Pods
Troubleshoot pods unable to resolve DNS names. Check CoreDNS health, ndots configuration, search domains, and NetworkPolicies blocking UDP port 53 DNS traffic.
Debug etcd Performance Issues
Diagnose slow etcd causing API latency and leader election storms. Check disk IOPS, compaction, defrag, and network latency.
Fix Expired Certificates in Kubernetes
Renew expired certificates causing API server failures and kubelet disconnections. Manual and automatic renewal for kubeadm and OpenShift.
Fix OpenShift ImageStream Import Errors
Debug ImageStream import failures in OpenShift. Resolve DNS errors, auth issues, TLS problems, and registry rate limiting.
ITMS Race Condition with Ingress Controllers
Resolve the ITMS race condition where ImageTagMirrorSet rollouts deadlock with hostNetwork ingress controllers during MCO drain.
Fix Stale MachineConfigPool Updates
Debug and resolve stale OpenShift MachineConfigPool updates. Identify blocked nodes, check MachineConfigDaemon logs, and unblock stuck MCP rollouts.
MCP Drain Blocked by PDB: Workaround
Resolve OpenShift MachineConfigPool drain failures caused by PodDisruptionBudget violations. Scale down and restore after update.
Fix Namespace Stuck in Terminating
Remove Kubernetes namespaces stuck in Terminating state. Identify blocking finalizers, orphaned API resources, and safely force namespace cleanup procedures.
Debug NetworkPolicy Connectivity Issues
Troubleshoot pods unable to communicate despite correct Services. Verify NetworkPolicy rules, label selectors, and default deny.
Node Drain Blocked by hostNetwork Port Conflicts
Debug and fix OpenShift node drains that fail because hostNetwork pods cannot schedule replacements due to port exhaustion across the cluster.
Debug Node NotReady Status
Diagnose Kubernetes nodes stuck in NotReady state. Check kubelet logs, container runtime, network, disk pressure, and certificates.
OpenShift Ingress Router Troubleshooting
Debug OpenShift HAProxy router issues: pods stuck Pending, hostPort conflicts, PDB violations during maintenance, and custom router deployment scaling problems.
Debug MachineConfigDaemon Logs
Read and interpret OpenShift MachineConfigDaemon logs to diagnose node update failures. Common error patterns, drain issues, and config application problems.
Debug OpenShift OAuth Login Failures
Troubleshoot OpenShift console and CLI login failures. Check OAuth server pods, identity provider config, and expired tokens.
Fix Stuck OLM Operator Subscriptions
Debug Operator Lifecycle Manager subscriptions stuck in pending or failed state. Resolve catalog source issues, approval policies, and CSV dependency conflicts.
Fix PV Stuck in Terminating State
Resolve PVs and PVCs stuck in Terminating status. Remove finalizers safely, check volume detachment, and handle storage issues.
PDB Allowed Disruptions Zero: Debugging
Debug PodDisruptionBudgets stuck at zero allowed disruptions. Understand minAvailable vs maxUnavailable, fix eviction failures, and plan for maintenance.
Fix ResourceQuota Exceeded Errors
Debug resource quota violations preventing pod scheduling. Understand LimitRange defaults, ResourceQuota, and namespace management.
Debug Service with No Ready Endpoints
Troubleshoot Services showing zero endpoints. Verify label selectors, readiness probes, pod status, and port configuration.
Debug Taint and Toleration Scheduling
Fix pods stuck Pending due to node taints. Understand NoSchedule, PreferNoSchedule, NoExecute effects and toleration syntax.
Fix Admission Webhook Timeout Errors
Debug admission webhook failures blocking pod creation. Identify failing webhooks, check timeouts, and set failurePolicy.
Decode and Inspect Kubernetes Docker Secrets
Decode base64-encoded dockerconfigjson secrets to verify registry credentials, troubleshoot ImagePullBackOff errors, and audit pull secret configurations.
Troubleshoot CatalogSource and OLM Issues
Debug CatalogSource failures including pod crashes, gRPC errors, stale caches, and operator install problems in OpenShift OLM environments.
SR-IOV VF Troubleshooting on Kubernetes
Diagnose and fix SR-IOV Virtual Function issues including VF creation failures, device plugin errors, RDMA problems, and network attachment failures.
Diagnose NVIDIA Memory-Only Kernel Modules on OpenShift
Understand why lsmod shows NVIDIA modules loaded but modinfo fails, and how the GPU Operator's proprietary driver container inserts modules without.
Fix NVIDIA Peer Memory Driver Not Detected
Diagnose and resolve the 'NVIDIA peer memory driver not detected' error when running GPU workloads with RDMA on Kubernetes and OpenShift.
Troubleshoot nvidia-fs Module Conflict on OpenShift
Diagnose and fix the 'insmod: ERROR: could not insert module nvidia-fs.ko: File exists' error when enabling GPUDirect Storage with the NVIDIA GPU Operator.
Debug NCCL Timeouts and Hangs in Kubernetes
Systematically troubleshoot NCCL runs that stall or timeout across multi-GPU and multi-node Kubernetes jobs with step-by-step diagnostic commands.
Diagnose GPU Peer-to-Peer Latency with NCCL Tests
Use NCCL point-to-point and collective tests to isolate GPU peer-to-peer latency issues between GPU pairs in multi-node Kubernetes clusters.
Validate GPU and NIC Topology Before NCCL Benchmarks
Inspect node-level GPU, NIC, and PCI topology on Kubernetes workers to predict and explain NCCL benchmark performance before running tests.
Check Bonding and Interface Status for SR-IOV
Inspect bond membership, interface state, and link aggregation to confirm which NICs can be correctly targeted by SR-IOV network policies on Kubernetes.
Identify Mellanox Interface Models from Linux and PCI Data
Map interface names to PCI addresses and Mellanox model generations to build accurate SR-IOV policies and GPU networking configurations on Kubernetes.
Troubleshoot NVIDIA NIM TensorRT-LLM Initialization Failures
Diagnose and fix common NIM TensorRT-LLM executor failures including DecoderState mismatch, version incompatibilities, and engine build errors.
Fix 'No Supported NIC Is Selected' in SR-IOV
Diagnose SR-IOV operator webhook rejections by validating node state, label selectors, PF eligibility, and SriovNetworkNodePolicy configuration.
Troubleshoot nv-ipam 'Pool Not Found' Errors in Multus
Fix nv-ipam IPPool lookup failures in Multus by aligning SriovNetwork, NetworkAttachmentDefinition, and IPPool names and namespaces correctly.
Validate SR-IOV Operator Health Across Multiple Worker Nodes
Run a full checklist to confirm SR-IOV discovery, VF creation, scheduler resources, and pod attachment on multiple nodes.
How to Troubleshoot Kubernetes Networking
Debug and resolve Kubernetes networking issues systematically. Learn to diagnose DNS problems, service connectivity, network policies, and CNI issues.
How to Debug ImagePullBackOff Errors
Troubleshoot Kubernetes ImagePullBackOff and ErrImagePull errors. Learn to diagnose registry authentication, image tags, and network connectivity issues.
How to Debug Kubernetes Node Issues
Diagnose and troubleshoot node problems in Kubernetes clusters. Identify resource pressure, connectivity issues, and component failures.
OOMKilled in Kubernetes: How to Debug and Fix
Fix OOMKilled errors in Kubernetes pods. Learn why containers get OOMKilled (exit code 137), how to set memory limits, debug memory leaks, and prevent OOM.
How to Debug Pod Networking Issues
Diagnose and fix Kubernetes networking problems. Troubleshoot connectivity, DNS resolution, service discovery, and network policies with practical tools.
How to Debug Pod Scheduling Failures
Troubleshoot pods stuck in Pending state due to scheduling issues. Learn to diagnose resource constraints, node affinity, taints, and topology spread.
How to Use Ephemeral Containers for Debugging
Debug running pods using ephemeral containers without restarting. Learn kubectl debug techniques for troubleshooting production workloads.
How to Run Kubernetes in Docker (kind)
Create local Kubernetes clusters using kind (Kubernetes in Docker). Set up multi-node clusters, configure networking, and test applications locally.
Essential kubectl Commands for Debugging
Master kubectl debugging commands to troubleshoot Kubernetes issues. Learn to inspect pods, view logs, debug networking, and diagnose cluster problems.
How to Extend kubectl with Plugins
Enhance kubectl with custom plugins using Krew package manager. Discover, install, and create plugins to boost K8s productivity.
How to Manage Kubernetes Finalizers and Stuck Resources
Understand and manage finalizers for controlled resource deletion. Handle stuck resources and implement custom cleanup logic.
CrashLoopBackOff: How to Fix in Kubernetes
Fix CrashLoopBackOff in Kubernetes pods. Learn why pods crash loop, systematic debugging with kubectl logs and describe, and solutions for common causes.
How to Debug DNS Issues in Kubernetes
Troubleshoot and resolve DNS problems in Kubernetes. Learn to diagnose CoreDNS issues, test resolution, and fix common DNS failures.
Troubleshooting Pending PersistentVolumeClaims
Diagnose and fix PVCs stuck in Pending status. Learn common causes including StorageClass issues, capacity problems, and node affinity conflicts with.