π§ Troubleshooting
Debug and fix: CrashLoopBackOff, OOMKilled, ImagePullBackOff, pending pods, networking issues, SR-IOV VF troubleshooting, and NFS-oRDMA performance debugging.
NCCL Debug Subsystems for GPU Network Troubleshooting
Configure NCCL_DEBUG and NCCL_DEBUG_SUBSYS for targeted logging during multi-node GPU training. Covers INIT, NET, GRAPH subsystems, log
NCCL Network Validation Troubleshooting Checklist
Complete troubleshooting checklist for NCCL multi-node GPU bandwidth validation. Covers SR-IOV VF allocation, /dev/infiniband visibility, RoCE GID
Kubernetes Ephemeral Containers for Debugging
Debug running pods with Kubernetes ephemeral containers. Attach debug containers without restarting pods, troubleshoot distroless images, inspect network
Kubernetes Finalizers Explained and Troubleshooting
Understand Kubernetes finalizers for resource cleanup. How finalizers block deletion, common stuck resource scenarios, manual removal
Kubernetes ImagePullBackOff Troubleshooting Guide
Debug and fix ImagePullBackOff and ErrImagePull errors in Kubernetes. Resolve authentication failures, registry connectivity, image not found, TLS certificate
Kubernetes OOMKilled Troubleshooting and Prevention
Debug and prevent OOMKilled container terminations in Kubernetes. Understand memory limits, diagnose memory leaks, configure resource requests, and implement
Chaos Mesh Fault Injection on Kubernetes
Deploy Chaos Mesh for chaos engineering on Kubernetes. Covers PodChaos, NetworkChaos, IOChaos, StressChaos experiments, scheduling, RBAC
LitmusChaos Engineering on Kubernetes
Deploy LitmusChaos for resilience testing on Kubernetes. Covers ChaosEngine, ChaosExperiment, ChaosResult CRDs, built-in experiments, GameDay planning, Litmus
Ephemeral Containers for Live Debugging
Use kubectl debug with ephemeral containers to troubleshoot running Pods without restarting them. Attach debugging tools to distroless containers, inspect
OpenShift oc cp File Copy Guide
Use oc cp to copy files and directories between local machine and Pods. Covers tar-based transfer, container selection, large file handling, and comparison
OpenShift oc rsync File Transfer
Use oc rsync to copy files between local machine and Pods in OpenShift. Covers upload, download, live sync, filtering, and common patterns for debugging
Thanos Receive OOMKilled CrashLoopBackOff
Debug and fix Thanos Receive StatefulSet OOMKilled CrashLoopBackOff caused by WAL replay exceeding memory limits. Covers ArgoCD conflict resolution, liveness
Fix Thanos Receive OOMKilled in Run:ai
Troubleshoot and fix Thanos Receive OOMKilled (exit code 137) with 143+ restarts in Run:ai backend on OpenShift. Covers memory tuning, TSDB
Kubernetes 1.36 Statusz and Flagz Endpoints
Use /statusz and /flagz debug endpoints in Kubernetes 1.36 control plane components. Inspect runtime status and effective flag values without log parsing.
kubectl describe: Read Pod Events Guide
Use kubectl describe pod to read events, conditions, and container states. Diagnose scheduling failures, image pulls, crashes, and probe failures.
kubectl exec: Run Commands in Pods
Use kubectl exec to run commands inside running pods. Interactive shell, multi-container pods, debugging techniques, and security considerations.
K8s CoreDNS: Troubleshoot DNS Issues
Troubleshoot Kubernetes CoreDNS resolution failures. Debug dns pods, ndots settings, search domains, custom Corefile, and forward plugin configuration.
Fix CreateContainerError in Kubernetes
Troubleshoot Kubernetes CreateContainerError with step-by-step debugging. ConfigMap mounts, Secret references, volume permissions, and container runtime issues.
Troubleshoot ImagePullBackOff and ErrImagePull
Troubleshoot Kubernetes ImagePullBackOff and ErrImagePull errors. Private registry auth, image pull secrets, tag verification, and network connectivity fixes.
kubectl debug: Advanced Pod Debugging
Use kubectl debug for ephemeral containers, node debugging, and pod copy debugging. Debug distroless images, share process namespaces, and node-level access.
K8s Network Debugging: Connectivity Guide
Debug Kubernetes network issues with tcpdump, netshoot, and connectivity tests. Pod-to-pod, pod-to-service, DNS, and external connectivity troubleshooting.
Fix Untolerated Taint node-role master
Fix 'node untolerated taint node-role.kubernetes.io/master' scheduling error. Remove or tolerate control plane taints to schedule pods on master nodes.
Journald Verify Config Kubernetes Nodes
Validate journald configuration on Kubernetes nodes. Fix journal corruption, tune storage limits, configure persistence, and troubleshoot systemd-journald.
NXDOMAIN DNS Troubleshooting Kubernetes
Fix NXDOMAIN errors in Kubernetes. Debug CoreDNS failures, ndots configuration, search domain issues, and external DNS lookup problems.
oc-mirror Troubleshooting Disconnected
Troubleshoot oc-mirror failures in disconnected OpenShift. Fix archive corruption, registry auth errors, v1/v2 mismatches, and delta mirror issues.
OpenShift Cluster Operator Upgrade Debug
Debug degraded cluster operators during OpenShift upgrades. Identify stuck operators, decode status conditions, and unblock stalled rollouts.
OpenShift MCP Validation Broken Rules
Validate MachineConfigPool rules before applying in OpenShift. Detect broken MachineConfigs, degraded MCPs, and implement pre-flight checks.
SELinux SSH Login Failure Troubleshoot
Fix SSH login failures caused by SELinux enforcement. Diagnose AVC denials, restore file labels, fix custom SSH ports, and resolve PAM denials.
Cilium Debug Pod Troubleshooting
Debug Kubernetes networking with Cilium debug pods and containers. cilium-dbg, netshoot, hubble observe, and endpoint connectivity troubleshooting.
Fix CUDA Out of Memory K8s Pods
Troubleshoot CUDA out of memory errors in Kubernetes GPU pods. Memory fragmentation, batch size tuning, gradient checkpointing, and resource limits.
NVIDIA GPU Operator Troubleshooting
Fix common NVIDIA GPU Operator issues on Kubernetes. Driver pod crashes, toolkit failures, device plugin not ready, and validation pod errors.
Fix etcd Leader Election Timeout
Troubleshoot etcd leader election timeouts in K8s. Disk latency, network partition, heartbeat interval, and recovery steps.
Fix Certificate Errors Kubernetes
Troubleshoot TLS certificate errors in K8s. x509 unknown authority, expired certs, cert-manager issues, and custom CA bundles.
Fix DNS Resolution Issues in Kubernetes
Troubleshoot Kubernetes DNS resolution failures. ndots, search domains, CoreDNS CrashLoop, and pod-level DNS debugging steps.
Fix Pod cgroup Memory Errors K8s
Fix cgroup memory limit and OOM errors in Kubernetes pods. Covers cgroup v2 migration, memory.max, swap settings, and kernel tuning for stable workloads.
Fix Service Not Reachable in Kubernetes
Debug Kubernetes Service connectivity issues. Endpoint selection, kube-proxy rules, DNS resolution, and NetworkPolicy blocks.
Fix 502 Bad Gateway Kubernetes Ingress
Fix 502 Bad Gateway errors in Kubernetes Ingress. Backend not ready, timeout tuning, readiness probes, and NGINX ingress controller troubleshooting.
kubectl exec Into Pods: Complete Guide
Use kubectl exec to debug running pods. Interactive shells, non-interactive commands, multi-container pods, and ephemeral debug containers.
Fix Namespace Stuck Terminating K8s
Fix Kubernetes namespaces stuck in Terminating state. Finalizer removal, API resource cleanup, and force deletion of stuck namespaces.
Fix Node NotReady Status in Kubernetes
Troubleshoot Kubernetes nodes in NotReady state. Kubelet issues, disk pressure, network problems, certificate expiration, and recovery procedures.
Fix node-role.kubernetes.io/master
Remove the node-role.kubernetes.io/master taint to schedule pods on control plane nodes. Single-node clusters, tolerations, and untolerated taint fix.
Fix OOMKilled Kubernetes Guide
Troubleshoot and fix OOMKilled errors in Kubernetes. Memory limit tuning, Java heap sizing, memory leak detection, and VPA recommendations.
Fix Pending Pods Kubernetes Guide
Troubleshoot Kubernetes pods stuck in Pending state. Insufficient resources, node selector mismatch, PVC binding, taints, and scheduling failures.
Fix Pod Eviction Kubernetes Guide
Troubleshoot Kubernetes pod evictions. DiskPressure, MemoryPressure, ephemeral storage limits, and eviction thresholds configuration.
Pod Lifecycle and States Guide
Understand Kubernetes pod lifecycle phases and container states. Pending, Running, Succeeded, Failed, Unknown, and troubleshooting stuck pods.
LitmusChaos Chaos Engineering K8s
Run chaos experiments on Kubernetes with LitmusChaos. Pod kill, network latency, disk fill, and CPU stress experiments for resilience testing.
Debug Containers and Ephemeral Pods
Use kubectl debug with ephemeral containers to troubleshoot running pods without restart. Debug distroless images, node debugging.
DNS Debugging Kubernetes Guide
Debug Kubernetes DNS issues systematically. CoreDNS troubleshooting, ndots configuration, search domains, and resolving slow DNS lookups.
Network Debugging Tools Kubernetes
Debug Kubernetes networking with tcpdump, netshoot, iptables tracing, conntrack inspection, and DNS resolution testing techniques.
Fix RBAC Permission Errors K8s
Debug Kubernetes RBAC permission errors. kubectl auth can-i, impersonation testing, ClusterRole aggregation, and common permission mistakes.
Fix 502 Bad Gateway in Kubernetes
Troubleshoot and fix 502 Bad Gateway errors in Kubernetes. Causes include pod readiness timing, ingress misconfiguration, upstream timeouts.
kubectl cp Copy Files to and from Pods
Copy files between local machine and Kubernetes pods with kubectl cp. Supports containers, namespaces, tar-based transfer, and common troubleshooting.
kubectl logs View Pod Logs Guide
View and stream Kubernetes pod logs with kubectl logs. Multi-container pods, previous crashes, label selectors, timestamps, and log aggregation patterns.
Check Kubernetes Node Status with kubectl
Check and troubleshoot Kubernetes node status with kubectl. Node conditions (Ready, MemoryPressure, DiskPressure), NotReady debugging, and capacity monitoring.
Troubleshooting Pods with GPU Devices
Fix GPU device issues in Kubernetes pods. Troubleshoot device plugin errors, DRA claims, CUDA failures, driver mismatches.
Debug Kubernetes Pods: Complete Guide
Debug Kubernetes pods with kubectl debug, ephemeral containers, and netshoot. Troubleshoot distroless images, network issues, and crashed pods step by step.
Kubernetes Troubleshooting Flowchart
Systematic Kubernetes troubleshooting guide with flowcharts. Debug pods, services, networking, storage, and node issues step by step with kubectl commands.
kubectl exec: Run Commands Inside K8s Pods
Use kubectl exec to run commands inside Kubernetes pods. Covers interactive sessions, multi-container pods, and ephemeral container debugging.
Fix ImagePullBackOff in Kubernetes
Debug and fix ImagePullBackOff errors in Kubernetes. Covers wrong image names, private registry auth, rate limits, and network connectivity issues.
Debug and Fix OOMKilled Errors in Kubernetes
Debug and fix OOMKilled errors in Kubernetes. Find memory leaks, set correct limits, use VPA for right-sizing, and prevent container OOM kills.
Kubernetes Pod Eviction: Causes and Prevention
Understand why Kubernetes evicts pods and how to prevent it. Covers resource pressure, priority classes, PDBs, and eviction policies.
Fix API Server Timeout and Overload
Debug kubectl timeouts, API server overload, and connection refused errors. Covers etcd latency, webhook timeouts, and rate limiting.
Fix CoreDNS Resolution Failures in Kubernetes
Debug DNS resolution failures in Kubernetes pods. Covers CoreDNS crashes, NXDOMAIN errors, ndots configuration, and upstream DNS timeouts.
How to Fix CrashLoopBackOff in Kubernetes
Fix CrashLoopBackOff in Kubernetes with step-by-step troubleshooting. Debug OOMKilled, failed probes, missing configs, and image errors causing pod crash loops.
Fix etcd High Latency and Slow API Server
Debug etcd performance issues causing slow kubectl responses and API server timeouts. Covers disk I/O, compaction, defragmentation, and leader elections.
Fix fio libaio Silent Exit on OpenShift cru...
Debug fio instantly exiting with no output on crun-based OpenShift nodes. The root cause is seccomp blocking libaio syscalls β fix with psync or unconfined.
ImagePullBackOff Troubleshooting Guide
Debug and resolve ImagePullBackOff errors including auth failures, wrong tags, private registry access, and rate limiting from Docker Hub and Quay.
Fix Kubernetes Job Failures and Retries
Debug Kubernetes Jobs stuck in backoff or hitting retry limits. Covers backoffLimit, activeDeadlineSeconds, and CronJob overlap.
Fix Kubelet NotReady and Node Pressure Issues
Debug kubelet NotReady status, node pressure conditions, and eviction issues. Covers disk pressure, memory pressure, PID pressure, and network not ready.
Kubernetes Debugging Toolkit and Commands
Essential kubectl debugging commands and tools for Kubernetes troubleshooting. Covers ephemeral containers, debug pods, network debugging, and log analysis.
Fix OOMKilled Containers in Kubernetes
Debug and resolve OOMKilled container terminations. Understand memory limits, kernel OOM killer behavior, and right-sizing strategies for Kubernetes pods.
OpenShift crun vs runc Runtime Differences
Understand why pods behave differently on GPU vs CPU nodes in OpenShift. Compare crun and runc container runtimes, seccomp profiles, and syscall filtering.
Fix Unexpected Pod Evictions in Kubernetes
Debug pods being evicted due to node pressure, preemption, or taint-based eviction. Understand eviction priorities, QoS classes, and PodDisruptionBudgets.
Fix Pod Stuck in Pending State
Debug pods stuck in Pending status. Covers insufficient resources, node affinity mismatches, taint/toleration issues, and PVC binding failures.
Fix Podman TLS x509 Behind Corporate Proxy
Resolve podman pull x509 certificate signed by unknown authority errors caused by corporate TLS-intercepting proxies. Extract and install the proxy CA.
Fix PVC Stuck in Pending State
Debug PersistentVolumeClaims stuck in Pending status. Covers storage class issues, provisioner failures, capacity problems, and access mode mismatches.
Fix Service Mesh Sidecar Injection Failures
Debug Istio and Envoy sidecar injection issues. Covers missing sidecars, port conflicts, init container failures, and mTLS connection errors.
OpenShift oc debug Mount Limitation
Why NFS and filesystem mounts via oc debug node disappear after the debug pod exits. Understand the container namespace isolation and use MachineConfig instead.
Fix the Kubernetes ConfigMap Too Large Error
Resolve the 1MB ConfigMap size limit error. Split configs, use Secrets for binary data, mount volumes, or use external stores.
Debug CRI-O Container Runtime Errors
Troubleshoot CRI-O issues on OpenShift nodes. Fix image pull failures, container start errors, storage driver problems, and CNI networking plugin failures.
Debug Degraded MachineConfigPool Nodes
Fix nodes stuck Degraded after MachineConfig updates. Check MCD logs, on-disk validation, and recovery for degraded workers.
Debug Kubernetes Pod Eviction Reasons
Investigate why pods were evicted from Kubernetes nodes. Check node pressure conditions, resource limits, priority classes, and preemption events.
Debug DNS Resolution Failures in Pods
Troubleshoot pods unable to resolve DNS names. Check CoreDNS health, ndots configuration, search domains, and NetworkPolicies blocking UDP port 53 DNS traffic.
Debug etcd Performance Issues in Kubernetes
Diagnose slow etcd causing API latency and leader election storms. Check disk IOPS, compaction, defrag, and network latency.
Fix Expired Certificates in Kubernetes
Renew expired certificates causing API server failures and kubelet disconnections. Manual and automatic renewal for kubeadm and OpenShift.
Fix OpenShift ImageStream Import Errors
Debug ImageStream import failures in OpenShift. Resolve DNS errors, auth issues, TLS problems, and registry rate limiting.
ITMS Race Condition with Ingress Controllers
Resolve the ITMS race condition where ImageTagMirrorSet rollouts deadlock with hostNetwork ingress controllers during MCO drain.
MCP Drain Blocked by PDB: Workaround
Resolve OpenShift MachineConfigPool drain failures caused by PodDisruptionBudget violations. Scale down and restore after update.
Fix Stale MachineConfigPool Updates
Debug and resolve stale OpenShift MachineConfigPool updates. Identify blocked nodes, check MachineConfigDaemon logs, and unblock stuck MCP rollouts.
Fix Namespace Stuck in Terminating
Remove Kubernetes namespaces stuck in Terminating state. Identify blocking finalizers, orphaned API resources, and safely force namespace cleanup procedures.
Debug NetworkPolicy Connectivity Issues
Troubleshoot pods unable to communicate despite correct Services. Verify NetworkPolicy rules, label selectors, and default deny.
Node Drain Blocked by hostNetwork Port Conf...
Debug and fix OpenShift node drains that fail because hostNetwork pods cannot schedule replacements due to port exhaustion across the cluster.
Debug Node NotReady Status in Kubernetes
Diagnose Kubernetes nodes stuck in NotReady state. Check kubelet logs, container runtime, network, disk pressure, and certificates.
OpenShift Ingress Router Troubleshooting
Debug OpenShift HAProxy router issues: pods stuck Pending, hostPort conflicts, PDB violations during maintenance, and custom router deployment scaling problems.
Debug MachineConfigDaemon Logs
Read and interpret OpenShift MachineConfigDaemon logs to diagnose node update failures. Common error patterns, drain issues, and config application problems.
Debug OpenShift OAuth Login Failures
Troubleshoot OpenShift console and CLI login failures. Check OAuth server pods, identity provider config, and expired tokens.
Fix Stuck OLM Operator Subscriptions
Debug Operator Lifecycle Manager subscriptions stuck in pending or failed state. Resolve catalog source issues, approval policies, and CSV dependency conflicts.
PDB Allowed Disruptions Zero: Debugging
Debug PodDisruptionBudgets stuck at zero allowed disruptions. Understand minAvailable vs maxUnavailable, fix eviction failures, and plan for maintenance.
Fix PV Stuck in Terminating State
Resolve PVs and PVCs stuck in Terminating status. Remove finalizers safely, check volume detachment, and handle storage issues.
Fix ResourceQuota Exceeded Errors
Debug resource quota violations preventing pod scheduling. Understand LimitRange defaults, ResourceQuota, and namespace management.
Debug Service with No Ready Endpoints
Troubleshoot Services showing zero endpoints. Verify label selectors, readiness probes, pod status, and port configuration.
Fix Node Untolerated Taint Scheduling Errors
Fix node untolerated taint errors causing pods stuck in Pending. NoSchedule, PreferNoSchedule, NoExecute effects, and toleration syntax guide.
Fix Admission Webhook Timeout Errors
Debug admission webhook failures blocking pod creation. Identify failing webhooks, check timeouts, and set failurePolicy.
Decode and Inspect Kubernetes Docker Secrets
Decode base64-encoded dockerconfigjson secrets to verify registry credentials, troubleshoot ImagePullBackOff errors, and audit pull secret configurations.
Troubleshoot CatalogSource and OLM Issues
Debug CatalogSource failures including pod crashes, gRPC errors, stale caches, and operator install problems in OpenShift OLM environments.
SR-IOV VF Troubleshooting on Kubernetes
Diagnose and fix SR-IOV Virtual Function issues including VF creation failures, device plugin errors, RDMA problems, and network attachment failures.
Diagnose NVIDIA Memory-Only Kernel Modules ...
Understand why lsmod shows NVIDIA modules loaded but modinfo fails, and how the GPU Operator's proprietary driver container inserts modules without.
Fix NVIDIA Peer Memory Driver Not Detected
Diagnose and resolve the 'NVIDIA peer memory driver not detected' error when running GPU workloads with RDMA on Kubernetes and OpenShift.
Fix nvidia-fs Module Conflict on OpenShift
Diagnose and fix the 'insmod: ERROR: could not insert module nvidia-fs.ko: File exists' error when enabling GPUDirect Storage with the NVIDIA GPU Operator.
Debug NCCL Timeouts and Hangs in Kubernetes
Systematically troubleshoot NCCL runs that stall or timeout across multi-GPU and multi-node Kubernetes jobs with step-by-step diagnostic commands.
Diagnose GPU Peer-to-Peer Latency NCCL Tests
Use NCCL point-to-point and collective tests to isolate GPU peer-to-peer latency issues between GPU pairs in multi-node Kubernetes clusters.
Validate GPU & NIC Topology Before NCCL Ben...
Inspect node-level GPU, NIC, and PCI topology on Kubernetes workers to predict and explain NCCL benchmark performance before running tests.
Check Bonding and Interface Status for SR-IOV
Inspect bond membership, interface state, and link aggregation to confirm which NICs can be correctly targeted by SR-IOV network policies on Kubernetes.
Identify Mellanox Interface Models from Lin...
Map interface names to PCI addresses and Mellanox model generations to build accurate SR-IOV policies and GPU networking configurations on Kubernetes.
Fix NVIDIA NIM TensorRT-LLM Initialization ...
Diagnose and fix common NIM TensorRT-LLM executor failures including DecoderState mismatch, version incompatibilities, and engine build errors.
Fix 'No Supported NIC Is Selected' in SR-IOV
Diagnose SR-IOV operator webhook rejections by validating node state, label selectors, PF eligibility, and SriovNetworkNodePolicy configuration.
Fix nv-ipam 'Pool Not Found' Errors in Multus
Fix nv-ipam IPPool lookup failures in Multus by aligning SriovNetwork, NetworkAttachmentDefinition, and IPPool names and namespaces correctly.
Validate SR-IOV Operator Health Across Mult...
Run a full checklist to confirm SR-IOV discovery, VF creation, scheduler resources, and pod attachment on multiple nodes.
How to Troubleshoot Kubernetes Networking
Debug and resolve Kubernetes networking issues systematically. Learn to diagnose DNS problems, service connectivity, network policies, and CNI issues.
How to Debug ImagePullBackOff Errors
Troubleshoot Kubernetes ImagePullBackOff and ErrImagePull errors. Learn to diagnose registry authentication, image tags, and network connectivity issues.
How to Debug Kubernetes Node Issues
Diagnose and troubleshoot node problems in Kubernetes clusters. Identify resource pressure, connectivity issues, and component failures.
Fix OOMKilled in Kubernetes Pods
Fix OOMKilled errors in Kubernetes pods (exit code 137). Debug memory leaks, set correct memory limits, and prevent OOM kills in containers.
Debug Pod Scheduling Failures in K8s
Fix pods stuck in Pending from scheduling failures. Diagnose resource constraints, node affinity, taints, tolerations, and topology spread conflicts.
How to Debug Pod Networking Issues
Diagnose and fix Kubernetes networking problems. Troubleshoot connectivity, DNS resolution, service discovery, and network policies with practical tools.
Ephemeral Containers: Debug Running Pods
Debug running pods with ephemeral containers using kubectl debug. Attach debug containers without restart for production troubleshooting on Kubernetes.
How to Run Kubernetes in Docker (kind)
Create local Kubernetes clusters using kind (Kubernetes in Docker). Set up multi-node clusters, configure networking, and test applications locally.
Essential kubectl Commands for Debugging
Master kubectl debugging commands to troubleshoot Kubernetes issues. Learn to inspect pods, view logs, debug networking, and diagnose cluster problems.
How to Extend kubectl with Plugins
Enhance kubectl with custom plugins using Krew package manager. Discover, install, and create plugins to boost K8s productivity.
Fix K8s Stuck Resources and Finalizers
Fix Kubernetes resources stuck in Terminating state by managing finalizers. Remove stuck namespaces, PVs, and CRDs with force-delete procedures.
CrashLoopBackOff: How to Fix in Kubernetes
Fix CrashLoopBackOff in Kubernetes pods. Learn why pods crash loop, systematic debugging with kubectl logs and describe, and solutions for common causes.
How to Debug DNS Issues in Kubernetes
Troubleshoot and resolve DNS problems in Kubernetes. Learn to diagnose CoreDNS issues, test resolution, and fix common DNS failures.
Fix Pending PVC Status in Kubernetes
Fix PersistentVolumeClaims stuck in Pending status. Diagnose StorageClass issues, capacity problems, node affinity conflicts, and provisioner failures.