Kubernetes Recipes

⏱ 15 minutes networkingrdmaperformance

ib_write_bw RDMA Bandwidth Testing on Kubernetes GPU Nodes

Validate RDMA write bandwidth on Kubernetes GPU nodes with ib_write_bw and SR-IOV. Device selection, RoCE GID index, and ConnectX-7 400G expectations.

⏱ 15 minutes networkingperformancerdma

NVIDIA DOCA Bench for DPU Performance Testing on Kubernetes

Benchmark NVIDIA BlueField DPU accelerators in Kubernetes with DOCA Bench: throughput/latency modes, RDMA, compression offload, and multi-core scaling.

⏱ 15 minutes gpuncclperformance

H200 NVL 8-GPU Topology Bandwidth Tiers for Kubernetes

Map the three bandwidth tiers of 8× H200 NVL GPU nodes—NVLink (~337 GB/s), PCIe+UPI (~50 GB/s), RoCE (~35 GB/s)—for NCCL topology-aware NUMA scheduling.

⏱ 15 minutes storagenfsnetworking

Dell PowerScale NFS Access Zones for Kubernetes AI Storage

Configure Dell PowerScale (Isilon) access zones and SmartConnect pools for Kubernetes AI storage with per-environment NFS isolation and IP pool sizing.

⏱ 20 minutes ansibleautomationdeployments

Automate Kubernetes Day-2 Operations with Ansible

Use Ansible to automate Kubernetes day-2 operations — apply manifests, roll out upgrades, and reconcile cluster state with the kubernetes.core collection.

⏱ 15 minutes iommupassthroughgds

Disable GDS and Enable IOMMU Passthrough on K8s GPUs

Disable GPUDirect Storage (GDS) when not needed and configure IOMMU passthrough mode for GPU and NIC device assignment. Kernel parameters, BIOS settings, VFIO

⏱ 15 minutes gpu-operatorrdmagds

GPU Operator ClusterPolicy RDMA and GDS Configuration

Configure NVIDIA GPU Operator ClusterPolicy to disable RDMA and enable GPUDirect Storage (GDS). Control nvidia-peermem, nvidia-fs modules, driver

⏱ 15 minutes gpudirectrdmanvidia

GPUDirect RDMA Setup and Verification on Kubernetes

Enable and verify GPUDirect RDMA for GPU-to-NIC direct data transfer on Kubernetes. Install nvidia-peermem, configure DMA-BUF, verify RDMA paths, troubleshoot

⏱ 15 minutes iommukernelgpu

IOMMU Kernel Parameters for Kubernetes GPU Nodes

Configure IOMMU kernel parameters for optimal GPU and RDMA performance on Kubernetes. Compare intel_iommu, amd_iommu, and iommu settings, passthrough vs off vs

⏱ 15 minutes mpisshopenshift

Kubeflow MPIJob Worker SSH Setup for GPU Training

Configure SSH daemon in Kubeflow MPIJob worker pods for multi-node GPU training. Covers SSHD setup in containers, host key generation, authorized keys from MPI

⏱ 15 minutes topology-managernumagpu

Kubernetes Topology Manager for GPU and NUMA Alignment

Configure Kubernetes Topology Manager to align CPU, GPU, and NIC allocations on the same NUMA node. Covers policies, kubelet config, and GPU performance tuning.

⏱ 15 minutes mpidnsnetworking

MPI DNS Resolution and Hostfile for Kubernetes GPU Jobs

Troubleshoot MPI hostfile DNS resolution in Kubeflow MPIJob on Kubernetes. Covers headless Service creation, subdomain configuration, DNS wait loops, FQDN

⏱ 15 minutes ncclbenchmarkingall-reduce

NCCL All-Reduce Benchmarking on Multi-Node GPUs

Run and interpret NCCL all_reduce_perf benchmarks on multi-node Kubernetes GPU clusters. Understand bus bandwidth results, expected throughput for H200 NVL

⏱ 15 minutes nccldebugginggpu-communication

NCCL Channel Routing and Transport Path Analysis

Interpret NCCL channel logs to understand GPU communication paths on Kubernetes. Decode P2P/CUMEM, SHM/direct, NET/IB/GDRDMA transport

⏱ 15 minutes nccltroubleshootingobservability

NCCL Debug Subsystems for GPU Network Troubleshooting

Configure NCCL_DEBUG and NCCL_DEBUG_SUBSYS for targeted logging during multi-node GPU training. Covers INIT, NET, GRAPH subsystems, log

NCCL DMABUF Enable for GPUDirect RDMA on Kubernetes

Enable NCCL DMA-BUF support for GPUDirect RDMA in Kubernetes GPU clusters. Covers NCCL_DMABUF_ENABLE=1, kernel requirements, nvidia-peermem vs dmabuf, GPU

⏱ 15 minutes ncclrdmagpu

⏱ 15 minutes ncclgpudirectrdma

NCCL GPUDirect RDMA Distance Levels and PIX vs SYS

Understand NCCL GPU Direct RDMA distance-based enablement. When PIX mode disables GDRDMA for distant GPU-HCA pairs (distance 9 > 4) and when SYS mode enables

NCCL GPUDirect RDMA Level Tuning PIX PXB PHB SYS

Tune NCCL_NET_GDR_LEVEL for optimal GPUDirect RDMA performance on Kubernetes. Compare PIX, PXB, PHB, and SYS distance thresholds with PCIe topology. Benchmark

⏱ 15 minutes ncclrdmagpu

⏱ 15 minutes ncclrdmaperformance

NCCL IB HCA Selection and QPS Tuning for RoCE

Configure NCCL_IB_HCA, NCCL_IB_GID_INDEX, NCCL_IB_QPS_PER_CONNECTION, and NCCL_IB_SPLIT_DATA_ON_QPS for optimal RoCE performance on Kubernetes GPU clusters.

⏱ 15 minutes ncclopenshiftsr-iov

NCCL Network Validation Script for OpenShift GPU Clusters

Build a comprehensive NCCL network validation script for OpenShift GPU clusters with SR-IOV. Configure NCCL_IB_GID_INDEX, NCCL_NET_GDR_LEVEL=SYS, per-rank HCA

⏱ 15 minutes nccltroubleshootingrdma

NCCL Network Validation Troubleshooting Checklist

Complete troubleshooting checklist for NCCL multi-node GPU bandwidth validation. Covers SR-IOV VF allocation, /dev/infiniband visibility, RoCE GID

Production NCCL Network Validator for Kubeflow MPIJob

Deploy a production-ready NCCL network validation framework using Kubeflow MPIJob on OpenShift. Complete validate_network.sh script

⏱ 15 minutes ncclmpirdma

NCCL RoCE Validation MPIJob Complete Reference

Complete nccl-roce-validation.yaml MPIJob reference for OpenShift GPU clusters. Full launcher environment variables, OpenMPI control plane settings, NCCL

⏱ 15 minutes ncclmpiroce

NCCL RoCE Validation with Kubeflow MPIJob on Kubernetes

Run NCCL all_reduce_perf validation tests using Kubeflow MPIJob on GPU clusters. Configure MPI launcher and workers, NCCL environment variables, test

⏱ 15 minutes ncclmpirdma

⏱ 15 minutes ncclgpuperformance

Shared Memory Transport for NCCL Intra-Node GPU

Configure NCCL shared memory (SHM) transport for intra-node GPU communication on Kubernetes. Covers /dev/shm sizing with emptyDir and NVLink/PCIe P2P paths.

⏱ 15 minutes nvidiagpu-topologynvidia-smi

NVIDIA GPU Topology Matrix Interpretation on Kubernetes

Read and interpret nvidia-smi topo and nvidia-device-plugin topology matrices on Kubernetes GPU nodes. Understand X, NV, SYS, NODE, PIX, PXB, PHB connection

⏱ 15 minutes rdmanvidianetwork-operator

RDMA Configuration with NVIDIA Network Operator

Deploy and configure RDMA for GPU clusters using the NVIDIA Network Operator. NicClusterPolicy setup, MLNX_OFED driver container, shared and SR-IOV RDMA device

⏱ 15 minutes nvlinkgpu-architecturepcie

NVLink Bridge Architecture for GPU Kubernetes Nodes

Understand NVLink Bridge logical architecture in GPU servers for Kubernetes. Dual-socket PCIe Gen5 topology, NVL4 groups, GPU-NIC-NVMe placement, PCIe switch

⏱ 15 minutes mpincclnetworking

OpenMPI Control Plane Separation for NCCL RDMA

Configure OpenMPI to use eth0 for MPI control traffic while NCCL uses net1 SR-IOV for data. Covers btl_tcp_if_include, pml, routed direct, plm_rsh_agent SSH

⏱ 15 minutes sriovopenshiftnv-ipam

OpenShift SR-IOV Network with NVIDIA IPAM for GPU Fabric

Configure SriovNetwork resources on OpenShift with nv-ipam for GPU fabric IP allocation. SR-IOV Network Operator setup, Mellanox NIC resource targeting, IPAM

⏱ 15 minutes gpuschedulingopenshift

Run:ai GPU Scheduling with Kubeflow MPIJob

Integrate Run:ai GPU scheduler with Kubeflow MPIJob for multi-node NCCL workloads. Covers Run:ai project namespaces, GPU quota annotations, pod group

⏱ 15 minutes rdmadevice-pluginshared

Shared RDMA Device Plugin for Kubernetes GPU Pods

Configure the RDMA shared device plugin to allow multiple pods to share RDMA-capable NICs on Kubernetes. K8s-rdma-shared-dev-plugin setup, resource

⏱ 15 minutes networkingsriovrdma

SR-IOV Multus Network Attachment for GPU RDMA Pods

Configure Multus CNI NetworkAttachmentDefinition for SR-IOV RDMA in Kubernetes GPU workloads. Covers k8s.v1.cni.cncf.io/networks annotation, IPAM

⏱ 15 minutes cloudnativepgpostgresqldatabase

CloudNativePG PostgreSQL Operator on Kubernetes

Deploy production PostgreSQL on Kubernetes with CloudNativePG operator. Automated failover, continuous backup to S3, point-in-time recovery, connection

⏱ 15 minutes crossplaneinfrastructure-as-codemulti-cloud

Crossplane Kubernetes Infrastructure Management

Manage cloud infrastructure as Kubernetes resources with Crossplane. Provision AWS, GCP, and Azure resources using custom resource

⏱ 15 minutes genai-perfbenchmarkingvllm

GenAI-Perf Benchmarking LLM Inference on Kubernetes

Benchmark LLM inference performance with NVIDIA GenAI-Perf on Kubernetes. Profile vLLM, TensorRT-LLM, and Triton endpoints with concurrency sweeps, token

⏱ 15 minutes grafanaprometheusmonitoring

Grafana Kubernetes Monitoring Dashboards Guide

Deploy and configure Grafana dashboards for Kubernetes monitoring including dashboard 6417 for pod metrics, dashboard 315 for cluster overview, and custom

⏱ 15 minutes helmsprigtemplates

Helm Sprig Functions Complete Reference

Complete reference for Helm Sprig template functions including cat, print, join, tostring, add1, trim, quote, default, and more. Examples and common patterns

⏱ 15 minutes kedaautoscalingevent-driven

KEDA Event-Driven Autoscaling on Kubernetes

Deploy KEDA for event-driven autoscaling on Kubernetes. Scale deployments to zero based on queue depth, HTTP requests, cron schedules, Prometheus

⏱ 15 minutes audit-loggingsecuritycompliance

Kubernetes Audit Logging Configuration

Configure Kubernetes audit logging to track API requests. Define audit policies, capture who did what and when, send logs to backends like

⏱ 15 minutes blue-greencanarydeployment-strategy

Kubernetes Blue-Green and Canary Deployment Strategies

Implement blue-green and canary deployment strategies on Kubernetes. Zero-downtime releases using Service label switching, traffic splitting, progressive

⏱ 15 minutes cronjobschedulingconcurrency

Kubernetes CronJob ConcurrencyPolicy Guide

Configure Kubernetes CronJob concurrencyPolicy with Allow, Forbid, and Replace options. Control concurrent job execution, prevent overlapping runs, and handle

⏱ 15 minutes daemonsetschedulingnode-management

Kubernetes DaemonSet One Pod Per Node Guide

Deploy DaemonSets on Kubernetes to run exactly one pod per node. Configure tolerations, node selectors, affinity rules, and resource management

⏱ 15 minutes efkelasticsearchfluentd

Kubernetes EFK Stack Centralized Logging

Deploy the EFK stack (Elasticsearch, Fluentd, Kibana) on Kubernetes for centralized log collection, processing, and visualization. DaemonSet log

⏱ 15 minutes configmapenvironment-variablesenvfrom

Kubernetes EnvFrom ConfigMap Environment Variables

Inject all ConfigMap keys as environment variables using envFrom in Kubernetes pods. Configure configMapRef, secretRef, prefix options, and selective key

⏱ 15 minutes ephemeral-containersdebuggingkubectl-debug

Kubernetes Ephemeral Containers for Debugging

Debug running pods with Kubernetes ephemeral containers. Attach debug containers without restarting pods, troubleshoot distroless images, inspect network

⏱ 15 minutes finalizersresource-lifecycletroubleshooting

Kubernetes Finalizers Explained and Troubleshooting

Understand Kubernetes finalizers for resource cleanup. How finalizers block deletion, common stuck resource scenarios, manual removal

⏱ 15 minutes graceful-shutdownpod-lifecycletermination

Kubernetes Graceful Shutdown and Pod Termination

Implement graceful shutdown for Kubernetes pods. Configure terminationGracePeriodSeconds, preStop hooks, SIGTERM handling, connection

⏱ 15 minutes gvisorkata-containersruntimeclass

Kubernetes gVisor and Kata Containers RuntimeClass

Deploy sandboxed container runtimes on Kubernetes using RuntimeClass with gVisor (runsc) and Kata Containers. Isolate untrusted workloads with kernel-level

⏱ 15 minutes hpaautoscalingprometheus

Kubernetes HPA Custom Metrics Prometheus Adapter

Configure Kubernetes Horizontal Pod Autoscaler with custom Prometheus metrics via the Prometheus Adapter. Scale on request latency, queue depth, GPU

⏱ 15 minutes imagepullbackofftroubleshootingcontainer-registry

Kubernetes ImagePullBackOff Troubleshooting Guide

Debug and fix ImagePullBackOff and ErrImagePull errors in Kubernetes. Resolve authentication failures, registry connectivity, image not found, TLS certificate

⏱ 15 minutes cert-managertlscertificates

Kubernetes Ingress TLS Certificate with cert-manager

Automate TLS certificate management on Kubernetes with cert-manager. Let's Encrypt integration, ClusterIssuer configuration, automatic renewal, wildcard

⏱ 15 minutes init-containerspod-lifecyclepatterns

Kubernetes Init Containers Patterns and Examples

Use Kubernetes init containers for pod initialization. Wait for dependencies, clone Git repos, setup configuration, database migrations, certificate

⏱ 15 minutes kindlocal-developmentdocker

Kubernetes Kind Local Development Cluster

Create local Kubernetes clusters with kind (Kubernetes in Docker). Multi-node clusters, ingress setup, local registry, port mapping, volume mounts, and CI/CD

⏱ 15 minutes kustomizeconfigurationoverlays

Kubernetes Kustomize Configuration Management

Manage Kubernetes configurations with Kustomize. Build overlays for multiple environments, patch resources, generate ConfigMaps and Secrets, and integrate

⏱ 15 minutes labelsannotationsmetadata

Kubernetes Labels and Annotations Best Practices

Implement Kubernetes labels and annotations following best practices. Recommended label keys, organizational conventions, selectors, annotations vs labels

⏱ 15 minutes sidecarambassadoradapter

Kubernetes Multi-Container Pod Patterns

Implement multi-container pod patterns in Kubernetes: sidecar for logging and proxying, ambassador for outbound connections, adapter for format

⏱ 15 minutes namespacesmulti-tenancyresource-quotas

Kubernetes Namespace Best Practices

Organize Kubernetes clusters with namespace best practices. Separation strategies, resource quotas, network policies, RBAC per namespace, naming

⏱ 15 minutes networkpolicysecurityzero-trust

Default Deny NetworkPolicy: Zero-Trust Examples

Implement default deny network policies in Kubernetes for zero-trust pod networking. Block all ingress and egress by default, then allow only required traffic

⏱ 15 minutes oomkilledtroubleshootingmemory

Kubernetes OOMKilled Troubleshooting and Prevention

Debug and prevent OOMKilled container terminations in Kubernetes. Understand memory limits, diagnose memory leaks, configure resource requests, and implement

⏱ 15 minutes pdbhigh-availabilitydisruption

Kubernetes Pod Disruption Budget PDB Guide

Protect application availability with Kubernetes PodDisruptionBudgets. Configure minAvailable and maxUnavailable for voluntary disruptions like node

⏱ 15 minutes prioritypreemptionscheduling

Kubernetes Pod Priority and Preemption

Configure pod priority and preemption in Kubernetes for critical workloads. PriorityClass definitions, preemption behavior, protecting system

⏱ 15 minutes rate-limitinggateway-apiingress

Kubernetes Rate Limiting with Gateway API

Implement rate limiting for Kubernetes services using Gateway API, Istio, Kong, NGINX, and Envoy. Protect APIs from abuse

⏱ 15 minutes secretssecurityexternal-secrets

Kubernetes Secrets Management Best Practices

Manage Kubernetes Secrets securely with best practices. External Secrets Operator, sealed secrets, RBAC restrictions, encryption at rest, secret

⏱ 15 minutes servicesnetworkingloadbalancer

Kubernetes Service Types LoadBalancer ClusterIP NodePort

Understand Kubernetes Service types: ClusterIP, NodePort, LoadBalancer, and ExternalName. When to use each type, configuration examples, and traffic routing

⏱ 15 minutes statefulsetheadless-servicepersistent-storage

Kubernetes StatefulSet Headless Service Guide

Deploy stateful applications with Kubernetes StatefulSets. Stable network identity, ordered deployment, persistent storage per pod, headless services

Kubernetes Taints and Tolerations Node Scheduling

Control pod scheduling with Kubernetes taints and tolerations. Dedicate nodes to specific workloads, prevent scheduling on control-plane nodes, implement GPU

⏱ 15 minutes vpaautoscalingresource-management

Kubernetes Vertical Pod Autoscaler VPA Guide

Deploy and configure the Vertical Pod Autoscaler (VPA) on Kubernetes. Auto-adjust CPU and memory requests based on actual usage, right-size

⏱ 15 minutes linkerdservice-meshmtls

Kubernetes Linkerd Service Mesh mTLS Guide

Deploy Linkerd service mesh on Kubernetes for automatic mTLS, traffic observability, and reliability features. Zero-config encryption, per-route

NCCL Environment Variables Complete Reference

Complete reference for NCCL environment variables on Kubernetes. Configure network transport, InfiniBand, GPUDirect RDMA, socket

⏱ 15 minutes ncclgpurdma

⏱ 15 minutes openshiftlifecyclesupport

OpenShift Support Lifecycle and Version Matrix

OpenShift Container Platform support lifecycle, version EOL dates, Kubernetes version mapping, upgrade paths, and Extended Update Support (EUS). Plan upgrades

⏱ 15 minutes velerobackupdisaster-recovery

Velero Kubernetes Backup and Disaster Recovery

Deploy Velero for Kubernetes cluster backup and disaster recovery. Configure scheduled backups, restore namespaces, migrate workloads between

⏱ 15 minutes volcanogang-schedulingbatch

Kubernetes Volcano Batch Scheduler Gang Scheduling

Deploy Volcano batch scheduler for gang scheduling on Kubernetes. Configure minAvailable for all-or-nothing pod group scheduling, queue management, and GPU job

NCCL and RCCL Networking Performance on Kubernetes

Optimize NCCL (NVIDIA) and RCCL (AMD) collective communication performance on Kubernetes GPU clusters. Network transport selection, bandwidth tuning, latency

⏱ 15 minutes ncclrcclgpu

⏱ 15 minutes wandbmlopsexperiment-tracking

Weights and Biases Experiment Tracking on Kubernetes

Deploy Weights & Biases (W&B) on Kubernetes for ML experiment tracking, model registry, and hyperparameter sweeps. Self-hosted W&B Server, agent-based

⏱ 15 minutes leaderworkersetdisaggregated-inferencellm-d

Integrate DisaggregatedSet with llm-d on Kubernetes

Deploy disaggregated LLM inference using DisaggregatedSet and llm-d on Kubernetes. Install LWS then DS controller, model prefill/decode roles, wire llm-d

⏱ 15 minutes leaderworkersetdisaggregated-inferencellm

DisaggregatedSet for Multi-Role LLM Inference

Deploy disaggregated LLM inference on Kubernetes with DisaggregatedSet and LeaderWorkerSet. Separate prefill and decode phases across GPU pools

⏱ 15 minutes openshiftdisconnectedregistry

Mirror OpenShift Releases to Disconnected Registry

Mirror OCP release images to an air-gapped Quay registry using oc adm release mirror. Auth setup, proxy config, ImageDigestMirrorSet, and disconnected updates.

⏱ 15 minutes ncclgpunvidia

NCCL Topology Dump and Tuning on Kubernetes

Use NCCL_TOPO_DUMP_FILE to export and inject GPU topology on Kubernetes for reproducible distributed training performance. Topology XML caching, environment

⏱ 15 minutes securitycontainer-imagestrivy

Container Image Security Scanning on Kubernetes

Implement container image security scanning in Kubernetes CI/CD pipelines. Trivy, Grype, and admission controllers to prevent vulnerable images from running.

⏱ 15 minutes cosignsigstoresupply-chain-security

Container Image Signing and Verification on Kubernetes

Sign container images with Sigstore cosign and verify signatures at admission time with Kyverno or Connaisseur. Supply chain security for Kubernetes

⏱ 15 minutes hermesai-agentnous-research

Hermes Agent Self-Hosted AI on Kubernetes

Deploy Hermes Agent (Nous Research) on Kubernetes as a persistent self-hosted AI agent with memory, automated skill creation, multi-platform

⏱ 15 minutes container-imagesperformancecaching

Image Pull Optimization for Kubernetes

Optimize container image pull performance in Kubernetes. Layer caching, pre-pulling with DaemonSets, image streaming, lazy pulling with stargz/nydus, registry

⏱ 15 minutes multi-archcontainer-imagesbuildx

Multi-Architecture Container Images for Kubernetes

Build and deploy multi-architecture container images for mixed Kubernetes clusters. Docker buildx, manifest lists, image indexes, platform-aware

⏱ 15 minutes nvidiacnsinsight

NVIDIA CNS with Insight Operator for Network Diagnostics

Deploy NVIDIA Cloud-Native Stack (CNS) with the Insight Operator and NVIDIA Insight tools for deep GPU fabric diagnostics. Collect NIC firmware health, link

⏱ 15 minutes nvidiadocatelemetry

NVIDIA DOCA Telemetry for Network Monitoring on Kubernetes

Deploy NVIDIA DOCA Telemetry Service (DTS) to collect real-time network metrics from BlueField DPUs and ConnectX NICs. Export RoCE counters, port

⏱ 15 minutes nvidia-dynamoinference-optimizationproduction

NVIDIA Dynamo Production Tuning on Kubernetes

Tune NVIDIA Dynamo for production LLM inference: prefill/decode pool sizing, KV cache transfer optimization, NCCL backend selection, SLA-driven autoscaling

⏱ 15 minutes nvidiaopenshellagents

NVIDIA OpenShell Sandboxed AI Agent Runtime on Kubernetes

Deploy NVIDIA OpenShell on Kubernetes for safe, private autonomous AI agent execution. Declarative YAML network policies, sandboxed containers

⏱ 15 minutes nvidiansightprofiling

NVIDIA Nsight Operator for GPU Profiling on Kubernetes

Deploy NVIDIA Nsight Systems and Nsight Compute on Kubernetes for GPU workload profiling. Capture kernel traces, memory bandwidth, SM occupancy, and NCCL

⏱ 15 minutes ocicontainer-imagesregistry

OCI Container Image Internals on Kubernetes

Understand OCI container image internals: layers as tar archive diffs, image configuration JSON, content-addressable storage with SHA-256, multi-platform image

⏱ 15 minutes openshiftcluster-updatecvo

OpenShift Cluster Update Process Explained

Complete guide to OpenShift Container Platform cluster updates. CVO workflow, Runlevels, Machine Config Operator node updates, update channels

⏱ 15 minutes poolsidefoundation-modelsagents

Poolside AI Foundation Models on Kubernetes

Deploy Poolside AI foundation models for enterprise software agents on Kubernetes. On-prem and VPC deployment, multi-agent orchestration, sandboxed

⏱ 15 minutes registryocicontainer-images

Private Container Registry on Kubernetes

Deploy a private OCI container registry on Kubernetes with persistent storage, TLS, authentication, garbage collection, and high availability. Self-hosted

⏱ 15 minutes red-hatopenshiftai-studio

Red Hat AI Studio on OpenShift

Deploy Red Hat AI Studio on OpenShift for end-to-end LLM development. Model catalog, InstructLab fine-tuning, experiment tracking, model

⏱ 15 minutes tabninecode-assistantenterprise-ai

Tabnine AI Code Assistant Self-Hosted on Kubernetes

Deploy Tabnine Enterprise self-hosted on Kubernetes for private AI code completion and chat. On-prem model serving, multi-model support (Tabnine

⏱ 15 minutes gateway-apicanarytraffic-splitting

Canary Deployment with Gateway API Traffic Splitting

Implement canary deployments using Kubernetes Gateway API HTTPRoute traffic splitting. Gradually shift traffic from stable to canary version with weight-based

⏱ 15 minutes fiocsistorage

Validate CSI Storage Performance with FIO Kubernetes Job

Benchmark CSI storage performance using FIO inside a Kubernetes Job. Create a PVC backed by a CSI StorageClass, run sequential/random read/write

⏱ 15 minutes emptydirvolumescka

emptyDir Volumes: Sharing, Lifecycle, and Memory-Backed

Master emptyDir volumes for CKA/CKAD exam prep. Share data between containers, understand volume lifecycle across restarts vs Pod deletion, and configure

⏱ 15 minutes chaos-engineeringchaos-meshfault-injection

Chaos Mesh Fault Injection on Kubernetes

Deploy Chaos Mesh for chaos engineering on Kubernetes. Covers PodChaos, NetworkChaos, IOChaos, StressChaos experiments, scheduling, RBAC

⏱ 15 minutes gpudirectstoragenvidia

GPUDirect Storage on Kubernetes

Configure NVIDIA GPUDirect Storage (GDS) for direct data path between NVMe/NFS storage and GPU memory bypassing CPU. Covers Magnum IO, cuFile API, GDS driver

⏱ 15 minutes infinibandopensmsubnet-manager

InfiniBand Subnet Manager OpenSM on Kubernetes

Deploy and manage InfiniBand Subnet Manager (OpenSM) on Kubernetes for GPU cluster fabric management. Covers SM architecture, UFM integration, partition

⏱ 15 minutes chaos-engineeringlitmusresilience

LitmusChaos Engineering on Kubernetes

Deploy LitmusChaos for resilience testing on Kubernetes. Covers ChaosEngine, ChaosExperiment, ChaosResult CRDs, built-in experiments, GameDay planning, Litmus

⏱ 15 minutes nmstatebondingvlan

NMState Network Config for GPU Worker Nodes

Declaratively configure Ethernet bonding, VLANs, MTU, and static routes on GPU worker nodes using NMState on OpenShift. Covers bonding modes, LACP

⏱ 15 minutes nvidia-peermemgpu-directrdma

NVIDIA PeerMem for GPU-Direct RDMA

Install and configure nvidia_peermem kernel module to enable GPU-Direct RDMA between NVIDIA GPUs and Mellanox RDMA NICs. Covers module

⏱ 15 minutes multuscniopenshift

OpenShift Multus CNI Multiple Network Interfaces

Attach multiple network interfaces to Pods using Multus CNI on OpenShift. Covers NetworkAttachmentDefinitions, SR-IOV, macvlan, IPVLAN, traffic separation

RoCE PFC and ECN Lossless Ethernet for GPU Clusters

Configure RoCE v2 with Priority Flow Control (PFC) and ECN for lossless Ethernet RDMA on GPU clusters. Covers DSCP mapping, switch configuration, NIC

⏱ 15 minutes rocepfcecn

⏱ 15 minutes strimzikafkaoperator

Strimzi Kafka Operator on Kubernetes

Deploy Apache Kafka on Kubernetes with Strimzi operator. Covers Kafka CR, KafkaTopic, KafkaUser, KafkaConnect, KafkaBridge, rack awareness, storage

⏱ 15 minutes acspciegpu-direct

Disable PCIe ACS for GPU-Direct P2P

Disable PCIe Access Control Services (ACS) to enable GPU-Direct peer-to-peer DMA between GPUs and RDMA NICs. Covers BIOS disable, kernel override, and when

⏱ 15 minutes infinibandethernetmellanox

Dual-Fabric Mellanox: GPU InfiniBand + Storage Ethernet

Design and configure dual-fabric network architecture with separate Mellanox NICs for GPU communication (InfiniBand) and storage traffic (Ethernet). Covers

⏱ 15 minutes iommuncclgpu-direct

IOMMU BIOS and Kernel Config for NCCL GPU-Direct

Configure IOMMU at BIOS and kernel level to enable NCCL GPU-Direct RDMA on Kubernetes. Covers Intel VT-d, AMD-Vi, kernel parameters, passthrough

⏱ 15 minutes ncclpxnnvlink

NCCL PXN Cross-NIC Communication via NVLink

Configure NCCL PXN (PCIe cross-NIC via NVLink) for multi-node GPU training where not every GPU has a direct RDMA NIC. Covers topology

⏱ 15 minutes nv-ipamipamgpu-fabric

NVIDIA IPAM for GPU Fabric IP Address Allocation

Configure nv-ipam (NVIDIA IPAM) to assign IP addresses on GPU fabric SR-IOV networks in Kubernetes. Covers IPPool CRDs, per-node allocation, InfiniBand IPoIB

⏱ 15 minutes sriovmmiomellanox

Fix SR-IOV 'Not Enough MMIO Resources' Error

Resolve the mlx5_core 'not enough MMIO resources for SR-IOV' error on OpenShift nodes with Mellanox ConnectX NICs. Covers BIOS settings, PCIe BAR

⏱ 15 minutes runaisriovrdma

Run:ai Distributed Inference with SR-IOV RDMA

Deploy distributed vLLM inference on Run:ai using SR-IOV RDMA for NCCL inter-node communication. Covers extended-resource for Mellanox VFs, network annotation

⏱ 15 minutes runaivllmnccl

Run:ai Distributed Inference with vLLM and NCCL

Deploy distributed LLM inference on Run:ai with vLLM tensor parallelism across multiple workers. Covers multi-node GPU splitting, NCCL configuration, PVC model

⏱ 15 minutes sriovvirtual-functioncontainers

SR-IOV VF to Container Mapping and Lifecycle

How SR-IOV Virtual Functions are mapped to containers in Kubernetes. Covers VF allocation flow, link state management (VFs are down when unassigned), device

⏱ 15 minutes virtualizationiommusriov

VT-x vs VT-d vs SR-IOV Explained

Understand the difference between CPU virtualization (VT-x/SVM), I/O virtualization (VT-d/AMD-Vi/IOMMU), and SR-IOV. Which to enable or disable for GPU

⏱ 15 minutes vllmnccldebugging

Debug Distributed vLLM Inference with NCCL Verbose Logging

Debug distributed vLLM inference using NCCL_DEBUG=INFO and NCCL_DEBUG_SUBSYS=ALL. Covers air-gapped deployment with TRANSFORMERS_OFFLINE, interpreting NCCL

⏱ 15 minutes ai-infrastructurescalinginference

Kubernetes AI Infrastructure Scaling

Scale AI inference infrastructure on Kubernetes from 10K to 100K requests per second. Covers latency optimization, horizontal scaling, caching

⏱ 15 minutes ai-searchllms-txtrag

Kubernetes for AI Search and Discoverability

Deploy AI-searchable services on Kubernetes: llms.txt implementation, RAG-optimized APIs, structured data for AI chatbots, and infrastructure patterns

⏱ 15 minutes serviceaccountrbacsecurity

ServiceAccount for Running Pods

Configure Kubernetes ServiceAccounts for Pods: token mounting, RBAC permissions, workload identity, automountServiceAccountToken control, and least-privilege

⏱ 15 minutes sriovrdmainfiniband

OpenShift SR-IOV RDMA InfiniBand Device Plugin

Configure and troubleshoot SR-IOV Network Operator with Mellanox ConnectX RDMA InfiniBand devices on OpenShift. Covers SriovNetworkNodePolicy, device

⏱ 15 minutes openshiftuser-managementrbac

OpenShift User Account Management

Manage user accounts in OpenShift: create users, assign roles, configure identity providers, manage groups, and implement RBAC for multi-tenant clusters.

⏱ 15 minutes cost-optimizationfinopsautoscaling

Kubernetes Cost Optimization Strategies

Comprehensive cost reduction strategies for Kubernetes clusters: right-sizing, spot instances, autoscaling, idle resource detection, namespace budgets, and GPU

⏱ 15 minutes ephemeral-containersdebuggingkubectl-debug

Ephemeral Containers for Live Debugging

Use kubectl debug with ephemeral containers to troubleshoot running Pods without restarting them. Attach debugging tools to distroless containers, inspect

⏱ 15 minutes goldilocksvpacost-optimization

Goldilocks VPA Dashboard for Resource Optimization

Deploy Goldilocks to visualize VPA recommendations across all workloads and identify over-provisioned or under-provisioned containers with actionable

⏱ 15 minutes pdbavailabilitydisruption

Pod Disruption Budget (PDB) Production Guide

Configure Pod Disruption Budgets to protect application availability during voluntary disruptions: node drains, cluster upgrades, and autoscaler scale-downs.

⏱ 15 minutes vpaautoscalingresource-management

Vertical Pod Autoscaler (VPA) Guide

Configure Kubernetes Vertical Pod Autoscaler to automatically right-size container CPU and memory requests based on actual usage. Covers

⏱ 15 minutes kyvernosupply-chainai-security

Kyverno AI Workload Provenance Verification

Use Kyverno to verify software and content provenance for AI workloads: SBOM validation, model signing with Sigstore, dataset integrity, and supply chain

⏱ 15 minutes kyvernocelpolicy

Kyverno CEL Policy Model Migration

Migrate Kyverno policies from YAML-based rules to CEL expressions for type-safe, performant validation. Covers CEL syntax, migration patterns, and comparison

⏱ 15 minutes kyvernogitopsargocd

Kyverno Drift Prevention for GitOps

Prevent configuration drift in GitOps workflows using Kyverno: block manual kubectl edits, enforce ArgoCD/Flux ownership, and detect out-of-band changes

⏱ 15 minutes kyvernocomplianceiso27001

Kyverno ISO 27001 Compliance Policies

Implement ISO 27001 and BSI IT-Grundschutz security controls in Kubernetes using Kyverno policies: access control, cryptography, operations security, and audit

⏱ 15 minutes kyvernollminference

Kyverno LLM Inference Cost and Security Guardrails

Implement policy-as-code guardrails for LLM inference workloads with Kyverno: GPU quota enforcement, model size limits, cost controls, prompt injection

⏱ 15 minutes kyvernorbacmulti-tenancy

Kyverno ReBAC Multi-Tenant RBAC Automation

Implement Relationship-Based Access Control (ReBAC) with Kyverno to automate multi-tenant RBAC at scale: dynamic RoleBindings, namespace

⏱ 15 minutes kyvernowebhookadmission-control

Kyverno Webhook Topology and Admission Latency

Optimize Kyverno webhook topology for minimal admission latency: webhook configuration tuning, failure policies, timeout settings, and lessons from migrating

⏱ 15 minutes openshiftoc-cpfile-transfer

OpenShift oc cp File Copy Guide

Use oc cp to copy files and directories between local machine and Pods. Covers tar-based transfer, container selection, large file handling, and comparison

⏱ 15 minutes openshiftoc-rsyncfile-transfer

OpenShift oc rsync File Transfer

Use oc rsync to copy files between local machine and Pods in OpenShift. Covers upload, download, live sync, filtering, and common patterns for debugging

⏱ 15 minutes trainingdatasetsstorage

Deep Learning with Large Datasets on K8s

Optimize deep learning training with large datasets on Kubernetes. Covers data loading, caching strategies, parallel prefetch, and storage architecture

⏱ 15 minutes inferencemulti-gpudistributed

Distributed Multi-GPU Inference on Kubernetes

Deploy distributed inference across multiple GPUs and nodes on Kubernetes. Covers tensor parallelism, pipeline parallelism, vLLM, and NIM multi-GPU serving.

⏱ 15 minutes secretssecurityopenshift

External Secrets Operator on OpenShift

Manage Kubernetes secrets from external vaults using External Secrets Operator on OpenShift. Covers ExternalSecret CRD, SecretStore configuration, and GitOps

⏱ 15 minutes benchmarkingstoragenfs

PScale NFS and SMB Storage Benchmarking

Benchmark NFS and SMB storage performance on Kubernetes using fio clients in Pods. Covers multi-client parallel testing, bandwidth measurement, and IOPS

⏱ 15 minutes fsdplorafine-tuning

FSDP LoRA Fine-Tuning LLMs on Kubernetes

Fine-tune large language models with FSDP and LoRA on Kubernetes. Covers memory-efficient loading, checkpoint strategies, and multi-node H200 training.

⏱ 15 minutes benchmarkinginferencenvidia

NVIDIA GenAI-Perf Inference Benchmarking

Benchmark LLM inference throughput and latency on Kubernetes using NVIDIA GenAI-Perf. Covers vLLM, Run:ai, concurrency testing, and multi-location client runs.

⏱ 15 minutes inferencedistributedlws

LeaderWorkerSet Multi-Node Inference on K8s

Deploy multi-node distributed inference using LeaderWorkerSet (LWS) operator on Kubernetes. Covers vLLM pipeline parallelism across nodes for 405B+ parameter

⏱ 15 minutes fsdploramistral

Mistral FSDP LoRA Complete Accelerate Config

Complete accelerate FSDP configuration for fine-tuning Mistral-Small-4 11B with LoRA on multi-GPU H200 clusters. Covers every FSDP2 setting with explanations.

⏱ 15 minutes trainingdistributedmulti-node

Multi-Node Distributed Training on Kubernetes

Run distributed deep learning training across multiple GPU nodes on Kubernetes. Covers PyTorch DDP, DeepSpeed, Horovod, and MPI jobs with NCCL optimization.

⏱ 15 minutes benchmarkingnvidiagds

NVIDIA GPUDirect Storage Benchmark on K8s

Benchmark NVIDIA GPUDirect Storage (GDS) on Kubernetes for direct NVMe-to-GPU data transfers. Covers gdsio, gds_stats, performance validation, and comparison

⏱ 15 minutes nvidiagpu-operatoropenshift

NVIDIA GPU Operator GitOps on OpenShift

Deploy NVIDIA GPU Operator on OpenShift via GitOps with ArgoCD. Covers ClusterPolicy configuration, DCGM exporter, drain settings, tolerations, and rolling

⏱ 15 minutes nvidianetwork-operatorrdma

NVIDIA Network Operator NicClusterPolicy

Deploy NVIDIA Network Operator on OpenShift with NicClusterPolicy for DOCA telemetry, NIC feature discovery, RDMA IPAM, and OFED drivers. GitOps-managed

⏱ 15 minutes openshiftgpucapacity-planning

OpenShift GPU Node Resource Planning

Plan CPU, memory, and overhead budgets for GPU nodes running NVIDIA GPU Operator, Network Operator, Run:ai, and OpenShift infrastructure Pods. Understand what

⏱ 15 minutes runaiopenshiftarchitecture

Run:ai Backend Architecture on OpenShift

Understand the full Run:ai backend deployment on OpenShift with 40+ microservices including Keycloak, PostgreSQL, NATS, Thanos, Traefik, and workload

⏱ 15 minutes runaiopenshiftdistributed

Run:ai Distributed PyTorch Training on OpenShift

Submit multi-node distributed PyTorch training jobs on OpenShift using Run:ai CLI. Covers DDP, FSDP, RDMA networking, and GPU scheduling.

⏱ 15 minutes runaidistributed-trainingfsdp

FSDP Distributed Training on Run:ai

Run PyTorch FSDP distributed training workloads on Run:ai with GPU scheduling, event tracking, and GPU memory monitoring. Covers Mistral-class model

⏱ 15 minutes runaidcgmthanos

Run:ai GPU Metrics Pipeline with DCGM and Thanos

End-to-end GPU metrics pipeline on Run:ai: DCGM exporter collects GPU utilization, Prometheus scrapes, remote-writes to Thanos Receive, and Grafana dashboards

⏱ 15 minutes runaikeycloaksso

Run:ai Keycloak SSO Authentication Setup

Configure Run:ai SSO authentication with Keycloak on OpenShift: OIDC integration, user federation, role mapping, and troubleshooting login failures.

⏱ 15 minutes runaiopentelemetryobservability

Run:ai Observability with OpenTelemetry

Configure Run:ai observability on OpenShift with OpenTelemetry Collector, Prometheus receivers, metrics enrichment, OAuth2 export, and GPU metric collection

⏱ 15 minutes runaiarchitectureopenshift

Run:ai Platform Backend Components

Overview of Run:ai backend StatefulSets and components on OpenShift: Thanos receive/query, Keycloak, NATS, Redis, PostgreSQL, workload controllers, and their

⏱ 15 minutes runaitraininggpu

Run:ai Training Job Submit Script Pattern

Production pattern for submitting Run:ai training jobs via shell scripts with GPU fractional allocation, NFS mounts, custom Python environments, and private

⏱ 15 minutes runaiopenshiftcontrollers

Run:ai Workload Controllers on OpenShift

Understand Run:ai cluster-level workload controllers on OpenShift: workload-controller, workload-overseer, workload-exporter, and status-updater components.

⏱ 15 minutes thanosmemorycapacity-planning

Thanos Receive Memory Sizing Guide

Calculate correct memory limits for Thanos Receive based on WAL segments, active series, retention, and ingestion rate. Prevent OOMKill crash loops

⏱ 15 minutes thanosoomcrashloopbackoff

Thanos Receive OOMKilled CrashLoopBackOff

Debug and fix Thanos Receive StatefulSet OOMKilled CrashLoopBackOff caused by WAL replay exceeding memory limits. Covers ArgoCD conflict resolution, liveness

⏱ 15 minutes thanosrunaioomkilled

Fix Thanos Receive OOMKilled in Run:ai

Troubleshoot and fix Thanos Receive OOMKilled (exit code 137) with 143+ restarts in Run:ai backend on OpenShift. Covers memory tuning, TSDB

⏱ 15 minutes securitycvelinux-kernel

CVE-2026-31431 Linux Kernel Crypto Fix

Security advisory for CVE-2026-31431: Linux kernel crypto algif_aead vulnerability. Impact on Kubernetes nodes and how to patch container host kernels.

⏱ 15 minutes kubernetes-1.36rbacsecurity

Kubernetes 1.36 Constrained Impersonation

Use constrained impersonation in Kubernetes 1.36 to limit which identities a user can impersonate. Tighter RBAC control for multi-tenant clusters.

⏱ 15 minutes kubernetes-1.36csisnapshots

Kubernetes 1.36 CSI Differential Snapshots

Use CSI differential snapshots in Kubernetes 1.36 to track changed blocks between snapshots. Enables incremental backups and faster disaster recovery.

⏱ 15 minutes kubernetes-1.36apivalidation

Kubernetes 1.36 Declarative Type Validation

Kubernetes 1.36 introduces declarative validation for native API types using validation-gen. Replaces hand-written validation code with struct tag annotations.

⏱ 15 minutes kubernetes-1.36dragpu

Kubernetes 1.36 DRA for GPU and TPU Management

Use Dynamic Resource Allocation in Kubernetes 1.36 for advanced GPU/TPU management with partitionable devices, device taints, and tolerations.

⏱ 15 minutes kubernetes-1.36service-accountssecurity

Kubernetes 1.36 External SA Token Signing

Delegate ServiceAccount token signing to external KMS or HSM systems in Kubernetes 1.36. Improve security with hardware-backed key management.

⏱ 15 minutes kubernetes-1.36deprecationnetworking

Migrate from externalIPs in Kubernetes 1.36

Service externalIPs are deprecated in Kubernetes 1.36 due to CVE-2020-8554. Migrate to Gateway API, LoadBalancer services, or MetalLB for external access.

⏱ 15 minutes kubernetes-1.36schedulinggang-scheduling

Kubernetes 1.36 Gang Scheduling

Use gang scheduling in Kubernetes 1.36 to schedule Pod groups atomically. Essential for distributed ML training, MPI jobs, and Spark workloads.

⏱ 15 minutes kubernetes-1.36migrationvolumes

Migrate from gitRepo Volume in Kubernetes 1.36

The gitRepo volume plugin is permanently removed in Kubernetes 1.36. Migrate to init containers or OCI volumes to avoid broken deployments.

⏱ 15 minutes kubernetes-1.36high-availabilitycontrol-plane

Kubernetes 1.36 Graceful Leader Transition

Configure graceful leader transitions in Kubernetes 1.36 control plane components. Eliminate brief outages during leader election failovers.

⏱ 15 minutes kubernetes-1.36cpu-managerperformance

Kubernetes 1.36 L3 Cache Topology in CPU Manager

Configure L3 cache topology awareness in Kubernetes 1.36 CPU Manager. Allocate CPUs sharing L3 cache for better performance in latency-sensitive workloads.

⏱ 15 minutes kubernetes-1.36memorycgroups

Kubernetes 1.36 Memory QoS with cgroups v2

Configure memory quality of service with cgroups v2 in Kubernetes 1.36. Set memory.min and memory.high for guaranteed memory and throttling before OOM kills.

⏱ 15 minutes kubernetes-1.36api-serverupgrades

Kubernetes 1.36 Mixed Version Proxy

Use the Mixed Version Proxy in Kubernetes 1.36 to handle API version skew during rolling upgrades. Ensures API availability across mixed control plane versions.

⏱ 15 minutes kubernetes-1.36prometheusmetrics

Kubernetes 1.36 Native Histogram Metrics

Enable Prometheus native histograms in Kubernetes 1.36 for higher-resolution metrics with lower storage cost. Covers all control plane components.

⏱ 15 minutes kubernetes-1.36ocivolumes

Kubernetes 1.36 OCI Volume Source

Use OCI VolumeSource in Kubernetes 1.36 to pull OCI artifacts directly into Pod volumes. No init containers needed for ML models, configs, or data.

⏱ 15 minutes kubernetes-1.36securitymtls

Kubernetes 1.36 Pod Certificates (mTLS)

Use Pod Certificates in Kubernetes 1.36 to authenticate Pods to the API server via mTLS. Built-in X.509 certificate provisioning without external tools.

⏱ 15 minutes kubernetes-1.36resourcespods

Kubernetes 1.36 Pod-Level Resource Limits

Set resource requests and limits at the Pod level in Kubernetes 1.36 instead of per-container. Simplifies multi-container Pod resource management.

⏱ 15 minutes kubernetes-1.36machine-learninggpu

Kubernetes 1.36 RestartAllContainers for ML

Use the RestartAllContainers policy in Kubernetes 1.36 to restart all Pod containers in-place when a worker fails, avoiding costly ML training rescheduling.

⏱ 15 minutes kubernetes-1.36selinuxsecurity

Kubernetes 1.36 SELinux Mount-Time Labeling

Configure SELinux mount-time volume labeling in Kubernetes 1.36 to eliminate slow recursive relabeling and speed up Pod startup times dramatically.

⏱ 15 minutes kubernetes-1.36kubectlwebsockets

Kubernetes 1.36 SPDY to WebSocket Migration

Kubernetes 1.36 continues migrating kubectl exec/attach/port-forward from SPDY to WebSockets. Understand the changes and troubleshoot connection issues.

⏱ 15 minutes kubernetes-1.36debuggingcontrol-plane

Kubernetes 1.36 Statusz and Flagz Endpoints

Use /statusz and /flagz debug endpoints in Kubernetes 1.36 control plane components. Inspect runtime status and effective flag values without log parsing.

⏱ 15 minutes kubernetes-1.36schedulingtopology

Kubernetes 1.36 Topology-Aware Scheduling

Use topology-aware workload scheduling in Kubernetes 1.36 to place Pods on nodes with optimal GPU, NUMA, and network topology for ML training.

⏱ 15 minutes kubernetes-1.36storagesnapshots

Kubernetes 1.36 VolumeGroupSnapshot GA

Use VolumeGroupSnapshot in Kubernetes 1.36 to take crash-consistent snapshots of multiple volumes atomically. Now GA and production-ready.

⏱ 15 minutes kubernetes-1.36user-namespacessecurity

Kubernetes 1.36 User Namespaces in Pods

Enable user namespaces in Kubernetes 1.36 for rootless containers and stronger Pod isolation. Map container root to unprivileged host UIDs.

⏱ 15 minutes ciliumebpfcni

Cilium: eBPF-Powered K8s Networking

Deploy Cilium CNI in Kubernetes for eBPF-based networking, network policies, service mesh, and observability with Hubble.

⏱ 12 minutes kedaautoscalingevent-driven

KEDA: Event-Driven Autoscaling for K8s

Scale Kubernetes workloads with KEDA based on events from Kafka, RabbitMQ, AWS SQS, Prometheus metrics, and cron schedules.

⏱ 15 minutes knativeserverlessscale-to-zero

Knative: Serverless Workloads on Kubernetes

Run serverless containers with Knative Serving and Eventing on Kubernetes. Auto-scaling to zero, traffic splitting, revision management.

⏱ 10 minutes natsmessagingpub-sub

NATS: Lightweight Messaging for Kubernetes

Deploy NATS messaging in Kubernetes for pub/sub, request/reply, and JetStream persistent streaming. High-performance alternative to Kafka for cloud-native mi...

⏱ 12 minutes spiffespireidentity

SPIFFE/SPIRE: Workload Identity for K8s

Deploy SPIRE for Kubernetes workload identity using SPIFFE standards. Automatic mTLS certificate issuance, cross-cluster identity federation.

⏱ 15 minutes nvidiagpuscheduling

NVIDIA GPU Feature Discovery for Kubernetes

Deploy GPU Feature Discovery (GFD) to auto-label Kubernetes nodes with GPU model, MIG capability, CUDA version, and driver info for intelligent scheduling.

⏱ 15 minutes openshiftnvidiamig

OpenShift NVIDIA MIG Reconfiguration Without Reboot

Reconfigure NVIDIA MIG geometry on OpenShift without rebooting nodes. Use nvidia-mig-manager with node labels to dynamically switch GPU partitions.

⏱ 15 minutes talosnvidiamig

Talos Linux MIG Configuration with GPU Operator

Configure NVIDIA MIG on Talos Linux Kubernetes clusters. Install GPU Operator, set MIG strategy, and dynamically partition A100 GPUs without node reboot.

⏱ 15 minutes nvidiadgxh100

DGX H100 nvidia-smi topo -m Guide

Read nvidia-smi topo -m output on DGX H100 systems. Understand NVLink, NVSwitch, PCIe topology, GPU-to-GPU bandwidth, and NUMA affinity for Kubernetes.

⏱ 15 minutes nvidiagpu-operatorprometheus

GPU Operator Node Status Exporter Metrics

Monitor NVIDIA GPU Operator node validation with gpu_operator_node_driver_ready and status exporter metrics. Prometheus alerts for GPU node health.

⏱ 15 minutes grafanaprometheusmonitoring

Grafana Dashboard 6417 Kubernetes Pods

Import Grafana dashboard 6417 for Kubernetes pod monitoring. Configure Prometheus data source, visualize CPU, memory, network, and disk usage per pod.

⏱ 10 minutes helmchartsdeployment

Helm Install: Deploy Charts Guide

Install Helm charts on Kubernetes with helm install, upgrade, rollback, and values customization. Repository management, OCI registries, and release lifecycle.

⏱ 20 minutes kata-containersruntimeclasssecurity

Kata Containers RuntimeClass Kubernetes

Deploy Kata Containers with Kubernetes RuntimeClass for hardware-isolated pods. VM-based sandboxing, microVM configuration, and multi-runtime clusters.

⏱ 8 minutes kubectlconfigurationgitops

kubectl apply vs create: Key Differences

Understand when to use kubectl apply vs kubectl create. Declarative vs imperative, last-applied annotation, server-side apply, and GitOps workflows.

⏱ 15 minutes kubectlcheat-sheetcka

kubectl Cheat Sheet: Essential Commands

Complete kubectl cheat sheet with essential commands for pods, deployments, services, debugging, and cluster management. Copy-paste ready examples.

⏱ 8 minutes kubectltroubleshootingevents

kubectl describe: Read Pod Events Guide

Use kubectl describe pod to read events, conditions, and container states. Diagnose scheduling failures, image pulls, crashes, and probe failures.

⏱ 8 minutes kubectltroubleshootingdebugging

kubectl exec: Run Commands in Pods

Use kubectl exec to run commands inside running pods. Interactive shell, multi-container pods, debugging techniques, and security considerations.

⏱ 10 minutes kubectlpodscka

kubectl get pods: Output Formats Guide

Master kubectl get pods with output formats, label selectors, field selectors, and custom columns. Wide output, JSON, YAML, and jsonpath examples.

⏱ 8 minutes kubectlpodscka

kubectl run: Create Pod from Command Line

Use kubectl run to create pods and deployments from the command line. Dry-run output, resource limits, environment variables, and CKA exam patterns.

⏱ 15 minutes admission-webhookssecuritypolicy

K8s Admission Webhooks: Validate and Mutate

Build Kubernetes validating and mutating admission webhooks. Webhook configuration, TLS setup, failure policies, and common patterns for policy enforcement.

⏱ 6 minutes kubectlapireference

kubectl explain: API Resource Reference

Use kubectl explain and api-resources to discover Kubernetes API objects. Field documentation, resource versions, short names, and API group exploration.

⏱ 12 minutes argo-workflowsci-cdpipelines

Argo Workflows: K8s-Native Pipeline Engine

Run CI/CD pipelines and data workflows with Argo Workflows in Kubernetes. DAG workflows, artifact passing, retry strategies.

⏱ 15 minutes argocdgitopsci-cd

ArgoCD GitOps: Declarative Continuous Delivery

Deploy applications with ArgoCD GitOps in Kubernetes. Application sync, auto-heal, multi-cluster management, ApplicationSets, and Helm/Kustomize integration.

⏱ 12 minutes auditsecuritylogging

K8s Audit Logging: Track API Activity

Configure Kubernetes audit logging to track API requests. Audit policy levels, log backends, webhook integration, and security compliance monitoring.

⏱ 15 minutes backstagedeveloper-portalplatform-engineering

Backstage: K8s Developer Portal and Catalog

Deploy the Backstage developer portal on Kubernetes for a service catalog, API docs, software templates, and TechDocs documentation.

⏱ 12 minutes tlscertificatescert-manager

cert-manager: Automated TLS Certificates

Automate TLS certificate management with cert-manager in Kubernetes. Let's Encrypt integration, Issuer configuration, wildcard certificates, and automatic

⏱ 12 minutes certificatestlssecurity

K8s Certificate Rotation and Management

Manage Kubernetes cluster certificates with kubeadm. Check expiration, renew certificates, configure auto-rotation, and troubleshoot TLS errors.

⏱ 15 minutes cluster-apicluster-managementinfrastructure

Cluster API: Declarative K8s Management

Manage Kubernetes cluster lifecycle with Cluster API. Provision, upgrade, and scale clusters declaratively using management clusters and infrastructure provi...

⏱ 10 minutes configmapconfigurationvolumes

K8s ConfigMap: Create and Mount Guide

Create Kubernetes ConfigMaps from files, literals, and directories. Mount as volumes or environment variables with hot-reload and immutable ConfigMap patterns.

⏱ 10 minutes container-runtimecontainerdcri-o

K8s Container Runtimes: containerd vs CRI-O

Compare Kubernetes container runtimes containerd and CRI-O. Configuration, crictl debugging, runtime class for gVisor and Kata, and migration from Docker.

⏱ 10 minutes corednsdnstroubleshooting

K8s CoreDNS: Troubleshoot DNS Issues

Troubleshoot Kubernetes CoreDNS resolution failures. Debug dns pods, ndots settings, search domains, custom Corefile, and forward plugin configuration.

⏱ 12 minutes crdcustom-resourcesapi

K8s Custom Resources: CRD Development

Create Kubernetes Custom Resource Definitions with schema validation, additional printer columns, subresources, and conversion webhooks.

⏱ 8 minutes troubleshootingcontainerserrors

Fix CreateContainerError in Kubernetes

Troubleshoot Kubernetes CreateContainerError with step-by-step debugging. ConfigMap mounts, Secret references, volume permissions, and container runtime issues.

⏱ 10 minutes cronjobschedulingbatch

K8s CronJob: Advanced Scheduling Patterns

Configure Kubernetes CronJobs with concurrency policies, deadlines, history limits, and suspend/resume. Timezone scheduling, failure handling, and monitoring.

⏱ 15 minutes crossplaneinfrastructurecloud

Crossplane: Provision Cloud from Kubernetes

Manage cloud infrastructure with Crossplane in Kubernetes. Provision AWS RDS, S3, Azure databases, and GCP resources using Kubernetes manifests and compositi...

⏱ 15 minutes csistoragepersistent-volumes

K8s CSI Drivers: Container Storage Guide

Install and configure Kubernetes CSI drivers for persistent storage. CSI architecture, StorageClass provisioners, snapshots, and volume expansion patterns.

⏱ 10 minutes daemonsetdeploymentsmonitoring

K8s DaemonSet: Run Pod on Every Node

Deploy Kubernetes DaemonSets to run one pod per node. Log collectors, monitoring agents, node-level networking, tolerations, and update strategies.

⏱ 12 minutes daprmicroservicespub-sub

Dapr: Microservice Building Blocks on K8s

Deploy Dapr in Kubernetes for service invocation, state management, pub/sub messaging, and secrets. Sidecar architecture that works with any language or fram...

⏱ 12 minutes deploymentsrolling-updaterollback

K8s Deployment Rolling Update Strategy

Configure Kubernetes Deployment rolling updates with maxSurge and maxUnavailable. Rollback, revision history, blue-green, and canary deployment patterns.

⏱ 10 minutes dnsservicesnetworking

K8s DNS for Services: Resolution Guide

Understand Kubernetes DNS for Services and Pods. Service discovery patterns, FQDN format, headless services, DNS policies, ndots configuration.

⏱ 8 minutes volumesstorageemptydir

K8s Volumes: emptyDir and hostPath Guide

Configure Kubernetes emptyDir and hostPath volumes for temporary storage and host filesystem access. Memory-backed tmpfs, size limits.

⏱ 10 minutes endpointsliceservice-discoverydns

K8s EndpointSlice and Service Discovery

Understand Kubernetes EndpointSlice for scalable service discovery. DNS resolution, headless services, external services, and endpoint conditions.

⏱ 15 minutes etcdbackupdisaster-recovery

K8s etcd Backup and Restore Commands

Backup and restore Kubernetes etcd with etcdctl snapshot save and restore. Automated CronJob backups, verification, and disaster recovery procedures.

⏱ 15 minutes etcdbackupcluster-administration

etcd Deep Dive: K8s Data Store Operations

Master etcd operations for Kubernetes. Backup and restore, compaction, defragmentation, health checks, member management, and performance tuning for production.

⏱ 10 minutes secretsvaultsecurity

External Secrets Operator: Vault and Cloud

Sync secrets from HashiCorp Vault, AWS Secrets Manager, Azure Key Vault, and GCP Secret Manager into Kubernetes with External Secrets Operator.

⏱ 12 minutes falcoruntime-securitysecurity

Falco: K8s Runtime Threat Detection

Deploy Falco for Kubernetes runtime security monitoring. Detect suspicious container behavior, privilege escalation, file access.

⏱ 12 minutes fluxgitopsci-cd

Flux: GitOps Toolkit for Kubernetes

Deploy Flux GitOps toolkit for Kubernetes continuous delivery. Kustomization, HelmRelease, image automation, and multi-tenant GitOps with source controllers.

⏱ 12 minutes gateway-apinetworkingingress

Gateway API: Next-Gen K8s Ingress

Replace Kubernetes Ingress with Gateway API. HTTPRoute, GRPCRoute, TLSRoute configuration. Multi-tenant gateways, traffic splitting, and header-based routing.

⏱ 15 minutes graceful-shutdowndeploymentslifecycle

Kubernetes Graceful Shutdown Guide

Implement graceful shutdown in Kubernetes pods. Configure terminationGracePeriodSeconds, preStop hooks, SIGTERM handling, and drain connections properly.

⏱ 12 minutes harborregistrysecurity

Harbor: Private Container Registry on K8s

Deploy Harbor container registry in Kubernetes for private image hosting. Vulnerability scanning, image replication, RBAC, Helm chart repository.

⏱ 10 minutes autoscalinghpascaling

K8s Horizontal Scaling: Manual and Auto

Scale Kubernetes workloads horizontally with kubectl scale, HPA, and KEDA. Covers replica management and event-driven scaling strategies.

⏱ 10 minutes hpaautoscalingcpu

K8s HPA: Autoscale on CPU and Memory

Configure Kubernetes HorizontalPodAutoscaler to scale on CPU and memory utilization. Target utilization, minReplicas, maxReplicas, and scaling behavior.

⏱ 8 minutes troubleshootingimage-pullcontainers

Troubleshoot ImagePullBackOff and ErrImagePull

Troubleshoot Kubernetes ImagePullBackOff and ErrImagePull errors. Private registry auth, image pull secrets, tag verification, and network connectivity fixes.

⏱ 12 minutes ingressnginxtls

K8s Ingress NGINX: Routing and TLS

Configure Kubernetes Ingress with NGINX controller. Path-based routing, TLS termination, annotations, rate limiting, and multiple hosts with examples.

⏱ 8 minutes init-containerspodsdeployments

K8s Init Containers: Setup Before Main

Use Kubernetes init containers to run setup tasks before main containers start. Database migrations, config fetching, dependency checks, and ordering.

⏱ 10 minutes jobscronjobsbatch

K8s Jobs and CronJobs: Complete Guide

Create Kubernetes Jobs and CronJobs for batch processing. Parallelism, backoff limits, completion counts, cron schedules, and failure handling patterns.

⏱ 20 minutes kubeadmcluster-setupinstallation

kubeadm init: Bootstrap K8s Cluster

Bootstrap a Kubernetes cluster with kubeadm init and join. Control plane setup, worker node joining, pod network installation.

⏱ 20 minutes kubeadmupgradecluster-management

K8s kubeadm Upgrade: Step-by-Step Guide

Upgrade Kubernetes clusters with kubeadm from one minor version to the next. Control plane upgrade, worker node drain, kubelet upgrade, and rollback procedures.

⏱ 10 minutes kubectldebuggingephemeral-containers

kubectl debug: Advanced Pod Debugging

Use kubectl debug for ephemeral containers, node debugging, and pod copy debugging. Debug distroless images, share process namespaces, and node-level access.

⏱ 8 minutes kubectlpluginskrew

kubectl Plugins: Extend with Krew

Install kubectl plugins with Krew package manager. Essential plugins for debugging, resource management, and cluster operations. Build custom kubectl plugins.

⏱ 8 minutes kubectlscriptingautomation

kubectl wait: Script K8s Operations

Use kubectl wait for scripting Kubernetes operations. Wait for pod ready, job completion, deployment rollout, and custom conditions in CI/CD pipelines.

⏱ 12 minutes kubeletnode-managementconfiguration

K8s Kubelet Configuration and Tuning

Configure Kubernetes kubelet with KubeletConfiguration API. Resource reservation, eviction thresholds, image garbage collection, and node allocatable settings.

⏱ 12 minutes kustomizeconfigurationgitops

Kustomize: Customize K8s Manifests

Use Kustomize to customize Kubernetes manifests without templates. Overlays, patches, configMapGenerator, secretGenerator.

⏱ 12 minutes kyvernopolicysecurity

Kyverno: K8s Policy Engine Without Code

Enforce Kubernetes policies with Kyverno. Validate, mutate, and generate resources using YAML policies. Image verification, label enforcement.

⏱ 10 minutes labelsbest-practicesconfiguration

Kubernetes Labels Best Practices

Kubernetes labels best practices for organizing workloads. Recommended label schemas, selector patterns, naming conventions, and operational label strategies.

⏱ 12 minutes linkerdservice-meshnetworking

Linkerd: Lightweight K8s Service Mesh

Deploy Linkerd service mesh in Kubernetes for mTLS, traffic splitting, retries, and observability. Lighter alternative to Istio with zero-config mTLS and min...

⏱ 8 minutes metricsmonitoringkubectl

K8s Metrics Server: kubectl top Guide

Install Kubernetes Metrics Server for kubectl top and HPA. Resource usage monitoring, troubleshooting metrics, and custom metrics integration.

⏱ 10 minutes namespacesmulti-tenancyrbac

Kubernetes Namespaces: Complete Guide

Create and manage Kubernetes namespaces for multi-tenant isolation. Resource quotas, RBAC per namespace, network policies, and LimitRange configuration.

⏱ 12 minutes networkingtroubleshootingdebugging

K8s Network Debugging: Connectivity Guide

Debug Kubernetes network issues with tcpdump, netshoot, and connectivity tests. Pod-to-pod, pod-to-service, DNS, and external connectivity troubleshooting.

⏱ 10 minutes networkpolicysecuritynetworking

K8s NetworkPolicy: Allow and Deny Rules

Configure Kubernetes NetworkPolicy for pod-to-pod traffic control. Default deny, allow by label, namespace selectors, egress rules, and CIDR blocks.

⏱ 12 minutes schedulingnode-affinitypod-affinity

K8s Node Affinity and Pod Scheduling

Configure Kubernetes node affinity, pod affinity, and anti-affinity rules. nodeSelector, requiredDuringScheduling, preferredDuringScheduling, and topology.

⏱ 5 minutes taintstolerationsscheduling

Fix Untolerated Taint node-role master

Fix 'node untolerated taint node-role.kubernetes.io/master' scheduling error. Remove or tolerate control plane taints to schedule pods on master nodes.

⏱ 12 minutes opentelemetrytracingobservability

OpenTelemetry in Kubernetes: Traces and Metrics

Deploy OpenTelemetry Collector in Kubernetes for distributed tracing and metrics. Auto-instrumentation, OTLP export, Jaeger integration.

⏱ 15 minutes operatorscontrollerscrd

K8s Operator Pattern: Build Controllers

Build Kubernetes operators with the controller pattern. Reconciliation loops, watch events, owner references, finalizers, and operator frameworks comparison.

⏱ 12 minutes persistent-volumesstoragepvc

K8s PV and PVC: Persistent Storage Guide

Create Kubernetes PersistentVolumes and PersistentVolumeClaims. StorageClass, dynamic provisioning, access modes, reclaim policies, and volume expansion.

⏱ 10 minutes storagepvcpersistent-volumes

K8s PersistentVolumeClaimSpec Reference

Complete PersistentVolumeClaimSpec reference for Kubernetes. accessModes, storageClassName, resources, selector, volumeMode, and dataSource explained.

⏱ 8 minutes pdbavailabilitynode-drain

K8s PodDisruptionBudget PDB Guide

Configure Kubernetes PodDisruptionBudgets to protect application availability during node drains. minAvailable, maxUnavailable, and drain safety patterns.

⏱ 10 minutes pod-lifecycleterminationgraceful-shutdown

K8s Pod Lifecycle and Graceful Shutdown

Understand Kubernetes pod lifecycle phases, termination sequence, preStop hooks, SIGTERM handling, and terminationGracePeriodSeconds for zero-downtime shutdo...

⏱ 10 minutes pod-securitysecurityadmission-controller

K8s Pod Security Admission Standards

Configure Kubernetes Pod Security Admission with enforce, audit, and warn modes. Privileged, baseline, and restricted profiles for namespace-level pod security.

⏱ 8 minutes prioritypreemptionscheduling

K8s PriorityClass: Pod Scheduling Priority

Configure Kubernetes PriorityClass for pod scheduling priority and preemption. System-critical pods, resource guarantees, and preemption policies.

⏱ 10 minutes probeshealth-checksliveness

Kubernetes Liveness and Readiness Probes Guide

Configure Kubernetes liveness, readiness, and startup probes for health checks. HTTP, TCP, exec probes, timing parameters, and failure threshold tuning.

⏱ 8 minutes volumesprojectedconfiguration

K8s Projected Volumes: Combine Sources

Configure Kubernetes projected volumes to combine secrets, configmaps, downward API, and service account tokens into a single mount.

⏱ 15 minutes prometheusmonitoringalerting

Prometheus: K8s Monitoring and Alerting

Deploy Prometheus monitoring in Kubernetes with kube-prometheus-stack. ServiceMonitor, PrometheusRule, Grafana dashboards, and alerting for production clusters.

⏱ 8 minutes qosresource-managementeviction

K8s QoS Classes: Guaranteed vs Burstable

Understand Kubernetes QoS classes for pod eviction priority. Guaranteed, Burstable, and BestEffort resource configurations and eviction behavior under pressure.

⏱ 15 minutes rate-limitingingressgateway-api

Kubernetes Rate Limiting Guide

Implement rate limiting in Kubernetes with Ingress annotations, Gateway API, Envoy filters, and application-level middleware. Protect APIs from abuse.

⏱ 12 minutes rbacsecurityservice-accounts

K8s RBAC: Role and RoleBinding Guide

Configure Kubernetes RBAC with Role, ClusterRole, RoleBinding, and ClusterRoleBinding. Service account permissions, least privilege, and audit examples.

⏱ 8 minutes replicasetpodsscaling

K8s ReplicaSet: Maintain Pod Replicas

Understand Kubernetes ReplicaSets for maintaining desired pod count. Selector matching, scaling, ownership, and relationship to Deployments.

⏱ 20 minutes resourcesoptimizationcost

Kubernetes Right-Sizing and Cost Optimization

Optimize Kubernetes resource allocation with right-sizing, VPA recommendations, bin packing, request-to-limit ratios, and cost reduction best practices.

⏱ 10 minutes resource-quotaslimitrangemulti-tenancy

K8s ResourceQuota and LimitRange Guide

Configure Kubernetes ResourceQuota and LimitRange for namespace resource management. CPU and memory quotas, pod count limits, and default container limits.

⏱ 10 minutes rolling-updatedeployment-strategydeployments

K8s Rolling Update: Deployment Strategies

Configure Kubernetes rolling update strategies with maxSurge, maxUnavailable, and recreate strategy. Blue-green, canary patterns, and rollback procedures.

⏱ 10 minutes secretssecurityencryption

K8s Secrets: Types and Usage Guide

Create and manage Kubernetes Secrets: Opaque, docker-registry, TLS, and basic-auth types. Mount as volumes, inject as env vars, and encrypt at rest.

⏱ 10 minutes security-contextsecuritycontainers

K8s SecurityContext: Container Hardening

Configure Kubernetes SecurityContext for pods and containers. runAsNonRoot, readOnlyRootFilesystem, capabilities, seccomp profiles, and privilege escalation.

⏱ 15 minutes istioservice-meshnetworking

Istio Service Mesh: Traffic Management

Deploy Istio service mesh in Kubernetes for traffic management, mTLS, observability, and canary deployments. VirtualService, DestinationRule.

⏱ 10 minutes servicesnetworkingload-balancer

K8s Service Types: ClusterIP NodePort LB

Kubernetes Service types explained: ClusterIP, NodePort, LoadBalancer, and ExternalName. When to use each type with YAML examples and traffic flow diagrams.

⏱ 10 minutes service-accountssecurityrbac

K8s ServiceAccount: Pod Identity Guide

Create Kubernetes ServiceAccounts for pod authentication. Token projection, RBAC binding, workload identity, automountServiceAccountToken, and OIDC federation.

⏱ 10 minutes sidecarcontainerspods

K8s Sidecar Containers: Native Support

Configure Kubernetes native sidecar containers with restartPolicy Always in initContainers. Logging sidecars, service mesh proxies, and lifecycle management.

⏱ 12 minutes statefulsetdeploymentsstorage

K8s StatefulSet: Stable Identity Guide

Deploy stateful applications with Kubernetes StatefulSets. Stable network identity, ordered deployment, persistent storage, and headless service patterns.

⏱ 10 minutes taintstolerationsscheduling

K8s Taints and Tolerations Explained

Configure Kubernetes taints and tolerations for pod scheduling. NoSchedule, PreferNoSchedule, NoExecute effects, GPU node taints, and drain behavior.

⏱ 12 minutes tektonci-cdpipelines

Tekton: Cloud-Native CI/CD Pipelines

Build CI/CD pipelines with Tekton in Kubernetes. Tasks, Pipelines, PipelineRuns, workspaces, and Tekton Hub integration for cloud-native continuous delivery.

⏱ 10 minutes topologyschedulinghigh-availability

K8s Topology Spread: Distribute Pods

Configure Kubernetes topology spread constraints to distribute pods across zones, nodes, and regions. maxSkew, whenUnsatisfiable, and scheduling strategies.

⏱ 10 minutes trivyvulnerability-scanningsecurity

Trivy: K8s Security Scanning and SBOM

Scan Kubernetes clusters with Trivy for vulnerabilities, misconfigurations, and secrets. Trivy Operator for continuous scanning, SBOM generation.

⏱ 12 minutes backupdisaster-recoveryvelero

Velero: K8s Backup and Disaster Recovery

Back up and restore Kubernetes clusters with Velero. Schedule backups, restore namespaces, and migrate workloads between clusters.

⏱ 10 minutes nginxingressrate-limiting

NGINX Ingress limit-burst-multiplier

Configure nginx.ingress.kubernetes.io/limit-burst-multiplier for rate limiting burst control. Tune burst size, rate limits, and 429 response handling.

⏱ 15 minutes nvidiagpuh300

NVIDIA H300 GPU Setup on Kubernetes

Deploy NVIDIA H300 GPUs on Kubernetes. H300 vs H100 vs H200 specs comparison, memory bandwidth, GPU Operator setup, and AI inference optimization.

⏱ 15 minutes nvidiapytorchgpu

NVIDIA PyTorch Container on Kubernetes

Deploy nvcr.io/nvidia/pytorch containers on Kubernetes for GPU training. Version selection, CUDA compatibility, multi-node DDP, and NCCL configuration.

⏱ 10 minutes vpaautoscalinginstallation

Install VPA with hack/vpa-up.sh Script

Install Kubernetes Vertical Pod Autoscaler using hack/vpa-up.sh from the official repository. VPA components, prerequisites, and troubleshooting guide.

⏱ 45 minutes openshiftairgapdisconnected

Air-Gap OpenShift Upgrade oc-mirror OSUS

Upgrade air-gapped OpenShift with oc-mirror and OSUS. Mirror release payloads and Cincinnati graph, configure IDMS, and drive CVO upgrades.

⏱ 15 minutes openshiftcincinnatiupgrades

Cincinnati Graph OpenShift Upgrades

Understand Cincinnati upgrade graph for OpenShift. Query graph endpoints, decode channels, blocked edges, conditional updates, and debug upgrade paths.

⏱ 15 minutes containerdregistrytls

containerd certs.d Registry CA Trust

Configure containerd to trust private registry CAs using /etc/containerd/certs.d. Set up hosts.toml for custom CA certificates and mirror registries.

⏱ 20 minutes genai-perfbenchmarkingllm

GenAI-Perf Benchmark LLM Kubernetes

Benchmark LLM inference with GenAI-Perf on Kubernetes. Use --service-kind openai for vLLM, NIM, and TGI. Measure TTFT, ITL, and throughput.

⏱ 20 minutes gkeoidcworkload-identity

GKE OIDC Issuer Workload Identity

Enable OIDC issuer on GKE with --enable-oidc-issuer. Configure workload identity federation for cross-cloud auth and external IdP integration.

⏱ 15 minutes journaldsystemdlogging

Journald Verify Config Kubernetes Nodes

Validate journald configuration on Kubernetes nodes. Fix journal corruption, tune storage limits, configure persistence, and troubleshoot systemd-journald.

⏱ 10 minutes kubectlsecretsregistry

kubectl create secret docker-registry

Create Kubernetes Docker registry secrets with --docker-password-stdin. Authenticate to private registries and configure imagePullSecrets securely.

⏱ 20 minutes nmstatebondinglacp

NMState Bond LACP Configuration OpenShift

Configure LACP bonding with NMState on OpenShift. NodeNetworkConfigurationPolicy for 802.3ad bonds, VLAN tagging, and storage network bonds.

⏱ 15 minutes dnsnxdomaincoredns

NXDOMAIN DNS Troubleshooting Kubernetes

Fix NXDOMAIN errors in Kubernetes. Debug CoreDNS failures, ndots configuration, search domain issues, and external DNS lookup problems.

⏱ 20 minutes oc-mirrordisconnectedopenshift

oc-mirror Troubleshooting Disconnected

Troubleshoot oc-mirror failures in disconnected OpenShift. Fix archive corruption, registry auth errors, v1/v2 mismatches, and delta mirror issues.

⏱ 20 minutes openshiftcluster-operatorsupgrades

OpenShift Cluster Operator Upgrade Debug

Debug degraded cluster operators during OpenShift upgrades. Identify stuck operators, decode status conditions, and unblock stalled rollouts.

⏱ 15 minutes openshiftidmsitms

OpenShift IDMS ITMS Mirror Rules Guide

Configure IDMS and ITMS mirror rules in OpenShift for disconnected registries. NeverContactSource vs AllowContactingSource and ICSP migration.

⏱ 45 minutes openshiftdisconnectedmigration

Convert Connected to Disconnected OCP

Convert a connected OpenShift cluster to disconnected. Mirror images, configure IDMS, update pull secrets, fix Insights Operator, and verify applications.

⏱ 15 minutes openshiftdisconnectedair-gapped

Disconnected Environments OpenShift

Complete guide to OpenShift disconnected and air-gapped environments. Mirror registry, oc-mirror, OLM, OSUS, IDMS, upgrades, and enclave support overview.

⏱ 20 minutes etcdbackupdisaster-recovery

etcd Backup Restore Kubernetes

Back up and restore etcd in Kubernetes and OpenShift clusters. Automated snapshots, disaster recovery procedures, and cluster state restoration.

⏱ 20 minutes idmsitmsicsp

IDMS ITMS ICSP Disconnected OpenShift

Configure ImageDigestMirrorSet, ImageTagMirrorSet, and ImageContentSourcePolicy for disconnected OpenShift. Redirect image pulls to your mirror registry.

⏱ 25 minutes velerobackupdisaster-recovery

Kubernetes Backup Velero Guide

Set up Velero for Kubernetes cluster backup and restore. Schedule backups, protect namespaces, restore applications, and configure S3 storage backends.

⏱ 15 minutes configmapsecretsconfiguration

Kubernetes ConfigMap Secrets Management

Manage ConfigMaps and Secrets in Kubernetes. Create, mount, update, and secure application configuration and sensitive data effectively.

⏱ 18 minutes deploymentsrolling-updatecanary

Kubernetes Deployment Strategies

Compare rolling update, recreate, blue-green, and canary deployment strategies in Kubernetes. Configuration, trade-offs, and production rollback procedures.

⏱ 18 minutes hpaautoscalingscaling

Kubernetes HPA Autoscaling Guide

Configure Horizontal Pod Autoscaler for automatic scaling based on CPU, memory, and custom metrics. HPA v2 policies, scaling behavior, and production tuning.

⏱ 15 minutes ingressnginxtls

Kubernetes Ingress Fundamentals

Configure Kubernetes Ingress for HTTP routing, TLS termination, and path-based routing. NGINX Ingress Controller setup, annotations, and multi-service routing.

⏱ 20 minutes ippoolipamnetworking

Kubernetes IPPool Management Guide

Configure IP address pools in Kubernetes with Whereabouts, NV-IPAM, MetalLB, and Calico IPPool for secondary networks and LoadBalancer IPs.

⏱ 15 minutes jobscronjobsbatch

Kubernetes Jobs CronJobs Guide

Run batch workloads with Kubernetes Jobs and CronJobs. Parallel execution, completion tracking, failure handling, TTL cleanup, and scheduled tasks.

⏱ 15 minutes probeshealth-checksliveness

Kubernetes Probes Liveness Readiness

Configure liveness, readiness, and startup probes in Kubernetes. HTTP, TCP, exec, and gRPC probe types with real-world tuning for production workloads.

⏱ 20 minutes loggingfluent-bitobservability

Kubernetes Logging Fluent Bit Guide

Deploy Fluent Bit for centralized Kubernetes logging. DaemonSet configuration, parsing, filtering, and forwarding logs to Elasticsearch, Loki, or S3.

⏱ 12 minutes namespacesmulti-tenancyrbac

Kubernetes Namespace Management Guide

Create, manage, and organize Kubernetes namespaces for multi-tenancy. Resource isolation, RBAC scoping, namespace quotas, and lifecycle best practices.

⏱ 18 minutes network-policysecuritynetworking

Kubernetes NetworkPolicy Guide

Secure pod-to-pod traffic with Kubernetes NetworkPolicies. Ingress and egress rules, namespace selectors, deny-all policies, and CNI requirements.

⏱ 12 minutes node-draincordonmaintenance

Kubernetes Node Drain Cordon Guide

Safely drain and cordon Kubernetes nodes for maintenance. Graceful pod eviction, PDB-aware drains, force drain, and maintenance window procedures.

⏱ 18 minutes persistent-volumesstoragepvc

Kubernetes Persistent Volumes Guide

Manage Kubernetes Persistent Volumes with PV, PVC, and StorageClass. Dynamic provisioning, access modes, reclaim policies, and volume expansion.

⏱ 18 minutes rbacsecurityaccess-control

Kubernetes RBAC Role ClusterRole

Configure RBAC in Kubernetes with Roles, ClusterRoles, RoleBindings, and ClusterRoleBindings. Least-privilege access for users, groups, and service accounts.

⏱ 18 minutes resource-quotalimit-rangemulti-tenancy

Kubernetes ResourceQuota LimitRange

Configure ResourceQuota and LimitRange for Kubernetes namespace resource governance. CPU, memory, storage, and object count limits for multi-tenant clusters.

⏱ 30 minutes openshiftdisconnectedmirror-registry

Mirror Registry Disconnected OpenShift

Set up a mirror registry for disconnected OpenShift installations. Deploy mirror-registry for Red Hat OpenShift, configure storage, TLS, and credentials.

⏱ 25 minutes mofedmellanoxnvidia

MOFED Driver for Kubernetes: Setup Guide

Install and manage MOFED drivers in Kubernetes. Network Operator integration, NicClusterPolicy, driver versions, and RDMA troubleshooting.

⏱ 20 minutes mofednvidianetwork-operator

MOFED Driver Operator Build Kubernetes

Let the NVIDIA Network Operator build MOFED drivers on-node via DKMS. Kernel header detection, compile flags, and DTK integration for OpenShift.

⏱ 35 minutes oc-mirroropenshiftdisconnected

oc-mirror Plugin Disconnected OpenShift

Use oc-mirror to mirror OpenShift content for disconnected installations. ImageSetConfiguration, incremental mirrors, and operator catalog mirroring.

⏱ 25 minutes olmoperatorsdisconnected

OLM Disconnected OpenShift Operators

Use Operator Lifecycle Manager in disconnected OpenShift clusters. Mirror catalogs, create CatalogSources, and manage Operators without internet access.

⏱ 25 minutes openshiftmachineconfigmcp

OpenShift MCP Validation Broken Rules

Validate MachineConfigPool rules before applying in OpenShift. Detect broken MachineConfigs, degraded MCPs, and implement pre-flight checks.

⏱ 20 minutes openshiftosusupdate-service

OSUS Direct vs Replicated OpenShift

Choose between direct and replicated OSUS graph data modes in OpenShift. Configure UpdateService for connected and disconnected environments.

⏱ 25 minutes prometheusmonitoringalerting

Prometheus Monitoring Kubernetes Guide

Deploy Prometheus for Kubernetes cluster monitoring. ServiceMonitor, PodMonitor, alerting rules, Grafana dashboards, and kube-prometheus-stack Helm install.

⏱ 30 minutes quayregistryopenshift

Red Hat Quay Registry Kubernetes

Deploy and manage Quay container registry on Kubernetes. Mirror policies, robot accounts, security scanning, and integration with OpenShift.

⏱ 15 minutes selinuxsshtroubleshooting

SELinux SSH Login Failure Troubleshoot

Fix SSH login failures caused by SELinux enforcement. Diagnose AVC denials, restore file labels, fix custom SSH ports, and resolve PAM denials.

⏱ 20 minutes skopeocontainer-imagesregistry

Skopeo Container Image Operations

Use skopeo to inspect, copy, sync, and delete container images across registries. Essential tool for disconnected Kubernetes and OpenShift environments.

⏱ 20 minutes sriovdevice-pluginrdma

SR-IOV Device Plugin PF Flag on Kubernetes

Configure SR-IOV device plugin PF flag in Kubernetes. Expose physical functions as allocatable resources for exclusive RDMA access.

⏱ 15 minutes cert-managercloudflaredns01

cert-manager Cloudflare DNS01 K8s

Configure cert-manager with Cloudflare DNS01 challenge for wildcard TLS certificates on Kubernetes. API token secret, ClusterIssuer, and auto-renewal.

⏱ 15 minutes ciliumdebugnetshoot

Cilium Debug Pod Troubleshooting

Debug Kubernetes networking with Cilium debug pods and containers. cilium-dbg, netshoot, hubble observe, and endpoint connectivity troubleshooting.

⏱ 15 minutes cloudnativepgpostgresqloperator

CloudNativePG PostgreSQL Operator K8s

Deploy PostgreSQL with CloudNativePG operator on Kubernetes. Cluster setup, affinity, replication lag monitoring, backup, and high availability configuration.

⏱ 15 minutes continuous-batchinginferencethroughput

Continuous Batching LLM Inference K8s

Configure continuous batching for LLM inference on Kubernetes. vLLM and TRT-LLM batch scheduling, max-num-seqs tuning, and throughput optimization.

⏱ 15 minutes cudacompatibilitydriver-version

CUDA Version Compatibility K8s Guide

Match CUDA versions with GPU drivers and container images on Kubernetes. Forward compatibility, driver requirements, and container toolkit matrix.

⏱ 15 minutes cudaoomgpu-memory

Fix CUDA Out of Memory K8s Pods

Troubleshoot CUDA out of memory errors in Kubernetes GPU pods. Memory fragmentation, batch size tuning, gradient checkpointing, and resource limits.

⏱ 15 minutes deepspeedzerodistributed-training

DeepSpeed ZeRO Training Kubernetes

Deploy DeepSpeed ZeRO-1/2/3 for large model training on Kubernetes. Multi-node config, NCCL tuning, memory optimization, and 70B+ model training.

⏱ 15 minutes dgxh100gpu-topology

DGX H100 GPU Topology nvidia-smi

Inspect DGX H100 GPU topology with nvidia-smi topo -m. NVSwitch NV18 links, cross-socket detection, PCIe hierarchy, and NCCL performance validation.

⏱ 30 minutes nvidiadocabluefield

DOCA Telemetry BlueField Kubernetes

Collect NVIDIA BlueField DPU telemetry in Kubernetes using DOCA Telemetry libraries. Monitor adaptive retransmission, PCC, diagnostics, and PCI metrics.

⏱ 25 minutes edrflexeracrowdstrike

EDR Flexera Agents Kubernetes Deploy

Deploy EDR and Flexera agents on Kubernetes with DaemonSets. Priority classes, host path access, exclusion paths, and security agent lifecycle.

⏱ 25 minutes flexeralicensingcompliance

Flexera License Management Kubernetes

Manage software licenses in Kubernetes with Flexera. FlexNet Manager, container license tracking, GPU software metering, and compliance for enterprise K8s.

⏱ 15 minutes gpu-feature-discoverynode-labelsscheduling

GPU Feature Discovery Node Labels

Configure NVIDIA GPU Feature Discovery for automatic node labeling on Kubernetes. GPU model, driver version, CUDA, and MIG labels for scheduling.

⏱ 15 minutes node-affinitygpu-schedulingtopology

GPU Node Affinity Scheduling K8s

Schedule GPU workloads with node affinity and topology on Kubernetes. GPU type selection, multi-GPU locality, and NUMA-aware pod placement.

⏱ 15 minutes gpu-limitsresource-requestsnvidia

K8s GPU Limits Requests Configuration

Configure GPU resource limits and requests in Kubernetes pod specs. nvidia.com/gpu resource, fractional GPUs, MIG slices, and multi-GPU allocation.

⏱ 15 minutes hpaprometheuscustom-metrics

HPA Prometheus Custom Metrics K8s

Configure HPA with custom Prometheus metrics using prometheus-adapter on Kubernetes. Custom and external metrics, query mapping, and scaling on business KPIs.

⏱ 15 minutes ingressrate-limitnginx

K8s Ingress Rate Limit NGINX Config

Configure rate limiting on Kubernetes NGINX Ingress. limit-rps, limit-burst-multiplier annotations, per-client limits, and webhook protection patterns.

⏱ 30 minutes lacpbondingstorage

LACP Storage Switch Kubernetes Guide

Configure LACP bond aggregation for NFS and iSCSI storage switches in Kubernetes clusters. 802.3ad setup, hash policies, switch config, and failure handling.

⏱ 15 minutes lorafine-tuningvllm

LoRA Adapter Serving vLLM on K8s

Serve multiple LoRA adapters with a single vLLM base model on Kubernetes. Dynamic loading, per-request routing, and multi-tenant fine-tuned models.

⏱ 15 minutes pytorchddpmulti-gpu

Multi-GPU PyTorch DDP on Kubernetes

Run PyTorch DistributedDataParallel across multiple GPUs on Kubernetes. torchrun, NCCL backend, pod topology, and scaling to multi-node training.

⏱ 35 minutes nfsmulti-tenancystorage

NFS Tenant Segregation Kubernetes

Implement NFS tenant segregation in Kubernetes with six-layer defense-in-depth. Exports, StorageClass, quotas, and admission policies.

⏱ 15 minutes nmstateoperatoropenshift

NMState Operator Install OpenShift K8s

Install and configure the NMState operator on OpenShift and Kubernetes. Enable declarative node networking with NNCP, NodeNetworkState, and enactments.

⏱ 25 minutes nncpnmstateopenshift

NNCP NodeNetworkConfigurationPolicy

Master NodeNetworkConfigurationPolicy (NNCP) on OpenShift and Kubernetes. Configure VLANs, bonds, bridges, SR-IOV, MTU, static IPs, and DNS with NMState.

⏱ 15 minutes nvidia-driverupgraderolling-update

NVIDIA Driver Update K8s Nodes Guide

Safely update NVIDIA GPU drivers on Kubernetes nodes. Rolling updates, drain strategy, driver compatibility matrix, and GPU Operator upgrades.

⏱ 15 minutes gpu-operatornvidiadriver

NVIDIA GPU Operator Troubleshooting

Fix common NVIDIA GPU Operator issues on Kubernetes. Driver pod crashes, toolkit failures, device plugin not ready, and validation pod errors.

⏱ 15 minutes nvidia-peermemgpudirectrdma

NVIDIA PeerMem GPUDirect RDMA K8s

Configure nvidia_peermem and ib_register_peer_memory_client for GPUDirect RDMA on Kubernetes. Module loading and modprobe invalid argument fix.

⏱ 15 minutes nvidia-smigpu-monitoringhealth-check

nvidia-smi Monitoring in K8s Pods

Run nvidia-smi inside Kubernetes pods for GPU monitoring. Memory usage, temperature, utilization, and automated health checks with liveness probes.

OpenShift ACS RHACS Security Guide

Deploy Red Hat Advanced Cluster Security (RHACS/ACS) on OpenShift. Vulnerability scanning, compliance, runtime threat detection, and policy enforcement.

⏱ 45 minutes openshiftdisconnectedair-gapped

OpenShift Upgrade Disconnected Cluster

Step-by-step guide to upgrading OpenShift in a disconnected air-gapped environment. Mirror releases, configure ICSP/IDMS, validate, and execute the upgrade.

⏱ 20 minutes openshiftupgradecincinnati

OpenShift Upgrade Service Graph Guide

Use the OpenShift Upgrade Service (OSUS) and Cincinnati graph to plan safe upgrade paths. Channel selection, conditional edges, and air-gapped graph data.

⏱ 30 minutes osusopenshiftdisconnected

OSUS Operator Disconnected OpenShift

Deploy the OpenShift Update Service (OSUS) operator for disconnected clusters. Local Cincinnati graph, graph-data image mirroring, and upgrade path serving.

⏱ 15 minutes prefix-cachingkv-cachevllm

Prefix Caching vLLM KV Cache K8s

Enable automatic prefix caching in vLLM on Kubernetes for shared-prompt workloads. KV cache reuse, memory savings, and chatbot latency optimization.

⏱ 15 minutes quantizationawqgptq

Quantize LLMs AWQ GPTQ for K8s Deploy

Deploy AWQ and GPTQ quantized LLMs on Kubernetes. 4-bit inference with vLLM, model conversion, accuracy trade-offs, and GPU memory savings guide.

⏱ 25 minutes rhacsstackroxnfs

RHACS NFS Tenant Security Kubernetes

Enforce NFS tenant isolation with RHACS policies. Detect direct NFS mounts, wrong StorageClass usage, privileged escalation, and cross-tenant violations.

⏱ 15 minutes speculative-decodingvllminference-optimization

Speculative Decoding with vLLM on Kubernetes

Enable speculative decoding in vLLM on Kubernetes for 2-3x faster LLM inference. Draft model selection, acceptance rates, and latency optimization.

⏱ 15 minutes tensorrt-llmvllmbenchmark

TensorRT-LLM vs vLLM Benchmark 2026

Compare TensorRT-LLM vs vLLM for LLM inference on Kubernetes. TTFT, throughput, GPU utilization benchmarks, and when to use each inference engine.

⏱ 15 minutes vllmalternativesinference

vLLM Alternatives LLM Inference K8s

Compare vLLM alternatives for LLM inference on Kubernetes. TensorRT-LLM, SGLang, NVIDIA NIM, Ollama, and text-generation-inference feature comparison.

⏱ 20 minutes ubuntuhardeningsudo-rs

Ubuntu 26.04 LTS K8s Node Hardening

Harden Kubernetes nodes with Ubuntu 26.04 LTS Resolute Raccoon. sudo-rs Rust rewrite, APT rollback, Kernel 7.0 TDX, ROCm GPU, and secure base images.

⏱ 15 minutes ciliumclustermeshmulti-cluster

Cilium ClusterMesh Multi-Cluster

Connect multiple K8s clusters with Cilium ClusterMesh. Shared services, global service discovery, and cross-cluster network policies.

⏱ 15 minutes ciliumhubblenetwork-flows

Cilium Hubble Observability Guide

Monitor Kubernetes network flows with Cilium Hubble. CLI usage, Hubble UI, flow filtering, DNS visibility, and L7 HTTP observability.

⏱ 15 minutes crunrunccontainer-runtime

crun vs runc Container Runtime 2026

Compare crun vs runc container runtimes for Kubernetes. Performance benchmarks, memory usage, cgroup v2 support, and migration from runc to crun guide.

⏱ 15 minutes csisnapshotrestore

CSI Snapshot and Restore K8s Guide

Create and restore volume snapshots with CSI on K8s. VolumeSnapshot, VolumeSnapshotClass, and cross-namespace clone patterns.

⏱ 15 minutes etcdleader-electiontimeout

Fix etcd Leader Election Timeout

Troubleshoot etcd leader election timeouts in K8s. Disk latency, network partition, heartbeat interval, and recovery steps.

⏱ 15 minutes certificatetlsx509

Fix Certificate Errors Kubernetes

Troubleshoot TLS certificate errors in K8s. x509 unknown authority, expired certs, cert-manager issues, and custom CA bundles.

⏱ 15 minutes dnsresolutioncoredns

Fix DNS Resolution Issues in Kubernetes

Troubleshoot Kubernetes DNS resolution failures. ndots, search domains, CoreDNS CrashLoop, and pod-level DNS debugging steps.

⏱ 15 minutes cgroupmemoryoom

Fix Pod cgroup Memory Errors K8s

Fix cgroup memory limit and OOM errors in Kubernetes pods. Covers cgroup v2 migration, memory.max, swap settings, and kernel tuning for stable workloads.

⏱ 15 minutes serviceconnectivityendpoints

Fix Service Not Reachable in Kubernetes

Debug Kubernetes Service connectivity issues. Endpoint selection, kube-proxy rules, DNS resolution, and NetworkPolicy blocks.

⏱ 15 minutes helmdependenciessubcharts

Helm Chart Dependencies: Complete Guide

Manage Helm chart dependencies and subcharts. Condition flags, tags, import-values, alias patterns, and dependency update workflow for K8s.

⏱ 15 minutes helmhookslifecycle

Helm Hooks and Lifecycle Management Guide

Master Helm hooks for Kubernetes deployments. Pre-install, post-install, pre-upgrade, hook weights, deletion policies, and database migration patterns.

⏱ 15 minutes helmrollbackhistory

Helm Rollback and History Guide

Roll back Helm releases and manage revision history. Diagnose failed upgrades, compare revisions, and automate rollback.

⏱ 15 minutes helmvaluesoverride

Helm Values Override Patterns Explained

Master Helm values override patterns. CLI flags, multiple files, JSON values, and precedence rules for complex deployments.

⏱ 15 minutes 502-bad-gatewayingresstroubleshooting

Fix 502 Bad Gateway Kubernetes Ingress

Fix 502 Bad Gateway errors in Kubernetes Ingress. Backend not ready, timeout tuning, readiness probes, and NGINX ingress controller troubleshooting.

⏱ 15 minutes admission-controllerwebhookvalidation

K8s Admission Controllers List Guide

Complete list of Kubernetes admission controllers. Enable and disable controllers, PodSecurity, ResourceQuota, and custom validating webhooks guide.

⏱ 15 minutes api-versionsdeprecationmigration

Kubernetes API Versions Explained

Understand K8s API versions: alpha, beta, stable. API deprecation policy, migration strategy, and kubectl api-versions usage.

⏱ 15 minutes argocdsync-waveshooks

ArgoCD Sync Waves and Hooks Guide

Configure ArgoCD sync waves for ordered deployments. Wave ordering, sync hooks, resource health checks, and dependency management patterns.

⏱ 15 minutes caliconetworkpolicyglobal

Calico NetworkPolicy K8s Guide

Configure Calico NetworkPolicy for K8s. GlobalNetworkPolicy, host endpoints, application layer policies, and DNS policy rules.

⏱ 15 minutes canarydeploymentrollout

Canary Deployment Kubernetes Guide

Implement canary deployments on K8s without service mesh. Native K8s strategy, traffic splitting, and automated rollback.

⏱ 15 minutes certificatesexpirationkubeadm

Certificate Expiration Management K8s

Monitor and manage Kubernetes certificate expiration. kubeadm cert check, cert-manager alerts, auto-renewal, and preventing expired certificate outages.

⏱ 15 minutes cluster-autoscalernode-scalingcloud

Cluster Autoscaler Kubernetes Guide

Configure Kubernetes Cluster Autoscaler for automatic node scaling. Scale-down delay, expanders, priority, and integration with cloud providers.

⏱ 15 minutes cnicalicocilium

CNI Comparison 2026 Kubernetes

Compare Kubernetes CNI plugins: Calico, Cilium, Flannel, Multus, and OVN-Kubernetes. Performance benchmarks, features, and selection guidance.

⏱ 15 minutes configmapsubpathvolume-mount

ConfigMap subPath Update Fix K8s

Handle ConfigMap subPath mount limitations in Kubernetes. Why subPath mounts don't auto-update, workarounds, and alternative patterns.

⏱ 15 minutes corednsdnscustom-config

CoreDNS Custom Config Kubernetes

Customize CoreDNS on Kubernetes for advanced DNS needs. Forward zones, stub domains, custom records, caching tuning, and DNS debugging.

⏱ 15 minutes dnsdns-policycoredns

DNS Policy Configuration Kubernetes

Configure Kubernetes DNS policies: Default, ClusterFirst, ClusterFirstWithHostNet, and None. Custom resolv.conf, ndots tuning, and DNS performance.

⏱ 10 minutes docker-registrysecretauthentication

Docker Registry Secret kubectl

Create Kubernetes docker-registry secrets with kubectl. --docker-password-stdin, .dockerconfigjson format, and automating registry authentication.

⏱ 15 minutes downward-apimetadataenvironment-variables

Kubernetes Downward API: Complete Guide

Expose pod and container metadata to applications using the Downward API. Environment variables, volume files, fieldRef, resourceFieldRef, and common patterns.

⏱ 15 minutes efkelasticsearchfluentd

EFK Logging System Principles K8s

EFK logging system principles for Kubernetes. Elasticsearch, Fluentd, Kibana architecture, log pipeline design, parsing, and retention strategies.

⏱ 15 minutes emptydirtmpfsephemeral-storage

emptyDir tmpfs Kubernetes Guide

Configure emptyDir volumes with memory-backed tmpfs on Kubernetes. Size limits, memory accounting, sidecar sharing, and ephemeral cache patterns.

⏱ 15 minutes environment-variablesconfigmapsecrets

Env Variables from ConfigMap K8s

Inject environment variables from ConfigMaps and Secrets in Kubernetes. envFrom, valueFrom, configMapKeyRef, and secretKeyRef patterns.

⏱ 10 minutes envfromconfigmaprefenvironment-variables

envFrom ConfigMapRef Kubernetes

Inject all ConfigMap keys as environment variables using envFrom configMapRef in Kubernetes. Bulk injection, prefix, and selective key patterns.

⏱ 15 minutes etcdperformancetuning

etcd Performance Tuning Kubernetes

Tune etcd for Kubernetes cluster performance. Disk IOPS requirements, compaction, defragmentation, and monitoring etcd health metrics.

⏱ 15 minutes falcorulesruntime-security

Falco Rules for Kubernetes: Complete Guide

Write custom Falco rules for K8s runtime security. Syscall detection, container escape alerts, and cryptomining detection.

⏱ 10 minutes fsgroupchangepolicyonrootmismatchchown

fsGroupChangePolicy OnRootMismatch

Configure fsGroupChangePolicy OnRootMismatch to skip recursive chown on volume mounts. Fix slow pod startup with large persistent volumes on Kubernetes.

⏱ 15 minutes fluxgitopssources

Flux Sources Config Kubernetes

Configure Flux source controllers for GitOps on Kubernetes. GitRepository, HelmRepository, OCIRepository, and Bucket sources for multi-source deployments.

⏱ 10 minutes grafanadashboardsmonitoring

Grafana Dashboards for Kubernetes Guide

Import and customize Grafana dashboards for Kubernetes monitoring. Dashboard 315, 6417, kube-prometheus-stack, and custom panel creation.

⏱ 15 minutes hostpathpvccomparison

hostPath vs PVC Kubernetes Guide

Compare hostPath and PVC storage options for Kubernetes. Security risks of hostPath, node affinity constraints, and when to use each storage type.

⏱ 10 minutes hpamax-replicasautoscaling

HPA Max Replicas Configuration K8s

Set max replicas for Kubernetes HPA to control autoscaling ceiling. maxReplicas tuning, scaling behavior, stabilization window, and cost protection strategies.

⏱ 10 minutes hpatutorialkubectl

HPA Tutorial for Kubernetes Beginners

Step-by-step HPA tutorial for Kubernetes. Create, monitor, and tune Horizontal Pod Autoscalers with kubectl commands and YAML examples.

⏱ 15 minutes trivyimage-scanningvulnerability

Trivy Image Scanning Kubernetes

Scan container images with Trivy on K8s. Admission webhook, CI/CD integration, CIS benchmarks, and vulnerability reporting.

⏱ 10 minutes imagepullsecretsregistryauthentication

imagePullSecrets Pod Config K8s

Configure imagePullSecrets for pulling from private container registries on Kubernetes. Docker registry secrets, service account default.

⏱ 15 minutes ingressroutingpath-based

Ingress Path Routing Kubernetes

Configure Kubernetes Ingress for path-based and host-based routing. PathType Prefix vs Exact, rewrite rules, and multi-service routing patterns.

⏱ 15 minutes karpenternode-autoscalingcost-optimization

Karpenter Node Autoscaler for Kubernetes

Scale Kubernetes nodes with Karpenter. NodePool configuration, instance selection, consolidation, and cost optimization vs Cluster Autoscaler.

⏱ 15 minutes kedascalersevent-driven

KEDA Scalers Guide for Kubernetes

Configure KEDA scalers for event-driven autoscaling on Kubernetes. Covers Kafka, RabbitMQ, Prometheus, and cron trigger configuration.

⏱ 15 minutes kindlocaldevelopment

KIND Local Kubernetes Dev Guide

Use KIND for local Kubernetes development. Multi-node clusters, ingress setup, load balancer, persistent storage, and CI/CD integration.

⏱ 15 minutes kubectlexecdebug

kubectl exec Into Pods: Complete Guide

Use kubectl exec to debug running pods. Interactive shells, non-interactive commands, multi-container pods, and ephemeral debug containers.

⏱ 15 minutes kubeflowpytorchjobdistributed-training

Kubeflow PyTorchJob Training K8s

Run distributed PyTorch training on Kubernetes with Kubeflow PyTorchJob. ElasticPolicy, nproc_per_node, RDMA configuration, and multi-GPU scaling.

⏱ 15 minutes labelsannotationsmetadata

K8s Labels vs Annotations Explained

Kubernetes labels vs annotations differences explained. When to use each, recommended labels, label selectors, and annotation best practices for K8s.

⏱ 10 minutes letsencryptingresstls

Let's Encrypt Ingress Kubernetes

Set up Let's Encrypt TLS certificates for Kubernetes Ingress with cert-manager. HTTP-01 challenge, automatic renewal, and HTTPS redirect configuration.

⏱ 15 minutes local-pvpersistent-volumenode-affinity

Local Persistent Volumes Kubernetes

Configure local persistent volumes on Kubernetes for high-performance storage. Node affinity, local-path-provisioner, and SSD-backed database workloads.

⏱ 10 minutes multi-clusterfederationfleet

K8s Multi-Cluster Management Guide

Kubernetes multi-cluster management guide. Federation, Cluster API, Rancher, and GitOps patterns for fleet management across production environments.

⏱ 15 minutes namespaceterminatingfinalizers

Fix Namespace Stuck Terminating K8s

Fix Kubernetes namespaces stuck in Terminating state. Finalizer removal, API resource cleanup, and force deletion of stuck namespaces.

⏱ 10 minutes networkpolicyexamplesdeny

NetworkPolicy Examples Cookbook K8s

Copy-paste Kubernetes NetworkPolicy examples. Default deny all, allow DNS, allow specific namespace, database access, and external egress patterns.

⏱ 15 minutes node-notreadykubelettroubleshooting

Fix Node NotReady Status in Kubernetes

Troubleshoot Kubernetes nodes in NotReady state. Kubelet issues, disk pressure, network problems, certificate expiration, and recovery procedures.

⏱ 10 minutes taintmastercontrol-plane

Fix node-role.kubernetes.io/master

Remove the node-role.kubernetes.io/master taint to schedule pods on control plane nodes. Single-node clusters, tolerations, and untolerated taint fix.

⏱ 15 minutes oidcauthenticationsso

K8s OIDC Authentication Login Guide

Configure OIDC authentication for Kubernetes API server. --enable-oidc-issuer with GKE, Keycloak, Dex, kubelogin plugin, and RBAC SSO integration.

⏱ 15 minutes oomkilledmemorytroubleshooting

Fix OOMKilled Kubernetes Guide

Troubleshoot and fix OOMKilled errors in Kubernetes. Memory limit tuning, Java heap sizing, memory leak detection, and VPA recommendations.

⏱ 15 minutes pdbdisruption-budgetavailability

Pod Disruption Budget Best Practices

Configure PodDisruptionBudgets for high availability on Kubernetes. minAvailable vs maxUnavailable, voluntary disruptions, and upgrade coordination.

⏱ 15 minutes pendingschedulingtroubleshooting

Fix Pending Pods Kubernetes Guide

Troubleshoot Kubernetes pods stuck in Pending state. Insufficient resources, node selector mismatch, PVC binding, taints, and scheduling failures.

⏱ 15 minutes pvcpersistent-volumestorage

PersistentVolumeClaim PVC Guide K8s

Create and manage PersistentVolumeClaims on Kubernetes. Access modes, storage classes, volume expansion, and namespace-scoped PVC lifecycle.

⏱ 15 minutes evictiondisk-pressurememory-pressure

Fix Pod Eviction Kubernetes Guide

Troubleshoot Kubernetes pod evictions. DiskPressure, MemoryPressure, ephemeral storage limits, and eviction thresholds configuration.

⏱ 15 minutes pod-lifecyclepod-statespending

Pod Lifecycle and States Guide

Understand Kubernetes pod lifecycle phases and container states. Pending, Running, Succeeded, Failed, Unknown, and troubleshooting stuck pods.

⏱ 15 minutes rbacauditcompliance

RBAC Audit Review Kubernetes Guide

Audit Kubernetes RBAC permissions for security compliance. Identify over-permissioned roles, service account privileges, and least-privilege enforcement.

⏱ 15 minutes probesreadinessliveness

Readiness Liveness Startup Probes

Configure Kubernetes health probes correctly. When to use each probe type, common mistakes, and production-ready probe configurations.

⏱ 10 minutes readiness-probehealth-checkhttp-get

Readiness Probe Kubernetes Guide

Configure readiness probes correctly on Kubernetes. HTTP, TCP, exec probes, failure threshold tuning, and why readiness probes should never check databases.

⏱ 10 minutes resourcescpumemory

Resource Format 200m 256Mi Syntax

Understand Kubernetes resource format: CPU millicores (200m, 500m, 1) and memory units (256Mi, 1Gi). Syntax reference for requests, limits.

⏱ 10 minutes runtimeclassgvisorrunsc

RuntimeClass gVisor Kubernetes

Deploy gVisor as a sandboxed container runtime on Kubernetes using RuntimeClass. Covers installation, runsc configuration, and workload isolation.

⏱ 15 minutes secretsencryptionbest-practices

K8s Secrets Management Best Practices

Kubernetes secrets management best practices. Encryption at rest, external secrets operator, rotation strategies, and RBAC for secure secret handling.

⏱ 15 minutes securitychecklisthardening

K8s Security Checklist 2026 Guide

Complete Kubernetes security checklist for 2026. RBAC audit, network policies, pod security standards, image scanning, and compliance hardening steps.

⏱ 10 minutes dnsservice-discoveryfqdn

Service DNS Discovery Kubernetes

How Kubernetes DNS service discovery works. Service FQDN format, headless services, SRV records, and cross-namespace DNS resolution patterns.

⏱ 15 minutes storageclassprovisioningdynamic

Kubernetes StorageClass Complete Guide

Configure StorageClasses for dynamic provisioning on Kubernetes. Covers reclaim policies, volume binding modes, and cloud provider examples.

⏱ 15 minutes terminationgraceful-shutdownsigterm

terminationGracePeriodSeconds Guide

Configure terminationGracePeriodSeconds for Kubernetes pods. SIGTERM vs SIGKILL timing, connection draining, long-running tasks, and graceful shutdown.

⏱ 15 minutes velerosnapshotsbackup

Velero Snapshot Locations on Kubernetes

Configure Velero snapshot locations for Kubernetes backup. Volume snapshots, file system backup, cross-region copies, and backup verification.

⏱ 15 minutes vparecommenderright-sizing

VPA Recommender Setup Kubernetes

Configure the VPA Recommender for Kubernetes resource right-sizing. Off mode recommendations, memory-only mode, and interpreting VPA suggestions.

⏱ 15 minutes kustomizehelmcomparison

Kustomize vs Helm Comparison Guide

Kustomize vs Helm comparison for Kubernetes. When to use each tool, complexity trade-offs, GitOps compatibility, and combined workflow patterns.

⏱ 15 minutes ncclenvironment-variablesgpu

NCCL Environment Variables Reference

Complete NCCL environment variables reference for Kubernetes GPU training. NCCL_IB_DISABLE, NCCL_SOCKET_IFNAME, NCCL_DEBUG, and network tuning guide.

⏱ 10 minutes ncclbenchmarkgpu

NCCL Test Benchmark Kubernetes

Run NCCL tests on Kubernetes for GPU communication benchmarking. all_reduce_perf, all_gather_perf, multi-node bandwidth, and latency validation.

⏱ 15 minutes dcgmgpu-monitoringprometheus

NVIDIA DCGM Exporter GPU Monitoring

Monitor GPU metrics with DCGM Exporter on K8s. Prometheus integration, Grafana dashboards, and alerting on utilization and temperature.

⏱ 15 minutes gputime-slicingmig

GPU Time-Slicing vs MIG Comparison

Compare NVIDIA GPU time-slicing and MIG for K8s workloads. When to use each, performance trade-offs, and configuration examples.

⏱ 15 minutes openshiftlifecycleversions

OpenShift Lifecycle Versions Guide

OpenShift Container Platform lifecycle, version support, and upgrade planning. EUS versions, support timelines, K8s version mapping, and EOL dates.

⏱ 15 minutes openshiftoauthproxy

OpenShift OAuth Proxy Sidecar Guide

Protect K8s services with OpenShift OAuth proxy sidecar. Authentication, RBAC delegation, and SSO for internal dashboards.

⏱ 15 minutes openshiftroutesingress

OpenShift Routes vs Ingress Guide

Compare OpenShift Routes and Kubernetes Ingress. Covers edge, passthrough, and re-encrypt TLS termination, and when to use each option.

⏱ 15 minutes openshiftsccsecurity-context

OpenShift SCC Security Context Guide

Configure OpenShift Security Context Constraints for pods. Restricted, anyuid, privileged SCCs, custom SCC, and migration to PSA.

⏱ 15 minutes tensorrt-llminferencetriton

TensorRT-LLM Kubernetes Deployment

Deploy TensorRT-LLM on K8s for optimized inference. Engine building, model conversion, and serving with Triton Inference Server.

⏱ 10 minutes vpavertical-pod-autoscalerinstallation

VPA Setup hack/vpa-up.sh Guide

Install Vertical Pod Autoscaler with hack/vpa-up.sh on Kubernetes. Recommender, Updater, Admission Controller components and production configuration.

⏱ 15 minutes vllminferencellm

vLLM Deployment Kubernetes Guide

Deploy vLLM inference engine on K8s. Model loading, tensor parallelism, continuous batching, and OpenAI-compatible API setup.

⏱ 20 minutes ai-securityml-compliancemodel-encryption

AI ML Security and Compliance Kubernetes

Secure AI and ML workloads on Kubernetes with model encryption, data governance, audit logging, network isolation for training jobs.

⏱ 20 minutes gpuresource-optimizationbin-packing

AI Resource Allocation Optimization

Optimize GPU and memory allocation for AI workloads on Kubernetes. Right-size GPU requests, bin-packing strategies, gang scheduling.

⏱ 20 minutes cncfai-landscapecloud-native

CNCF AI Projects Landscape Kubernetes

Navigate the CNCF AI project landscape for Kubernetes. Kubeflow, KServe, KAITO, Volcano, and emerging projects for training, serving, scheduling.

⏱ 20 minutes dellswitchrocev2

Dell Switch RoCEv2 PFC ECN DSCP

Configure Dell OS10 switches for lossless RoCEv2 with PFC, ECN, WRED, and DSCP-to-traffic-class mapping. Priority 3 for RDMA traffic classes 24 and 26.

⏱ 20 minutes distributed-trainingtensorflowpytorch

Distributed Training TensorFlow PyTorch

Run distributed training jobs on Kubernetes with TensorFlow and PyTorch. Training Operator, multi-worker strategies, NCCL configuration.

⏱ 15 minutes ecnmachineconfigopenshift

ECN MachineConfig OpenShift Nodes

Enable ECN (Explicit Congestion Notification) on OpenShift nodes via MachineConfig for lossless RoCEv2 RDMA networking. Sysctl and Mellanox NIC configuration.

⏱ 20 minutes feastfeature-storeml-features

Feast Feature Store Kubernetes

Deploy Feast feature store on Kubernetes for ML feature management. Offline and online stores, feature serving, point-in-time joins.

⏱ 20 minutes gitlabrunnerhelm

GitLab Runner Helm Kubernetes Executor

Deploy GitLab Runner on Kubernetes with Helm. Configure concurrent jobs, internal registry, PodMonitor metrics, scale-to-zero, security contexts.

⏱ 20 minutes gpumigtime-slicing

GPU Sharing MIG and Time-Slicing Kubernetes

Share GPUs across multiple pods with NVIDIA MIG and time-slicing on Kubernetes. MIG profiles for A100/H100, time-slicing configuration.

⏱ 20 minutes kaitoinferencegpu-provisioning

KAITO AI Model Inference Kubernetes

Deploy AI models with KAITO (Kubernetes AI Toolchain Operator) for automated GPU provisioning, model serving, and inference workload management.

⏱ 20 minutes katibhyperparameterautoml

Katib Hyperparameter Tuning Kubernetes

Automate hyperparameter tuning with Katib on Kubernetes. Bayesian optimization, random search, grid search, early stopping.

⏱ 20 minutes knativeserverlessinference

KnativeServing for AI Inference OpenShift

Configure KnativeServing with scale-to-zero, GPU scheduling features, Kourier ingress, and custom domain templates for AI inference workloads on OpenShift.

⏱ 20 minutes kservemodel-servinginference

KServe Model Serving Kubernetes

Deploy ML models with KServe for serverless inference on Kubernetes. InferenceService, scale-to-zero, canary rollouts, model transformers.

⏱ 20 minutes kubeflowmlopsmachine-learning

Kubeflow ML Platform Setup Kubernetes

Deploy Kubeflow as a production-ready ML platform on Kubernetes. Notebooks, pipelines, training operators, and model serving with KServe for end-to-end MLO.

⏱ 20 minutes cost-managementgpu-costchargeback

AI Cost Management on Kubernetes

Control AI infrastructure costs on Kubernetes with GPU utilization tracking, chargeback per team, spot instance strategies, right-sizing recommendations.

⏱ 20 minutes inferenceoptimizationbatching

AI Inference Optimization Kubernetes

Optimize AI inference performance on Kubernetes. Request batching, KV cache tuning, speculative decoding, continuous batching.

⏱ 20 minutes gpu-monitoringdcgmprometheus

AI Workload Monitoring Kubernetes

Monitor AI and GPU workloads on Kubernetes with DCGM Exporter, Prometheus, and Grafana. GPU utilization, memory usage, inference latency.

⏱ 15 minutes api-priorityfairnessflow-schema

API Priority and Fairness K8s Guide

Configure Kubernetes API Priority and Fairness to protect the API server. Covers FlowSchemas, PriorityLevelConfigurations, and request concurrency tuning.

⏱ 15 minutes argo-rolloutscanaryprogressive-delivery

Argo Rollouts Canary Blue-Green K8s

Progressive delivery with Argo Rollouts on Kubernetes. Canary, blue-green, analysis templates, and experiment-based promotion for safe deployments.

⏱ 20 minutes canaryflaggerprogressive-delivery

Canary Deployments with Flagger

Automate canary deployments in Kubernetes using Flagger with Istio, Linkerd, or NGINX ingress. Progressive traffic shifting, metric analysis.

⏱ 20 minutes cert-managertlscertificates

cert-manager Advanced Configuration

Advanced cert-manager patterns for Kubernetes. Wildcard certificates, DNS-01 challenges, certificate rotation, cross-namespace sharing.

⏱ 15 minutes chaos-engineeringlitmusresilience

LitmusChaos Chaos Engineering K8s

Run chaos experiments on Kubernetes with LitmusChaos. Pod kill, network latency, disk fill, and CPU stress experiments for resilience testing.

⏱ 20 minutes ciliumnetwork-policyebpf

Cilium Network Policies Kubernetes

Advanced network policies with Cilium on Kubernetes. L7 HTTP-aware policies, DNS-based egress, identity-based security, cluster-wide policies.

⏱ 15 minutes configmapconfigurationbest-practices

ConfigMap Best Practices K8s Guide

ConfigMap best practices for Kubernetes applications. Size limits, binary data, environment variables vs volume mounts, and hot-reload patterns.

⏱ 15 minutes configmapreloadreloader

ConfigMap Reload Patterns Kubernetes

Implement automatic ConfigMap reload in Kubernetes using volume projection, Reloader operator, checksum annotations, and inotify sidecars.

⏱ 15 minutes configmapsecretimmutable

Immutable ConfigMaps and Secrets

Use immutable ConfigMaps and Secrets for performance and safety in Kubernetes. Reduce API server load, prevent accidental changes.

⏱ 15 minutes container-runtimecontainerdcri-o

Container Runtime Comparison K8s

Compare Kubernetes container runtimes: containerd vs CRI-O vs Kata Containers. Performance, security, and use cases for each runtime in production.

⏱ 15 minutes corednsdnsnetworking

CoreDNS Customization Guide Kubernetes

Customize CoreDNS with forward zones, rewrite rules, cache tuning, and stub domains. Troubleshoot DNS resolution failures and optimize query performance in.

⏱ 15 minutes cosignsigstoreimage-signing

Cosign Image Signing Kubernetes

Verify container image signatures with Cosign and Sigstore on Kubernetes. Policy enforcement with Kyverno, supply chain security, and SBOM attestation.

⏱ 15 minutes crdcustom-resourcedevelopment

CRD Development Kubernetes Guide

Design and implement Kubernetes Custom Resource Definitions. Schema validation, status subresource, printer columns, conversion webhooks.

⏱ 10 minutes cronjobschedulingbest-practices

CronJob Best Practices Kubernetes

Configure Kubernetes CronJobs with concurrency policies, failure handling, timezone scheduling, resource limits, and job history cleanup.

⏱ 20 minutes crossplaneinfrastructure-as-codecloud

Crossplane Infrastructure as Code

Manage cloud infrastructure from Kubernetes with Crossplane. Covers Composite Resources, Compositions, and provider configuration for AWS and GCP.

⏱ 15 minutes csistorage-driverdevelopment

Build Custom CSI Drivers Kubernetes

Develop custom Container Storage Interface drivers for Kubernetes. CSI spec, controller and node plugins, volume lifecycle, and testing with csi-sanity.

⏱ 20 minutes prometheuscustom-metricshpa

Custom Metrics with Prometheus Adapter

Expose application metrics to Kubernetes HPA via Prometheus Adapter. Configure custom.metrics.k8s.io for HTTP requests per second, queue depth.

⏱ 20 minutes schedulercustom-schedulerscheduling

Custom Scheduler Kubernetes Guide

Build and deploy custom Kubernetes schedulers for specialized workloads. Scheduler profiles, extender webhooks, scoring plugins.

⏱ 15 minutes daemonsetrolling-updateondelete

DaemonSet Update Strategies Kubernetes

Configure DaemonSet rolling updates with maxUnavailable, OnDelete strategy, partition rollouts, and canary updates for node-level workloads like log collec.

⏱ 10 minutes debugephemeral-containerstroubleshooting

Debug Containers and Ephemeral Pods

Use kubectl debug with ephemeral containers to troubleshoot running pods without restart. Debug distroless images, node debugging.

⏱ 15 minutes dnscorednsdebugging

DNS Debugging Kubernetes Guide

Debug Kubernetes DNS issues systematically. CoreDNS troubleshooting, ndots configuration, search domains, and resolving slow DNS lookups.

⏱ 15 minutes endpointsliceservice-topologyrouting

EndpointSlices and Service Topology

Understand EndpointSlices for scalable service discovery in Kubernetes. Covers topology-aware routing and traffic localization for large clusters.

⏱ 15 minutes ephemeral-storageemptydireviction

Ephemeral Storage Management Guide

Manage ephemeral storage in Kubernetes with emptyDir size limits, ephemeral-storage requests and limits, and eviction thresholds.

⏱ 20 minutes etcdbackuprestore

etcd Backup and Restore Kubernetes

Back up and restore etcd for Kubernetes disaster recovery. Covers automated snapshots, S3 upload, and point-in-time restore procedures.

⏱ 15 minutes etcdmaintenancebackup

etcd Maintenance Operations Kubernetes

Perform etcd maintenance for Kubernetes clusters. Defragmentation, compaction, snapshot backup, member health checks, and performance monitoring with etcdctl.

⏱ 20 minutes external-dnsdnsautomation

ExternalDNS Automation Kubernetes

Automate DNS record management with ExternalDNS on Kubernetes. Route53, CloudDNS, and Azure DNS integration for Ingress, Service, and Gateway resources.

⏱ 15 minutes finalizersowner-referencesgarbage-collection

Finalizers and Ownership Guide

Understand Kubernetes finalizers and owner references for resource lifecycle management. Prevent resource leaks, implement cleanup logic.

⏱ 15 minutes gateway-apihttprouterouting

Gateway API HTTPRoute Kubernetes

Configure HTTPRoute for Kubernetes Gateway API. Path matching, header-based routing, traffic splitting, URL rewriting, and request mirroring.

⏱ 20 minutes gpukarpenterautoscaler

GPU Node Provisioning Kubernetes

Automate GPU node provisioning for Kubernetes with Karpenter, Cluster Autoscaler, and cloud-specific node pools for AI and ML workloads.

⏱ 20 minutes gpu-operatornvidiadriver

GPU Operator Advanced Configuration

Advanced NVIDIA GPU Operator configuration on Kubernetes. Driver containers, CUDA toolkit, GDS, GPUDirect RDMA, MIG manager, DCGM Exporter.

⏱ 15 minutes helmtestingchart-testing

Helm Chart Testing CI/CD Guide

Test Helm charts with helm test, helm lint, chart-testing, and conftest. Unit tests, integration tests, and CI/CD pipeline integration for chart quality.

🎯 Helm advanced

Helm Library Charts Reusable Guide

Create reusable Helm library charts for Kubernetes. Shared templates, named templates, and standardizing deployments across teams with common patterns.

⏱ 15 minutes helmlibrary-charttemplates

⏱ 15 minutes helmociregistry

Helm OCI Registry Push Pull Guide

Push and pull Helm charts from OCI registries. Harbor, ECR, ACR, and GCR integration for Helm chart distribution and versioning.

⏱ 15 minutes dnscorednsautoscaling

DNS Autoscaling and CoreDNS Scaling

Scale CoreDNS horizontally with dns-autoscaler and proportional autoscaling. Tune cache size, configure node-local DNS cache.

⏱ 15 minutes hpacustom-metricsprometheus-adapter

HPA Custom Metrics Scaling Guide

Scale Kubernetes workloads on custom Prometheus metrics with HPA. Prometheus Adapter, external metrics, and request-rate-based scaling for web services.

⏱ 15 minutes image-pullregistrycache

Image Pull Optimization Kubernetes

Optimize container image pulls with pre-pulling DaemonSets, registry mirrors, image caching, and pull-through proxies for faster pod startup.

⏱ 10 minutes init-containerspatternsdependency

Init Container Patterns Kubernetes

Use init containers for dependency waiting, database migration, config generation, certificate fetching, and permission setup.

⏱ 15 minutes istiotraffic-managementvirtual-service

Istio Traffic Management Kubernetes

Advanced Istio traffic management on Kubernetes. VirtualService routing, DestinationRule load balancing, traffic mirroring, fault injection.

⏱ 15 minutes jaegertracingdistributed-tracing

Jaeger Tracing Kubernetes Guide

Deploy Jaeger for distributed tracing on Kubernetes. Collector, storage backends, sampling strategies, and trace analysis for microservice debugging.

⏱ 15 minutes jobbatchparallel

Job Completion Patterns Kubernetes

Configure Kubernetes Jobs with indexed completions, work queues, parallel processing, backoff limits, and TTL cleanup for batch workloads.

⏱ 15 minutes jobttlcleanup

Job TTL Cleanup Kubernetes Guide

Automate Kubernetes Job cleanup with TTL controller. ttlSecondsAfterFinished, CronJob history limits, and preventing completed Job accumulation.

⏱ 20 minutes kedaautoscalingevent-driven

KEDA Event-Driven Pod Autoscaling Guide

Scale Kubernetes workloads on external events with KEDA. Covers Kafka queue length, Prometheus metrics, and cron schedule trigger patterns.

⏱ 20 minutes kustomizeconfigurationoverlays

Kustomize Advanced Patterns Kubernetes

Advanced Kustomize patterns for Kubernetes configuration management. Strategic merge patches, JSON patches, components, replacements.

⏱ 15 minutes kustomizeoverlaysconfiguration

Kustomize Overlays Guide Kubernetes

Manage Kubernetes manifests with Kustomize overlays. Base and overlay patterns, strategic merge patches, JSON patches, ConfigMap generators.

⏱ 20 minutes lokiloggingpromtail

Loki Log Aggregation Kubernetes

Deploy Grafana Loki for log aggregation on Kubernetes. Promtail DaemonSet, LogQL queries, structured logging, retention policies, and Grafana integration.

⏱ 15 minutes longhorndistributed-storagereplication

Longhorn Distributed Storage K8s

Deploy Longhorn for distributed block storage on Kubernetes. Replicated volumes, snapshots, backups, and disaster recovery for bare-metal clusters.

⏱ 20 minutes metallbload-balancerbare-metal

MetalLB Bare Metal Load Balancer

Deploy MetalLB for LoadBalancer services on bare-metal Kubernetes. L2 mode, BGP mode, IP address pools, and integration with Cilium and Gateway API.

⏱ 20 minutes multi-clusterservice-meshistio

Multi-Cluster Service Mesh Kubernetes

Connect multiple Kubernetes clusters with service mesh federation. Istio multi-cluster, Linkerd multi-cluster, cross-cluster service discovery.

⏱ 20 minutes multi-clusterkubectxfleet

Multi-Cluster K8s Mgmt Patterns

Manage multiple Kubernetes clusters with kubectx, Cluster API, Fleet, and federation patterns. Context switching, workload distribution.

⏱ 15 minutes multi-tenancynamespacesisolation

Multi-Tenancy Namespaces Kubernetes

Implement multi-tenancy on Kubernetes with namespaces. Resource quotas, network policies, RBAC isolation, and hierarchical namespaces for team separation.

⏱ 15 minutes networkingtcpdumpdebug

Network Debugging Tools Kubernetes

Debug Kubernetes networking with tcpdump, netshoot, iptables tracing, conntrack inspection, and DNS resolution testing techniques.

⏱ 15 minutes network-policysecurityfirewall

NetworkPolicy Recipes Cookbook K8s

Common Kubernetes NetworkPolicy recipes. Default deny, allow DNS, namespace isolation, database access, and external egress patterns for zero-trust networking.

⏱ 15 minutes networkpolicyzero-trustsecurity

NetworkPolicy Zero Trust Kubernetes

Implement zero-trust networking with Kubernetes NetworkPolicies. Default-deny ingress and egress, namespace isolation, DNS egress rules, and Cilium L7 policies.

⏱ 15 minutes nfsstoragereadwritemany

NFS Dynamic Provisioner Kubernetes

Deploy NFS dynamic provisioner for ReadWriteMany storage on Kubernetes. NFS CSI driver, StorageClass configuration, and performance tuning with nconnect.

⏱ 15 minutes node-affinityschedulinglabels

Node Affinity Scheduling Kubernetes

Configure node affinity rules for Kubernetes pod scheduling. Required vs preferred affinity, label selectors, and combining with taints and tolerations.

⏱ 15 minutes draincordonmaintenance

Node Maintenance and Drain Operations

Safely drain Kubernetes nodes for maintenance with cordon, drain, and uncordon. Handle PodDisruptionBudgets, DaemonSets, and local storage.

⏱ 20 minutes opagatekeeperpolicy

OPA Gatekeeper Policy Enforcement

Enforce policies with OPA Gatekeeper on Kubernetes. ConstraintTemplates, Constraints, dry-run mode, audit, and common policies for security compliance.

⏱ 20 minutes opentelemetrytracingobservability

OpenTelemetry Collector Kubernetes

Deploy the OpenTelemetry Collector on Kubernetes for unified observability. Traces, metrics, and logs pipeline configuration, auto-instrumentation.

⏱ 15 minutes operatorsdkcontroller

Build Operators with Operator SDK

Build Kubernetes operators with Operator SDK. Controller reconciliation, custom resources, status subresource, leader election, and testing patterns.

⏱ 15 minutes pdbrolling-updatedisruption-budget

PDB Rolling Update Coordination K8s

Coordinate PodDisruptionBudgets with rolling updates on Kubernetes. minAvailable vs maxUnavailable, voluntary disruptions, and upgrade-safe configurations.

⏱ 15 minutes pvcexpansionstorage

Persistent Volume Expansion Kubernetes

Expand PersistentVolumeClaims online without downtime. allowVolumeExpansion, filesystem resize, StatefulSet PVC expansion.

⏱ 15 minutes affinityanti-affinityscheduling

Pod Affinity and Anti-Affinity Guide

Configure pod affinity and anti-affinity rules for Kubernetes scheduling. Co-locate cache with app, spread replicas across nodes.

⏱ 10 minutes pdbdisruptionmaintenance

Pod Disruption Budget Strategies

Configure PodDisruptionBudgets for zero-downtime maintenance. MinAvailable vs maxUnavailable strategies for stateful workloads, GPU training.

⏱ 15 minutes pod-securitypsastandards

Kubernetes Pod Security Standards Guide

Implement Pod Security Standards with Pod Security Admission. Privileged, baseline, and restricted profiles, namespace labels.

⏱ 15 minutes topology-spreadschedulinghigh-availability

Pod Topology Spread Advanced Patterns

Advanced topology spread constraints for Kubernetes. Multi-zone HA, GPU rack awareness, combined with affinity rules, and minDomains for scaling clusters.

⏱ 15 minutes prioritypreemptionscheduling

Priority and Preemption Scheduling

Configure PriorityClasses for Kubernetes workload scheduling. System-critical pods, GPU training preemption, and preemptionPolicy Never for batch workloads.

⏱ 20 minutes prometheusalertingalertmanager

Prometheus Alerting Rules Kubernetes

Write effective Prometheus alerting rules for Kubernetes. Alertmanager routing, inhibition, silence, and production-ready alert templates for CPU, memory.

⏱ 15 minutes persistent-volumereclaim-policystorage

PV Reclaim Policy Retain vs Delete

Understand Kubernetes PersistentVolume reclaim policies. Retain vs Delete vs Recycle, recovering data from released PVs.

⏱ 15 minutes rbacsecurityleast-privilege

RBAC Least Privilege Kubernetes

Configure Kubernetes RBAC with least-privilege Roles, ClusterRoles, and service account bindings. Audit permissions, restrict secrets access.

⏱ 15 minutes rbactroubleshootingpermissions

Fix RBAC Permission Errors K8s

Debug Kubernetes RBAC permission errors. kubectl auth can-i, impersonation testing, ClusterRole aggregation, and common permission mistakes.

⏱ 15 minutes resourceslimitsrequests

Resource Limits and Requests Guide

Configure CPU and memory requests and limits for Kubernetes pods. Guaranteed vs Burstable vs BestEffort QoS classes, OOMKill prevention.

⏱ 15 minutes cpumemorycgroups

CPU and Memory Limits Deep Dive

Deep dive into Kubernetes CPU and memory management. CFS bandwidth throttling, OOMKill scoring, cgroup v2 behavior, memory.high vs memory.

⏱ 15 minutes rookcephblock-storage

Rook Ceph Storage Kubernetes Guide

Deploy Rook-Ceph for enterprise storage on Kubernetes. Block, file, and object storage, erasure coding, and multi-site replication for production workloads.

⏱ 20 minutes sealed-secretssecretsencryption

Sealed Secrets Management Kubernetes

Manage secrets securely with Bitnami Sealed Secrets on Kubernetes. Encrypt secrets for Git storage, cluster-scoped and namespace-scoped sealing.

⏱ 15 minutes secretsvaultexternal-secrets

External Secrets Management Kubernetes

Integrate Kubernetes with external secret stores using External Secrets Operator. Sync secrets from HashiCorp Vault, AWS Secrets Manager, Azure Key Vault.

⏱ 15 minutes service-accounttokensauthentication

Service Account Tokens Kubernetes

Manage Kubernetes service account tokens securely. Projected volumes, bound tokens, token request API, and eliminating long-lived tokens for zero-trust aut.

⏱ 15 minutes service-accountworkload-identityirsa

Service Accounts and Workload Identity

Configure Kubernetes service accounts with cloud workload identity for AWS IRSA, GCP Workload Identity, and Azure AD pod federation.

⏱ 15 minutes service-meshistiolinkerd

Service Mesh Comparison Kubernetes

Compare Istio, Linkerd, and Cilium service mesh for Kubernetes. mTLS, observability, traffic management, resource overhead.

⏱ 15 minutes statefulsetstatefuldatabases

Kubernetes StatefulSet Management Guide

Manage stateful applications on Kubernetes with StatefulSets. Ordered deployment, stable network identity, persistent storage.

⏱ 20 minutes storage-classcsipersistent-volume

Storage Classes and Provisioners

Configure Kubernetes StorageClasses for dynamic volume provisioning. CSI drivers, reclaim policies, volume expansion, topology-aware provisioning.

⏱ 15 minutes tempotracinggrafana

Grafana Tempo Tracing Kubernetes

Deploy Grafana Tempo for cost-effective distributed tracing on Kubernetes. Object storage backend, TraceQL queries, and Grafana integration.

⏱ 15 minutes thanosprometheushigh-availability

Thanos HA Prometheus Kubernetes

Scale Prometheus with Thanos for high availability and long-term storage on Kubernetes. Sidecar, Store, Compactor, and Query frontend for multi-cluster metrics.

⏱ 15 minutes topologyroutingzone-aware

Topology-Aware Routing Kubernetes

Enable topology-aware routing for cost optimization on Kubernetes. Zone-local traffic, EndpointSlice hints, and reducing cross-zone data transfer costs.

⏱ 20 minutes velerobackuprestore

Velero Backup and Restore Kubernetes

Back up and restore Kubernetes applications with Velero. Scheduled backups, cross-cluster migration, selective restore, and disaster recovery workflows.

⏱ 15 minutes vpaautoscalingright-sizing

Vertical Pod Autoscaler Deep Dive

Configure VPA for automatic memory and CPU right-sizing in Kubernetes. Recommendation modes, update policies, VPA with HPA coexistence, and GPU workload tuning.

⏱ 20 minutes vpavertical-pod-autoscalerright-sizing

VPA Resource Right-Sizing Kubernetes

Use Vertical Pod Autoscaler to right-size Kubernetes resource requests and limits. Off mode for recommendations, Auto mode for live adjustment.

⏱ 20 minutes kueuejob-queuingfair-sharing

Kueue Job Queuing Fair Sharing Kubernetes

Implement fair-share GPU job queuing with Kueue on Kubernetes. ClusterQueues, LocalQueues, ResourceFlavors, and cohort-based borrowing for multi-team AI cl.

⏱ 20 minutes llmdeploymentgpu-memory

LLM Deployment Challenges Kubernetes

Address common LLM deployment challenges on Kubernetes. GPU memory management, model loading optimization, inference latency tuning, batch scheduling.

⏱ 15 minutes mellanoxrocedscp

Mellanox RoCE DSCP QoS DaemonSet

Deploy a DaemonSet that configures DSCP trust, PFC priority 3, and RoCE ToS 106 on all Mellanox PFs. Uses DOCA driver image with ibdev2netdev, mlnx_qos.

⏱ 20 minutes ml-pipelinekubeflow-pipelinesargo-workflows

ML Pipeline Automation Kubernetes

Automate ML pipelines on Kubernetes with Kubeflow Pipelines, Argo Workflows, and Tekton. Data preprocessing, training, evaluation, model registration.

⏱ 20 minutes modelmeshmulti-modelinference

ModelMesh Multi-Model Serving Kubernetes

Deploy hundreds of ML models on shared GPU infrastructure with ModelMesh. Intelligent model loading and unloading, memory management, routing.

⏱ 20 minutes multi-cloudgpu-availabilityspot-instances

Multi-Cloud AI Workloads Kubernetes

Run AI workloads across multiple cloud providers with Kubernetes. GPU instance availability, spot pricing arbitrage, model portability.

⏱ 30 minutes ncclsriovgds

NCCL SR-IOV GDS PyTorch Configuration

Configure NCCL with SR-IOV RDMA and GPUDirect Storage on Kubernetes. PyTorch 25.11 container with NCCL 2.28, CUDA 13, MOFED 5.4, GDRCopy 2.

⏱ 20 minutes rdmaqostraffic-class

RDMA Network QoS Traffic Classes DCQCN

Complete RDMA network QoS architecture with traffic classes TC0-TC6, DSCP and dot1p mappings, PFC, ECN, WRED, and DCQCN congestion control for lossless RoC.

⏱ 30 minutes rocev2losslesspfc

RoCEv2 End-to-End Lossless Stack

Complete RoCEv2 lossless fabric configuration from GPU node to switch and back. Dell OS10 switches, Mellanox NICs, OpenShift MachineConfig, PFC, ECN.

⏱ 20 minutes volcanobatch-schedulinggang-scheduling

Volcano Job minAvailable Gang Schedule

Volcano batch scheduling with minAvailable gang scheduling on Kubernetes. Job configuration, queue policies, and AI training workload scheduling.

⏱ 15 minutes aiperfvllmbenchmarking

AIPerf Offline vLLM Benchmarking

Benchmark vLLM inference with AIPerf in air-gapped Kubernetes clusters. Use dummy tokenizers, offline mode, custom endpoints.

⏱ 18 minutes ib-write-bwperftestrdma

ib_write_bw RDMA Bandwidth Testing

Run ib_write_bw from perftest on Kubernetes to measure RDMA write bandwidth between GPU nodes. Full CLI reference, bidirectional tests, HugePages.

⏱ 10 minutes openshiftoperatorhubair-gapped

Disable OperatorHub Default Sources

Disable default OperatorHub catalog sources in OpenShift for air-gapped clusters. Use OperatorHub CR to disable individual or all sources with Ansible auto.

⏱ 25 minutes runaivllmnccl

Run:ai Distributed vLLM with NCCL

Deploy distributed vLLM inference on Run:ai with NCCL over NVLink and RDMA. Tensor parallelism across GPUs with NCCL debug logging, SR-IOV networking.

⏱ 20 minutes aiperfbenchmarkingllm

AIPerf LLM Benchmarking on K8s

Benchmark generative AI inference on Kubernetes with NVIDIA AIPerf. Measure TTFT, ITL, throughput, and latency across vLLM, NIM.

⏱ 18 minutes databasesmemoryovercommit

Databases on K8s: Memory Overcommit

Why vm.overcommit_memory must be disabled for production databases on Kubernetes. Configure guaranteed QoS, disable swap.

⏱ 25 minutes docaperftestrdma

DOCA Perftest RDMA Benchmarking

Run NVIDIA DOCA perftest on Kubernetes to benchmark RDMA bandwidth and latency between GPU nodes. Traffic patterns, GPUDirect memory modes.

⏱ 20 minutes mlnx-qosmofedpfc

mlnx_qos QoS on MOFED Containers

Configure RDMA QoS with mlnx_qos from MOFED containers on Kubernetes. Set PFC, ETS, DSCP trust mode, and validate lossless RoCE traffic classes on ConnectX.

⏱ 20 minutes retinanetgpu-trainingmemlock

RetinaNet GPU Training on Kubernetes

Train RetinaNet object detection models on Kubernetes with unlimited memlock for RDMA, CRI-O ulimit configuration, and multi-GPU distributed training.

⏱ 15 minutes certificatescsrtls

Kubernetes Certificate Signing Requests

Use the Kubernetes CSR API to issue, approve, and manage TLS certificates. Automate certificate workflows for services, users, and kubelet rotation.

⏱ 8 minutes startup-probeprobeshealth-check

Kubernetes startupProbe Configuration Guide

Configure startupProbe for slow-starting containers to prevent premature kills. Understand interaction with liveness and readiness probes.

⏱ 10 minutes daemonsetupdate-strategyrolling-update

Kubernetes DaemonSet Update Strategies

Configure DaemonSet rolling updates with maxUnavailable and maxSurge. Understand OnDelete vs RollingUpdate strategies for node-level workloads.

⏱ 10 minutes endpointsliceservice-discoverynetworking

EndpointSlice Service Discovery

Understand Kubernetes EndpointSlices for scalable service discovery. Compare with legacy Endpoints and configure topology-aware routing.

⏱ 12 minutes graceful-shutdownprestopsigterm

Kubernetes preStop Hooks for Graceful Shutdown

Configure preStop hooks and terminationGracePeriodSeconds for zero-downtime pod termination. Handle SIGTERM correctly in your applications.

⏱ 15 minutes hpaautoscalingmetrics

HPA v2 Multiple Metrics Scaling Guide

Configure HorizontalPodAutoscaler v2 with CPU, memory, custom, and external metrics. Control scaling behavior with stabilization windows.

⏱ 8 minutes image-pull-policycontainer-imagesregistry

Kubernetes imagePullPolicy Guide

Configure imagePullPolicy correctly: Always, Never, and IfNotPresent behavior. Understand digest pinning and tag mutability implications.

⏱ 12 minutes jobsbatchparallelism

Kubernetes Job Parallelism Guide

Configure Kubernetes Jobs with parallelism, completions, and indexed completion mode for efficient batch processing and parallel workloads.

⏱ 8 minutes limitrangeresource-defaultsnamespace

Kubernetes LimitRange Defaults

Set default resource requests and limits per namespace with LimitRange. Enforce min/max constraints and prevent unbounded resource consumption.

⏱ 12 minutes sidecarmulti-containerambassador

Multi-Container Pod Patterns in Kubernetes

Implement sidecar, ambassador, and adapter patterns in Kubernetes pods. Share volumes and network namespace between containers for modular architectures.

⏱ 12 minutes network-policyegresssecurity

Kubernetes Egress Network Policies

Control outbound traffic from pods with egress NetworkPolicies. Allow DNS, block internet access, and restrict pod-to-pod communication by namespace.

⏱ 10 minutes node-affinityschedulingnode-selector

Kubernetes Node Affinity Guide

Schedule pods to specific nodes with requiredDuringScheduling and preferredDuringScheduling node affinity. Control placement with expressions and weights.

⏱ 10 minutes persistent-volumereclaim-policystorage

PersistentVolume Reclaim Policies

Understand Retain, Delete, and Recycle reclaim policies for PersistentVolumes. Manage PV lifecycle after PVC deletion and recover bound volumes.

⏱ 12 minutes prioritypreemptionscheduling

Pod Priority Preemption Kubernetes

Configure PriorityClasses to ensure critical workloads get resources by preempting lower-priority pods. Understand preemption mechanics and safeguards.

⏱ 15 minutes topologyschedulinghigh-availability

Pod Topology Spread Constraints Guide

Use topologySpreadConstraints to distribute pods evenly across zones, nodes, and failure domains for high availability in Kubernetes.

⏱ 10 minutes projected-volumessecretsconfigmap

Kubernetes Projected Volumes Explained

Combine Secrets, ConfigMaps, Downward API, and ServiceAccount tokens into a single projected volume mount for cleaner pod configuration.

⏱ 10 minutes rolling-updatedeployment-strategyrollout

Kubernetes Rolling Update Strategy

Configure rolling update deployments with maxSurge and maxUnavailable to control rollout speed, minimize downtime, and enable safe progressive delivery.

⏱ 12 minutes topology-routingzone-awaretraffic-distribution

Topology-Aware Service Routing

Enable zone-aware traffic routing in Kubernetes to reduce cross-zone latency and egress costs. Configure topology hints and traffic distribution.

⏱ 12 minutes statefulsetheadless-servicedns

StatefulSet Headless Service DNS

Configure StatefulSets with headless services for stable network identities. Understand pod DNS, ordered deployment, and persistent storage patterns.

⏱ 15 minutes admission-policycelpolicy

ValidatingAdmissionPolicy with CEL

Replace admission webhooks with ValidatingAdmissionPolicy and CEL expressions for in-process, low-latency Kubernetes policy enforcement.

⏱ 20 minutes sriovrdmamellanox

SR-IOV NetworkNodePolicy for RDMA

Configure SriovNetworkNodePolicy on OpenShift to create RDMA-capable VFs on Mellanox ConnectX NICs for GPUDirect RDMA and high-performance AI networking.

⏱ 20 minutes cert-managerovhdns-01

cert-manager OVH DNS-01 Wildcard TLS

Configure cert-manager with OVH DNS-01 challenge for automated wildcard TLS certificates on k3s. Let's Encrypt production certificates with zero downtime r.

⏱ 30 minutes ciliumebpfgateway-api

Cilium eBPF Gateway API Hubble k3s

Install Cilium with eBPF dataplane, Gateway API support, and Hubble observability on k3s. Replace kube-proxy with eBPF, configure GatewayClass.

⏱ 20 minutes cloudnativepgpostgresqldatabase

CloudNativePG PostgreSQL on Kubernetes

Deploy PostgreSQL on Kubernetes with CloudNativePG operator. Cluster setup, affinity, backups to S3, connection pooling, and high availability configuration.

⏱ 15 minutes 502bad-gatewayingress

Fix 502 Bad Gateway in Kubernetes

Troubleshoot and fix 502 Bad Gateway errors in Kubernetes. Causes include pod readiness timing, ingress misconfiguration, upstream timeouts.

⏱ 60 minutes gitopsargocdoctopus-deploy

Full GitOps Pipeline k3s to Production

End-to-end GitOps pipeline: git push triggers Gitea Actions build, pushes to quay.io, Octopus Deploy creates release with ephemeral preview.

⏱ 20 minutes gateway-apihttproutetls

Gateway API HTTPRoutes TLS on k3s

Configure Gateway API HTTPRoutes with TLS termination on k3s using Cilium. Route traffic to multiple services with wildcard certificates and HTTP-to-HTTPS .

⏱ 25 minutes giteaactionsci-cd

Gitea Actions Runner Push to Quay

Deploy Gitea Actions runner on k3s to build container images and push to quay.io. DinD-less builds with Kaniko, automated CI pipelines for every git push.

⏱ 30 minutes giteapostgresqlvalkey

Gitea PostgreSQL Valkey on k3s

Deploy self-hosted Gitea with PostgreSQL and Valkey (Redis fork) on k3s. Complete Git forge with Actions CI runner, container registry, and package management.

⏱ 10 minutes helmhooksdelete-policy

Helm Hook Delete Policy Explained

Configure Helm hook delete policies: before-hook-creation, hook-succeeded, hook-failed. Control Job cleanup after install, upgrade, and test hooks.

⏱ 15 minutes helmociregistry

Helm OCI Registry for Charts Explained

Store and manage Helm charts in OCI-compliant registries like GHCR, ECR, ACR, and Quay. Push, pull, and version charts using standard container registries.

⏱ 15 minutes hugonginxstatic-site

Hugo nginx Static Site on a k3s Cluster

Deploy a Hugo static site with nginx on k3s. Multi-stage build, Brotli compression, security headers, and automated redeployment on git push via Gitea Actions.

⏱ 30 minutes fedorakubeadminstall

Install Kubernetes on Fedora with kubeadm

Step-by-step guide to install Kubernetes on Fedora Linux using kubeadm. Disable swap, configure containerd, install kubeadm kubelet kubectl.

⏱ 45 minutes kairosk3shetzner

Kairos k3s on Hetzner CPX42: Immutable Bootstrap

Deploy an immutable Kairos-based k3s cluster on Hetzner Cloud CPX42. Automated provisioning with cloud-init, immutable OS upgrades.

⏱ 5 minutes kubectlcpcopy

kubectl cp Copy Files to and from Pods

Copy files between local machine and Kubernetes pods with kubectl cp. Supports containers, namespaces, tar-based transfer, and common troubleshooting.

⏱ 10 minutes kubectllogsdebugging

kubectl logs View Pod Logs Guide

View and stream Kubernetes pod logs with kubectl logs. Multi-container pods, previous crashes, label selectors, timestamps, and log aggregation patterns.

⏱ 5 minutes kubectlrolloutrestart

kubectl rollout restart Deployment

Restart Kubernetes Deployments, StatefulSets, and DaemonSets with kubectl rollout restart. Zero-downtime rolling restart without changing pod spec.

⏱ 15 minutes cluster-autoscalerautoscalingnode-scaling

Kubernetes Cluster Autoscaler Configuration

Configure Kubernetes Cluster Autoscaler: scale-down delay, node group settings, priority expander, GPU scaling, and cloud provider integration for EKS, GKE.

⏱ 10 minutes configmapkubectlconfiguration

Create ConfigMap from File in Kubernetes

Create Kubernetes ConfigMaps from files, directories, and env files with kubectl. Mount as volumes or inject as environment variables in pods.

⏱ 20 minutes profilingpyroscopeperformance

Continuous Profiling with Pyroscope

Deploy Pyroscope on Kubernetes for continuous CPU and memory profiling. Identify performance bottlenecks in production without overhead.

⏱ 15 minutes csisnapshotsstorage

CSI Volume Snapshots and Restore

Create and restore volume snapshots using CSI VolumeSnapshot API. Configure VolumeSnapshotClass, take point-in-time backups, and clone PVCs from snapshots.

⏱ 10 minutes dnsdnspolicyhostnetwork

Kubernetes DNS Policy ClusterFirstWithHostNet

Configure Kubernetes DNS policies: ClusterFirst, ClusterFirstWithHostNet, Default, and None. Fix DNS resolution for hostNetwork pods and custom nameservers.

⏱ 10 minutes downward-apienvironment-variablesmetadata

Kubernetes Downward API: Pod Metadata in Env

Expose pod metadata to containers using Kubernetes Downward API. Access pod name, namespace, node name, labels, annotations.

⏱ 10 minutes ephemeral-volumesstoragecsi

Generic Ephemeral Volumes in Kubernetes

Use generic ephemeral volumes for per-pod temporary storage with CSI driver features. Scratch space, caching, and temp data without pre-provisioned PVCs.

⏱ 10 minutes finalizersdeletioncontrollers

Kubernetes Finalizers Explained

How Kubernetes finalizers work: prevent resource deletion until cleanup completes. Custom finalizer patterns, stuck resource recovery.

⏱ 15 minutes gateway-apigrpcnetworking

Gateway API gRPC Routes on Kubernetes

Configure Kubernetes Gateway API GRPCRoute for gRPC traffic routing. Service-level matching, header-based routing, and traffic splitting for gRPC services.

⏱ 10 minutes hostpathvolumesstorage

Kubernetes hostPath Volume Guide

Use hostPath volumes to mount node filesystem paths into pods. Types, security risks, use cases for DaemonSets, and safer alternatives like local PVs.

⏱ 20 minutes hpaautoscalingscaling-policies

HPA Behavior and Scaling Policies

Configure HPA scaling behavior with stabilization windows, scaling policies, and rate limiting. Fine-tune scale-up and scale-down speed.

⏱ 10 minutes hpaautoscalingcontainer-metrics

HPA Container Resource Metrics

Configure HPA to scale based on individual container metrics instead of pod-level averages. Target specific containers in multi-container pods.

⏱ 20 minutes hpaautoscalingprometheus

Kubernetes HPA Custom Metrics with Prometheus

Configure Kubernetes HPA with custom Prometheus metrics. Prometheus Adapter setup, custom and external metrics, scaling on request latency, queue depth.

⏱ 15 minutes kustomizekustomizationconfiguration

Kubernetes kustomization.yaml Guide

Write kustomization.yaml files for Kubernetes resource management. Overlays, patches, generators, transformers, and multi-environment deployment patterns.

⏱ 20 minutes cert-managerletsencrypttls

K8s Let's Encrypt Ingress with cert-manager

Automate TLS certificates for Kubernetes Ingress using cert-manager and Let's Encrypt. ClusterIssuer setup, HTTP-01 and DNS-01 challenges, and auto-renewal.

⏱ 10 minutes livenessprobeshealth-check

Kubernetes Liveness Probe Best Practices

Configure Kubernetes liveness probes correctly. Best practices for httpGet, exec, and tcpSocket probes. Avoid database checks, thundering herd.

⏱ 25 minutes autoscalingmpahpa

Multidimensional Pod Autoscaler (MPA)

Configure Google's Multidimensional Pod Autoscaler to scale both horizontally and vertically simultaneously. Combines HPA and VPA logic in one controller.

⏱ 10 minutes networkpolicyegressdeny

Kubernetes NetworkPolicy Default Deny Egress

Implement Kubernetes NetworkPolicy default deny egress rules. Block all outbound traffic, then allow specific destinations: DNS, external APIs.

⏱ 10 minutes nodestatuskubectl

Check Kubernetes Node Status with kubectl

Check and troubleshoot Kubernetes node status with kubectl. Node conditions (Ready, MemoryPressure, DiskPressure), NotReady debugging, and capacity monitoring.

⏱ 15 minutes opentelemetrytracingauto-instrumentation

OpenTelemetry Auto-Instrumentation

Configure OpenTelemetry Operator auto-instrumentation to inject tracing into pods without code changes. Supports Java, Python, Node.js, .NET, and Go.

⏱ 10 minutes prioritypriorityclassscheduling

K8s PriorityClass and Missing Pod Priority

Fix missing pod priority in Kubernetes. PriorityClass configuration, preemption behavior, system-critical classes, and scheduling order for GPU workloads.

⏱ 10 minutes release-cycleversioningupgrade

Kubernetes Release Cycle and Version Support

Kubernetes release cycle explained: 3 releases per year, 14-month support window, patch cadence, version skew policy, and upgrade planning timeline.

⏱ 15 minutes service-accounttokenrbac

Kubernetes Service Account Token Guide

Create and manage Kubernetes service account tokens. TokenRequest API, projected volumes, long-lived tokens, and RBAC binding for pod-to-API authentication.

⏱ 10 minutes dnsservicecoredns

Kubernetes Service DNS Resolution

How Kubernetes Service DNS works: naming conventions, FQDN format, headless services, cross-namespace resolution, and DNS debugging with nslookup.

⏱ 10 minutes terminationgraceful-shutdownsigterm

terminationGracePeriodSeconds Default

Configure Kubernetes terminationGracePeriodSeconds for graceful pod shutdown. Default 30s, SIGTERM handling, preStop hooks, and per-container settings.

⏱ 20 minutes multi-clusterfederationfleet

Multi-Cluster Fleet Management on Kubernetes

Manage multiple Kubernetes clusters with kubectl contexts, federation, GitOps fleet patterns, and tools like Rancher, ArgoCD, and Cluster API.

⏱ 15 minutes mutagenfile-syncdevelopment

Mutagen Kubernetes File Sync Guide

Sync files between local machine and Kubernetes pods with Mutagen. Real-time bidirectional sync for development, hot-reload workflows.

⏱ 15 minutes nccltopologygpu

NCCL Topology Dump File for GPU Debugging

Use NCCL_TOPO_DUMP_FILE to capture and analyze GPU interconnect topology in Kubernetes. Debug NVLink, NVSwitch, and PCIe connection paths.

⏱ 40 minutes octopus-deploymssqlrelease-management

Octopus Deploy 2025.4 on Kubernetes

Deploy Octopus Deploy 2025.4 with MSSQL and Kubernetes agent on k3s. Release orchestration with ephemeral preview environments, approval gates.

⏱ 10 minutes kubectlrecordingaudit

Record kubectl Sessions for Kubernetes

Record and replay kubectl sessions for auditing, documentation, and training. Terminal recording with asciinema, script, and kubectl plugins for OpenShift.

⏱ 30 minutes runaivllmdistributed-inference

Run:ai Distrib. vLLM Inference Multimodal LLMs

Deploy multimodal LLMs with Run:ai distributed inference and vLLM on Kubernetes. Tensor parallelism, NCCL over NVLink, GPUDirect RDMA.

DCB on Mellanox ConnectX: Lossless Ethernet...

Configure Data Center Bridging (DCB) on Mellanox ConnectX NICs. DCBX negotiation, PFC, ETS, and CN for lossless RoCE Ethernet in Kubernetes AI clusters.

⏱ 30 minutes dcbdcbxpfc

ETS Queue, PFC, DSCP Trust on Mellanox Quic...

Quick reference for enabling ETS queues, PFC, DSCP trust, and DSCP-to-priority mapping on Mellanox ConnectX NICs. Three commands for lossless RoCE Ethernet.

⏱ 10 minutes etspfcdscp

⏱ 15 minutes day-2-operationsplatform-engineeringautoscaling

Kubernetes Day 2: Where the Leverage Kicks In

Why Kubernetes pays off after initial setup. Day 2 operations leverage: auto-scaling, self-healing, rolling updates, observability.

⏱ 15 minutes quick-startdeploymentdeveloper-experience

Deploy a New App in 5 Minutes on Kubernetes

Deploy a production-ready application in 5 minutes on an existing Kubernetes cluster. Deployment, Service, Ingress, TLS, autoscaling.

⏱ 15 minutes namespacetemplatesonboarding

Namespace Templates: Instant Envs in K8s

Create production-ready namespace templates for instant environment provisioning. One command deploys namespace, RBAC, quotas, network policies, and monitoring.

⏱ 15 minutes platform-engineeringgolden-pathbackstage

Platform Engineering: Golden Paths in K8s

Build golden paths for developers on Kubernetes. Internal developer platform with Backstage, self-service namespaces, pre-built Helm charts.

⏱ 15 minutes cicdpipelinegithub-actions

Reusable CI/CD Pipeline Templates for K8s

Build once, deploy anything. Reusable CI/CD pipeline templates for Kubernetes using GitHub Actions, GitLab CI, and Tekton.

⏱ 20 minutes nmstatenmstatectlnncp

NMState & nmstatectl: Node Network Management

Manage node networking with NMState declarative API and nmstatectl CLI. Create NodeNetworkConfigurationPolicy manifests, verify with nmstatectl.

⏱ 25 minutes pfcmellanoxconnectx

PFC Configuration on Mellanox ConnectX NICs

Enable Priority Flow Control on Mellanox ConnectX-6/7 NICs for lossless RoCE. mlnx_qos, cma_roce_mode, DSCP trust, ECN, and firmware-level PFC verification.

⏱ 30 minutes access-zonesscale-out-naspowerscale

Access Zones on Scale-Out NAS for Kubernetes

Configure access zones on scale-out NAS (Dell PowerScale/Isilon) for Kubernetes persistent storage. Multi-tenant isolation, CSI driver setup.

⏱ 25 minutes extended-resourcesrdmashared-device-plugin

Extended Resources & RDMA Shared Device Plugin

Kubernetes extended resources for RDMA devices using the shared device plugin. Advertise and schedule InfiniBand and RoCE NICs without SR-IOV using k8s-rdm.

⏱ 25 minutes routeingressopenshift-route

Kubernetes Route and Ingress Management Guide

Manage OpenShift Routes and Kubernetes Ingress resources. TLS termination, path-based routing, weighted traffic splitting.

⏱ 25 minutes secret-rotationcert-managerexternal-secrets

Automate Secret and Key Rotation in Kubernetes

Automate TLS certificate and secret key rotation in Kubernetes. CronJob-based rotation, external-secrets-operator, cert-manager auto-renewal.

⏱ 25 minutes rbaconboardingoffboarding

Automate User Onboarding & Offboarding in K8s

Automate Kubernetes user onboarding and offboarding. RBAC provisioning, namespace creation, quota assignment, OIDC group sync, and access revocation scripts.

⏱ 35 minutes iommuvfiogpu-passthrough

IOMMU on K8s: GPU Passthrough and SR-IOV

Enable and configure IOMMU for GPU passthrough, SR-IOV, and VFIO on Kubernetes. Kernel parameters, IOMMU groups, device isolation, and troubleshooting guide.

⏱ 30 minutes major-upgrademinor-upgradeapi-deprecation

Kubernetes and OpenShift Major Version Upgrade

Upgrade Kubernetes minor versions (1.31→1.32) and OpenShift (4.16→4.17, EUS-to-EUS). API deprecation migration, etcd backup.

⏱ 30 minutes patch-updateupgradekubeadm

Kubernetes and OpenShift Patch Updates

Apply patch updates to Kubernetes and OpenShift clusters safely. Patch version upgrades for control plane, kubeadm, kubelet.

⏱ 30 minutes upgradeopenshiftkubernetes

Kubernetes and OpenShift Upgrade Strategy

Complete upgrade strategy for Kubernetes and OpenShift clusters. Understand patch, minor, and major versions, upgrade paths.

⏱ 25 minutes mariadbopenshiftscc

Deploy MariaDB on OpenShift with SCC

Deploy MariaDB on OpenShift with proper Security Context Constraints. Configure anyuid SCC, persistent storage, custom my.

⏱ 20 minutes openshiftopenshift-4.20eus

OpenShift 4.20: New Features and Upgrade Guide

OpenShift 4.20 (EUS) new features, Kubernetes 1.33 alignment, the upgrade path from 4.18, and what administrators need to know before upgrading.

⏱ 20 minutes openshiftopenshift-4.21upgrade

OpenShift 4.21: New Features and Upgrade Guide

OpenShift 4.21 new features, K8s 1.34 alignment, upgrade from 4.20. Non-EUS release with latest innovations: in-place pod resize GA, DRA improvements.

⏱ 25 minutes machineconfigmcpopenshift

OpenShift MachineConfig and MCP Deep Dive

Master MachineConfig and MachineConfigPool on OpenShift. Configure kernel args, files, systemd units, and manage rolling node updates with MCP strategies.

⏱ 25 minutes sccopenshiftsecurity-context

OpenShift SCC: Security Context Constraints

Configure Security Context Constraints on OpenShift. Manage SCCs for pods requiring privileged access, host networking, custom UID/GID, and volume types.

⏱ 40 minutes pfcnmstatenncp

Configure PFC with NMState on Kubernetes

Enable Priority Flow Control (PFC) for lossless RDMA using NMState and NodeNetworkConfigurationPolicy. Configure DSCP-to-priority mapping, ECN, and RoCEv2 QoS.

⏱ 45 minutes tensor-parallelismdistributed-inferencemulti-node

Inter-Node Tensor Parallelism on Kubernetes

Split a single LLM across multiple physical servers using tensor parallelism. Configure vLLM, NIM, and Ray for inter-node TP with NCCL over RDMA or TCP.

⏱ 15 minutes kubectlkubeconfigcontext

kubectl Config: Manage Contexts and Clusters

Manage kubectl contexts with kubectl config commands. Switch clusters, delete contexts, rename entries, and merge multiple kubeconfig files safely.

⏱ 15 minutes imagepullsecretsprivate-registrydocker-registry

K8s imagePullSecrets: Private Registry Auth

Configure imagePullSecrets for pulling container images from private registries. Create docker-registry secrets, attach to pods and ServiceAccounts.

⏱ 15 minutes tritonvllminference

Triton Inference Server vs vLLM: Which to C...

Compare NVIDIA Triton Inference Server vs vLLM for LLM serving on Kubernetes. Performance, multi-model support, batching, GPU utilization.

⏱ 30 minutes ncclrdmainfiniband

Verify NCCL RDMA Traffic with Debug Logging

Prove NCCL uses RDMA for GPU communication on Kubernetes. Use NCCL_DEBUG and NCCL_DEBUG_SUBSYS=ALL to verify InfiniBand, RoCE.

⏱ 25 minutes cluster-apicapiaws

Cluster API on AWS: Provision EKS Clusters

Use Cluster API (CAPI) to provision and manage EKS clusters declaratively. Install clusterctl, configure CAPA provider, and automate cluster lifecycle on AWS.

⏱ 25 minutes cluster-apicapiclusterclass

ClusterClass: Reusable Cluster Templates in...

Define reusable ClusterClass templates in Cluster API for consistent multi-cluster provisioning. Variables, patches, and topology-based cluster creation.

⏱ 25 minutes cluster-apicapivsphere

Cluster API on vSphere: On-Prem K8s Clusters

Provision on-premises Kubernetes clusters on vSphere using Cluster API (CAPV). VM templates, control plane HA, node scaling, and day-2 operations.

⏱ 25 minutes attestationconfidential-computingzero-trust

Hardware Attestation for Kubernetes Workloads

Implement remote attestation for Kubernetes workloads. Verify TEE integrity with attestation services, release secrets to verified enclaves.

⏱ 25 minutes confidential-containerskata-containerstee

Confidential Containers with Kata

Deploy confidential containers using Kata Containers and TEEs on Kubernetes. Hardware attestation, encrypted container images.

CVE-2026-3865: CSI SMB Driver Path Traversa...

Fix CVE-2026-3865 Kubernetes CSI SMB driver path traversal vulnerability. Upgrade to v1.20.1, detect malicious PersistentVolumes.

⏱ 15 minutes cvecsismb

⏱ 20 minutes alertmanagerroutingsilences

Alertmanager Routing, Grouping, and Silences

Configure Alertmanager routing trees, receiver integrations, inhibition rules, silences, and alert grouping for production Kubernetes monitoring stacks.

⏱ 20 minutes slislogolden-signals

K8s Golden Signals: SLI and SLO Monitoring

Implement Google SRE golden signals on Kubernetes. Define SLIs, set SLO targets, configure error budgets, and build SLO dashboards with Prometheus and Sloth.

⏱ 20 minutes gvisorruntimeclasssandbox

gVisor RuntimeClass on K8s: Sandbox Pods

Deploy gVisor sandbox containers on Kubernetes using RuntimeClass. Install runsc, configure containerd, and isolate untrusted workloads with application-le.

⏱ 20 minutes lokiloggingpromtail

Kubernetes Log Aggregation with Grafana Loki

Aggregate Kubernetes logs with Grafana Loki and Promtail. Install Loki stack, LogQL queries, label-based filtering, and Grafana log exploration dashboards.

⏱ 15 minutes metrics-servermonitoringkubectl-top

K8s Metrics Server: Install and Configure

Install and configure Kubernetes Metrics Server for kubectl top, HPA autoscaling, and resource monitoring. Troubleshoot common metrics-server errors and TL.

⏱ 20 minutes ciliumhubblenetwork-observability

Network Observability with Cilium Hubble

Monitor Kubernetes network traffic with Cilium Hubble. Service maps, DNS visibility, HTTP flow logs, network policy auditing, and Hubble UI dashboards.

⏱ 20 minutes grafanaprometheusresource-monitoring

K8s Pod Resource Monitoring with Grafana

Monitor Kubernetes pod CPU and memory with Grafana dashboards. Prometheus queries for resource usage, request vs limit tracking.

⏱ 20 minutes ncclinfinibandrdma

NCCL_IB_DISABLE Environment Variable

NCCL_IB_DISABLE environment variable explained. Set NCCL_IB_DISABLE=1 for Ethernet-only clusters, debug InfiniBand errors, and tune GPU communication.

⏱ 30 minutes vllmascendnpu

vLLM on Huawei Ascend NPU: K8s Deployment

Deploy vLLM inference on Huawei Ascend NPUs in Kubernetes. Atlas 300I/910B device plugin, vllm-ascend container image, tensor parallelism, and model serving.

⏱ 20 minutes vllmopenai-apiinference

Deploy vLLM OpenAI Container on Kubernetes

Deploy the vLLM OpenAI-compatible server container on Kubernetes. Pull ghcr.io/vllm-project/vllm-openai, configure GPU resources, model loading.

⏱ 20 minutes ai-nativedevelopment-platformscopilot

AI-Native Development Platforms on Kubernetes

Build AI-native development platforms on Kubernetes. AI coding agents, automated testing, Copilot infrastructure, dev containers, and AI-driven CI/CD pipelines.

⏱ 25 minutes agentic-aimulti-agentlangchain

Agentic AI and Multi-Agent Systems

Deploy autonomous AI agents and multi-agent orchestration on Kubernetes. LangGraph, CrewAI, AutoGen, tool-calling agents, agent-to-agent communication.

⏱ 20 minutes cost-optimizationgpu-sharingspot-instances

AI Infrastructure Cost Optimization

Optimize AI infrastructure costs on Kubernetes. GPU sharing, spot instances, inference batching, model quantization, token economics.

⏱ 20 minutes watermarkingsynthidai-generated-content

AI Content Watermarking on Kubernetes

Deploy AI-generated content watermarking on Kubernetes. Invisible watermarks, SynthID integration, detection APIs, image and text watermarking pipelines.

⏱ 25 minutes ai-securityllm-securityprompt-injection

AI Security Platforms on Kubernetes

Secure AI workloads on Kubernetes. Model supply chain security, prompt injection defense, LLM output filtering, AI RBAC, GPU isolation.

⏱ 30 minutes supercomputinggpu-clustersnvidia-dgx

AI Supercomputing on Kubernetes GPU Clusters

Build AI supercomputing platforms on Kubernetes. Multi-node GPU training, NVIDIA DGX SuperPOD, InfiniBand RDMA, NCCL tuning, Blackwell clusters.

⏱ 25 minutes industrial-aidigital-twiniot

Autonomous Industrial Systems on Kubernetes

Orchestrate autonomous factories and logistics with Kubernetes. Digital twins, robot fleet coordination, industrial IoT pipelines, predictive maintenance.

⏱ 25 minutes ciliumservice-meshebpf

Cilium Service Mesh: eBPF-Powered Kubernetes

Deploy Cilium service mesh on Kubernetes with eBPF. Sidecar-free mTLS, L7 traffic management, network policies, Hubble observability, and Gateway API support.

⏱ 25 minutes confidential-computingsgxsev-snp

Confidential Computing: SGX and SEV-SNP

Deploy confidential containers on Kubernetes with Intel SGX and AMD SEV-SNP. Encrypted memory, attestation, confidential VMs, Kata Containers.

⏱ 30 minutes crossplaneinfrastructure-as-codecloud-providers

Crossplane K8s Infrastructure Management

Manage cloud infrastructure from Kubernetes with Crossplane. Providers, Compositions, Claims, XRDs, and GitOps-driven infrastructure as code for AWS, GCP.

⏱ 20 minutes data-monetizationdata-meshdata-marketplace

Data Monetization Platforms on Kubernetes

Build data monetization platforms on Kubernetes. Data marketplace APIs, usage-based billing, data mesh architecture, secure data sharing, and catalog services.

⏱ 20 minutes data-sovereigntygeopatriationgdpr

Data Sovereignty and Geopatriation

Implement data sovereignty and geopatriation on Kubernetes. Multi-region clusters, data residency policies, sovereign cloud, GDPR compliance.

⏱ 20 minutes digital-provenancec2pacontent-authenticity

Digital Provenance and Content Authenticity

Implement digital provenance on Kubernetes with C2PA content credentials. Verify AI-generated content, sign media pipelines.

⏱ 25 minutes domain-specific-llmfine-tuninglora

Domain-Specific Language Models on Kubernetes

Deploy and fine-tune domain-specific LLMs on Kubernetes. Legal, healthcare, finance, and code models with LoRA fine-tuning, NIM serving, and RAG pipelines.

⏱ 20 minutes fluxargocdgitops

Flux vs ArgoCD: Kubernetes GitOps Compared

Compare Flux and ArgoCD for Kubernetes GitOps. Architecture, multi-tenancy, Helm support, UI, scalability, and when to choose each for production GitOps de.

⏱ 25 minutes gitopsai-workloadsargocd

GitOps for AI Workloads on Kubernetes

Deploy AI models with GitOps on Kubernetes. Version ML models in Git, ArgoCD for model rollouts, Flux for GPU cluster sync.

⏱ 10 minutes grafanadashboard-6417node-exporter

Grafana Dashboard 6417: Node Exporter Setup

Import Grafana Dashboard 6417 for Kubernetes pod monitoring. Node Exporter Full setup with Prometheus, CPU, memory, disk, and network metrics.

⏱ 10 minutes helmsprigtemplating

Helm Sprig add1 trim merge Functions

Helm Sprig add1 function increments integers in templates. Plus trim for whitespace removal and merge for combining dictionaries in Helm charts.

⏱ 10 minutes helmsprigtemplating

Helm Sprig print quote default Functions

Helm Sprig print function concatenates without spaces, quote wraps in double quotes, default provides fallback values. Template examples and patterns.

⏱ 20 minutes kedahpaevent-driven

KEDA vs HPA: Event-Driven Autoscaling Expla...

Compare KEDA and HPA for Kubernetes autoscaling. Scale on Kafka lag, Prometheus metrics, queue depth, cron, and custom events. KEDA vs HPA decision guide.

⏱ 25 minutes kubernetes-upgradedeprecated-apismigration

Kubernetes 1.35 and 1.36 Upgrade Checklist

Kubernetes 1.35 and 1.36 upgrade checklist with deprecated APIs, removed features, new GA capabilities, and step-by-step migration guide for production clu.

⏱ 25 minutes ai-gatewaygateway-apiinference

K8s AI Gateway: Inference Extension Guide

Use the Kubernetes AI Gateway and Inference Extension to route LLM traffic. Model-aware routing, load balancing across inference backends.

⏱ 10 minutes configmaphot-reloadconfiguration

K8s ConfigMap Hot Reload Without Restart

Reload Kubernetes ConfigMaps without pod restarts. Volume-mounted auto-update, Reloader controller, checksum annotations.

⏱ 10 minutes cronjobconcurrencyscheduling

Kubernetes CronJob concurrencyPolicy Explained

Configure Kubernetes CronJob concurrencyPolicy: Allow, Forbid, and Replace. Control overlapping job execution, prevent duplicate runs, and handle slow jobs.

⏱ 10 minutes dnsdnspolicycoredns

Kubernetes dnsPolicy and dnsConfig Explained

Configure Kubernetes dnsPolicy: ClusterFirst, Default, None, ClusterFirstWithHostNet. Custom dnsConfig with nameservers, searches, and ndots options.

⏱ 25 minutes dragpu-schedulingresource-allocation

Dynamic Resource Allocation for GPUs

Use Kubernetes Dynamic Resource Allocation to schedule GPUs. DRA ResourceClaims, partitionable devices, GPU sharing, and structured parameters for accelerators.

⏱ 10 minutes finalizersdeletioncontrollers

K8s Finalizers: Prevent Premature Deletion

How Kubernetes finalizers work to prevent premature resource deletion. Add, remove, and troubleshoot stuck finalizers on PVCs, namespaces, and custom resources.

⏱ 10 minutes fsgroupchangepolicysecurity-contextpersistent-volumes

K8s fsGroupChangePolicy: Fix Slow Mounts

Configure fsGroupChangePolicy OnRootMismatch to skip recursive chown on volume mounts. Fix slow pod starts caused by large persistent volumes with millions.

⏱ 10 minutes jobsbatch-processingparallelism

Kubernetes Job Completions and Parallelism

Configure Kubernetes Job completions, parallelism, backoffLimit, and indexed jobs. Parallel batch processing, work queue patterns, and job failure handling.

⏱ 20 minutes sidecarinit-containersservice-mesh

Native Sidecar Containers in K8s: Complete ...

Use native sidecar containers in Kubernetes v1.33+. InitContainer restartPolicy Always, lifecycle ordering, logging sidecars, service mesh.

⏱ 15 minutes networkpolicydefault-denynetwork-security

Kubernetes NetworkPolicy Default Deny Examples

Create Kubernetes NetworkPolicy default deny rules for ingress and egress. Block all traffic, allow specific pods, DNS exceptions, and namespace isolation.

⏱ 10 minutes prioritypreemptionscheduling

Kubernetes Pod Priority and Preemption Guide

Configure Kubernetes PriorityClasses for pod scheduling priority. Preemption, system-critical pods, resource guarantee hierarchy, and non-preempting priority.

⏱ 15 minutes topology-spreadschedulinghigh-availability

Kubernetes topologySpreadConstraints Guide

Configure pod topology spread constraints for even distribution across zones, nodes, and racks. maxSkew, topologyKey, whenUnsatisfiable.

⏱ 10 minutes poddisruptionbudgetpdbavailability

Kubernetes PodDisruptionBudget (PDB) Guide

Configure PodDisruptionBudgets to protect workloads during node drains, upgrades, and maintenance. minAvailable, maxUnavailable, and eviction policies.

⏱ 10 minutes resource-limitscpumemory

Kubernetes Resource Limits CPU Memory Format

Kubernetes container resource limits and requests syntax. CPU units (200m, 500m, 1), memory units (256Mi, 1Gi), QoS classes, and YAML format examples.

⏱ 15 minutes rolling-updatezero-downtimedeployment-strategy

Kubernetes Rolling Update Zero Downtime Guide

Configure Kubernetes rolling updates for zero-downtime deployments. maxSurge, maxUnavailable, readiness probes, preStop hooks, and graceful shutdown strategies.

⏱ 10 minutes servicesclusteripnodeport

Kubernetes Service Types Comparison

Compare Kubernetes Service types: ClusterIP for internal access, NodePort for direct port exposure, LoadBalancer for external traffic.

⏱ 10 minutes startup-probehealth-checksliveness

Kubernetes Startup Probes for Slow Containers

Configure Kubernetes startup probes for containers with long initialization. Separate startup from liveness checks, failureThreshold tuning.

⏱ 25 minutes kueuebatch-jobsgpu-scheduling

Kueue for Batch Jobs and GPU Queues

Use Kueue to manage batch job queues on Kubernetes. GPU quota, fair sharing, priority queues, ML training workloads, and multi-tenant cluster scheduling.

⏱ 15 minutes llamamodel-sizinggpu-requirements

Llama 2 70B FP16 Model Size 140GB Guide

Llama 2 70B FP16 model size is 140GB. Complete GPU memory requirements for FP16, FP8, INT4 quantization, and multi-GPU tensor parallelism on Kubernetes.

⏱ 15 minutes ncclgpu-trainingdistributed-training

NCCL_SOCKET_IFNAME Environment Variable Guide

Configure NCCL_SOCKET_IFNAME for multi-node GPU training on Kubernetes. Network interface selection, bonding, InfiniBand, and troubleshooting NCCL timeouts.

⏱ 15 minutes openshiftlifecyclesupport-matrix

OpenShift Support Lifecycle: Versions, EOL,...

OpenShift lifecycle: version support matrix, EOL dates for OCP 4.14-4.18, EUS upgrade paths, and end-of-life schedule. Updated for 2026.

⏱ 25 minutes openshiftupgradeseus

OpenShift Upgrade Planning for 2026

Plan OpenShift upgrades for 2026. EUS-to-EUS paths, operator compatibility, pre-upgrade checks, canary node pools, and rollback strategy for OCP 4.14 to 4.18.

⏱ 25 minutes physical-airoboticsros2

Physical AI and Robotics Orchestration

Orchestrate physical AI and robotics fleets with Kubernetes. ROS 2 on K8s, robot fleet management, edge-cloud hybrid, NVIDIA Isaac.

⏱ 30 minutes platform-engineeringdeveloper-experiencebackstage

Platform Engineering on K8s: Build an IDP

Build an internal developer platform on Kubernetes. Backstage, Crossplane, ArgoCD, self-service templates, golden paths.

⏱ 20 minutes post-quantumcryptographypqc

Post-Quantum Cryptography on Kubernetes

Prepare Kubernetes clusters for post-quantum cryptography. NIST PQC standards, hybrid TLS certificates, quantum-safe mTLS, Istio/Cilium integration.

⏱ 20 minutes preemptive-securitythreat-detectioncnapp

Preemptive Cybersecurity on Kubernetes

Implement preemptive cybersecurity on Kubernetes. Threat prediction, automated vulnerability patching, runtime behavior analysis, CNAPP.

⏱ 25 minutes quantum-computingqiskithybrid-workflows

Quantum Computing on K8s: Hybrid Workflows

Run quantum computing workloads on Kubernetes. Qiskit, Cirq, PennyLane hybrid classical-quantum pipelines, quantum job scheduling, and QPU integration patterns.

⏱ 30 minutes air-gappedsovereignoffline

Sovereign Air-Gapped Kubernetes Clusters

Deploy sovereign and air-gapped Kubernetes clusters. Offline installation, private registry mirrors, disconnected GitOps, sovereign cloud.

⏱ 20 minutes gpu-troubleshootingdevice-pluginnvidia

Troubleshooting Pods with GPU Devices

Fix GPU device issues in Kubernetes pods. Troubleshoot device plugin errors, DRA claims, CUDA failures, driver mismatches.

⏱ 30 minutes run-aitopology-awaregang-scheduling

Run:ai Topology-Aware Scheduling Deep Dive

Configure Run:ai topology-aware scheduling for distributed AI workloads. Multi-level hierarchies, required vs preferred placement, LeaderWorkerSet.

⏱ 25 minutes nvidia-nimmodel-profilesgpu

NIM Model Profiles and Selection on Kubernetes

Configure NIM_MODEL_PROFILE for NVIDIA NIM deployments on Kubernetes. List profiles, select by ID or name, tune VRAM, and override with vLLM CLI args.

⏱ 45 minutes nvidia-nimmulti-nodeleaderworkerset

NIM Multi-Node Deployment with Helm on K8s

Deploy NVIDIA NIM across multiple Kubernetes nodes using Helm, LeaderWorkerSet, Ray, and vLLM. Run Llama 405B and DeepSeek-R1 on 16+ GPUs.

⏱ 15 minutes nvidia-nimgpu-compatibilitysupport-matrix

NIM LLM Support Matrix and GPU Compatibility

Complete NVIDIA NIM support matrix for Kubernetes. Supported models, profiles, precision formats, GPU compatibility, and hardware requirements per model.

⏱ 45 minutes nvidia-dynamodistributed-inferencedisaggregated-serving

NVIDIA Dynamo Distributed Inference

Deploy NVIDIA Dynamo on Kubernetes for disaggregated LLM inference. KV-aware routing, prefill/decode splitting, Grove operator, and zero-config deployment.

⏱ 40 minutes nvidia-nimcustom-modelfine-tuning

Rebuild NIM with Custom Model on Kubernetes

Step-by-step guide to deploy custom, fine-tuned, or self-hosted models with NVIDIA NIM on Kubernetes. Model-free NIM from HuggingFace, S3, NGC, or local path.

⏱ 40 minutes nvidia-dynamorun-aigang-scheduling

Run:ai + Dynamo Multi-Node Scheduling on K8s

Deploy NVIDIA Dynamo with Run:ai v2.23 for gang scheduling and topology-aware placement. Atomic pod launches, zone co-location, and disaggregated inference.

⏱ 20 minutes nvidia-nimquay-registrycontainer-images

Copy NVIDIA NIM Images to Internal Quay Reg...

Pull NIM container images from nvcr.io and push to an internal Quay registry. Covers authentication, tagging, air-gapped workflows, and curl token issues.

⏱ 15 minutes cveingress-nginxsecurity

CVE-2026-4342: ingress-nginx Code Execution...

Patch CVE-2026-4342 in ingress-nginx — a CVSS 8.8 configuration injection vulnerability enabling arbitrary code execution. Upgrade to v1.13.9, v1.14.

⏱ 45 minutes nvidia-nimmultinodetensor-parallelism

Deploy Multinode NIM Models on Kubernetes

Run large language models across multiple GPU nodes with NVIDIA NIM. Tensor parallelism, NCCL, InfiniBand, and Kubernetes Job orchestration.

⏱ 40 minutes nvidia-runaidistributed-inferenceknative

Distributed Inference with Run:ai

Deploy distributed AI inference with NVIDIA Run:ai on Kubernetes. Single-node Knative, multinode LeaderWorkerSet, NIM, autoscaling, and observability.

⏱ 15 minutes k8s-iofiohammerdb

K8s-IO Benchmark CLI for fio and HammerDB

Run distributed fio and HammerDB storage benchmarks on Kubernetes with K8s-IO, a lightweight Go CLI tool that replaces heavy benchmark operators.

⏱ 35 minutes audit-loggingcompliancesoc2

K8s Audit Logging for Enterprise Compliance

Configure API server audit logging for SOC2, HIPAA, and PCI-DSS compliance. Structured audit policies, log shipping, and alerting on suspicious activity.

⏱ 35 minutes change-managementitilmaintenance-windows

K8s Change Mgmt for Enterprise Operations

Implement ITIL-aligned change management for Kubernetes with approval gates, maintenance windows, rollback procedures, and change audit trails.

⏱ 60 minutes disaster-recoveryveleroetcd-backup

Kubernetes Disaster Recovery for Enterprise

Kubernetes disaster recovery with Velero backup and restore. Cross-region replication, etcd snapshots, multi-cluster failover, and RTO/RPO strategies.

⏱ 35 minutes capacity-planningresource-optimizationcluster-sizing

K8s Capacity Planning for Enterprise Clusters

Right-size enterprise clusters with data-driven capacity planning. Forecast resource needs, optimize bin-packing, and plan for growth with metrics.

⏱ 45 minutes gitopsargocdfleet-management

Enterprise GitOps at Scale with Fleet Mgmt

Manage hundreds of Kubernetes clusters with ArgoCD ApplicationSets, Flux multi-cluster, and fleet-wide policy enforcement using GitOps principles.

⏱ 40 minutes image-governanceadmission-controllerscosign

Enterprise Container Image Governance

Enforce image policies with admission controllers. Require signed images, block public registries, and automate vulnerability scanning gates.

⏱ 40 minutes secret-rotationvaultexternal-secrets

Automated Secret Rotation on Kubernetes

Implement zero-downtime secret rotation with External Secrets Operator, HashiCorp Vault dynamic secrets, and rolling restarts for enterprise compliance.

⏱ 50 minutes istioservice-meshmtls

Enterprise Service Mesh mTLS & Observability

Deploy Istio service mesh for enterprise mTLS, traffic management, circuit breaking, and distributed tracing across microservices on Kubernetes.

⏱ 50 minutes multi-tenancynamespace-isolationresource-quotas

Kubernetes Multi-Tenancy for Enterprise Teams

Implement secure multi-tenancy with namespace isolation, ResourceQuotas, NetworkPolicies, hierarchical namespaces, and vCluster for strong isolation.

⏱ 45 minutes oidcenterprise-ssokeycloak

K8s OIDC Integration with Enterprise SSO

Configure Kubernetes API server OIDC authentication with Keycloak, Azure AD, or Okta for enterprise single sign-on and group-based RBAC.

⏱ 60 minutes nvidia-runainvidia-nimdistributed-inference

Run:ai NIM Distributed Inference Tutorial

Step-by-step guide to deploy DeepSeek-R1 distributed inference on Run:ai with LeaderWorkerSet, SGLang, PVC caching, and OpenShift security.

⏱ 15 minutes argo-workflowsci-cdpipeline

Argo Workflows on Kubernetes: CI/CD Guide

Run CI/CD pipelines and data workflows with Argo Workflows on Kubernetes. Create DAG-based workflows, parallel steps, artifact passing, and cron workflows.

⏱ 15 minutes fiostorage-benchmarkopenshift

Distributed fio Storage Benchmark K8s

Run distributed fio benchmarks on Kubernetes and OpenShift to test storage performance at scale. Covers fio-distributed with k8s Jobs, Red Hat dbench.

⏱ 15 minutes external-dnsdnsroute53

External DNS for Kubernetes: Setup Guide

Automate DNS record management with ExternalDNS for Kubernetes. Sync Service and Ingress hostnames to Route53, CloudFlare, Google Cloud DNS, and 30+ providers.

⏱ 15 minutes falcoruntime-securitythreat-detection

Falco Runtime Security for Kubernetes

Deploy Falco for Kubernetes runtime threat detection. Detect shell spawns in containers, privilege escalation, sensitive file access, and suspicious network

⏱ 15 minutes helmtestingci-cd

Helm Chart Testing & CI/CD Pipeline Integra...

Test Helm charts automatically with ct (chart-testing), helm unittest, and GitHub Actions. Validate templates, lint values.

⏱ 15 minutes helmhooksdatabase-migration

Helm Hooks Database Migrations & Lifecycle ...

Use Helm hooks to run database migrations, backups, and validation jobs during install, upgrade, and rollback. Control execution order with hook weights an.

🎯 Helm advanced

Helm Library Charts for Reusable Templates

Create Helm library charts to share common templates across multiple charts. DRY up deployments, services, and config patterns with reusable library functions.

⏱ 15 minutes helmlibrary-charttemplates

⏱ 15 minutes helmociregistry

Helm OCI Registry for Chart Distribution

Store and distribute Helm charts using OCI registries like GHCR, ECR, ACR, and Harbor. Migrate from ChartMuseum to OCI-native chart management.

⏱ 15 minutes helmsecretssops

Helm Secrets Mgmt with SOPS & Age Encryption

Encrypt Helm values files using SOPS with Age or GPG keys. Manage secrets in Git safely with helm-secrets plugin for transparent encrypt/decrypt workflows.

⏱ 15 minutes fioopenshiftstorage-benchmark

OpenShift Storage Benchmark fio Config Prof...

Benchmark OpenShift and Kubernetes storage using fio with reusable YAML config profiles for random and sequential read/write I/O patterns.

⏱ 15 minutes karpenterawseks

Karpenter Node Autoscaling for K8s on AWS

Deploy Karpenter for fast, flexible node autoscaling on AWS EKS. Configure NodePools, EC2NodeClasses, and consolidation for real cost savings.

⏱ 15 minutes kubeflowmlopsoperator

Kubeflow Operator: Full ML Platform

Deploy the complete Kubeflow platform on Kubernetes with the Kubeflow Operator. Covers Pipelines, Notebooks, KServe, Katib, and multi-tenant ML workflows.

⏱ 15 minutes affinityanti-affinityscheduling

Kubernetes Affinity and Anti-Affinity Guide

Schedule pods with Kubernetes node affinity, pod affinity, and anti-affinity rules. Spread across zones, co-locate related services, and optimize

⏱ 15 minutes cluster-autoscalernode-scalinggpu

Advanced Cluster Autoscaler Config & Tuning

Fine-tune the Kubernetes Cluster Autoscaler with expanders, priority-based scaling, mixed instance policies, and GPU node pool autoscaling for production c.

⏱ 15 minutes clusteripserviceinternal

Kubernetes ClusterIP Service Explained

Understand Kubernetes ClusterIP services for internal communication. How kube-proxy routes traffic, DNS resolution, and when ClusterIP is the right service

⏱ 15 minutes kubectlcommandsreference

Essential Kubernetes Commands Reference

Master the most used Kubernetes commands for daily operations. Complete kubectl reference for pods, deployments, services, debugging, and cluster management.

⏱ 15 minutes configmapconfigurationenvironment-variables

ConfigMap Patterns in Kubernetes

Create and use Kubernetes ConfigMaps for application configuration. Mount as files, inject as environment variables, and manage config updates without

⏱ 15 minutes cronjobschedulingcron

Kubernetes CronJob Scheduling Guide

Schedule recurring tasks with Kubernetes CronJobs. Covers cron syntax, timezone support, concurrency policies, job history, manual triggers, and monitoring.

⏱ 15 minutes daemonsetper-nodemonitoring

Kubernetes DaemonSet Complete Guide

Deploy DaemonSets in Kubernetes to run one pod per node. Covers monitoring agents, log collectors, CNI plugins, node affinity, and rolling update strategies.

⏱ 15 minutes dnscorednsservice-discovery

Kubernetes DNS and CoreDNS Guide

Understand Kubernetes DNS resolution with CoreDNS. Debug DNS issues, configure custom DNS, and optimize DNS performance for large clusters.

⏱ 15 minutes ingressnginx-ingresstls

Kubernetes Ingress Complete Guide

Configure Kubernetes Ingress for HTTP routing, TLS termination, and path-based routing. Covers NGINX Ingress Controller, cert-manager, and Ingress vs Gateway

⏱ 15 minutes jobcronjobbatch

Kubernetes Jobs and CronJobs Guide

Run batch workloads with Kubernetes Jobs and CronJobs. Covers one-shot tasks, parallel processing, scheduled jobs, failure handling, and cleanup policies.

⏱ 15 minutes labelsselectorsorganization

Kubernetes Labels and Selectors Explained

Use Kubernetes labels and selectors to organize and query resources. Covers label conventions, selector types, recommended labels, and label-based operations.

⏱ 15 minutes loadbalancerserviceexternal-access

Kubernetes LoadBalancer Service Guide

Expose Kubernetes services with LoadBalancer type for production traffic. Covers cloud providers, MetalLB for bare-metal, health checks, and cost optimization.

⏱ 15 minutes nodeportserviceexternal-access

Kubernetes NodePort Service Explained

Expose Kubernetes services externally with NodePort. Understand port ranges, security implications, and when to use NodePort vs LoadBalancer vs Ingress.

⏱ 15 minutes persistent-volumepvstorage

Persistent Volume NFS iSCSI Guide

Master Kubernetes PersistentVolumes: static and dynamic provisioning, reclaim policies, volume modes, and lifecycle. From PV creation to pod mounting and data

⏱ 15 minutes pod-lifecyclephaseshooks

Kubernetes Pod Lifecycle Explained

Understand the Kubernetes pod lifecycle from creation to termination. Covers pod phases, container states, init containers, hooks, and graceful shutdown

⏱ 15 minutes pvcpersistent-volumestorage

PVC Storage Provisioning in Kubernetes

Create and manage Kubernetes PersistentVolumeClaims and PersistentVolumes. Covers dynamic provisioning, StorageClasses, access modes, volume

⏱ 15 minutes rolling-updatedeployment-strategyzero-downtime

Kubernetes Rolling Update Strategy Guide

Configure Kubernetes rolling update strategy for zero-downtime deployments. Tune maxSurge, maxUnavailable, minReadySeconds, and rollback procedures.

⏱ 15 minutes secretsencryptionsecurity

Secrets Encryption Rotation K8s Guide

Manage Kubernetes Secrets for passwords, tokens, and certificates. Covers creation, encryption at rest, external secret operators, and security best practices.

⏱ 15 minutes service-typesclusteripnodeport

Kubernetes Service Types Explained

Compare all Kubernetes service types: ClusterIP, NodePort, LoadBalancer, ExternalName, and headless. Choose the right type for internal, external, and hybrid

Taints and Tolerations in Kubernetes

Control pod scheduling with Kubernetes taints and tolerations. Dedicate nodes for specific workloads, prevent scheduling on control plane nodes, and handle GPU

⏱ 15 minutes kubevirtvirtual-machinesvm

KubeVirt: Run VMs on Kubernetes

Run virtual machines alongside containers on Kubernetes with KubeVirt. Covers VM creation, live migration, GPU passthrough, and VM-to-container networking.

⏱ 15 minutes tektonci-cdpipeline

Tekton Pipelines on Kubernetes

Build cloud-native CI/CD pipelines with Tekton on Kubernetes. Create reusable Tasks, Pipelines, triggers, and integrate with Git webhooks for automated builds.

⏱ 15 minutes wasmspinkubespin

WebAssembly Runtime with Spin and SpinKube

Deploy WebAssembly workloads on Kubernetes using SpinKube and the Spin Operator. Run Wasm components alongside containers with sub-millisecond cold starts.

⏱ 15 minutes wasmwasicontainerd

WASI and containerd Wasm Shims on Kubernetes

Run WebAssembly workloads using containerd Wasm shims with WASI support on Kubernetes. Configure runwasi, wasmtime, and WasmEdge as container runtimes.

⏱ 15 minutes wasmserverlesskeda

Serverless Functions with WebAssembly

Build serverless functions using WebAssembly on Kubernetes with Fermyon Cloud, KEDA, and SpinKube. Achieve sub-millisecond scale-to-zero with Wasm cold starts.

⏱ 15 minutes cluster-autoscalernode-scalingcloud

Kubernetes Cluster Autoscaler Setup Guide

Configure the Cluster Autoscaler to automatically add and remove nodes based on pod scheduling demands. Covers AWS, GKE, Azure, and bare-metal setups.

⏱ 15 minutes kedaevent-drivenautoscaling

KEDA: Event-Driven Autoscaling for Kubernetes

Scale Kubernetes workloads with KEDA based on external events: queue depth, cron schedules, Prometheus metrics, HTTP traffic, and 60+ event sources.

⏱ 15 minutes alertingprometheusalertmanager

Kubernetes Alerting Best Practices

Design effective Kubernetes alerts that reduce noise and catch real issues. Covers severity tiers, golden signals, runbook links, and fatigue prevention.

⏱ 15 minutes blue-greendeployment-strategyzero-downtime

Blue-Green Deployment in Kubernetes

Implement blue-green deployments in Kubernetes for instant rollback. Covers Service selector switching, Argo Rollouts blue-green, and comparison with canary

⏱ 15 minutes canarydeployment-strategyprogressive-delivery

Canary Deployment in Kubernetes

Implement canary deployments in Kubernetes to gradually roll out changes. Covers native K8s, Argo Rollouts, Istio traffic splitting, and automated analysis.

⏱ 15 minutes cordondrainnode-maintenance

Kubernetes Cordon, Drain, and Uncordon Nodes

Safely manage Kubernetes nodes with cordon, drain, and uncordon. Prepare nodes for maintenance, upgrades, and decommissioning without disrupting workloads.

⏱ 15 minutes kubecostcost-monitoringfinops

Kubernetes Cost Monitoring with Kubecost

Monitor and optimize Kubernetes costs with Kubecost. Track per-namespace and per-deployment spend with cloud billing integration and savings tips.

⏱ 15 minutes custom-metricsprometheus-adapterhpa

Custom Metrics Autoscaling in Kubernetes

Scale Kubernetes pods on custom application metrics with Prometheus Adapter. Configure HPA with custom and external metrics beyond CPU and memory.

⏱ 15 minutes debugkubectl-debugephemeral-containers

Debug Kubernetes Pods: Complete Guide

Debug Kubernetes pods with kubectl debug, ephemeral containers, and netshoot. Troubleshoot distroless images, network issues, and crashed pods step by step.

⏱ 15 minutes endpointslicesendpointsservice-discovery

Kubernetes EndpointSlices Explained

Understand Kubernetes EndpointSlices for scalable service endpoint management. How they improve on Endpoints objects for large clusters with thousands of pods.

⏱ 15 minutes graceful-shutdownsigtermprestop

Graceful Shutdown Pod Termination

Implement graceful shutdown in Kubernetes pods. Handle SIGTERM, drain connections, use preStop hooks, and configure terminationGracePeriodSeconds correctly.

⏱ 15 minutes headless-servicestatefulsetdns

Kubernetes Headless Service Explained

Create Kubernetes headless services for StatefulSet DNS, direct pod addressing, and service discovery. Understand when clusterIP None is the right choice.

⏱ 15 minutes health-checksprobesliveness

Kubernetes Health Checks Best Practices

Design effective Kubernetes health checks with liveness, readiness, and startup probes. Avoid common anti-patterns like database checks in liveness probes.

⏱ 15 minutes init-containersstartupmigrations

Kubernetes Init Containers Guide

Use Kubernetes init containers to run setup tasks before your main application starts. Covers database migrations, config generation, dependency

⏱ 15 minutes limitrangeresourcequotaresource-management

Kubernetes LimitRange and ResourceQuota

Configure LimitRange and ResourceQuota in Kubernetes namespaces. Set default resource requests, enforce limits, and prevent resource exhaustion across teams.

⏱ 15 minutes rookcephdistributed-storage

Rook-Ceph: Distributed Storage for Kubernetes

Deploy Rook-Ceph on Kubernetes for distributed block, file, and object storage. Covers installation, CephCluster configuration, StorageClasses, and monitoring.

⏱ 15 minutes service-accountrbactokens

Kubernetes Service Accounts Guide

Create and manage Kubernetes service accounts for pod identity. Covers RBAC binding, token projection, workload identity, and least-privilege access

⏱ 15 minutes sidecarmulti-containerlogging

Kubernetes Sidecar Containers Pattern

Implement the sidecar pattern in Kubernetes for logging, proxying, syncing, and monitoring alongside your main application container. Covers native K8s 1.28+

⏱ 15 minutes storagebest-practicesproduction

K8s Storage Best Practices for Production

Production storage best practices for Kubernetes. StorageClass selection, backup strategies, volume expansion, data migration, and performance tuning.

⏱ 15 minutes troubleshootingdebuggingflowchart

Kubernetes Troubleshooting Flowchart

Systematic Kubernetes troubleshooting guide with flowcharts. Debug pods, services, networking, storage, and node issues step by step with kubectl commands.

⏱ 15 minutes zero-downtimerolling-updategraceful-shutdown

Zero-Downtime Deployment in Kubernetes

Achieve zero-downtime deployments in Kubernetes. Covers readiness probes, PDBs, preStop hooks, rolling update tuning, and connection draining best practices.

⏱ 15 minutes virtual-kubeletserverlessburst-scaling

Virtual Kubelet for Serverless K8s Scaling

Deploy Virtual Kubelet to burst Kubernetes workloads to serverless backends like Azure ACI, AWS Fargate, and Hashicorp Nomad for infinite scaling.

⏱ 15 minutes deploymentstatefulsetcomparison

Deployment vs StatefulSet in Kubernetes

Choose between Deployment and StatefulSet for your Kubernetes workloads. Compare identity, storage, ordering, scaling, and use cases for each controller.

⏱ 15 minutes affinityanti-affinityscheduling

Kubernetes Node and Pod Affinity Guide

Configure node affinity, pod affinity, and anti-affinity rules for advanced Kubernetes scheduling. Control pod placement across zones, nodes, and topologies.

⏱ 15 minutes annotationsmetadataingress

Kubernetes Annotations Complete Guide

Use Kubernetes annotations for metadata, automation, and controller config. Common patterns for ingress annotations, Helm labels, and triggers.

⏱ 15 minutes backuprestorevelero

Kubernetes Backup and Restore with Velero

Backup and restore Kubernetes clusters with Velero. Covers namespace backups, scheduled backups, disaster recovery, and migration between clusters.

⏱ 15 minutes ci-cdgithub-actionspipeline

Kubernetes CI/CD Pipeline with GitHub Actions

Build a complete CI/CD pipeline for Kubernetes with GitHub Actions. Covers Docker build, image push, Helm deploy, and automated rollback on failure.

⏱ 15 minutes upgradekubeadmcluster-management

Kubernetes Cluster Upgrade Step-by-Step

Upgrade Kubernetes clusters safely with kubeadm. Covers pre-flight checks, control plane upgrade, worker node drain, and rollback procedures.

⏱ 15 minutes deploymentreplicasrolling-update

Kubernetes Deployment Complete Guide

Create and manage Kubernetes Deployments for stateless applications. Covers replicas, selectors, rolling updates, rollback, and deployment strategies.

⏱ 15 minutes dnscorednsservice-discovery

Kubernetes DNS: How Service Discovery Works

Understand Kubernetes DNS resolution with CoreDNS. Service discovery, pod DNS, headless services, custom DNS policies, and troubleshooting DNS failures.

⏱ 15 minutes emptydirvolumestemporary-storage

Kubernetes emptyDir Volume Explained

Use emptyDir volumes in Kubernetes for temporary storage, shared data between containers, and cache. Covers medium types, size limits, and tmpfs backing.

⏱ 15 minutes environment-variablesenvconfigmap

Kubernetes Environment Variables Guide

Set Kubernetes environment variables with envFrom, configMapRef, secretKeyRef, and the Downward API. Variable ordering, fieldRef, and best practices.

⏱ 15 minutes kubectl-execdebuggingshell

kubectl exec: Run Commands Inside K8s Pods

Use kubectl exec to run commands inside Kubernetes pods. Covers interactive sessions, multi-container pods, and ephemeral container debugging.

⏱ 15 minutes helmkustomizecomparison

Helm vs Kustomize: Which to Use

Compare Helm and Kustomize for Kubernetes configuration management. Covers templating vs overlays, use cases, pros and cons, and when to use both together.

⏱ 15 minutes imagepullbackofftroubleshootingregistry

Fix ImagePullBackOff in Kubernetes

Debug and fix ImagePullBackOff errors in Kubernetes. Covers wrong image names, private registry auth, rate limits, and network connectivity issues.

⏱ 15 minutes ingressroutingtls

K8s Ingress: Routing, TLS, and Controllers

Configure Kubernetes Ingress for HTTP routing, TLS termination, and path-based routing. Covers NGINX, Traefik, and HAProxy ingress controllers.

⏱ 15 minutes labelsselectorsorganization

Kubernetes Labels and Selectors Guide

Master Kubernetes labels and selectors for organizing and querying resources. Label conventions, equality selectors, set-based selectors, and field selectors.

⏱ 15 minutes load-balancingserviceingress

Kubernetes Load Balancing Strategies

Configure Kubernetes load balancing with Services, Ingress, and Gateway API. Round-robin, session affinity, weighted routing, and traffic policy.

⏱ 15 minutes minikubekindk3d

K8s Local Development with Minikube and Kind

Set up local Kubernetes clusters for development with Minikube, Kind, and k3d. Covers installation, configuration, local registries, and hot-reload workflows.

⏱ 15 minutes loggingelasticsearchfluentd

EFK Stack: Kubernetes Centralized Logging

Deploy EFK stack for Kubernetes centralized logging. Elasticsearch, Fluentd, Kibana setup, log collection, parsing, and retention policies.

⏱ 15 minutes monitoringprometheusgrafana

K8s Monitoring with Prometheus and Grafana

Set up Kubernetes monitoring with Prometheus and Grafana. Covers kube-prometheus-stack, custom dashboards, alerting rules, and key metrics to monitor.

⏱ 15 minutes multi-tenancynamespacesisolation

Kubernetes Multi-Tenancy Patterns

Implement multi-tenancy in Kubernetes with namespaces, RBAC, quotas, network policies, and virtual clusters. Covers soft and hard tenancy models.

⏱ 15 minutes security-checklisthardeningproduction

Kubernetes Security Checklist for Production

Production security checklist for Kubernetes clusters. Covers RBAC, network policies, pod security, secrets encryption, audit logging, and image scanning.

⏱ 15 minutes oomkilledmemoryout-of-memory

Debug and Fix OOMKilled Errors in Kubernetes

Debug and fix OOMKilled errors in Kubernetes. Find memory leaks, set correct limits, use VPA for right-sizing, and prevent container OOM kills.

⏱ 15 minutes operatorcrdcustom-resource

Kubernetes Operator Pattern Explained

Build and use Kubernetes Operators for automated application management. Covers the operator pattern, CRDs, controller-runtime, and Operator SDK.

⏱ 15 minutes evictionresource-pressurepriority-class

Kubernetes Pod Eviction: Causes and Prevention

Understand why Kubernetes evicts pods and how to prevent it. Covers resource pressure, priority classes, PDBs, and eviction policies.

⏱ 15 minutes pod-lifecyclephasesgraceful-shutdown

Kubernetes Pod Lifecycle and States Explained

Understand the Kubernetes pod lifecycle from Pending to Terminated. Covers pod phases, container states, restart policies, graceful shutdown, and preStop hooks.

⏱ 15 minutes port-forwardkubectldebugging

kubectl Port-Forward: Access Pods and Services

Use kubectl port-forward to access Kubernetes pods, services, and deployments from your local machine. Debug, test, and access internal services securely.

⏱ 15 minutes rbacrolesclusterrole

K8s RBAC: Roles, ClusterRoles, and Bindings

Configure Kubernetes RBAC with Roles, ClusterRoles, RoleBindings, and service accounts. Least privilege access control for users, groups, and applications.

⏱ 15 minutes replicasetreplicasscaling

Kubernetes ReplicaSet Explained

Understand ReplicaSets in Kubernetes for maintaining pod replicas. Covers selectors, scaling, ownership, and why you should use Deployments instead.

⏱ 15 minutes resourcesrequestslimits

Kubernetes Resource Requests and Limits Guide

Configure CPU and memory requests and limits in Kubernetes. Understand QoS classes, OOMKilled, CPU throttling, and right-sizing with VPA recommendations.

⏱ 15 minutes secretssecurityencryption

Kubernetes Secrets: Create, Use, and Secure

Create and manage Kubernetes Secrets for sensitive data. Covers types, encoding, mounting, external secrets operators, and encryption at rest best practices.

Kubernetes Taints and Tolerations Guide

Use Kubernetes taints and tolerations to control pod scheduling. Dedicate nodes for GPU workloads, isolate teams, and prevent scheduling on specific nodes.

⏱ 15 minutes volumesemptydirhostpath

Kubernetes Volume Types Explained

Compare all Kubernetes volume types: emptyDir, hostPath, PVC, ConfigMap, Secret, NFS, CSI, and projected volumes. When to use each type with examples.

⏱ 15 minutes air-gappeddisconnectedpodman

Air-Gapped Image Import for OpenShift Clusters

Import container images into disconnected OpenShift clusters. Use podman save/load and internal registries when DNS and TLS block external pulls.

⏱ 15 minutes api-servertimeoutconnectivity

Fix API Server Timeout and Overload

Debug kubectl timeouts, API server overload, and connection refused errors. Covers etcd latency, webhook timeouts, and rate limiting.

⏱ 15 minutes backstagedeveloper-portalidp

Backstage Developer Portal on Kubernetes

Deploy Spotify Backstage on Kubernetes as an internal developer portal. Covers Helm install, PostgreSQL backend, catalog entities, and TechDocs integration.

⏱ 15 minutes certificatestlsexpiry

Fix Kubernetes Certificate Expiry Issues

Debug and renew expired Kubernetes certificates for API server, kubelet, and etcd. Covers kubeadm cert renewal, OpenShift auto-rotation, and monitoring expiry.

⏱ 15 minutes cluster-apicapiinfrastructure

Cluster API for K8s Lifecycle Management

Manage Kubernetes cluster lifecycle with Cluster API. Declarative cluster creation, upgrades, scaling, and multi-cloud infrastructure provisioning as code.

⏱ 15 minutes confidential-computingsgxsev-snp

Confidential Computing on Kubernetes

Deploy confidential containers with encrypted memory using Intel SGX, AMD SEV-SNP, and Kata Containers. Protect data in use from even the cluster admin.

⏱ 15 minutes configmaphot-reloadvolumes

Fix ConfigMap Changes Not Applied to Pods

Debug ConfigMap updates not reflected in running pods. Covers volume mount propagation delays, env var immutability, and sidecar-based reload strategies.

⏱ 15 minutes corednsdnsnetworking

Fix CoreDNS Resolution Failures in Kubernetes

Debug DNS resolution failures in Kubernetes pods. Covers CoreDNS crashes, NXDOMAIN errors, ndots configuration, and upstream DNS timeouts.

⏱ 15 minutes crashloopbackoffpodsdebugging

How to Fix CrashLoopBackOff in Kubernetes

Fix CrashLoopBackOff in Kubernetes with step-by-step troubleshooting. Debug OOMKilled, failed probes, missing configs, and image errors causing pod crash loops.

⏱ 15 minutes etcdperformanceapi-server

Fix etcd High Latency and Slow API Server

Debug etcd performance issues causing slow kubectl responses and API server timeouts. Covers disk I/O, compaction, defragmentation, and leader elections.

⏱ 15 minutes fiolibaioseccomp

Fix fio libaio Silent Exit on OpenShift cru...

Debug fio instantly exiting with no output on crun-based OpenShift nodes. The root cause is seccomp blocking libaio syscalls — fix with psync or unconfined.

⏱ 15 minutes helmchart-developmenttemplates

Helm Chart Development from Scratch

Build production-ready Helm charts with templates, values, helpers, hooks, tests, and CI validation. Complete guide from chart create to publishing.

⏱ 15 minutes helmupgraderollback

Fix Helm Upgrade Failed and Rollback

Debug failed Helm releases stuck in pending-upgrade or failed state. Covers atomic upgrades, manual rollback, secret storage cleanup, and history limits.

⏱ 15 minutes imagepullbackoffregistrypull-secret

ImagePullBackOff Troubleshooting Guide

Debug and resolve ImagePullBackOff errors including auth failures, wrong tags, private registry access, and rate limiting from Docker Hub and Quay.

⏱ 15 minutes ingressnginx502

Fix Ingress 502 and 503 Gateway Errors

Debug 502 Bad Gateway and 503 Service Unavailable from Kubernetes ingress controllers. Fix backend health and timeout issues.

Install ArgoCD on AlmaLinux: Step-by-Step

Deploy ArgoCD on Kubernetes running on AlmaLinux. GitOps continuous delivery with automated sync, self-healing, and multi-cluster support.

Install ArgoCD on Amazon Linux

Deploy ArgoCD on Kubernetes running on Amazon Linux. GitOps continuous delivery with automated sync, self-healing, and multi-cluster support.

Install ArgoCD on Arch Linux: Step-by-Step

Deploy ArgoCD on Kubernetes running on Arch Linux. GitOps continuous delivery with automated sync, self-healing, and multi-cluster support.

Install ArgoCD on CentOS Stream

Deploy ArgoCD on Kubernetes running on CentOS Stream. GitOps continuous delivery with automated sync, self-healing, and multi-cluster support.

Install ArgoCD on Debian: Step-by-Step Guide

Deploy ArgoCD on Kubernetes running on Debian. GitOps continuous delivery with automated sync, self-healing, and multi-cluster support.

Install ArgoCD on Fedora: Step-by-Step Guide

Deploy ArgoCD on Kubernetes running on Fedora. GitOps continuous delivery with automated sync, self-healing, and multi-cluster support.

Install ArgoCD on openSUSE: Step-by-Step

Deploy ArgoCD on Kubernetes running on openSUSE. GitOps continuous delivery with automated sync, self-healing, and multi-cluster support.

Install ArgoCD on Oracle Linux

Deploy ArgoCD on Kubernetes running on Oracle Linux. GitOps continuous delivery with automated sync, self-healing, and multi-cluster support.

Install ArgoCD on RHEL: Step-by-Step Guide

Deploy ArgoCD on Kubernetes running on RHEL. GitOps continuous delivery with automated sync, self-healing, and multi-cluster support.

Install ArgoCD on Rocky Linux Step-by-Step

Deploy ArgoCD on Kubernetes running on Rocky Linux. GitOps continuous delivery with automated sync, self-healing, and multi-cluster support.

Install ArgoCD on SUSE SLES: Step-by-Step

Deploy ArgoCD on Kubernetes running on SUSE SLES. GitOps continuous delivery with automated sync, self-healing, and multi-cluster support.

Install ArgoCD on Ubuntu: Step-by-Step Guide

Deploy ArgoCD on Kubernetes running on Ubuntu. GitOps continuous delivery with automated sync, self-healing, and multi-cluster support.

⏱ 15 minutes helminstallationalma-linux

Install Helm on AlmaLinux: Setup Guide

Install Helm 3 on AlmaLinux and configure chart repositories. Covers package manager install, script install, and shell completion for AlmaLinux 8/9.

⏱ 15 minutes helminstallationamazon-linux

Install Helm on Amazon Linux: Setup Guide

Install Helm on Amazon Linux 2023 and AL2. Three install methods, chart repository setup, shell completion, and troubleshooting for Amazon Linux environments.

⏱ 15 minutes helminstallationarch-linux

Install Helm on Arch Linux: Setup Guide

Install Helm 3 on Arch Linux and configure chart repositories. Covers package manager install, script install, and shell completion for Arch Linux rolling.

⏱ 15 minutes helminstallationcentos-stream

Install Helm on CentOS Stream Setup Guide

Install Helm 3 on CentOS Stream and configure chart repositories. Covers package manager install, script install, and shell completion for CentOS Stream 9.

⏱ 15 minutes helminstallationdebian

Install Helm on Debian: Setup Guide

Install Helm 3 on Debian and configure chart repositories. Covers package manager install, script install, and shell completion for Debian 11/12.

⏱ 15 minutes helminstallationfedora

Install Helm on Fedora: Setup Guide

Install Helm 3 on Fedora and configure chart repositories. Covers package manager install, script install, and shell completion for Fedora 39/40.

⏱ 15 minutes helminstallationopensuse

Install Helm on openSUSE: Setup Guide

Install Helm 3 on openSUSE with package manager or script. Configure chart repos and shell completion for openSUSE Leap 15 / Tumbleweed.

⏱ 15 minutes helminstallationoracle-linux

Install Helm on Oracle Linux: Setup Guide

Install Helm 3 on Oracle Linux and configure chart repositories. Covers package manager install, script install, and shell completion for Oracle Linux 8/9.

⏱ 15 minutes helminstallationrhel

Install Helm on RHEL: Complete Setup Guide

Install Helm 3 on RHEL and configure chart repositories. Covers package manager install, script install, and shell completion for RHEL 8/9.

⏱ 15 minutes helminstallationrocky-linux

Install Helm on Rocky Linux: Setup Guide

Install Helm 3 on Rocky Linux and configure chart repositories. Covers package manager install, script install, and shell completion for Rocky Linux 8/9.

⏱ 15 minutes helminstallationsuse-sles

Install Helm on SUSE SLES: Setup Guide

Install Helm 3 on SUSE SLES and configure chart repositories. Covers package manager install, script install, and shell completion for SLES 15.

⏱ 15 minutes helminstallationubuntu

Install Helm on Ubuntu: Setup Guide

Install Helm 3 on Ubuntu and configure chart repositories. Covers package manager install, script install, and shell completion for Ubuntu 22.04/24.04.

⏱ 15 minutes kubernetesinstallationalma-linux

Install Kubernetes on AlmaLinux

Step-by-step guide to install Kubernetes on AlmaLinux with kubeadm. Covers containerd, kubeadm init, CNI setup, and worker node joining for AlmaLinux 8/9.

⏱ 15 minutes kubernetesinstallationamazon-linux

Install Kubernetes on Amazon Linux

Install Kubernetes on Amazon Linux with kubeadm. Covers containerd setup, kubeadm init, Calico CNI, and worker node joining for Amazon Linux 2023.

⏱ 15 minutes kubernetesinstallationarch-linux

Install Kubernetes on Arch Linux

Step-by-step guide to install Kubernetes on Arch Linux with kubeadm. Covers containerd, kubeadm init, CNI setup, and worker node joining for Arch Linux rolling.

⏱ 15 minutes kubernetesinstallationcentos-stream

Install Kubernetes on CentOS Stream

Step-by-step guide to install Kubernetes on CentOS Stream with kubeadm. Covers containerd, kubeadm init, CNI setup, and worker node joining for CentOS Stream 9.

⏱ 15 minutes kubernetesinstallationdebian

Install Kubernetes on Debian: Setup Guide

Step-by-step guide to install Kubernetes on Debian with kubeadm. Covers containerd, kubeadm init, CNI setup, and worker node joining for Debian 11/12.

⏱ 15 minutes kubernetesinstallationfedora

Install Kubernetes on Fedora: Setup Guide

Step-by-step guide to install Kubernetes on Fedora with kubeadm. Covers containerd, kubeadm init, CNI setup, and worker node joining for Fedora 39/40.

⏱ 15 minutes kubernetesinstallationopensuse

Install Kubernetes on openSUSE

Install Kubernetes on openSUSE with kubeadm. Covers containerd setup, kubeadm init, Calico CNI, and worker node joining for openSUSE Leap 15 / Tumbleweed.

⏱ 15 minutes kubernetesinstallationoracle-linux

Install Kubernetes on Oracle Linux

Step-by-step guide to install Kubernetes on Oracle Linux with kubeadm. Covers containerd, kubeadm init, CNI setup, and worker node joining for Oracle Linux 8/9.

⏱ 15 minutes kubernetesinstallationrhel

Install Kubernetes on RHEL: Setup Guide

Step-by-step guide to install Kubernetes on RHEL with kubeadm. Covers containerd, kubeadm init, CNI setup, and worker node joining for RHEL 8/9.

⏱ 15 minutes kubernetesinstallationrocky-linux

Install Kubernetes on Rocky Linux

Step-by-step guide to install Kubernetes on Rocky Linux with kubeadm. Covers containerd, kubeadm init, CNI setup, and worker node joining for Rocky Linux 8/9.

⏱ 15 minutes kubernetesinstallationsuse-sles

Install Kubernetes on SUSE SLES

Step-by-step guide to install Kubernetes on SUSE SLES with kubeadm. Covers containerd, kubeadm init, CNI setup, and worker node joining for SLES 15.

⏱ 15 minutes kubernetesinstallationubuntu

Install Kubernetes on Ubuntu: Setup Guide

Step-by-step guide to install Kubernetes on Ubuntu with kubeadm. Covers containerd, kubeadm init, CNI setup, and worker node joining for Ubuntu 22.04/24.04.

⏱ 15 minutes jobscronjobbackoff

Fix Kubernetes Job Failures and Retries

Debug Kubernetes Jobs stuck in backoff or hitting retry limits. Covers backoffLimit, activeDeadlineSeconds, and CronJob overlap.

⏱ 15 minutes karpenterautoscalingnodes

Karpenter Node Autoscaling for Kubernetes

Replace Cluster Autoscaler with Karpenter for faster, smarter node provisioning. Right-sized instances, spot fallback, consolidation, and GPU-aware scaling.

⏱ 15 minutes kubeletnodenotready

Fix Kubelet NotReady and Node Pressure Issues

Debug kubelet NotReady status, node pressure conditions, and eviction issues. Covers disk pressure, memory pressure, PID pressure, and network not ready.

⏱ 15 minutes admission-controllerswebhooksopa

Kubernetes Admission Controllers and Webhooks

Build validating and mutating admission webhooks for Kubernetes. Policy enforcement with OPA Gatekeeper, Kyverno, and custom webhooks.

⏱ 15 minutes api-deprecationmigrationupgrade

Kubernetes API Deprecation Migration Guide

Migrate deprecated Kubernetes APIs before cluster upgrades. Detect deprecated resources with pluto, kubent, and kubectl convert.

⏱ 15 minutes cnicalicocilium

Kubernetes CNI Plugins Compared

Compare Calico, Cilium, Flannel, and Multus CNI plugins for Kubernetes. Performance benchmarks, features, and selection criteria for your cluster.

⏱ 15 minutes debuggingkubectltroubleshooting

Kubernetes Debugging Toolkit and Commands

Essential kubectl debugging commands and tools for Kubernetes troubleshooting. Covers ephemeral containers, debug pods, network debugging, and log analysis.

⏱ 15 minutes disaster-recoverybackupvelero

Kubernetes Disaster Recovery Planning

Build a Kubernetes disaster recovery plan with etcd backups, Velero, cross-region replication, and RTO/RPO targets for production clusters.

⏱ 15 minutes etcdbackuprestore

Kubernetes etcd Operations and Maintenance

Manage etcd for Kubernetes: backup, restore, compaction, defragmentation, member management, and disaster recovery procedures.

⏱ 15 minutes gpu-sharingmpsmig

GPU Sharing with MPS and MIG on Kubernetes

Share NVIDIA GPUs across multiple pods using MPS time-slicing and MIG hardware partitioning. Maximize GPU utilization for inference workloads.

⏱ 15 minutes multi-clusterfederationgitops

Multi-Cluster Mgmt Strategies K8s

Manage multiple Kubernetes clusters with federation, service mesh, and GitOps. Covers Admiralty, Liqo, Skupper, and ArgoCD ApplicationSets.

⏱ 15 minutes secretsvaultexternal-secrets

Kubernetes Secrets Management Patterns

Kubernetes secrets management best practices 2026: External Secrets Operator, Vault, Sealed Secrets, SOPS, encryption at rest, and rotation.

⏱ 15 minutes service-accountstokensoidc

K8s Service Accounts and Token Management

Configure service accounts, bound tokens, OIDC federation, and workload identity for Kubernetes. Migrate from legacy tokens to projected volumes.

⏱ 15 minutes sidecarpatternslogging

Kubernetes Sidecar Container Patterns

Implement sidecar containers for logging, proxying, config reload, and security. Built-in sidecar support in Kubernetes 1.28+ with restartPolicy Always.

⏱ 15 minutes statefulsetdatabasesordered-deployment

Kubernetes StatefulSet Advanced Patterns

Advanced StatefulSet patterns for databases, message queues, and distributed systems. Covers ordered deployment, persistent identity, and headless services.

⏱ 15 minutes windowsmixed-osnode-selector

Run Windows Containers on Kubernetes

Deploy Windows workloads on Kubernetes with mixed Linux and Windows node pools. Covers taints, node selectors, and Windows-specific networking.

⏱ 15 minutes longhornstoragedistributed

Longhorn Distributed Storage on Kubernetes

Install Longhorn for distributed block storage on Kubernetes. Replicated volumes, snapshots, backups to S3, and disaster recovery across nodes.

⏱ 15 minutes nfdnode-feature-discoveryoperator

Node Feature Discovery Operator for Kubernetes

Install and configure Node Feature Discovery (NFD) Operator to auto-detect hardware features like GPUs, NICs, CPU flags, and USB devices on Kubernetes nodes.

⏱ 15 minutes oomkilledmemoryresources

Fix OOMKilled Containers in Kubernetes

Debug and resolve OOMKilled container terminations. Understand memory limits, kernel OOM killer behavior, and right-sizing strategies for Kubernetes pods.

⏱ 15 minutes crunruncopenshift

OpenShift crun vs runc Runtime Differences

Understand why pods behave differently on GPU vs CPU nodes in OpenShift. Compare crun and runc container runtimes, seccomp profiles, and syscall filtering.

⏱ 15 minutes opentelemetryoteltracing

OpenTelemetry Complete Setup on Kubernetes

Deploy OpenTelemetry Collector, auto-instrumentation, and exporters on Kubernetes. Unified traces, metrics, and logs pipeline to Jaeger, Prometheus, and Loki.

⏱ 15 minutes pvcresizeexpansion

Fix PVC Resize Stuck or Failed

Debug PVC expansion failures in Kubernetes. Covers allowVolumeExpansion, filesystem resize, and offline vs online expansion.

⏱ 15 minutes evictionpreemptionpdb

Fix Unexpected Pod Evictions in Kubernetes

Debug pods being evicted due to node pressure, preemption, or taint-based eviction. Understand eviction priorities, QoS classes, and PodDisruptionBudgets.

⏱ 15 minutes pendingschedulingresources

Fix Pod Stuck in Pending State

Debug pods stuck in Pending status. Covers insufficient resources, node affinity mismatches, taint/toleration issues, and PVC binding failures.

⏱ 15 minutes podmantlsx509

Fix Podman TLS x509 Behind Corporate Proxy

Resolve podman pull x509 certificate signed by unknown authority errors caused by corporate TLS-intercepting proxies. Extract and install the proxy CA.

⏱ 15 minutes pvcstoragepersistent-volume

Fix PVC Stuck in Pending State

Debug PersistentVolumeClaims stuck in Pending status. Covers storage class issues, provisioner failures, capacity problems, and access mode mismatches.

⏱ 15 minutes rbacforbiddenpermissions

Fix RBAC Permission Denied Errors

Debug RBAC forbidden and unauthorized errors in Kubernetes. Covers ClusterRole vs Role scope and service account permissions.

⏱ 15 minutes deploymentrolloutstuck

Fix Deploy Rollout Stuck at Partial Progress

Debug deployments stuck with unavailable replicas during rollout. Covers readiness probes, resource constraints, and rollback.

⏱ 15 minutes rookcephstorage

Rook Ceph Storage Cluster on Kubernetes

Deploy Rook Ceph for enterprise-grade distributed storage on Kubernetes. Block, file, and object storage with self-healing and automatic rebalancing.

⏱ 15 minutes istioenvoysidecar

Fix Service Mesh Sidecar Injection Failures

Debug Istio and Envoy sidecar injection issues. Covers missing sidecars, port conflicts, init container failures, and mTLS connection errors.

⏱ 15 minutes wasmwebassemblyspinkube

Run WebAssembly Workloads on Kubernetes

Deploy WASM workloads on Kubernetes using SpinKube and containerd-shim. Sub-millisecond cold starts, polyglot runtimes, and sandboxed edge computing.

⏱ 30 minutes fionfsbenchmark

Fio NFS Benchmark on OpenShift Nodes

Run fio NFS storage benchmarks on OpenShift using parallel pods with hostPath mounts. Measure IOPS, bandwidth, and latency across multiple NFS endpoints.

⏱ 25 minutes openshiftmachineconfignfs

MachineConfig NFS Mount on OpenShift Nodes

Mount NFS shares on OpenShift worker nodes using MachineConfig systemd mount units. The only production-safe way to persist NFS mounts on RHCOS nodes.

⏱ 10 minutes openshiftoc-debugmount

OpenShift oc debug Mount Limitation

Why NFS and filesystem mounts via oc debug node disappear after the debug pod exits. Understand the container namespace isolation and use MachineConfig instead.

⏱ 5 minutes kubeconbookcommunity

KubeCon EU 2026 Book Giveaway Recap

Recap of the Kubernetes Recipes book giveaway at KubeCon EU 2026 Amsterdam. Photos from the signing sessions, community highlights, and how to get your copy.

⏱ 25 minutes knativeingresskourier

Configure Knative Ingress Networking

Set up Knative Serving ingress with Kourier, Istio, or Contour. Custom domains, TLS, path routing, and external visibility.

⏱ 20 minutes argocdgitopsdrift-detection

Detect ArgoCD Shadow Updates Out-of-Band

Detect and prevent ArgoCD shadow updates where manual kubectl changes bypass GitOps. Configure self-heal, sync, and drift detection.

⏱ 30 minutes gateway-apiingressmigration

Migrate Ingress to Gateway API ingress2gateway

Migrate Ingress to Gateway API using ingress2gateway. Convert HTTPRoute and TLSRoute with zero-downtime parallel migration.

⏱ 60 minutes operatoroperator-sdkkubebuilder

Build a K8s Operator with Docker Testing

Build a Kubernetes operator with Operator SDK and Kubebuilder. Test with Docker, Kind, and envtest. Full TDD workflow to OLM bundle.

⏱ 15 minutes configmapsize-limitconfiguration

Fix the Kubernetes ConfigMap Too Large Error

Resolve the 1MB ConfigMap size limit error. Split configs, use Secrets for binary data, mount volumes, or use external stores.

⏱ 15 minutes cri-ocontainer-runtimeopenshift

Debug CRI-O Container Runtime Errors

Troubleshoot CRI-O issues on OpenShift nodes. Fix image pull failures, container start errors, storage driver problems, and CNI networking plugin failures.

⏱ 15 minutes openshiftmachineconfigdegraded

Debug Degraded MachineConfigPool Nodes

Fix nodes stuck Degraded after MachineConfig updates. Check MCD logs, on-disk validation, and recovery for degraded workers.

⏱ 15 minutes evictionnode-pressureresources

Debug Kubernetes Pod Eviction Reasons

Investigate why pods were evicted from Kubernetes nodes. Check node pressure conditions, resource limits, priority classes, and preemption events.

⏱ 15 minutes dnscorednsresolution

Debug DNS Resolution Failures in Pods

Troubleshoot pods unable to resolve DNS names. Check CoreDNS health, ndots configuration, search domains, and NetworkPolicies blocking UDP port 53 DNS traffic.

⏱ 15 minutes etcdperformancelatency

Debug etcd Performance Issues in Kubernetes

Diagnose slow etcd causing API latency and leader election storms. Check disk IOPS, compaction, defrag, and network latency.

⏱ 15 minutes certificatestlsexpiration

Fix Expired Certificates in Kubernetes

Renew expired certificates causing API server failures and kubelet disconnections. Manual and automatic renewal for kubeadm and OpenShift.

⏱ 20 minutes nvidiagdsgpu-operator

Enable GPUDirect Storage in ClusterPolicy

Enable NVIDIA GPUDirect Storage (GDS) in the GPU Operator ClusterPolicy for direct GPU-to-NVMe data paths. Driver module configuration and verification.

⏱ 20 minutes nvidiagputime-slicing

GPU Time-Slicing on Kubernetes

Share GPUs across multiple workloads using NVIDIA time-slicing on Kubernetes. Configure the device plugin, set replica counts, and manage fairness.

⏱ 15 minutes helmhooksbefore-hook-creation

Helm before-hook-creation Hook

Use Helm before-hook-creation for database migrations and pre-install checks. Complete hook lifecycle, delete policies, and ordering.

⏱ 10 minutes helmsprigcat

Helm Sprig cat Function: Concatenate Strings

Helm Sprig cat function concatenates strings with spaces between arguments. Syntax, why cat inserts spaces, conditionals, and template examples.

⏱ 10 minutes helmsprigjoin

Helm Sprig join Function: List to String

Helm Sprig join function converts lists to delimited strings. Join list example with CSV output, label values, and multi-value template patterns.

⏱ 10 minutes helmsprigtostring

Helm Sprig toString Function Guide

Helm Sprig toString function converts values to strings in templates. Handle integers, booleans, lists, and nil values safely in Helm charts.

⏱ 15 minutes openshiftimagestreamimport

Fix OpenShift ImageStream Import Errors

Debug ImageStream import failures in OpenShift. Resolve DNS errors, auth issues, TLS problems, and registry rate limiting.

⏱ 25 minutes openshiftitmsingress

ITMS Race Condition with Ingress Controllers

Resolve the ITMS race condition where ImageTagMirrorSet rollouts deadlock with hostNetwork ingress controllers during MCO drain.

⏱ 30 minutes resiliencyhigh-availabilitypdb

Kubernetes Resiliency Patterns Guide

Build resilient Kubernetes apps with PDBs, topology spread, anti-affinity, health probes, and graceful shutdown patterns.

⏱ 30 minutes resourcesoptimizationvpa

K8s Resource Optimization Strategies

Kubernetes resource optimization strategies and best practices. Right-size pods with VPA, Goldilocks dashboards, and resource allocation techniques.

⏱ 30 minutes securityhardeningpss

Harden Kubernetes Security Posture

Kubernetes security hardening: Pod Security Standards, RBAC least-privilege, network policies, secret encryption, and audit logging.

⏱ 15 minutes openshiftmachineconfigannotations

Inspect MachineConfig Annotations on Nodes

Read and interpret MachineConfig annotations on OpenShift nodes. Check desired vs current config, node state, and rendered config hashes to diagnose MCP issues.

⏱ 15 minutes openshiftmachineconfigchrony

Configure NTP Chrony via MachineConfig

Set custom NTP servers on OpenShift RHCOS nodes using MachineConfig. Fix time drift, configure chrony, and verify time synchronization across your cluster.

⏱ 15 minutes openshiftmachineconfigkernel

Set Kernel Parameters via MachineConfig

Tune kernel sysctl parameters on OpenShift nodes using MachineConfig. Set networking, memory, and performance sysctls on RHCOS.

⏱ 15 minutes openshiftmachineconfigregistries

Configure Container Registries via MachineC...

Set up mirror registries and blocked registries on OpenShift nodes using MachineConfig to control CRI-O image pull on RHCOS.

⏱ 20 minutes openshiftmachineconfigmcp

Fix Stale MachineConfigPool Updates

Debug and resolve stale OpenShift MachineConfigPool updates. Identify blocked nodes, check MachineConfigDaemon logs, and unblock stuck MCP rollouts.

⏱ 15 minutes openshiftpdbdrain

MCP Drain Blocked by PDB: Workaround

Resolve OpenShift MachineConfigPool drain failures caused by PodDisruptionBudget violations. Scale down and restore after update.

⏱ 15 minutes openshiftmachineconfigmcp

Configure MCP maxUnavailable for Rollouts

Control how many nodes the MachineConfig Operator updates simultaneously. Set maxUnavailable for faster rollouts or safer one-at-a-time updates in production.

⏱ 15 minutes openshiftmachineconfigmcp

Pause and Unpause MCP Rollouts

Temporarily pause MachineConfigPool rollouts to batch multiple MachineConfig changes or coordinate with maintenance windows. Unpause to resume node updates.

⏱ 30 minutes openshiftmachineconfigautomation

Automate MCP Updates with Drain Script

Bash script to automate OpenShift MachineConfigPool updates when drains are blocked by PDB violations. Auto-detects blockers, scales down, drains, and restores.

⏱ 15 minutes openshiftmachineconfigmcp

Separate Worker and Infra MachineConfigPools

Create dedicated MachineConfigPools for infrastructure and GPU nodes. Isolate MCP rollout blast radius and control update order for different node types.

⏱ 15 minutes namespaceterminatingfinalizer

Fix Namespace Stuck in Terminating

Remove Kubernetes namespaces stuck in Terminating state. Identify blocking finalizers, orphaned API resources, and safely force namespace cleanup procedures.

⏱ 15 minutes networkpolicyconnectivitydebugging

Debug NetworkPolicy Connectivity Issues

Troubleshoot pods unable to communicate despite correct Services. Verify NetworkPolicy rules, label selectors, and default deny.

⏱ 15 minutes openshifthostnetworkdrain

Node Drain Blocked by hostNetwork Port Conf...

Debug and fix OpenShift node drains that fail because hostNetwork pods cannot schedule replacements due to port exhaustion across the cluster.

⏱ 15 minutes nodenot-readykubelet

Debug Node NotReady Status in Kubernetes

Diagnose Kubernetes nodes stuck in NotReady state. Check kubelet logs, container runtime, network, disk pressure, and certificates.

⏱ 30 minutes nvidiagpu-operatorgpu

NVIDIA GPU Operator Setup on Kubernetes

Install and configure NVIDIA GPU Operator on Kubernetes. Driver containers, toolkit, device plugin, DCGM monitoring, and ClusterPolicy setup.

⏱ 45 minutes nvidiagpu-operatorgpudirect

NVIDIA Open GPU + GPUDirect RDMA + DOCA-OFE...

Deploy NVIDIA AI networking on Kubernetes: Open GPU driver with DMA-BUF, GPUDirect RDMA, DOCA-OFED, and SR-IOV VF isolation.

⏱ 15 minutes draindry-runmaintenance

Use oc adm drain Dry-Run for Diagnostics

Preview node drain impact without evicting pods. Identify PDB violations, unmanaged pods, and local storage blockers before maintenance.

⏱ 25 minutes openclawargocdgitops

OpenClaw GitOps Deployment with ArgoCD

Deploy OpenClaw on Kubernetes using ArgoCD for GitOps automation. Application definition, sync policies, drift detection, and secrets.

⏱ 30 minutes openclawexternal-secretsvault

OpenClaw API Keys External Secrets Operator

Manage OpenClaw API keys and gateway tokens using External Secrets Operator with AWS Secrets Manager, Vault, or GCP Secret Manager on Kubernetes.

⏱ 15 minutes openclawkindlocal-development

OpenClaw Local Development with Kind

Set up a local Kind cluster for OpenClaw development and testing. Auto-detect Docker or Podman, create a single-node cluster, and deploy OpenClaw in minutes.

⏱ 25 minutes openclawhelmchromium

OpenClaw Helm Chart with Chromium Sidecar

Deploy OpenClaw using the community Helm chart with Chromium browser sidecar for web automation, declarative skill installation, and custom values overlays.

⏱ 25 minutes openclawingresstls

Expose OpenClaw via K8s Ingress with TLS

Configure Kubernetes Ingress with TLS to expose OpenClaw gateway securely. Covers cert-manager, NGINX Ingress, and allowed origins.

⏱ 30 minutes openclawkustomizemulti-environment

OpenClaw Multi-Env Deploy with Kustomize

Deploy OpenClaw across dev, staging, and production Kubernetes environments using Kustomize overlays for configs and secrets.

⏱ 15 minutes openclawhealth-probesliveness

OpenClaw Health Probes on Kubernetes

Configure liveness and readiness probes for OpenClaw on Kubernetes. Custom Node.js health checks against /healthz and /readyz endpoints with proper timing.

⏱ 35 minutes openclawmulti-agentteam

OpenClaw Multi-Agent Team Deployment

Deploy multiple specialized OpenClaw agents as Kubernetes pods. Dedicated DevOps, security, and writing agents with shared workspace.

⏱ 20 minutes openclawai-modelsmulti-provider

OpenClaw Multi-Model Provider Setup

Configure OpenClaw with multiple AI providers on Kubernetes. Anthropic, OpenAI, Gemini, OpenRouter with fallback chains and cost control.

⏱ 30 minutes openclawiotedge

OpenClaw Node Pairing for IoT and Edge Devices

Pair phones, Raspberry Pi, and edge devices with OpenClaw on Kubernetes. Camera, location, screen control, and remote command execution.

⏱ 20 minutes openclawopenshiftscc

OpenClaw on OpenShift with SCCs and Routes

Deploy OpenClaw on OpenShift with Security Context Constraints, Routes for TLS termination, and OpenShift-specific considerations for non-root containers.

⏱ 25 minutes openclawoperatorai-agents

OpenClaw Operator for Kubernetes

Deploy OpenClaw AI agents on Kubernetes using the official operator. CRD-based lifecycle, Chromium sidecar, auto-update, and backup.

⏱ 20 minutes openclawpersistent-volumesstate-management

OpenClaw Persistent State Management

Manage OpenClaw agent state and workspace data with Kubernetes PVCs. Init container config seeding, backups, and storage classes.

⏱ 15 minutes openclawresource-limitstuning

OpenClaw Resource Limits and Tuning

Size CPU, memory, and storage for OpenClaw on Kubernetes. Tuning profiles for light usage, browser automation, and production deployments.

⏱ 20 minutes openclawpod-securityhardening

OpenClaw Pod Security Hardening on Kubernetes

Harden OpenClaw pods with read-only filesystem, dropped capabilities, non-root user, seccomp profiles, and resource limits.

⏱ 35 minutes openclawwebhooksautomation

OpenClaw Webhook Automation on Kubernetes

Configure OpenClaw webhooks on Kubernetes for GitHub, Jira, and PagerDuty event-driven automation. Ingress routing, HMAC validation, and hook handler patterns.

⏱ 20 minutes openshiftingresshaproxy

OpenShift Ingress Router Troubleshooting

Debug OpenShift HAProxy router issues: pods stuck Pending, hostPort conflicts, PDB violations during maintenance, and custom router deployment scaling problems.

⏱ 15 minutes openshiftmachineconfigmcd

Debug MachineConfigDaemon Logs

Read and interpret OpenShift MachineConfigDaemon logs to diagnose node update failures. Common error patterns, drain issues, and config application problems.

⏱ 10 minutes maintenancenode-managementdrain

Cordon, Drain, and Uncordon Nodes

Safely remove workloads from OpenShift and Kubernetes nodes for maintenance. Cordon to prevent scheduling, drain to evict pods, uncordon to restore.

⏱ 15 minutes openshiftoauthauthentication

Debug OpenShift OAuth Login Failures

Troubleshoot OpenShift console and CLI login failures. Check OAuth server pods, identity provider config, and expired tokens.

⏱ 15 minutes openshiftpdbingress

Configure PDBs for OpenShift Routers

Set PodDisruptionBudgets for OpenShift IngressController routers. Balance availability during maintenance with node drain ability.

⏱ 20 minutes openshiftmonitoringprometheus

Enable User Workload Monitoring OpenShift

Enable user workload monitoring on OpenShift. Deploy ServiceMonitor, PodMonitor, alerting rules, and Grafana dashboards.

⏱ 15 minutes openshiftolmoperator

Fix Stuck OLM Operator Subscriptions

Debug Operator Lifecycle Manager subscriptions stuck in pending or failed state. Resolve catalog source issues, approval policies, and CSV dependency conflicts.

⏱ 15 minutes pdbdisruption-budgeteviction

PDB Allowed Disruptions Zero: Debugging

Debug PodDisruptionBudgets stuck at zero allowed disruptions. Understand minAvailable vs maxUnavailable, fix eviction failures, and plan for maintenance.

⏱ 15 minutes pvpvcterminating

Fix PV Stuck in Terminating State

Resolve PVs and PVCs stuck in Terminating status. Remove finalizers safely, check volume detachment, and handle storage issues.

⏱ 15 minutes hostnetworkportsscheduling

Manage hostNetwork Pod Port Allocation

Plan and manage host port usage for hostNetwork pods. Prevent port conflicts, track allocations, and handle port exhaustion.

⏱ 15 minutes resourcequotalimitrangescheduling

Fix ResourceQuota Exceeded Errors

Debug resource quota violations preventing pod scheduling. Understand LimitRange defaults, ResourceQuota, and namespace management.

⏱ 15 minutes scalingrestoremaintenance

Restore Scaled Deployments After Node Drain

Restore deployments scaled down for maintenance. Verify node health, check pod scheduling, and confirm service availability.

⏱ 15 minutes scalingdrainpdb

Scale Deployments to Unblock Node Drains

Safely scale down deployments that block node drains due to PDB violations. Record original replicas, scale to zero, drain, then restore after the node returns.

⏱ 15 minutes serviceendpointsreadiness

Debug Service with No Ready Endpoints

Troubleshoot Services showing zero endpoints. Verify label selectors, readiness probes, pod status, and port configuration.

Fix Node Untolerated Taint Scheduling Errors

Fix node untolerated taint errors causing pods stuck in Pending. NoSchedule, PreferNoSchedule, NoExecute effects, and toleration syntax guide.

⏱ 15 minutes webhookadmissiontimeout

Fix Admission Webhook Timeout Errors

Debug admission webhook failures blocking pod creation. Identify failing webhooks, check timeouts, and set failurePolicy.

⏱ 20 minutes openshiftitmsimagetagmirrorset

ITMS External-to-External Registry Mirroring

Configure OpenShift ImageTagMirrorSet to map external registries to your private registry. Mirror Docker Hub, GHCR, Quay.io, and NVIDIA NGC.

⏱ 25 minutes openshiftitmsidms

How ITMS Updates registries.conf via Machin...

How ITMS and IDMS update /etc/containers/registries.conf on immutable CoreOS nodes via MCO and MachineConfig. Full chain deep-dive.

⏱ 10 minutes communitymilestonekubernetes

400 Recipes Milestone: What We Built & What...

Kubernetes Recipes reaches 400 articles. Explore new AI/GPU infrastructure, NVIDIA networking, ArgoCD GitOps, OpenShift, and RHACS security recipes.

⏱ 30 minutes model-servingstoragehostpath

AI Model Storage: hostPath vs PVC Inference

Deploy AI models on Kubernetes using hostPath and PVC storage. Compare performance, security trade-offs, and production patterns for model serving.

⏱ 15 minutes quayrobot-accountpermissions

Quay Default Permissions for Robot Accounts

Configure Quay Registry default permissions to auto-grant read access to robot accounts on every new repository. API and team patterns.

⏱ 15 minutes kubeconbookcommunity

KubeCon EU 2026 Book Signing Events

Join Luca Berton at two KubeCon Amsterdam events: Signal Overflow at Booking.com HQ (Mon 23 Mar) and book signing at vCluster booth #521 (Tue 24 Mar).

⏱ 35 minutes volcanobatchgang-scheduling

Volcano Job minAvailable Gang Scheduling

Configure Volcano job minAvailable for gang scheduling on Kubernetes. Batch AI training, fair-share queues, job plugins, and GPU preemption guide.

⏱ 25 minutes sr-iovnetworkingopenshift

Configure SR-IOV agent-config.yaml Device b...

Use agent-config.yaml to select network devices by PCI path for SR-IOV VF creation, ensuring deterministic NIC targeting across OpenShift nodes.

⏱ 20 minutes aiperfbenchmarkingnvidia

AIPerf Benchmark LLMs on Kubernetes

Deploy NVIDIA AIPerf to benchmark LLM inference performance on Kubernetes. Measure TTFT, ITL, throughput with real-time dashboard and GPU telemetry.

⏱ 30 minutes aiperfbenchmarkingconcurrency

AIPerf Concurrency Sweep on K8s

Run AIPerf concurrency sweeps on Kubernetes to find optimal LLM serving capacity. Automate 1-128 concurrent user benchmarks with batch Jobs.

⏱ 25 minutes aiperfbenchmarkinggoodput

AIPerf Goodput and SLO Benchmarks

Measure LLM goodput with AIPerf on Kubernetes. Define SLOs for TTFT and ITL, calculate effective throughput, and benchmark with timeslice analysis.

⏱ 30 minutes aiperfbenchmarkingcomparison

AIPerf Multi-Model Benchmark on K8s

Compare multiple LLM models and backends with AIPerf on Kubernetes. Benchmark vLLM vs TGI vs Triton with automated multi-run confidence intervals.

⏱ 25 minutes aiperfbenchmarkingtrace-replay

AIPerf Trace Replay Benchmarks on K8s

Replay production traffic traces with AIPerf on Kubernetes. Use moon_cake format, ShareGPT datasets, and fixed schedules for realistic LLM benchmarks.

⏱ 15 minutes air-gapopenshiftquay

Air-Gapped OpenShift with Quay Mirror

Deploy OpenShift in air-gapped environments with local Quay registry mirror, ImageDigestMirrorSet, and custom CatalogSources.

⏱ 20 minutes argocdgitopshelm

ArgoCD App of Apps with Helm Values

Use the ArgoCD App of Apps pattern with Helm value overrides per environment, enabling templated Application manifests and DRY multi-environment configurations.

⏱ 20 minutes argocdgitopsapp-of-apps

ArgoCD App of Apps Pattern Explained

Implement the ArgoCD App of Apps pattern to manage multiple applications from a parent Application for cluster bootstrapping.

⏱ 25 minutes argocdgitopsapp-of-apps

ArgoCD App of Apps with Sync Waves

Combine the ArgoCD App of Apps pattern with sync waves to bootstrap entire clusters in dependency order, from CRDs and operators to application workloads.

⏱ 15 minutes argocdapplicationsetsmulti-tenant

ArgoCD ApplicationSets for Multi-Tenant GPUs

Use ArgoCD ApplicationSets to auto-discover and provision GPU tenant overlays from Git directories with per-tenant sync policies.

⏱ 15 minutes argocdgitopsdeclarative

ArgoCD Declarative Application Setup

Define ArgoCD Applications, Projects, and repository credentials declaratively using Kubernetes manifests for reproducible GitOps configuration.

⏱ 25 minutes argocdgitopsmulti-cluster

ArgoCD Multi-Cluster App of Apps

Manage multiple Kubernetes clusters with ArgoCD App of Apps, deploying shared infrastructure and cluster-specific workloads from a single GitOps repository.

⏱ 20 minutes operatorgroupolmargocd

Manage OperatorGroups with ArgoCD

Deploy and manage OLM OperatorGroup resources via ArgoCD for GitOps-driven operator lifecycle management in OpenShift namespaces.

⏱ 15 minutes argocdgitopshooks

ArgoCD PreSync and PostSync Hooks

Use ArgoCD PreSync hooks for database migrations and PostSync hooks for smoke tests, with SyncFail hooks for automated rollback and cleanup.

⏱ 20 minutes argocdgitopscanary

ArgoCD Sync Waves for Canary Deployments

Use ArgoCD sync waves for canary deployments with Istio traffic splitting, automated validation, and progressive rollout strategies.

⏱ 15 minutes argocdgitopscrds

ArgoCD Sync Waves for CRD & Operator Ordering

Use ArgoCD sync waves to deploy Custom Resource Definitions before operators and custom resources, preventing CRD race conditions in GitOps pipelines.

⏱ 15 minutes argocdgitopssync-waves

ArgoCD Sync Waves for Ordered Deployments

Use ArgoCD sync waves to control the order of Kubernetes resource deployment, ensuring dependencies like namespaces and CRDs are created before workloads.

⏱ 20 minutes argocdgitopsdatabase

ArgoCD Sync Waves for Database Migrations

Use ArgoCD sync waves and PreSync hooks to run database migrations before deploying application code, with rollback strategies.

⏱ 20 minutes nvidiagpu-operatormofed

ClusterPolicy MOFED Upgrade Strategy

Configure safe MOFED driver upgrade policies in the NVIDIA GPU Operator ClusterPolicy with rolling updates, node draining, and rollback procedures.

⏱ 15 minutes cnpgpostgresqldisaster-recovery

CNPG Disaster Recovery and Replication

Set up cross-region PostgreSQL disaster recovery with CloudNativePG using replica clusters, WAL shipping, and automated failover.

⏱ 15 minutes cnpgpostgresqldatabase

CloudNativePG PostgreSQL Operator

Deploy highly available PostgreSQL clusters on Kubernetes using CloudNativePG operator with automated failover and backups.

⏱ 15 minutes cnpgpostgresqlscaling

CNPG Cluster Scaling and Upgrades

Scale CloudNativePG clusters, perform rolling PostgreSQL major upgrades, and manage storage expansion without downtime in Kubernetes.

⏱ 20 minutes certificatescatls

Add Custom CA Certificates in Kubernetes

Configure custom Certificate Authority trust in vanilla Kubernetes using ConfigMap mounts, node-level trust stores, and containerd registry configuration.

⏱ 25 minutes certificatescatls

Add Custom CA in OpenShift and Kubernetes

Configure custom Certificate Authority trust in both OpenShift and vanilla Kubernetes for private registries, internal services, and corporate PKI.

⏱ 20 minutes openshiftcertificatesca

Add Custom CA Certificates in OpenShift

Configure custom Certificate Authority trust across an OpenShift cluster using proxy config, image config, and automatic CA bundle injection into pods.

⏱ 10 minutes secretsbase64troubleshooting

Decode and Inspect Kubernetes Docker Secrets

Decode base64-encoded dockerconfigjson secrets to verify registry credentials, troubleshoot ImagePullBackOff errors, and audit pull secret configurations.

⏱ 15 minutes dellpoweredgexe7740

Dell PowerEdge XE7740 GPU Node Setup

Configure Dell PowerEdge XE7740 GPU nodes with H200 GPUs for OpenShift and Kubernetes including BIOS, power, cooling, and network setup.

⏱ 20 minutes fish-audiotext-to-speechtts

Deploy Fish Audio TTS on Kubernetes

Deploy Fish Audio S2-Pro 5B text-to-speech model on Kubernetes for high-quality voice synthesis with multi-speaker support and streaming audio.

⏱ 45 minutes glm-5zhipullm

Deploy GLM-5 754B on Kubernetes

Deploy Zhipu AI GLM-5 754B model on Kubernetes with vLLM. One of the largest open-weight models with multi-node tensor parallelism across 8+ GPUs.

⏱ 15 minutes graniteibmspeech-recognition

Deploy Granite 4.0 Speech on Kubernetes

Deploy IBM Granite 4.0 1B Speech model on Kubernetes for automatic speech recognition. Lightweight 2B model runs on CPU or small GPU for STT workloads.

⏱ 45 minutes kimimoonshotmixture-of-experts

Deploy Kimi K2.5 1.1T MoE on Kubernetes

Deploy Moonshot AI Kimi-K2.5 1.1T MoE multimodal model on Kubernetes. The largest open MoE model with 2.69M downloads for frontier AI tasks.

⏱ 30 minutes llamallmvllm

Deploy Llama 2 70B on Kubernetes

Deploy Meta Llama 2 70B on Kubernetes with multi-GPU tensor parallelism, vLLM serving, and production-ready health checks and resource limits.

⏱ 15 minutes llamallama-3.1meta

Deploy Llama 3.1 8B Instruct on K8s

Deploy Meta Llama 3.1 8B Instruct on Kubernetes with vLLM. Production-ready single-GPU deployment with 128K context, tool calling, and autoscaling.

⏱ 25 minutes ltxvideo-generationimage-to-video

Deploy LTX Video Generation on K8s

Deploy Lightricks LTX-2.3 image-to-video model on Kubernetes for AI video generation with batch processing and S3 output storage.

⏱ 30 minutes minimaxllmmulti-gpu

Deploy MiniMax M2.5 229B on Kubernetes

Deploy MiniMax M2.5 229B model on Kubernetes with vLLM. High-performance LLM with 485K downloads, optimized for multi-turn conversation and long context.

⏱ 25 minutes nemotronnvidiamixture-of-experts

Deploy NVIDIA Nemotron 120B MoE on K8s

Deploy NVIDIA Nemotron-3-Super-120B-A12B MoE model on Kubernetes. 120B total parameters with 12B active for enterprise-grade inference.

⏱ 20 minutes phi-4microsoftsmall-language-model

Deploy Microsoft Phi-4 on Kubernetes

Deploy Microsoft Phi-4 small language model on Kubernetes with vLLM. Efficient 14B model with GPT-4 level reasoning on a single GPU.

⏱ 20 minutes phi-4microsoftreasoning

Deploy Phi-4 Reasoning Vision on K8s

Deploy Microsoft Phi-4-reasoning-vision-15B on Kubernetes for multimodal chain-of-thought reasoning with visual understanding on a single GPU.

⏱ 30 minutes qwen3mixture-of-expertsmoe

Deploy Qwen3 235B MoE on Kubernetes

Deploy Alibaba Qwen3-235B-A22B mixture-of-experts model on Kubernetes. Only 22B parameters active per token for efficient 235B-class inference.

⏱ 25 minutes qwen3code-generationcoding-assistant

Deploy Qwen3 Coder 80B on Kubernetes

Deploy Qwen3-Coder-Next 80B on Kubernetes for code generation, review, and refactoring. Production-ready AI coding assistant with multi-GPU serving.

⏱ 15 minutes qwen3text-to-speechtts

Deploy Qwen3 TTS on Kubernetes

Deploy Qwen3-TTS-12Hz-1.7B-CustomVoice on Kubernetes for text-to-speech with custom voice cloning. 1.13M downloads, lightweight single-GPU deployment.

⏱ 20 minutes qwen3.5mixture-of-expertsmoe

Deploy Qwen3.5 35B MoE on Kubernetes

Deploy Alibaba Qwen3.5-35B-A3B mixture-of-experts multimodal model on Kubernetes. 35B total parameters with only 3B active for ultra-efficient inference.

⏱ 30 minutes qwen3.5mixture-of-expertsmoe

Deploy Qwen3.5 397B MoE on Kubernetes

Deploy Alibaba Qwen3.5-397B-A17B MoE multimodal model on Kubernetes. 397B total parameters with only 17B active per token for frontier VLM inference.

⏱ 20 minutes qwen3.5multimodalvision-language

Deploy Qwen3.5 9B Multimodal on K8s

Deploy Alibaba Qwen3.5-9B vision-language model on Kubernetes with vLLM. Process images and text with a single GPU deployment.

⏱ 25 minutes retinanetobject-detectioncomputer-vision

RetinaNet Object Detection on K8s

Deploy RetinaNet object detection model on Kubernetes with Triton Inference Server, TensorRT optimization, and batch processing pipelines.

⏱ 25 minutes sarvammultilingualindic-languages

Deploy Sarvam 105B on Kubernetes

Deploy Sarvam 105B multilingual LLM on Kubernetes with vLLM. India's largest open language model with native support for 10+ Indic languages.

⏱ 30 minutes stable-diffusionsdxlimage-generation

Stable Diffusion XL on Kubernetes

Deploy Stable Diffusion XL for image generation on Kubernetes with TensorRT acceleration, queued batch processing, and S3 output storage.

⏱ 20 minutes whisperspeech-to-texttranscription

Deploy Whisper Speech-to-Text on K8s

Deploy OpenAI Whisper for speech-to-text on Kubernetes with faster-whisper, batch transcription Jobs, and real-time streaming endpoints.

⏱ 15 minutes distributed-inferencetensor-parallelismpipeline-parallelism

Distributed Inference Kubernetes

Deploy distributed LLM inference with tensor parallelism across multiple GPUs and pipeline parallelism across nodes on Kubernetes.

⏱ 15 minutes nvidiadocardma

NVIDIA DOCA Driver Container in Kubernetes

Deploy and configure NVIDIA DOCA Driver containers via NicClusterPolicy for RDMA, NFS-RDMA, and precompiled driver builds.

⏱ 15 minutes nvidiadocaopenshift

DOCA Driver on OpenShift with DTK

Build and deploy precompiled NVIDIA DOCA Driver containers on OpenShift using DriverToolKit, MachineConfig, and upgrade lifecycle.

⏱ 25 minutes nvidiagdsnvme

GPU Operator GDS with NVMe and NFS RDMA

Configure GPUDirect Storage for local NVMe drives and NFS over RDMA in Kubernetes, including cuFile verification and performance benchmarking.

⏱ 15 minutes genai-perfbenchmarkllm

GenAI-Perf Benchmark LLM Serving

Benchmark LLM inference endpoints with NVIDIA GenAI-Perf for throughput, latency percentiles, time-to-first-token, and ITL metrics.

⏱ 25 minutes genai-perftritonbenchmarking

GenAI-Perf Benchmark Triton on K8s

Benchmark NVIDIA Triton Inference Server performance on Kubernetes using GenAI-Perf. Measure TTFT, inter-token latency, throughput, and GPU telemetry.

⏱ 15 minutes gitopsargocdbare-metal

GitOps Bootstrap for Bare-Metal GPU Clusters

Bootstrap bare-metal GPU clusters with ArgoCD and Kustomize in air-gapped environments with NVIDIA GPU and Network Operators.

⏱ 25 minutes nvidiagpu-operatorgds

GPU Operator GPUDirect Storage GDS Module

Enable the GPUDirect Storage GDS module in the NVIDIA GPU Operator ClusterPolicy for direct GPU-to-storage data transfers bypassing CPU and system memory.

⏱ 20 minutes nvidiagpu-operatorclusterpolicy

GPU Operator ClusterPolicy Complete Reference

Complete reference for the NVIDIA GPU Operator ClusterPolicy CRD covering driver, toolkit, device plugin, MOFED, GDS, MIG, and DCGM configuration options.

⏱ 30 minutes nvidiagpu-operatormofed

NVIDIA GPU Operator MOFED Driver Configuration

Configure the NVIDIA GPU Operator to deploy Mellanox OFED drivers for high-performance RDMA networking on Kubernetes GPU nodes with InfiniBand and RoCE support.

⏱ 15 minutes gpu-operatorupgradecanary

GPU Operator Canary Upgrade Strategy

Safely upgrade NVIDIA GPU Operator using canary node pools, 48-hour bake periods, validation gates, and Git-based rollback.

⏱ 15 minutes multi-tenantkustomizegpu

GPU Tenant Bootstrap Bundle for Kubernetes

Provision GPU tenants with a single Kustomize bundle containing namespace, RBAC, NetworkPolicy, quotas, and HAProxy VIP config.

⏱ 15 minutes monitoringgpuchargeback

Per-Tenant GPU Monitoring and Chargeback

Build per-tenant GPU monitoring dashboards with queue time, utilization, thermal metrics, and GPU-hour chargeback on Kubernetes.

⏱ 15 minutes slogpuobservability

GPU Tenant SLO Observability on Kubernetes

Define and monitor GPU tenant SLOs for queue time, inference latency, GPU utilization, and job completion rate with Prometheus alerting.

⏱ 15 minutes upgradeversion-matrixgpu-operator

GPU Cluster Upgrade Version Matrix

Maintain a version compatibility matrix for GPU Operator, Network Operator, drivers, firmware, CUDA, and OpenShift for safe upgrades.

⏱ 15 minutes gpudirectrdmadma-buf

GPUDirect RDMA via DMA-BUF on Kubernetes

Configure GPUDirect RDMA using DMA-BUF kernel subsystem for zero-copy GPU-to-GPU transfers over InfiniBand and RoCE networks.

⏱ 15 minutes haproxykeepalivedmulti-tenant

HAProxy Keepalived Multi-Tenant GPU Ingress

Configure HAProxy with Keepalived VIPs for per-tenant GPU cluster ingress with Jinja2 templates and per-tenant access logging.

⏱ 15 minutes infinibandethernetrdma

InfiniBand vs Ethernet for AI on Kubernetes

Compare InfiniBand and Ethernet networking for GPU AI workloads on Kubernetes, including RDMA, RoCE, latency, and throughput considerations.

⏱ 15 minutes kubeflowdistributed-trainingpytorch

Distrib. Training Kubeflow Training Operator

Run multi-node distributed PyTorch and TensorFlow training jobs using Kubeflow Training Operator with NCCL, RDMA, and shared storage.

⏱ 15 minutes kubeflowtraining-operatordistributed-training

Kubeflow Training Operator on Kubernetes

Install Kubeflow Training Operator for distributed ML training with PyTorchJob, TFJob, and MPIJob on GPU-enabled Kubernetes clusters.

⏱ 15 minutes leaderworkersetlwsdistributed-training

LeaderWorkerSet Operator for AI Workloads

Deploy distributed AI training with LeaderWorkerSet Operator on Kubernetes and OpenShift for leader-worker topology with gang scheduling.

⏱ 15 minutes llama-stacknvidia-nimllama

Llama Stack on Kubernetes with NVIDIA NIM

Deploy Meta Llama Stack on Kubernetes for unified inference, RAG, agents, and safety APIs using NVIDIA NIM as the inference backend.

⏱ 15 minutes mariadboperatordatabase

MariaDB Operator on Kubernetes

Deploy highly available MariaDB clusters on Kubernetes using MariaDB Operator with Galera replication, automated backups, and connection pooling.

⏱ 15 minutes mlperfbenchmarkinginference

MLPerf Benchmarking on Kubernetes

Run MLPerf inference and training benchmarks on Kubernetes GPU clusters to validate AI workload performance and compare hardware configurations.

⏱ 25 minutes model-cachingshared-memorypvc

Shared Model Caching Across Pods on Kubernetes

Optimize LLM inference startup and reduce storage costs by sharing model weights across pods using emptyDir, hostPath, ReadWriteMany PVCs, and init.

⏱ 15 minutes mofeddocaopenshift

MOFED and DOCA Driver Building for OpenShift

Build NVIDIA MOFED and DOCA drivers for OpenShift using DriverToolKit, Buildah, and MachineConfig for RDMA and GPU networking.

⏱ 30 minutes mpimpi-operatordistributed-training

MPI Operator for Distributed Training

Deploy MPI Operator on Kubernetes for distributed GPU training with Horovod and NCCL. Run multi-node MPI jobs natively in Kubernetes pods.

⏱ 15 minutes multi-tenantgpunamespace

Multi-Tenant GPU Namespace Isolation

Isolate GPU workloads across tenants using namespaces, RBAC, NetworkPolicy, and ResourceQuotas on OpenShift and Kubernetes.

⏱ 15 minutes networkpolicymulti-tenantgpu

NetworkPolicy Deny-Default for GPU Tenants

Implement deny-by-default NetworkPolicy for GPU tenant namespaces with NCCL port exceptions and DNS egress on Kubernetes.

⏱ 25 minutes nfsordmardmabonding

NFSoRDMA Bond with Access Mode Switch

Configure bonded NICs for NFS over RDMA using switch access mode for VLAN assignment. Aggregation on untagged interfaces for RDMA redundancy.

⏱ 25 minutes nfsordmardmanfs

NFSoRDMA Dedicated NIC Configuration

Configure dedicated NICs for NFS over RDMA on Kubernetes worker nodes. NFSoRDMA requires untagged interfaces — no VLAN tagging supported.

⏱ 15 minutes nfsordmardmamtu

NFSoRDMA Jumbo Frames MTU Configuration

Configure 9000 MTU jumbo frames for NFSoRDMA interfaces using NNCP to maximize RDMA throughput on Kubernetes worker nodes.

⏱ 30 minutes nfsordmardmavlan

NFSoRDMA Multi-VLAN Switch Access Mode

Design multi-VLAN NFSoRDMA networks using switch access mode ports. Separate storage, replication, and backup traffic with dedicated NICs per VLAN.

⏱ 15 minutes nfsordmardmapersistent-volume

NFSoRDMA Persistent Volume for Kubernetes

Create PersistentVolumes and StorageClasses for NFSoRDMA storage with RDMA transport, optimized mount options, and ReadWriteMany access.

⏱ 20 minutes nfsordmardmatroubleshooting

NFSoRDMA Troubleshooting and Performance

Troubleshoot NFS over RDMA connectivity issues, diagnose TCP fallback, tune performance, and benchmark RDMA throughput on Kubernetes workers.

⏱ 30 minutes nfsordmardmanfs

NFSoRDMA Worker Node Setup Guide

Complete worker node setup for NFS over RDMA including kernel modules, NFS client configuration, PersistentVolume mounts, and RDMA transport verification.

⏱ 15 minutes nvidiamofednode-selection

NicClusterPolicy MOFED Affinity & Node Sele...

Configure NicClusterPolicy node selectors and affinity rules to deploy MOFED drivers only on RDMA-capable nodes in Kubernetes clusters.

⏱ 20 minutes nncpnmstatebonding

NNCP Bond Interfaces on Worker Nodes

Create bonded network interfaces on Kubernetes worker nodes using NodeNetworkConfigurationPolicy for NIC redundancy and link aggregation.

⏱ 15 minutes nncpnmstatedns

NNCP DNS and Static Routes on Workers

Configure static routes, DNS servers, and policy-based routing on worker nodes using NodeNetworkConfigurationPolicy for multi-network setups.

⏱ 20 minutes nncpnmstatelinux-bridge

NNCP Linux Bridge on Worker Nodes

Create Linux bridges on Kubernetes worker nodes using NodeNetworkConfigurationPolicy for KubeVirt VM networking and pod bridging.

⏱ 15 minutes nncpnmstatemtu

NNCP MTU and Jumbo Frames on Workers

Set MTU and enable jumbo frames on worker node interfaces using NodeNetworkConfigurationPolicy for high-throughput storage and AI networking.

⏱ 30 minutes nncpnmstatemulti-nic

NNCP Multi-NIC Architecture for Workers

Design a complete multi-NIC worker node architecture with NNCP for separated management, storage, tenant, and GPU traffic using bonds, VLANs, and bridges.

⏱ 25 minutes nncpnmstateovs

NNCP OVS Bridge on Worker Nodes

Configure Open vSwitch bridges on Kubernetes worker nodes using NodeNetworkConfigurationPolicy for advanced SDN and DPDK networking.

⏱ 15 minutes nncpnmstatetroubleshooting

NNCP Rollback and Troubleshooting

Troubleshoot NodeNetworkConfigurationPolicy failures, monitor enactments, configure rollback timeouts, and recover from bad network configurations.

⏱ 25 minutes nncpnmstatesriov

NNCP SR-IOV and Macvlan on Workers

Configure SR-IOV virtual functions and macvlan interfaces on worker nodes using NodeNetworkConfigurationPolicy for high-performance networking.

⏱ 15 minutes nncpnmstatenetworking

NNCP Static IP Assignment on Worker Nodes

Use NodeNetworkConfigurationPolicy to assign static IPv4 and IPv6 addresses to worker node interfaces with nodeSelector targeting.

⏱ 15 minutes nncpnmstatevlan

NNCP VLAN Tagging on Worker Nodes

Configure VLAN interfaces on Kubernetes worker nodes using NodeNetworkConfigurationPolicy for network segmentation and traffic isolation.

⏱ 15 minutes nodeportingressgrpc

NodePort Raw Traffic vs HTTPS Ingress

Route raw GPU inference traffic via NodePort for low-latency gRPC and HTTPS model serving via OpenShift ingress controller.

⏱ 30 minutes nvidiaclaramedical-ai

Deploy NVIDIA Clara on Kubernetes

Deploy NVIDIA Clara medical AI and drug discovery platform on Kubernetes. Run digital biology and medtech inference workloads with GPU acceleration.

⏱ 15 minutes nvidiah200gpu

NVIDIA H200 GPU Workloads on Kubernetes

Deploy and optimize AI workloads on NVIDIA H200 GPUs with 141GB HBM3e memory for large model inference and training on Kubernetes.

⏱ 15 minutes nvidianemotraining

NVIDIA NeMo Training on Kubernetes

Deploy NVIDIA NeMo framework on Kubernetes for large language model pre-training, fine-tuning, and RLHF with multi-node GPU clusters.

⏱ 15 minutes nvidiamofeddoca

NVIDIA NIC Driver Container Entrypoint

Understand and customize the NVIDIA NIC driver container entrypoint for MOFED and DOCA driver lifecycle on Kubernetes and OpenShift.

⏱ 30 minutes pyxisenrootslurm

NVIDIA Pyxis and Enroot for SLURM

Use NVIDIA Pyxis and Enroot to run GPU containers in SLURM jobs. Bridge SLURM HPC scheduling with container-native AI workloads and NGC images.

⏱ 15 minutes nvidiakernel-modulesdma-buf

Open Kernel Modules and DMA-BUF for GPUs

Migrate from proprietary NVIDIA kernel modules and nvidia-peermem to open kernel modules with DMA-BUF for safer GPU upgrades.

⏱ 15 minutes openclawkedaautoscaling

OpenClaw Auto-Scaling with KEDA

Scale OpenClaw agents based on message queue depth using KEDA event-driven autoscaling for Discord, Telegram, and Slack.

⏱ 20 minutes openclawbackuprestore

OpenClaw Backup Restore Command Guide

OpenClaw backup and restore command guide. VolumeSnapshots, CronJobs to S3, disaster recovery procedures, and session state management on Kubernetes.

⏱ 20 minutes openclawcronheartbeat

OpenClaw Cron Jobs and Heartbeats

Configure OpenClaw's built-in cron scheduling and heartbeat system on Kubernetes for proactive notifications, periodic checks, and automated background.

⏱ 15 minutes openclawblue-greenzero-downtime

OpenClaw Blue-Green Deployment

Implement zero-downtime OpenClaw upgrades using blue-green deployments with traffic switching and rollback in Kubernetes.

⏱ 15 minutes openclawdockercontainer-image

Build a Custom OpenClaw Docker Image for K8s

Create an optimized Docker image for OpenClaw with pre-installed dependencies, custom skills, and workspace files for faster Kubernetes deployments.

⏱ 15 minutes openclawdiscordbot

Run an OpenClaw Discord Bot on Kubernetes

Deploy OpenClaw as a Discord bot on Kubernetes with channel routing, mention handling, group chat rules, and persistent conversation memory.

⏱ 25 minutes openclawhigh-availabilityhealth-checks

High Availability OpenClaw with Kubernetes

Run OpenClaw in a high-availability configuration on Kubernetes with health checks, automatic restarts, backup strategies, and monitoring for.

⏱ 25 minutes openclawai-gatewaydeployment

Deploy OpenClaw AI Gateway on Kubernetes

Deploy the OpenClaw multi-channel AI gateway on Kubernetes with persistent storage, TLS ingress, and high availability for WhatsApp, Telegram, Discord.

⏱ 15 minutes openclawloggingelasticsearch

OpenClaw Logging with EFK Stack

Collect and analyze OpenClaw agent logs using Elasticsearch, Fluent Bit, and Kibana (EFK stack) for debugging and audit trails.

⏱ 20 minutes openclawprometheusgrafana

Monitor OpenClaw with Prometheus and Grafana

Set up monitoring for OpenClaw AI gateway on Kubernetes with Prometheus metrics, Grafana dashboards, and alerting for uptime, message throughput, and.

⏱ 30 minutes openclawmulti-agentrouting

Multi-Agent Routing with OpenClaw

Configure multiple isolated AI agents in a single OpenClaw gateway on Kubernetes with per-agent workspaces, channel bindings, and session isolation.

⏱ 15 minutes openclawnetwork-policysecurity

Network Policies for OpenClaw on Kubernetes

Secure OpenClaw deployments with Kubernetes NetworkPolicies to restrict egress to messaging APIs, block unauthorized ingress, and isolate the gateway.

⏱ 15 minutes openclawpersistent-storagepvc

OpenClaw with Persistent Storage

Configure persistent storage for OpenClaw workspaces using PVCs, StorageClasses, and backup strategies in Kubernetes clusters.

⏱ 15 minutes openclawrbacmulti-tenancy

OpenClaw RBAC and Multi-Tenant Isolation

Configure OpenClaw RBAC policies and namespace isolation for multi-tenant Kubernetes clusters with per-team agent access controls.

⏱ 20 minutes openclawsecretssecurity

Secure Secrets Management for OpenClaw

Manage API keys, bot tokens, and credentials for OpenClaw on Kubernetes using Kubernetes Secrets, External Secrets Operator, and Sealed Secrets.

⏱ 20 minutes openclawsignalmessaging

Deploy an OpenClaw Signal Messenger Bot

Run OpenClaw as a Signal messenger AI assistant on Kubernetes with linked device pairing, end-to-end encryption, and persistent sessions.

⏱ 20 minutes openclawskillstools

Manage OpenClaw Skills on Kubernetes

Deploy and manage OpenClaw agent skills (tools, automations, integrations) on Kubernetes using ConfigMaps, PVCs, and git-sync for dynamic capability.

⏱ 15 minutes openclawtelegrambot

Deploy an OpenClaw Telegram Bot on Kubernetes

Run OpenClaw as a Telegram bot on Kubernetes with BotFather setup, webhook configuration, inline commands, and persistent conversation history.

⏱ 20 minutes openclawwhatsappai-assistant

Self-Host an OpenClaw WhatsApp AI Assistant

Deploy OpenClaw on Kubernetes to run a personal WhatsApp AI assistant with QR code pairing, persistent sessions, media support, and allow-list security.

⏱ 25 minutes openclawgitopsworkspace

GitOps for OpenClaw Workspaces on Kubernetes

Manage OpenClaw agent workspaces (SOUL.md, skills, memory) with GitOps using Flux or ArgoCD, enabling version-controlled AI persona management on.

OpenShift ACS Security for Kubernetes

Deploy and configure Red Hat Advanced Cluster Security (ACS/RHACS) for vulnerability scanning, compliance, network policies, and runtime threat detection.

⏱ 15 minutes openshiftbuildconfigimagestream

OpenShift BuildConfig with ImageStream

Build container images on OpenShift using BuildConfig with ImageStream triggers, pushing to internal registry or local Quay.

⏱ 15 minutes openshiftbuildconfigquay

OpenShift BuildConfig with Local Quay Registry

Build container images on OpenShift and push to a local Quay registry using BuildConfig, ImageStream, and robot account credentials.

⏱ 20 minutes catalogsourceolmoperators

Create Custom CatalogSources for OLM Operators

Configure CatalogSource in OpenShift to serve custom operator catalogs from private registries or air-gapped environments.

⏱ 25 minutes catalogsourceolmoperators

Filter CatalogSource Operators by Package

Curate a minimal CatalogSource with only approved operators using opm index pruning and file-based catalog filtering for security and compliance.

⏱ 15 minutes catalogsourceolmtroubleshooting

Troubleshoot CatalogSource and OLM Issues

Debug CatalogSource failures including pod crashes, gRPC errors, stale caches, and operator install problems in OpenShift OLM environments.

⏱ 20 minutes openshiftquaypull-secret

OpenShift Cluster-Wide Pull Secret Robot Ac...

Replace admin credentials in the OpenShift cluster-wide pull secret with a Quay robot account for secure, auditable container image pulls across all namespaces.

⏱ 15 minutes openshiftcertificatestls

OpenShift Custom CA for Private Registries

Configure OpenShift to trust a custom Certificate Authority for private container registries using additionalTrustedCA and image.config.openshift.io settings.

⏱ 25 minutes kustomizegitopsargocd

Kustomize Deployments with OpenShift GitOps

Use Kustomize overlays with the OpenShift GitOps Operator (ArgoCD) to manage environment-specific configurations across dev, staging, and production clusters.

⏱ 30 minutes openshiftidmsmirror-registry

OpenShift IDMS & install-config.yaml Mirror...

Configure ImageDigestMirrorSet and install-config.yaml imageContentSources for OpenShift disconnected installations with mirror registries.

⏱ 25 minutes openshiftitmsimage-mirroring

OpenShift ITMS ImageTagMirrorSet

Configure ImageTagMirrorSet in OpenShift 4.13+ for tag-based image mirroring. Mirror container images by tag instead of digest for disconnected clusters.

⏱ 15 minutes openshiftlifecycleupgrades

OpenShift Lifecycle and Version Support

OpenShift support lifecycle guide covering version support phases, EUS releases, end-of-life dates, and upgrade planning for production clusters.

⏱ 20 minutes openshiftmachineconfigpoolmcp

OpenShift MachineConfigPool After ITMS

Monitor and manage MachineConfigPool rollouts after applying ImageTagMirrorSet in OpenShift. Handle node restarts, paused pools, and degraded states.

⏱ 15 minutes openshifttemplatesnamespaces

OpenShift Project Request Template Pull Sec...

Configure an OpenShift Project Request Template so every new namespace automatically gets a ServiceAccount with imagePullSecrets for your private Quay registry.

⏱ 15 minutes openshiftserverlessknative

OpenShift Serverless KnativeServing

Deploy and configure OpenShift Serverless Operator with KnativeServing for autoscaling, scale-to-zero, and traffic splitting on Kubernetes.

⏱ 15 minutes priorityclassgpuscheduling

PriorityClasses for GPU Workloads

Configure Kubernetes PriorityClasses for GPU workloads with training, serving, batch, and interactive tiers and preemption policies.

⏱ 20 minutes quaycontainer-registrysecurity

Quay Robot Accounts for Kubernetes Image Pulls

Create Quay robot accounts and configure Kubernetes imagePullSecrets for automated container image pulls from private registries.

⏱ 15 minutes resourcequotalimitrangegpu

ResourceQuota and LimitRange for GPUs

Configure ResourceQuota and LimitRange for GPU workloads with per-tenant caps on GPU, CPU, memory, and object counts in Kubernetes.

RHACS Compliance Scanning in OpenShift

Run CIS, NIST, PCI DSS, and HIPAA compliance scans with Red Hat Advanced Cluster Security and automate reporting for audits.

RHACS Custom Security Policies Guide

Create and manage custom security policies in Red Hat Advanced Cluster Security for image scanning, deployment config, and runtime enforcement.

RHACS Multi-Cluster Management

Manage security across multiple Kubernetes clusters with RHACS Central hub, secured cluster registration, and unified policy enforcement.

RHACS Network Segmentation Policies

Use Red Hat Advanced Cluster Security network graph to discover traffic flows, generate NetworkPolicies, and enforce micro-segmentation.

⏱ 15 minutes openshiftrhcoscoreos

RHCOS Node Management for OpenShift

Understand and manage Red Hat Enterprise Linux CoreOS (RHCOS) for OpenShift nodes including MachineConfig, ignition, OS updates, and node customization.

RHACS CI/CD Pipeline Integration

Integrate Red Hat Advanced Cluster Security into CI/CD pipelines with roxctl for image scanning, policy checks, and deployment validation.

⏱ 15 minutes quaysecuritysecrets

Rotate Quay Robot Tokens in Kubernetes

Automate Quay robot account token rotation across Kubernetes namespaces with zero-downtime credential updates and validation scripts.

⏱ 15 minutes runaigpuquotas

Run:AI GPU Quotas on OpenShift

Configure Run:AI scheduler quotas for fair GPU sharing with guaranteed, over-quota borrowing, and per-tenant GPU allocation policies.

⏱ 45 minutes slurmhpcbatch-scheduling

SLURM and Kubernetes Integration

Integrate SLURM HPC workload manager with Kubernetes for hybrid AI and scientific computing. Bridge HPC batch scheduling with container orchestration.

⏱ 15 minutes sriovconnectx-7connectx-6

SR-IOV Mixed NICs for GPU Nodes

Configure SR-IOV with mixed ConnectX-7 and ConnectX-6 NICs for RDMA data plane and management traffic on GPU worker nodes.

⏱ 25 minutes sriovnetworkingnvidia

SR-IOV NicClusterPolicy for VF Configuration

Configure SR-IOV Virtual Functions on Mellanox ConnectX NICs using the NVIDIA Network Operator NicClusterPolicy for high-performance Kubernetes networking.

SR-IOV VF Networking for AI Workloads

Deploy SR-IOV Virtual Functions with RDMA support for distributed AI training on Kubernetes, including multi-NIC pod configuration and NCCL tuning.

⏱ 30 minutes sriovrdmaai

⏱ 20 minutes sriovtroubleshootingnetworking

SR-IOV VF Troubleshooting on Kubernetes

Diagnose and fix SR-IOV Virtual Function issues including VF creation failures, device plugin errors, RDMA problems, and network attachment failures.

⏱ 15 minutes time-slicingmiggpu-sharing

Time-Slicing vs MIG vs Full GPU Allocation

Compare GPU sharing strategies: time-slicing for notebooks, MIG for isolated inference, and full GPU for training workloads.

⏱ 30 minutes tritonautoscalinggpu-metrics

Triton Autoscaling with GPU Metrics

Autoscale Triton Inference Server on Kubernetes using GPU utilization, request queue depth, and inference latency metrics with KEDA and HPA.

⏱ 35 minutes tritonmulti-modeltensorrt-llm

Triton Multi-Model Serving on Kubernetes

Serve multiple LLMs simultaneously on Triton Inference Server using TensorRT-LLM and vLLM backends with model routing and GPU scheduling.

⏱ 45 minutes tritontensorrt-llmnvidia

Triton TensorRT-LLM on Kubernetes

Deploy NVIDIA Triton Inference Server with TensorRT-LLM backend on Kubernetes for optimized large language model serving with GPU acceleration.

⏱ 20 minutes tritontensorrt-llmvllm

TensorRT-LLM vs vLLM on Triton

Compare TensorRT-LLM and vLLM backends on Triton Inference Server. When to use each, performance benchmarks, and migration strategies.

⏱ 30 minutes tritonvllmnvidia

Triton with vLLM Backend on Kubernetes

Deploy NVIDIA Triton Inference Server with vLLM backend on Kubernetes for flexible LLM serving with PagedAttention and continuous batching.

⏱ 45 minutes certificatescatls

Update CA Certificates in Kubernetes

Rotate and update Certificate Authority (CA) certificates in Kubernetes clusters including kube-apiserver, etcd, kubelet, and custom CA bundles for TLS.

⏱ 30 minutes vector-databasemilvusweaviate

Deploying Vector Databases on Kubernetes

Deploy and operate vector databases (Milvus, Weaviate, Qdrant) on Kubernetes for RAG pipelines, semantic search, and AI applications with persistent.

⏱ 20 minutes nvidiagpu-operatorclusterpolicy

Configure ClusterPolicy kernelModuleType GP...

Understand and configure the driver.kernelModuleType field in the NVIDIA GPU Operator ClusterPolicy to choose between auto, open, and proprietary kernel.

⏱ 60 minutes nvidiagpurdma

Configure GPUDirect RDMA the NVIDIA GPU Ope...

Set up GPUDirect RDMA on Kubernetes using the NVIDIA GPU Operator with either DMA-BUF or legacy nvidia-peermem, including Network Operator integration.

⏱ 15 minutes nvidiagpukernel-modules

Diagnose NVIDIA Memory-Only Kernel Modules ...

Understand why lsmod shows NVIDIA modules loaded but modinfo fails, and how the GPU Operator's proprietary driver container inserts modules without.

⏱ 45 minutes nvidiagpugds

Enable GPUDirect Storage on OpenShift

Configure GPUDirect Storage (GDS) with the NVIDIA GPU Operator on OpenShift, including the Open Kernel Module requirement and nvidia-fs verification.

⏱ 30 minutes nvidiagpurdma

Fix NVIDIA Peer Memory Driver Not Detected

Diagnose and resolve the 'NVIDIA peer memory driver not detected' error when running GPU workloads with RDMA on Kubernetes and OpenShift.

⏱ 20 minutes nvidiagpu-operatorselinux

SELinux and SCC Config for GPU Operator

Understand SELinux device relabeling and Security Context Constraints (SCC) requirements for the NVIDIA GPU Operator driver pods on OpenShift.

⏱ 45 minutes nvidiagpurdma

Switch GPUDirect RDMA from nvidia-peermem t...

Migrate from the legacy nvidia-peermem kernel module to the recommended DMA-BUF GPUDirect RDMA path using the NVIDIA GPU Operator.

⏱ 60 minutes nvidiagpu-operatorkernel-modules

Switch to Open NVIDIA Kernel Modules on Ope...

Step-by-step guide to migrate the NVIDIA GPU Operator from proprietary to open kernel modules on OpenShift, enabling DMA-BUF and GPUDirect Storage support.

⏱ 30 minutes nvidiagpugds

Fix nvidia-fs Module Conflict on OpenShift

Diagnose and fix the 'insmod: ERROR: could not insert module nvidia-fs.ko: File exists' error when enabling GPUDirect Storage with the NVIDIA GPU Operator.

⏱ 30 minutes nvidiagpurdma

Validate GPUDirect RDMA Performance DMA-BUF

Run ib_write_bw with CUDA DMA-BUF to verify GPUDirect RDMA data transfer rates between GPU pods and validate network operator configuration.

⏱ 30 minutes ncclci-cdpreflight

Automate NCCL Preflight Checks in CI/CD Pipelines

Run NCCL smoke benchmarks automatically in CI/CD pipelines before promoting GPU cluster changes to production, catching regressions early.

⏱ 20 minutes ncclintra-nodeinter-node

Compare NCCL Intra-Node vs Inter-Node Perfo...

Build a repeatable comparison between local and cross-node NCCL throughput to validate GPU cluster interconnect scaling and identify bottlenecks early.

⏱ 30 minutes nccltimeouthang

Debug NCCL Timeouts and Hangs in Kubernetes

Systematically troubleshoot NCCL runs that stall or timeout across multi-GPU and multi-node Kubernetes jobs with step-by-step diagnostic commands.

⏱ 30 minutes ncclprometheusgrafana

Monitor NCCL Benchmark Runs Prometheus & Gr...

Track NCCL benchmark outcomes and GPU telemetry over time with Prometheus and Grafana dashboards to detect communication regressions early.

⏱ 20 minutes ncclallgatherai

Run NCCL AllGather Benchmarks Model Paralle...

Use all-gather NCCL tests to evaluate GPU communication behavior and throughput for tensor-parallel and model-parallel distributed AI workloads on Kubernetes.

⏱ 20 minutes ncclallreducegpu

Benchmark NCCL AllReduce Performance

Measure NCCL AllReduce bandwidth and latency on Kubernetes to validate distributed training network performance across multi-GPU clusters.

⏱ 25 minutes nccllatencyp2p

Diagnose GPU Peer-to-Peer Latency NCCL Tests

Use NCCL point-to-point and collective tests to isolate GPU peer-to-peer latency issues between GPU pairs in multi-node Kubernetes clusters.

⏱ 25 minutes ncclnccl-testsgpu

Run NCCL Tests for GPU Network Validation

Benchmark GPU-to-GPU communication using NVIDIA nccl-tests on Kubernetes or OpenShift to validate bandwidth and latency.

⏱ 35 minutes ncclmpijobkubeflow

Run NCCL Tests with MPIJob on Kubernetes

Launch multi-pod NCCL benchmarks using MPIJob on Kubernetes for repeatable, automated distributed GPU communication testing across nodes.

⏱ 20 minutes ncclrdmaethernet

Tune NCCL Env Variables for RDMA & Ethernet

Apply safe NCCL environment variable profiles for RDMA-capable and Ethernet-only GPU clusters to maximize collective communication throughput.

⏱ 15 minutes nccltopologypci

Validate GPU & NIC Topology Before NCCL Ben...

Inspect node-level GPU, NIC, and PCI topology on Kubernetes workers to predict and explain NCCL benchmark performance before running tests.

⏱ 15 minutes bondingnetworkingsriov

Check Bonding and Interface Status for SR-IOV

Inspect bond membership, interface state, and link aggregation to confirm which NICs can be correctly targeted by SR-IOV network policies on Kubernetes.

⏱ 20 minutes sriovnetworknv-ipammultus

Configure SriovNetwork with NVIDIA nv-ipam

Create a SriovNetwork resource that auto-generates a Multus NetworkAttachmentDefinition using nv-ipam for high-performance SR-IOV secondary interfaces.

⏱ 15 minutes nv-ipamippoolsriov

Create an NVIDIA nv-ipam IPPool SR-IOV Netw...

Define a valid nv-ipam IPPool and node-aware sizing strategy so SR-IOV workloads can reliably obtain secondary interface IP addresses on Kubernetes.

⏱ 30 minutes nvidia-nimtensorrt-llmmistral

Deploy Mistral 7B with NVIDIA NIM

Step-by-step guide to deploy Mistral-7B using NVIDIA NIM with TensorRT-LLM backend on Kubernetes for optimized GPU inference.

⏱ 30 minutes vllmmistralllm

Deploy Mistral 7B with vLLM on Kubernetes

Step-by-step guide to deploy Mistral-7B-v0.1 using vLLM as an OpenAI-compatible inference server on Kubernetes with GPU fractioning.

⏱ 20 minutes nvidianetwork-operatornic-feature-discovery

Enable NIC Feature Discovery in NVIDIA Netw...

Enable NIC Feature Discovery through NicClusterPolicy and verify the node labels required by SR-IOV and RDMA GPU networking workflows on Kubernetes.

⏱ 15 minutes mellanoxconnectxpci

Identify Mellanox Interface Models from Lin...

Map interface names to PCI addresses and Mellanox model generations to build accurate SR-IOV policies and GPU networking configurations on Kubernetes.

⏱ 30 minutes autoscalinghpakeda

Autoscale LLM Inference on Kubernetes

Configure Horizontal Pod Autoscaling and KEDA for LLM workloads using GPU utilization, request queue depth, and custom metrics.

⏱ 20 minutes quantizationgptqawq

Quantize LLMs for Efficient GPU Inference

Run quantized LLM models (GPTQ, AWQ, GGUF) on Kubernetes to reduce GPU memory requirements and serve models on smaller GPUs.

⏱ 15 minutes vllmnvidia-nimtriton

Kubernetes LLM Serving Frameworks Compared

Compare vLLM, NVIDIA NIM, Triton, Ollama, and llama.cpp for serving LLMs on Kubernetes — features, performance, and when to use each.

⏱ 15 minutes quaypodmancontainer-registry

Push a Podman-Saved Image to Local Quay

Load a Podman image tar archive, tag it for your Local Quay registry, authenticate with robot accounts, and push it safely to your private repo.

⏱ 10 minutes quaypodmanretag

Retag and Push an Image in Local Quay

Pull an existing container image from Local Quay, retag it for a new repository path or version, and push the updated tag back to the registry.

⏱ 30 minutes multi-gputensor-parallelismpipeline-parallelism

Multi-GPU and Tensor Parallel LLM Inference

Deploy large language models across multiple GPUs using tensor parallelism with vLLM and NVIDIA NIM on Kubernetes for high-throughput inference serving.

⏱ 25 minutes nvidiagpu-operatorgpu

Install NVIDIA GPU Operator on Kubernetes

Deploy the NVIDIA GPU Operator to automate GPU driver, container toolkit, and device plugin management across your Kubernetes cluster.

⏱ 30 minutes openshifttlscertificates

Deploy a New Certificate Each OpenShift Tenant

Replace and activate new TLS certificates tenant by tenant in OpenShift IngressController deployments with verification steps and rollback guidance.

⏱ 20 minutes openshiftmulti-tenantingress

OpenShift Multi-Tenant TLS per IngressContr...

Set up tenant-isolated TLS in OpenShift by assigning a dedicated certificate Secret to each IngressController for multi-tenant routing security.

⏱ 25 minutes openshiftsriovvf

Create SR-IOV VFs on OpenShift SriovNetwork...

Use the OpenShift SR-IOV Network Operator to create and manage Virtual Functions from selected Physical Functions on GPU worker nodes.

⏱ 25 minutes openshiftmulti-tenantsecrets

Rotate OpenShift Tenant Secrets Safely

Implement low-risk secret rotation in OpenShift multi-tenant environments using versioned Secrets and controlled rollouts.

⏱ 45 minutes ragretrieval-augmented-generationvector-database

Build a RAG Pipeline on Kubernetes

Deploy a Retrieval-Augmented Generation pipeline on Kubernetes using a vector database, embedding model, and LLM inference server.

⏱ 15 minutes s3storagepermissions

Configure S3 Storage Permissions for ML Models

Set up S3 bucket ACLs, IAM roles, and PVC permissions so Kubernetes inference pods can securely read large ML model weights from object storage.

⏱ 10 minutes llminferencecurl

Test LLM Inference Endpoints with curl

Validate Kubernetes-hosted LLM inference services using curl against OpenAI-compatible /v1/models, /v1/completions, and /v1/chat/completions endpoints.

⏱ 20 minutes nvidia-nimtensorrt-llmtroubleshooting

Fix NVIDIA NIM TensorRT-LLM Initialization ...

Diagnose and fix common NIM TensorRT-LLM executor failures including DecoderState mismatch, version incompatibilities, and engine build errors.

⏱ 30 minutes sriovtroubleshootingwebhook

Fix 'No Supported NIC Is Selected' in SR-IOV

Diagnose SR-IOV operator webhook rejections by validating node state, label selectors, PF eligibility, and SriovNetworkNodePolicy configuration.

⏱ 20 minutes nv-ipammultussriov

Fix nv-ipam 'Pool Not Found' Errors in Multus

Fix nv-ipam IPPool lookup failures in Multus by aligning SriovNetwork, NetworkAttachmentDefinition, and IPPool names and namespaces correctly.

⏱ 30 minutes sriovvalidationmultinode

Validate SR-IOV Operator Health Across Mult...

Run a full checklist to confirm SR-IOV discovery, VF creation, scheduler resources, and pod attachment on multiple nodes.

⏱ 15 minutes ovnunderlayopenshift

Verify Which Interface Carries OVN Underlay...

Confirm the actual OVN underlay network path by checking ovn-encap-ip, bridge port ownership, and physical route associations on Kubernetes nodes.

⏱ 15 minutes cronjobconcurrencyscheduling

How to Configure CronJob Concurrency Policy

Master Kubernetes CronJob concurrency policies to control parallel execution. Learn when to use Allow, Forbid, and Replace with real-world examples and.

⏱ 35 minutes argocdgitopscontinuous-deployment

How to Implement GitOps with Argo CD

Deploy and manage Kubernetes applications declaratively with Argo CD GitOps. Learn application deployment, sync strategies, multi-cluster management.

⏱ 55 minutes crossplaneinfrastructure-as-codecloud-resources

Crossplane for Cloud Infrastructure Management

Use Crossplane to provision and manage cloud infrastructure resources like databases, storage, and networking using Kubernetes-native APIs and GitOps.

⏱ 50 minutes dracomputedomainsnvlink

Multi-Node NVLink with ComputeDomains

Configure ComputeDomains for robust and secure Multi-Node NVLink (MNNVL) workloads on NVIDIA GB200 and similar systems using DRA

⏱ 40 minutes dragpunvidia

Dynamic Resource Allocation GPUs NVIDIA DRA...

Learn to use Kubernetes Dynamic Resource Allocation (DRA) for flexible GPU allocation, sharing, and configuration with the NVIDIA DRA Driver

MIG GPU Partitioning with DRA on Kubernetes

Dynamically partition NVIDIA A100 and H100 GPUs using Multi-Instance GPU (MIG) technology with Dynamic Resource Allocation for flexible workload isolation

⏱ 40 minutes dragpumig

Mixed Accelerator Workloads with DRA

Orchestrate heterogeneous accelerator workloads combining GPUs, TPUs, FPGAs, and custom AI chips using Dynamic Resource Allocation

⏱ 50 minutes dragputpu

⏱ 45 minutes dratpugoogle-cloud

TPU Allocation Dynamic Resource Allocation

Configure Google Cloud TPUs in Kubernetes using DRA for flexible allocation, multi-slice workloads, and optimized machine learning training

⏱ 30 minutes etcdbackuprestore

How to Backup and Restore etcd

Protect your Kubernetes cluster with etcd backup strategies. Learn to create snapshots, automate backups, and restore etcd data for disaster recovery.

⏱ 45 minutes gitopsfluxcontinuous-delivery

GitOps with Flux CD for Continuous Delivery

Implement GitOps workflows using Flux CD to automate Kubernetes deployments, manage infrastructure as code, and maintain desired cluster state from Git.

⏱ 45 minutes gvisorcontainer-runtimesandbox

gVisor Runtime Sandboxed Containers K8s

Deploy gVisor with Kubernetes RuntimeClass for sandboxed containers. Configure runsc runtime, pod isolation, and security hardening for untrusted code.

⏱ 40 minutes vaultsecretssecurity

How to Integrate HashiCorp Vault with K8s

Securely manage secrets with HashiCorp Vault in Kubernetes. Learn to inject secrets into pods using the Vault Agent Injector and CSI Provider.

⏱ 55 minutes istioservice-meshtraffic-management

Istio Traffic Management and Routing

Implement advanced traffic management with Istio service mesh including traffic splitting, fault injection, circuit breaking, and intelligent routing.

⏱ 35 minutes kai-schedulernvidiagpu

GPU Sharing and Bin Packing with KAI Scheduler

Maximize GPU utilization with KAI Scheduler GPU sharing, fractional GPUs, and bin packing strategies for Kubernetes AI workloads.

⏱ 30 minutes kai-schedulernvidiagpu

Installing NVIDIA KAI Scheduler AI Workloads

Deploy KAI Scheduler for optimized GPU resource allocation in Kubernetes AI/ML clusters with hierarchical queues and batch scheduling

⏱ 35 minutes kai-schedulernvidiagpu

Hierarchical Queues & Resource Fairness KAI...

Configure hierarchical queues in KAI Scheduler for multi-tenant GPU clusters with quotas, limits, and Dominant Resource Fairness (DRF)

⏱ 40 minutes kai-schedulernvidiagpu

Batch Scheduling PodGroups in KAI Scheduler

Implement gang scheduling for distributed training jobs using KAI Scheduler PodGroups to ensure all-or-nothing pod scheduling

⏱ 45 minutes kai-schedulernvidiagpu

Topology-Aware Scheduling with KAI Scheduler

Optimize GPU workload placement using KAI Scheduler's Topology-Aware Scheduling (TAS) for NVLink, NVSwitch, and disaggregated serving architectures

⏱ 60 minutes api-aggregationapi-serverextension-apiserver

Kubernetes API Aggregation Layer

Extend the Kubernetes API with custom API servers using the aggregation layer to add new resource types and functionality without modifying core components

⏱ 45 minutes upgradecluster-managementmaintenance

How to Upgrade Kubernetes Clusters Safely

Perform Kubernetes cluster upgrades with zero downtime. Learn upgrade strategies, pre-flight checks, rollback procedures, and best practices for.

⏱ 30 minutes gateway-apinetworkingingress

Kubernetes Gateway API: HTTPRoute Guide

Deploy Kubernetes Gateway API for HTTP routing. GatewayClass, Gateway, HTTPRoute, TLSRoute, traffic splitting, and migration from Ingress resources.

⏱ 30 minutes networkingtroubleshootingdns

How to Troubleshoot Kubernetes Networking

Debug and resolve Kubernetes networking issues systematically. Learn to diagnose DNS problems, service connectivity, network policies, and CNI issues.

⏱ 45 minutes operatorscontrollerscrd

How to Create and Use Kubernetes Operators

Learn to build Kubernetes Operators for automating application management. Understand custom controllers, the Operator pattern, and frameworks like.

⏱ 45 minutes kyvernopolicy-as-codeadmission-control

Kyverno Policy Management and Enforcement

Implement Kubernetes-native policy management using Kyverno to validate, mutate, and generate resources with declarative policies written in YAML

⏱ 35 minutes linkerdservice-meshmtls

Linkerd Service Mesh: mTLS and Observability

Deploy Linkerd service mesh on Kubernetes. Automatic mTLS, traffic management, observability dashboards, service profiles, and traffic splitting.

⏱ 25 minutes multi-containersidecarambassador

How to Use Multi-Container Pod Patterns

Master Kubernetes multi-container pod patterns including sidecar, ambassador, and adapter. Learn when and how to use each pattern for microservices.

⏱ 20 minutes node-problem-detectorobservabilitymonitoring

How to Set Up Node Problem Detector

Detect and report node-level issues automatically with Node Problem Detector. Learn to identify kernel problems, hardware failures, and container.

⏱ 50 minutes oidcauthenticationidentity-provider

OIDC Authentication for Kubernetes

Configure OpenID Connect (OIDC) authentication to integrate Kubernetes with identity providers like Keycloak, Okta, Azure AD, and Google for secure user.

⏱ 20 minutes prioritypreemptionscheduling

Pod Priority and Preemption Scheduling Guide

Control Kubernetes scheduling with Pod Priority and Preemption. Learn to prioritize critical workloads and ensure important pods get scheduled first.

⏱ 35 minutes readiness-gatespod-conditionsload-balancer

Pod Readiness Gates for Custom Conditions

Implement Pod Readiness Gates to add custom conditions that must be satisfied before a pod is considered ready for traffic, enabling integration with.

⏱ 20 minutes security-contextsecuritypod-security

Pod Security Context and Admission Standards

Configure Pod Security Context and Admission labels. Privileged, Baseline, Restricted standards, runAsUser, fsGroup, capabilities, and seccomp profiles.

⏱ 50 minutes schedulerscheduling-profilescustom-scheduler

Kubernetes Scheduler Configuration and Tuning

Customize the Kubernetes scheduler with scheduling profiles, plugins, and advanced placement strategies for optimal pod placement and resource utilization

⏱ 25 minutes sealed-secretsgitopssecurity

How to Use Sealed Secrets for GitOps

Encrypt Kubernetes secrets for safe Git storage with Sealed Secrets. Learn to seal, manage, and rotate secrets in GitOps workflows securely.

⏱ 45 minutes velerobackupdisaster-recovery

K8s Backup and Disaster Recovery with Velero

Implement comprehensive backup and disaster recovery strategies for Kubernetes clusters using Velero to protect workloads, configurations, and.

⏱ 30 minutes workload-identityiamcloud-security

How to Use Workload Identity for Cloud Access

Securely access cloud services from Kubernetes pods without static credentials. Configure Workload Identity for AWS, Azure, and GCP with IRSA, Workload.

⏱ 15 minutes admission-webhookssecurityvalidation

How to Create Admission Webhooks

Build validating and mutating admission webhooks to enforce policies and modify resources. Implement custom admission controllers for Kubernetes.

⏱ 15 minutes a-b-testingtraffic-routingfeature-flags

How to Implement A/B Testing with Kubernetes

Route traffic between application versions for A/B testing. Use service mesh, ingress, and custom routing rules to validate features with real users.

⏱ 15 minutes alertmanagerprometheusalerts

How to Set Up Alertmanager for Prometheus

Configure Alertmanager to route and manage Prometheus alerts. Set up notification channels including Slack, PagerDuty, and email with routing rules.

⏱ 15 minutes api-serverauthenticationauthorization

How to Configure Kubernetes API Access Control

Set up secure API server access with authentication and authorization. Configure RBAC, API groups, and audit logging for cluster security.

⏱ 15 minutes apideprecationmigration

Manage K8s API Versions and Deprecations

Handle Kubernetes API version changes and deprecations. Migrate resources to stable APIs and ensure cluster upgrade compatibility.

⏱ 15 minutes argocdgitopscontinuous-deployment

How to Deploy with Argo CD GitOps

Implement GitOps continuous deployment with Argo CD. Sync Kubernetes manifests from Git repositories automatically with declarative application management.

⏱ 15 minutes canarydeploymentsrollout

How to Implement Canary Deployments

Learn to implement canary deployments in Kubernetes for gradual rollouts. Use native features and Ingress-based traffic splitting for safe releases.

⏱ 15 minutes cert-managertlscertificates

Manage K8s Certificates with cert-manager

Automate TLS certificate management with cert-manager. Configure issuers, request certificates from Let's Encrypt, and enable automatic renewal.

⏱ 15 minutes securityscanningvulnerabilities

How to Implement Container Security Scanning

Scan container images for vulnerabilities before deployment. Integrate Trivy and other tools into CI/CD pipelines and runtime admission control.

⏱ 15 minutes loggingobservabilitysidecar

How to Implement Container Logging Patterns

Configure logging for Kubernetes applications. Implement sidecar logging, log aggregation, and structured logging best practices.

⏱ 15 minutes corednsdnsnetworking

How to Configure Kubernetes Cluster DNS

Customize CoreDNS configuration for your cluster. Add custom DNS entries, configure forwarding, and optimize DNS resolution.

⏱ 15 minutes csistorageebs

How to Configure CSI Drivers for Storage

Install and configure Container Storage Interface (CSI) drivers for cloud and on-premises storage. Set up dynamic provisioning with AWS EBS, GCP PD, and.

⏱ 15 minutes dnscorednsnetworking

How to Customize DNS Configuration in K8s

Configure custom DNS settings in Kubernetes. Learn CoreDNS customization, stub domains, upstream servers, and pod DNS policies.

⏱ 15 minutes crdcustom-resourcesapi

Create Custom Resource Definitions (CRDs)

Extend Kubernetes API with Custom Resource Definitions. Define custom objects, configure validation schemas, and manage CRD lifecycle.

⏱ 15 minutes imagepulltroubleshootingregistry

How to Debug ImagePullBackOff Errors

Troubleshoot Kubernetes ImagePullBackOff and ErrImagePull errors. Learn to diagnose registry authentication, image tags, and network connectivity issues.

⏱ 15 minutes nodesdebuggingtroubleshooting

How to Debug Kubernetes Node Issues

Diagnose and troubleshoot node problems in Kubernetes clusters. Identify resource pressure, connectivity issues, and component failures.

⏱ 15 minutes oomkilledoommemory

Fix OOMKilled in Kubernetes Pods

Fix OOMKilled errors in Kubernetes pods (exit code 137). Debug memory leaks, set correct memory limits, and prevent OOM kills in containers.

⏱ 15 minutes networkingdebuggingtroubleshooting

How to Debug Pod Networking Issues

Diagnose and fix Kubernetes networking problems. Troubleshoot connectivity, DNS resolution, service discovery, and network policies with practical tools.

⏱ 15 minutes schedulingpendingtroubleshooting

Debug Pod Scheduling Failures in K8s

Fix pods stuck in Pending from scheduling failures. Diagnose resource constraints, node affinity, taints, tolerations, and topology spread conflicts.

⏱ 15 minutes blue-greencanarydeployment

Implement Blue-Green and Canary Deployments

Deploy applications with zero downtime using blue-green and canary strategies. Configure traffic splitting, rollbacks, and progressive delivery.

⏱ 15 minutes tracingjaegeropentelemetry

Implement Distributed Tracing with Jaeger

Deploy Jaeger for distributed tracing in Kubernetes. Learn to instrument applications, trace requests across services, and identify performance.

⏱ 15 minutes dnsnetworkingcoredns

How to Configure Kubernetes DNS Policies

Control pod DNS resolution with DNS policies and configs. Configure custom nameservers, search domains, and optimize DNS for your workloads.

⏱ 15 minutes downward-apimetadataenvironment

K8s Downward API: Pod Metadata Access

Use Kubernetes Downward API to expose pod metadata to containers. Access labels, annotations, resource limits, and node information as env vars or files.

⏱ 15 minutes storagepvpvc

How to Configure Dynamic Volume Provisioning

Set up dynamic volume provisioning in Kubernetes with StorageClasses. Learn to configure provisioners for AWS EBS, GCP PD, Azure Disk, and NFS.

⏱ 15 minutes debuggingephemeralkubectl

Ephemeral Containers: Debug Running Pods

Debug running pods with ephemeral containers using kubectl debug. Attach debug containers without restart for production troubleshooting on Kubernetes.

⏱ 15 minutes configmapenvironment-variablesconfiguration

Configure Environment Variables and ConfigMaps

Manage application configuration with environment variables and ConfigMaps. Learn injection methods, mounting as files, and dynamic configuration updates.

⏱ 15 minutes secretsexternal-secretsvault

How to Use External Secrets Operator

Sync secrets from external providers like AWS Secrets Manager, HashiCorp Vault, and Azure Key Vault into Kubernetes using External Secrets Operator.

⏱ 15 minutes fluxgitopscontinuous-deployment

How to Deploy with Flux GitOps

Implement GitOps continuous deployment with Flux CD. Automatically sync Kubernetes manifests and Helm releases from Git repositories.

⏱ 15 minutes graceful-shutdownzero-downtimeSIGTERM

How to Implement Graceful Shutdown

Ensure zero-downtime deployments with proper graceful shutdown. Handle SIGTERM signals, drain connections, and configure termination settings.

⏱ 15 minutes grafanamonitoringdashboards

Grafana Dashboard 6417: K8s Pod Monitoring

Set up Grafana dashboard 6417 for Kubernetes pod monitoring. Import, customize panels, PromQL queries, and cluster-wide resource visualization.

⏱ 15 minutes helmchartspackaging

How to Create Helm Charts from Scratch

Build custom Helm charts for your applications. Learn chart structure, templates, values, dependencies, and best practices for packaging Kubernetes.

⏱ 15 minutes helmrepositorycharts

How to Create Helm Chart Repositories

Set up and manage Helm chart repositories. Learn to host charts on GitHub Pages, S3, GCS, and OCI registries for team distribution.

⏱ 15 minutes helmdependenciessubcharts

How to Manage Helm Chart Dependencies

Learn to manage Helm chart dependencies effectively. Configure subcharts, override values, and build complex applications with reusable components.

⏱ 15 minutes helmhookslifecycle

How to Use Helm Hooks for Lifecycle Management

Master Helm hooks for pre-install, post-install, pre-upgrade, and post-delete operations. Learn to run database migrations, backups, and cleanup tasks.

🎯 Helm advanced

Helm Sprig Functions: cat, print, toString

Master Helm Sprig functions: cat, print, toString, add1, join, and quote. String manipulation, conditionals, and advanced templating patterns.

⏱ 15 minutes helmtemplatingsprig

⏱ 15 minutes hpaautoscalingcustom-metrics

HPA Custom Metrics: Scale on Queue Depth

Configure Kubernetes HPA with custom and external metrics. Scale pods on queue depth, request latency, and Prometheus metrics via autoscaling/v2.

⏱ 15 minutes image-pull-secretsregistriesdocker

How to Configure Image Pull Secrets

Pull container images from private registries using image pull secrets. Configure authentication for Docker Hub, GCR, ECR, ACR, and private registries.

⏱ 15 minutes ingressroutingtraffic

How to Implement Request Routing with Ingress

Configure advanced routing rules with Kubernetes Ingress. Implement path-based routing, host-based routing, and traffic management.

⏱ 15 minutes tlssslcertificates

Secure Ingress with SSL/TLS Certificates

Configure TLS termination for Kubernetes Ingress using cert-manager and Let's Encrypt. Automate certificate issuance and renewal.

⏱ 15 minutes istioservice-meshtraffic

How to Implement Service Mesh with Istio

Deploy Istio service mesh for traffic management, security, and observability. Learn to configure virtual services, destination rules, and mTLS.

⏱ 15 minutes jaegertracingobservability

Jaeger Distributed Tracing on Kubernetes

Deploy Jaeger for distributed tracing in Kubernetes. Trace requests across microservices to identify latency issues and debug complex systems.

⏱ 15 minutes kindlocal-developmentdocker

How to Run Kubernetes in Docker (kind)

Create local Kubernetes clusters using kind (Kubernetes in Docker). Set up multi-node clusters, configure networking, and test applications locally.

⏱ 15 minutes kubeconfigcontextsclusters

How to Manage Kubernetes Contexts and Clusters

Switch between multiple clusters efficiently. Configure kubeconfig, manage contexts, and set up secure multi-cluster access.

⏱ 15 minutes kubectldebuggingtroubleshooting

Essential kubectl Commands for Debugging

Master kubectl debugging commands to troubleshoot Kubernetes issues. Learn to inspect pods, view logs, debug networking, and diagnose cluster problems.

⏱ 15 minutes kubectlkrewplugins

How to Extend kubectl with Plugins

Enhance kubectl with custom plugins using Krew package manager. Discover, install, and create plugins to boost K8s productivity.

⏱ 15 minutes auditloggingsecurity

How to Configure Kubernetes Audit Logging

Enable and configure Kubernetes API audit logging. Track who did what, when, and to which resources for security compliance and troubleshooting.

⏱ 15 minutes costoptimizationresources

How to Optimize Kubernetes Costs

Reduce cloud costs in Kubernetes clusters. Right-size resources, use spot instances, implement autoscaling, and monitor spending effectively.

⏱ 15 minutes dnscorednsnetworking

How to Configure DNS in Kubernetes

Understand and configure Kubernetes DNS with CoreDNS. Customize DNS policies, configure external DNS resolution, and troubleshoot DNS issues.

⏱ 15 minutes endpointslicesservicesnetworking

How to Use Kubernetes EndpointSlices

Understand and manage EndpointSlices for scalable service discovery. Configure endpoint slicing, troubleshoot connectivity, and optimize large clusters.

⏱ 15 minutes eventsmonitoringtroubleshooting

How to Use Kubernetes Events for Monitoring

Monitor cluster activity through Kubernetes events. Capture, filter, and alert on events for troubleshooting and operational visibility.

⏱ 15 minutes finalizerscleanupdeletion

How to Use Kubernetes Finalizers

Manage resource cleanup with Kubernetes finalizers. Implement custom cleanup logic and understand how finalizers prevent premature resource deletion.

⏱ 15 minutes labelsannotationsorganization

How to Use Labels and Annotations Effectively

Organize and manage Kubernetes resources with labels and annotations. Implement labeling strategies for selection, filtering, and metadata.

⏱ 15 minutes leasesleader-electioncoordination

How to Use K8s Leases for Leader Election

Implement distributed coordination with Kubernetes Leases. Configure leader election, distributed locks, and high availability patterns.

⏱ 15 minutes probeshealth-checksliveness

K8s Probes: Liveness, Readiness, Startup

Configure Kubernetes probes for reliable apps. Complete guide to liveness, readiness, and startup probes with httpGet, tcpSocket, exec, and gRPC examples.

⏱ 15 minutes runtimeclassgvisorkata

K8s RuntimeClass: gVisor and Kata Containers

Configure different container runtimes for workloads. Use gVisor, Kata Containers, or other runtimes for enhanced security and isolation.

⏱ 15 minutes kustomizeconfigurationoverlays

Use Kustomize for Configuration Management

Manage Kubernetes configurations with Kustomize overlays. Customize base manifests for different environments without template duplication.

⏱ 15 minutes local-storagepersistent-volumesssd

How to Configure Local Persistent Volumes

Use local persistent volumes for high-performance storage with node-local SSDs. Configure local storage classes and handle node affinity constraints.

⏱ 15 minutes loggingelasticsearchfluentd

Set Up Centralized Logging with EFK Stack

Deploy Elasticsearch, Fluentd, and Kibana for centralized Kubernetes logging. Learn to collect, parse, and visualize container logs at scale.

⏱ 15 minutes networkpolicysecuritynetworking

How to Implement Advanced NetworkPolicies

Master advanced Kubernetes NetworkPolicies for fine-grained traffic control. Learn egress rules, CIDR blocks, namespace isolation, and common security.

⏱ 15 minutes network-policiessecuritynetworking

How to Implement Network Policies

Secure pod-to-pod communication with Kubernetes Network Policies. Learn to create ingress and egress rules, isolate namespaces, and implement zero-trust.

How to Implement K8s Taints and Tolerations

Control pod scheduling with taints and tolerations. Dedicate nodes for specific workloads, handle node conditions, and implement scheduling constraints.

⏱ 15 minutes opentelemetryotelmetrics

Collect Metrics with OpenTelemetry Collector

Deploy OpenTelemetry Collector for unified metrics, traces, and logs collection in Kubernetes. Learn pipelines, processors, and exporters configuration.

⏱ 15 minutes affinityschedulingplacement

Configure Pod Affinity and Anti-Affinity

Control pod placement using affinity and anti-affinity rules. Co-locate related pods or spread them across nodes and zones for high availability.

⏱ 15 minutes pdbavailabilitydisruption

How to Configure Pod Disruption Budgets

Protect application availability during voluntary disruptions. Configure PDBs to ensure minimum replicas during node drains, upgrades, and maintenance.

⏱ 15 minutes pdbdisruptionavailability

How to Implement Pod Disruption Budgets

Configure Pod Disruption Budgets (PDB) for high availability during voluntary disruptions. Ensure minimum availability during node maintenance and.

⏱ 15 minutes lifecyclehookspreStop

How to Configure Pod Lifecycle Hooks

Execute custom actions during pod startup and shutdown with lifecycle hooks. Implement graceful shutdown, initialization tasks, and cleanup operations.

⏱ 15 minutes admission-controllermutationinjection

How to Use Pod Presets and Mutations

Automatically inject configurations into pods using admission controllers. Configure environment variables, volumes, and annotations at deployment time.

⏱ 15 minutes prioritypreemptionscheduling

How to Configure Pod Priority and Preemption

Set pod priorities to ensure critical workloads get scheduled first. Configure preemption to evict lower-priority pods when resources are scarce.

⏱ 15 minutes resourcescpumemory

How to Configure Pod Resource Management

Set CPU and memory requests and limits effectively. Understand QoS classes, resource quotas, and optimize container resource allocation.

⏱ 15 minutes pod-securitypsasecurity

How to Configure Pod Security Admission

Enforce security standards with Pod Security Admission. Configure privileged, baseline, and restricted policies at namespace level for cluster-wide.

⏱ 15 minutes topologyschedulinghigh-availability

How to Use Pod Topology Spread Constraints

Distribute pods evenly across failure domains using topology spread constraints. Ensure high availability across zones, nodes, and custom topologies.

⏱ 15 minutes prometheusmonitoringmetrics

How to Monitor Kubernetes with Prometheus

Set up Prometheus monitoring for Kubernetes clusters. Configure scraping, alerting rules, and visualize metrics with Grafana dashboards.

⏱ 15 minutes rate-limitingingressapi-gateway

Kubernetes Rate Limiting with NGINX and Istio

Implement Kubernetes rate limiting for API protection. Ingress NGINX annotations, Istio rate limits, Kong plugins, and per-service rate limiting patterns.

⏱ 15 minutes resourceslimitsrequests

K8s Resource Limits: CPU 500m Memory 256Mi

Configure Kubernetes container resource limits and requests. CPU 200m/500m, memory 256Mi syntax and format explained with QoS classes and right-sizing.

⏱ 15 minutes resourcequotalimitsnamespaces

How to Configure Resource Quotas per Namespace

Implement resource quotas to limit CPU, memory, and object counts per namespace. Ensure fair resource allocation across teams and environments.

⏱ 15 minutes resource-quotalimitsmulti-tenancy

How to Configure Resource Quotas

Limit resource consumption per namespace with ResourceQuotas. Control CPU, memory, storage, and object counts to ensure fair cluster sharing.

⏱ 15 minutes encryptionkmssecrets

How to Encrypt Secrets at Rest with KMS

Configure Kubernetes secrets encryption at rest using external KMS providers. Learn to set up AWS KMS, GCP KMS, and Azure Key Vault encryption.

⏱ 15 minutes secretssecurityencryption

How to Manage Kubernetes Secrets Securely

Best practices for managing secrets in Kubernetes. Learn encryption at rest, secret rotation, and integration with external secret stores.

⏱ 15 minutes rbacservice-accountssecurity

How to Configure Service Accounts and RBAC

Secure your Kubernetes workloads with service accounts and role-based access control. Create roles, bindings, and implement least-privilege access.

⏱ 15 minutes sidecarpatternscontainers

How to Use Sidecar Containers Effectively

Implement sidecar containers for logging, monitoring, proxying, and configuration management. Learn common sidecar patterns for microservices.

⏱ 15 minutes statefulsetdatabasespersistence

How to Deploy Stateful Applications

Run stateful workloads on Kubernetes with StatefulSets. Manage stable identities, persistent storage, and ordered deployment for databases and caches.

⏱ 15 minutes statefulsetstatefulstorage

How to Manage Kubernetes StatefulSets

Deploy stateful applications with StatefulSets. Configure stable network identities, persistent storage, ordered deployment, and graceful scaling.

⏱ 15 minutes finalizersdeletioncleanup

Fix K8s Stuck Resources and Finalizers

Fix Kubernetes resources stuck in Terminating state by managing finalizers. Remove stuck namespaces, PVs, and CRDs with force-delete procedures.

How to Use Taints and Tolerations

Control pod scheduling with taints and tolerations. Dedicate nodes for specific workloads, handle node conditions, and implement advanced scheduling.

⏱ 15 minutes topologyschedulingavailability

Topology Spread Constraints for HA Workloads

Distribute pods across nodes, zones, and regions using topology spread constraints. Ensure high availability and fault tolerance for your workloads.

⏱ 15 minutes snapshotsbackupstorage

How to Set Up Volume Snapshots

Create and restore volume snapshots for persistent data backup. Learn to configure VolumeSnapshotClass and automate snapshot schedules.

⏱ 30 minutes alertmanagermonitoringalerts

How to Configure Alertmanager for K8s Alerts

Set up Alertmanager to route, group, and deliver Kubernetes alerts. Learn to configure Slack, PagerDuty, and email notifications.

⏱ 25 minutes deploymentblue-greenzero-downtime

How to Implement Blue-Green Deployments

Learn how to implement blue-green deployments in Kubernetes for instant rollbacks and zero-downtime releases. Complete guide with Service switching.

⏱ 30 minutes autoscalingcluster-autoscalernodes

Kubernetes Cluster Autoscaler Setup

Configure Kubernetes Cluster Autoscaler for automatic node scaling. AWS, GCP, and Azure setup, scaling policies, and pod priority integration.

⏱ 20 minutes configmapsecretsconfiguration

Manage ConfigMaps and Secrets Effectively

Master Kubernetes ConfigMaps and Secrets for application configuration. Learn creation methods, mounting strategies, and security best practices.

⏱ 15 minutes troubleshootingcrashloopbackoffdebugging

CrashLoopBackOff: How to Fix in Kubernetes

Fix CrashLoopBackOff in Kubernetes pods. Learn why pods crash loop, systematic debugging with kubectl logs and describe, and solutions for common causes.

⏱ 20 minutes dnscorednstroubleshooting

How to Debug DNS Issues in Kubernetes

Troubleshoot and resolve DNS problems in Kubernetes. Learn to diagnose CoreDNS issues, test resolution, and fix common DNS failures.

⏱ 30 minutes helmchartspackage-manager

How to Create and Use Helm Charts

Master Helm, the Kubernetes package manager. Learn to create charts, manage releases, and template your deployments for reusability.

⏱ 15 minutes init-containersdependenciesstartup

How to Use Init Containers for Dependencies

Master Kubernetes init containers to handle dependencies, setup tasks, and pre-flight checks before your main application starts.

⏱ 20 minutes jobscronjobsbatch

How to Deploy Jobs and CronJobs

Master Kubernetes Jobs and CronJobs for batch processing and scheduled tasks. Learn completion modes, parallelism, and failure handling.

⏱ 20 minutes namespacesmulti-tenancyorganization

How to Manage K8s Namespaces Effectively

Master Kubernetes namespace organization for multi-team environments. Learn resource quotas, network policies, and RBAC per namespace.

⏱ 25 minutes securitypod-securitypss

How to Implement Pod Security Standards

Secure your Kubernetes workloads using Pod Security Standards (PSS). Learn to enforce Privileged, Baseline, and Restricted policies at the namespace level.

⏱ 35 minutes prometheusmonitoringmetrics

Set Up Prometheus Monitoring for Applications

Learn to instrument your Kubernetes applications with Prometheus metrics. Complete guide to ServiceMonitors, scraping configuration, and custom metrics.

⏱ 30 minutes rbacsecurityservice-account

How to Configure RBAC and Service Accounts

Master Kubernetes RBAC (Role-Based Access Control) to secure your cluster. Learn to create Roles, ClusterRoles, and bind them to ServiceAccounts.

⏱ 20 minutes resourcescpumemory

Set Resource Requests and Limits Properly

Master Kubernetes resource management with proper CPU and memory requests and limits. Avoid OOMKills, throttling, and resource contention.

⏱ 15 minutes deploymentrolling-updatezero-downtime

Perform Rolling Updates with Zero Downtime

Master Kubernetes rolling updates to deploy new application versions without service interruption. Learn update strategies, rollback procedures, and.

⏱ 15 minutes serviceloadbalancernodeport

Expose Services with LoadBalancer and NodePort

Learn different ways to expose Kubernetes services externally using LoadBalancer, NodePort, and ExternalIPs. Compare options for various environments.

⏱ 30 minutes statefulsetmysqldatabase

How to Deploy MySQL with StatefulSet

Deploy a production-ready MySQL database on Kubernetes using StatefulSet. Learn persistent storage, headless services, and backup strategies.

⏱ 25 minutes autoscalingvparesources

Kubernetes VPA: Vertical Pod Autoscaler

Install and configure Kubernetes Vertical Pod Autoscaler. VPA updateMode Off, Initial, and Auto with recommendations and HPA coexistence strategies.

⏱ 20 minutes hpaautoscalingmetrics

Kubernetes HPA: Set Max Replicas and Scale

Configure Kubernetes HPA with autoscaling/v2, averageUtilization targets, and max replica settings. CPU, memory, and custom metrics scaling policies.

⏱ 15 minutes probeshealth-checksliveness

K8s Readiness Probe: Complete YAML Guide

Kubernetes readiness probe explained with YAML examples. Configure HTTP, TCP, exec, and gRPC readiness probes with liveness and startup probe comparison.

⏱ 10 minutes networkpolicysecurityzero-trust

K8s NetworkPolicy: Default Deny All Traffic

Implement zero-trust network security in Kubernetes with default deny-all NetworkPolicy. Block all ingress and egress traffic with allow-list rules.

⏱ 20 minutes ingressnginxtls

Configure NGINX Ingress TLS using cert-manager

Learn how to set up NGINX Ingress Controller with automatic TLS certificates from Let's Encrypt using cert-manager. Complete YAML examples and.

⏱ 15 minutes storagepvcpersistentvolume

PersistentVolumeClaims with StorageClasses

Learn how to provision persistent storage for your Kubernetes workloads using PersistentVolumeClaims and StorageClasses. Includes examples for dynamic.