📚Book Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) — free book giveaway!RSVP Booking.com Event

🤖 AI & GPU

AI/ML on Kubernetes: GPU scheduling, NVIDIA Triton, vLLM, model deployment, distributed training, Kubeflow, and inference optimization.

234 recipes 🟢 7 beginner 🟡 73 intermediate 🔴 154 advanced
advanced ⏱ 15 minutes

H200 NVL 8-GPU Topology Bandwidth Tiers for Kubernetes

Map the three bandwidth tiers of 8× H200 NVL GPU nodes—NVLink (~337 GB/s), PCIe+UPI (~50 GB/s), RoCE (~35 GB/s)—for NCCL topology-aware NUMA scheduling.

gpuncclperformancenetworking
advanced ⏱ 15 minutes

Disable GDS and Enable IOMMU Passthrough on K8s GPUs

Disable GPUDirect Storage (GDS) when not needed and configure IOMMU passthrough mode for GPU and NIC device assignment. Kernel parameters, BIOS settings, VFIO

iommupassthroughgdsgpu
advanced ⏱ 15 minutes

GPU Operator ClusterPolicy RDMA and GDS Configuration

Configure NVIDIA GPU Operator ClusterPolicy to disable RDMA and enable GPUDirect Storage (GDS). Control nvidia-peermem, nvidia-fs modules, driver

gpu-operatorrdmagdscluster-policy
advanced ⏱ 15 minutes

GPUDirect RDMA Setup and Verification on Kubernetes

Enable and verify GPUDirect RDMA for GPU-to-NIC direct data transfer on Kubernetes. Install nvidia-peermem, configure DMA-BUF, verify RDMA paths, troubleshoot

gpudirectrdmanvidianccl
advanced ⏱ 15 minutes

IOMMU Kernel Parameters for Kubernetes GPU Nodes

Configure IOMMU kernel parameters for optimal GPU and RDMA performance on Kubernetes. Compare intel_iommu, amd_iommu, and iommu settings, passthrough vs off vs

iommukernelgpuperformance
intermediate ⏱ 15 minutes

Kubeflow MPIJob Worker SSH Setup for GPU Training

Configure SSH daemon in Kubeflow MPIJob worker pods for multi-node GPU training. Covers SSHD setup in containers, host key generation, authorized keys from MPI

mpisshopenshiftgpu
advanced ⏱ 15 minutes

Kubernetes Topology Manager for GPU and NUMA Alignment

Configure Kubernetes Topology Manager to align CPU, GPU, and NIC allocations on the same NUMA node. Covers policies, kubelet config, and GPU performance tuning.

topology-managernumagpuperformance
intermediate ⏱ 15 minutes

MPI DNS Resolution and Hostfile for Kubernetes GPU Jobs

Troubleshoot MPI hostfile DNS resolution in Kubeflow MPIJob on Kubernetes. Covers headless Service creation, subdomain configuration, DNS wait loops, FQDN

mpidnsnetworkingtroubleshooting
advanced ⏱ 15 minutes

NCCL All-Reduce Benchmarking on Multi-Node GPUs

Run and interpret NCCL all_reduce_perf benchmarks on multi-node Kubernetes GPU clusters. Understand bus bandwidth results, expected throughput for H200 NVL

ncclbenchmarkingall-reducegpu
advanced ⏱ 15 minutes

NCCL Channel Routing and Transport Path Analysis

Interpret NCCL channel logs to understand GPU communication paths on Kubernetes. Decode P2P/CUMEM, SHM/direct, NET/IB/GDRDMA transport

nccldebugginggpu-communicationrdma
advanced ⏱ 15 minutes

NCCL DMABUF Enable for GPUDirect RDMA on Kubernetes

Enable NCCL DMA-BUF support for GPUDirect RDMA in Kubernetes GPU clusters. Covers NCCL_DMABUF_ENABLE=1, kernel requirements, nvidia-peermem vs dmabuf, GPU

ncclrdmagpuperformance
advanced ⏱ 15 minutes

NCCL GPUDirect RDMA Distance Levels and PIX vs SYS

Understand NCCL GPU Direct RDMA distance-based enablement. When PIX mode disables GDRDMA for distant GPU-HCA pairs (distance 9 > 4) and when SYS mode enables

ncclgpudirectrdmatopology
advanced ⏱ 15 minutes

NCCL GPUDirect RDMA Level Tuning PIX PXB PHB SYS

Tune NCCL_NET_GDR_LEVEL for optimal GPUDirect RDMA performance on Kubernetes. Compare PIX, PXB, PHB, and SYS distance thresholds with PCIe topology. Benchmark

ncclrdmagpuperformance
advanced ⏱ 15 minutes

NCCL IB HCA Selection and QPS Tuning for RoCE

Configure NCCL_IB_HCA, NCCL_IB_GID_INDEX, NCCL_IB_QPS_PER_CONNECTION, and NCCL_IB_SPLIT_DATA_ON_QPS for optimal RoCE performance on Kubernetes GPU clusters.

ncclrdmaperformancenetworking
advanced ⏱ 15 minutes

NCCL Network Validation Script for OpenShift GPU Clusters

Build a comprehensive NCCL network validation script for OpenShift GPU clusters with SR-IOV. Configure NCCL_IB_GID_INDEX, NCCL_NET_GDR_LEVEL=SYS, per-rank HCA

ncclopenshiftsr-iovrdma
advanced ⏱ 15 minutes

Production NCCL Network Validator for Kubeflow MPIJob

Deploy a production-ready NCCL network validation framework using Kubeflow MPIJob on OpenShift. Complete validate_network.sh script

ncclmpirdmaopenshift
advanced ⏱ 15 minutes

NCCL RoCE Validation MPIJob Complete Reference

Complete nccl-roce-validation.yaml MPIJob reference for OpenShift GPU clusters. Full launcher environment variables, OpenMPI control plane settings, NCCL

ncclmpiroceopenshift
advanced ⏱ 15 minutes

NCCL RoCE Validation with Kubeflow MPIJob on Kubernetes

Run NCCL all_reduce_perf validation tests using Kubeflow MPIJob on GPU clusters. Configure MPI launcher and workers, NCCL environment variables, test

ncclmpirdmaroce
intermediate ⏱ 15 minutes

Shared Memory Transport for NCCL Intra-Node GPU

Configure NCCL shared memory (SHM) transport for intra-node GPU communication on Kubernetes. Covers /dev/shm sizing with emptyDir and NVLink/PCIe P2P paths.

ncclgpuperformanceconfiguration
advanced ⏱ 15 minutes

NVIDIA GPU Topology Matrix Interpretation on Kubernetes

Read and interpret nvidia-smi topo and nvidia-device-plugin topology matrices on Kubernetes GPU nodes. Understand X, NV, SYS, NODE, PIX, PXB, PHB connection

nvidiagpu-topologynvidia-sminuma
advanced ⏱ 15 minutes

NVLink Bridge Architecture for GPU Kubernetes Nodes

Understand NVLink Bridge logical architecture in GPU servers for Kubernetes. Dual-socket PCIe Gen5 topology, NVL4 groups, GPU-NIC-NVMe placement, PCIe switch

nvlinkgpu-architecturepcienvidia
advanced ⏱ 15 minutes

OpenMPI Control Plane Separation for NCCL RDMA

Configure OpenMPI to use eth0 for MPI control traffic while NCCL uses net1 SR-IOV for data. Covers btl_tcp_if_include, pml, routed direct, plm_rsh_agent SSH

mpincclnetworkingrdma
advanced ⏱ 15 minutes

Run:ai GPU Scheduling with Kubeflow MPIJob

Integrate Run:ai GPU scheduler with Kubeflow MPIJob for multi-node NCCL workloads. Covers Run:ai project namespaces, GPU quota annotations, pod group

gpuschedulingopenshiftai
intermediate ⏱ 15 minutes

GenAI-Perf Benchmarking LLM Inference on Kubernetes

Benchmark LLM inference performance with NVIDIA GenAI-Perf on Kubernetes. Profile vLLM, TensorRT-LLM, and Triton endpoints with concurrency sweeps, token

genai-perfbenchmarkingvllmtensorrt-llm
advanced ⏱ 15 minutes

NCCL Environment Variables Complete Reference

Complete reference for NCCL environment variables on Kubernetes. Configure network transport, InfiniBand, GPUDirect RDMA, socket

ncclgpurdmainfiniband
advanced ⏱ 15 minutes

Kubernetes Volcano Batch Scheduler Gang Scheduling

Deploy Volcano batch scheduler for gang scheduling on Kubernetes. Configure minAvailable for all-or-nothing pod group scheduling, queue management, and GPU job

volcanogang-schedulingbatchgpu
advanced ⏱ 15 minutes

NCCL and RCCL Networking Performance on Kubernetes

Optimize NCCL (NVIDIA) and RCCL (AMD) collective communication performance on Kubernetes GPU clusters. Network transport selection, bandwidth tuning, latency

ncclrcclgpunetworking
intermediate ⏱ 15 minutes

Weights and Biases Experiment Tracking on Kubernetes

Deploy Weights & Biases (W&B) on Kubernetes for ML experiment tracking, model registry, and hyperparameter sweeps. Self-hosted W&B Server, agent-based

wandbmlopsexperiment-trackingmodel-registry
advanced ⏱ 15 minutes

Integrate DisaggregatedSet with llm-d on Kubernetes

Deploy disaggregated LLM inference using DisaggregatedSet and llm-d on Kubernetes. Install LWS then DS controller, model prefill/decode roles, wire llm-d

leaderworkersetdisaggregated-inferencellm-dvllm
advanced ⏱ 15 minutes

DisaggregatedSet for Multi-Role LLM Inference

Deploy disaggregated LLM inference on Kubernetes with DisaggregatedSet and LeaderWorkerSet. Separate prefill and decode phases across GPU pools

leaderworkersetdisaggregated-inferencellmvllm
advanced ⏱ 15 minutes

NCCL Topology Dump and Tuning on Kubernetes

Use NCCL_TOPO_DUMP_FILE to export and inject GPU topology on Kubernetes for reproducible distributed training performance. Topology XML caching, environment

ncclgpunvidiadistributed-training
intermediate ⏱ 15 minutes

Hermes Agent Self-Hosted AI on Kubernetes

Deploy Hermes Agent (Nous Research) on Kubernetes as a persistent self-hosted AI agent with memory, automated skill creation, multi-platform

hermesai-agentnous-researchself-hosted
advanced ⏱ 15 minutes

NVIDIA Dynamo Production Tuning on Kubernetes

Tune NVIDIA Dynamo for production LLM inference: prefill/decode pool sizing, KV cache transfer optimization, NCCL backend selection, SLA-driven autoscaling

nvidia-dynamoinference-optimizationproductionautoscaling
intermediate ⏱ 15 minutes

NVIDIA OpenShell Sandboxed AI Agent Runtime on Kubernetes

Deploy NVIDIA OpenShell on Kubernetes for safe, private autonomous AI agent execution. Declarative YAML network policies, sandboxed containers

nvidiaopenshellagentssandbox
advanced ⏱ 15 minutes

Poolside AI Foundation Models on Kubernetes

Deploy Poolside AI foundation models for enterprise software agents on Kubernetes. On-prem and VPC deployment, multi-agent orchestration, sandboxed

poolsidefoundation-modelsagentsenterprise-ai
intermediate ⏱ 15 minutes

Red Hat AI Studio on OpenShift

Deploy Red Hat AI Studio on OpenShift for end-to-end LLM development. Model catalog, InstructLab fine-tuning, experiment tracking, model

red-hatopenshiftai-studioinstructlab
intermediate ⏱ 15 minutes

Tabnine AI Code Assistant Self-Hosted on Kubernetes

Deploy Tabnine Enterprise self-hosted on Kubernetes for private AI code completion and chat. On-prem model serving, multi-model support (Tabnine

tabninecode-assistantenterprise-aiself-hosted
advanced ⏱ 15 minutes

GPUDirect Storage on Kubernetes

Configure NVIDIA GPUDirect Storage (GDS) for direct data path between NVMe/NFS storage and GPU memory bypassing CPU. Covers Magnum IO, cuFile API, GDS driver

gpudirectstoragenvidianvme
advanced ⏱ 15 minutes

NVIDIA PeerMem for GPU-Direct RDMA

Install and configure nvidia_peermem kernel module to enable GPU-Direct RDMA between NVIDIA GPUs and Mellanox RDMA NICs. Covers module

nvidia-peermemgpu-directrdmakernel-module
advanced ⏱ 15 minutes

Disable PCIe ACS for GPU-Direct P2P

Disable PCIe Access Control Services (ACS) to enable GPU-Direct peer-to-peer DMA between GPUs and RDMA NICs. Covers BIOS disable, kernel override, and when

acspciegpu-directnccl
advanced ⏱ 15 minutes

IOMMU BIOS and Kernel Config for NCCL GPU-Direct

Configure IOMMU at BIOS and kernel level to enable NCCL GPU-Direct RDMA on Kubernetes. Covers Intel VT-d, AMD-Vi, kernel parameters, passthrough

iommuncclgpu-directrdma
advanced ⏱ 15 minutes

NCCL PXN Cross-NIC Communication via NVLink

Configure NCCL PXN (PCIe cross-NIC via NVLink) for multi-node GPU training where not every GPU has a direct RDMA NIC. Covers topology

ncclpxnnvlinkgpu-direct
advanced ⏱ 15 minutes

Run:ai Distributed Inference with SR-IOV RDMA

Deploy distributed vLLM inference on Run:ai using SR-IOV RDMA for NCCL inter-node communication. Covers extended-resource for Mellanox VFs, network annotation

runaisriovrdmavllm
advanced ⏱ 15 minutes

Run:ai Distributed Inference with vLLM and NCCL

Deploy distributed LLM inference on Run:ai with vLLM tensor parallelism across multiple workers. Covers multi-node GPU splitting, NCCL configuration, PVC model

runaivllmnccldistributed-inference
advanced ⏱ 15 minutes

Debug Distributed vLLM Inference with NCCL Verbose Logging

Debug distributed vLLM inference using NCCL_DEBUG=INFO and NCCL_DEBUG_SUBSYS=ALL. Covers air-gapped deployment with TRANSFORMERS_OFFLINE, interpreting NCCL

vllmnccldebuggingdistributed-inference
advanced ⏱ 15 minutes

Kubernetes AI Infrastructure Scaling

Scale AI inference infrastructure on Kubernetes from 10K to 100K requests per second. Covers latency optimization, horizontal scaling, caching

ai-infrastructurescalinginferenceperformance
intermediate ⏱ 15 minutes

Kubernetes for AI Search and Discoverability

Deploy AI-searchable services on Kubernetes: llms.txt implementation, RAG-optimized APIs, structured data for AI chatbots, and infrastructure patterns

ai-searchllms-txtragapi-design
advanced ⏱ 15 minutes

Deep Learning with Large Datasets on K8s

Optimize deep learning training with large datasets on Kubernetes. Covers data loading, caching strategies, parallel prefetch, and storage architecture

trainingdatasetsstorageperformance
advanced ⏱ 15 minutes

Distributed Multi-GPU Inference on Kubernetes

Deploy distributed inference across multiple GPUs and nodes on Kubernetes. Covers tensor parallelism, pipeline parallelism, vLLM, and NIM multi-GPU serving.

inferencemulti-gpudistributedvllm
advanced ⏱ 15 minutes

FSDP LoRA Fine-Tuning LLMs on Kubernetes

Fine-tune large language models with FSDP and LoRA on Kubernetes. Covers memory-efficient loading, checkpoint strategies, and multi-node H200 training.

fsdplorafine-tuningdistributed
advanced ⏱ 15 minutes

NVIDIA GenAI-Perf Inference Benchmarking

Benchmark LLM inference throughput and latency on Kubernetes using NVIDIA GenAI-Perf. Covers vLLM, Run:ai, concurrency testing, and multi-location client runs.

benchmarkinginferencenvidiavllm
advanced ⏱ 15 minutes

LeaderWorkerSet Multi-Node Inference on K8s

Deploy multi-node distributed inference using LeaderWorkerSet (LWS) operator on Kubernetes. Covers vLLM pipeline parallelism across nodes for 405B+ parameter

inferencedistributedlwsvllm
advanced ⏱ 15 minutes

Mistral FSDP LoRA Complete Accelerate Config

Complete accelerate FSDP configuration for fine-tuning Mistral-Small-4 11B with LoRA on multi-GPU H200 clusters. Covers every FSDP2 setting with explanations.

fsdploramistralaccelerate
advanced ⏱ 15 minutes

Multi-Node Distributed Training on Kubernetes

Run distributed deep learning training across multiple GPU nodes on Kubernetes. Covers PyTorch DDP, DeepSpeed, Horovod, and MPI jobs with NCCL optimization.

trainingdistributedmulti-nodepytorch
advanced ⏱ 15 minutes

NVIDIA GPUDirect Storage Benchmark on K8s

Benchmark NVIDIA GPUDirect Storage (GDS) on Kubernetes for direct NVMe-to-GPU data transfers. Covers gdsio, gds_stats, performance validation, and comparison

benchmarkingnvidiagdsstorage
advanced ⏱ 15 minutes

NVIDIA GPU Operator GitOps on OpenShift

Deploy NVIDIA GPU Operator on OpenShift via GitOps with ArgoCD. Covers ClusterPolicy configuration, DCGM exporter, drain settings, tolerations, and rolling

nvidiagpu-operatoropenshiftgitops
advanced ⏱ 15 minutes

OpenShift GPU Node Resource Planning

Plan CPU, memory, and overhead budgets for GPU nodes running NVIDIA GPU Operator, Network Operator, Run:ai, and OpenShift infrastructure Pods. Understand what

openshiftgpucapacity-planningresource-management
advanced ⏱ 15 minutes

Run:ai Backend Architecture on OpenShift

Understand the full Run:ai backend deployment on OpenShift with 40+ microservices including Keycloak, PostgreSQL, NATS, Thanos, Traefik, and workload

runaiopenshiftarchitectureplatform-engineering
advanced ⏱ 15 minutes

Run:ai Distributed PyTorch Training on OpenShift

Submit multi-node distributed PyTorch training jobs on OpenShift using Run:ai CLI. Covers DDP, FSDP, RDMA networking, and GPU scheduling.

runaiopenshiftdistributedtraining
advanced ⏱ 15 minutes

FSDP Distributed Training on Run:ai

Run PyTorch FSDP distributed training workloads on Run:ai with GPU scheduling, event tracking, and GPU memory monitoring. Covers Mistral-class model

runaidistributed-trainingfsdppytorch
intermediate ⏱ 15 minutes

Run:ai GPU Metrics Pipeline with DCGM and Thanos

End-to-end GPU metrics pipeline on Run:ai: DCGM exporter collects GPU utilization, Prometheus scrapes, remote-writes to Thanos Receive, and Grafana dashboards

runaidcgmthanosgrafana
intermediate ⏱ 15 minutes

Run:ai Platform Backend Components

Overview of Run:ai backend StatefulSets and components on OpenShift: Thanos receive/query, Keycloak, NATS, Redis, PostgreSQL, workload controllers, and their

runaiarchitectureopenshiftstatefulset
intermediate ⏱ 15 minutes

Run:ai Training Job Submit Script Pattern

Production pattern for submitting Run:ai training jobs via shell scripts with GPU fractional allocation, NFS mounts, custom Python environments, and private

runaitraininggpufinetuning
advanced ⏱ 15 minutes

Run:ai Workload Controllers on OpenShift

Understand Run:ai cluster-level workload controllers on OpenShift: workload-controller, workload-overseer, workload-exporter, and status-updater components.

runaiopenshiftcontrollersscheduling
advanced ⏱ 15 minutes

Kubernetes 1.36 DRA for GPU and TPU Management

Use Dynamic Resource Allocation in Kubernetes 1.36 for advanced GPU/TPU management with partitionable devices, device taints, and tolerations.

kubernetes-1.36dragputpu
advanced ⏱ 15 minutes

Kubernetes 1.36 Gang Scheduling

Use gang scheduling in Kubernetes 1.36 to schedule Pod groups atomically. Essential for distributed ML training, MPI jobs, and Spark workloads.

kubernetes-1.36schedulinggang-schedulingmachine-learning
advanced ⏱ 15 minutes

Kubernetes 1.36 RestartAllContainers for ML

Use the RestartAllContainers policy in Kubernetes 1.36 to restart all Pod containers in-place when a worker fails, avoiding costly ML training rescheduling.

kubernetes-1.36machine-learninggpurestart-policy
advanced ⏱ 15 minutes

Kubernetes 1.36 Topology-Aware Scheduling

Use topology-aware workload scheduling in Kubernetes 1.36 to place Pods on nodes with optimal GPU, NUMA, and network topology for ML training.

kubernetes-1.36schedulingtopologygpu
intermediate ⏱ 15 minutes

NVIDIA GPU Feature Discovery for Kubernetes

Deploy GPU Feature Discovery (GFD) to auto-label Kubernetes nodes with GPU model, MIG capability, CUDA version, and driver info for intelligent scheduling.

nvidiagpuschedulingnode-labels
advanced ⏱ 15 minutes

OpenShift NVIDIA MIG Reconfiguration Without Reboot

Reconfigure NVIDIA MIG geometry on OpenShift without rebooting nodes. Use nvidia-mig-manager with node labels to dynamically switch GPU partitions.

openshiftnvidiamiggpu
advanced ⏱ 15 minutes

Talos Linux MIG Configuration with GPU Operator

Configure NVIDIA MIG on Talos Linux Kubernetes clusters. Install GPU Operator, set MIG strategy, and dynamically partition A100 GPUs without node reboot.

talosnvidiamiggpu
advanced ⏱ 15 minutes

DGX H100 nvidia-smi topo -m Guide

Read nvidia-smi topo -m output on DGX H100 systems. Understand NVLink, NVSwitch, PCIe topology, GPU-to-GPU bandwidth, and NUMA affinity for Kubernetes.

nvidiadgxh100topology
intermediate ⏱ 15 minutes

NVIDIA H300 GPU Setup on Kubernetes

Deploy NVIDIA H300 GPUs on Kubernetes. H300 vs H100 vs H200 specs comparison, memory bandwidth, GPU Operator setup, and AI inference optimization.

nvidiagpuh300h100
intermediate ⏱ 15 minutes

NVIDIA PyTorch Container on Kubernetes

Deploy nvcr.io/nvidia/pytorch containers on Kubernetes for GPU training. Version selection, CUDA compatibility, multi-node DDP, and NCCL configuration.

nvidiapytorchgputraining
intermediate ⏱ 20 minutes

GenAI-Perf Benchmark LLM Kubernetes

Benchmark LLM inference with GenAI-Perf on Kubernetes. Use --service-kind openai for vLLM, NIM, and TGI. Measure TTFT, ITL, and throughput.

genai-perfbenchmarkingllmnvidia
intermediate ⏱ 15 minutes

Continuous Batching LLM Inference K8s

Configure continuous batching for LLM inference on Kubernetes. vLLM and TRT-LLM batch scheduling, max-num-seqs tuning, and throughput optimization.

continuous-batchinginferencethroughput
beginner ⏱ 15 minutes

CUDA Version Compatibility K8s Guide

Match CUDA versions with GPU drivers and container images on Kubernetes. Forward compatibility, driver requirements, and container toolkit matrix.

cudacompatibilitydriver-versioncontainer
advanced ⏱ 15 minutes

DeepSpeed ZeRO Training Kubernetes

Deploy DeepSpeed ZeRO-1/2/3 for large model training on Kubernetes. Multi-node config, NCCL tuning, memory optimization, and 70B+ model training.

deepspeedzerodistributed-traininglarge-models
advanced ⏱ 15 minutes

DGX H100 GPU Topology nvidia-smi

Inspect DGX H100 GPU topology with nvidia-smi topo -m. NVSwitch NV18 links, cross-socket detection, PCIe hierarchy, and NCCL performance validation.

dgxh100gpu-topologynvidia-smi
beginner ⏱ 15 minutes

GPU Feature Discovery Node Labels

Configure NVIDIA GPU Feature Discovery for automatic node labeling on Kubernetes. GPU model, driver version, CUDA, and MIG labels for scheduling.

gpu-feature-discoverynode-labelsscheduling
intermediate ⏱ 15 minutes

GPU Node Affinity Scheduling K8s

Schedule GPU workloads with node affinity and topology on Kubernetes. GPU type selection, multi-GPU locality, and NUMA-aware pod placement.

node-affinitygpu-schedulingtopologynuma
beginner ⏱ 15 minutes

K8s GPU Limits Requests Configuration

Configure GPU resource limits and requests in Kubernetes pod specs. nvidia.com/gpu resource, fractional GPUs, MIG slices, and multi-GPU allocation.

gpu-limitsresource-requestsnvidiapod-spec
advanced ⏱ 15 minutes

LoRA Adapter Serving vLLM on K8s

Serve multiple LoRA adapters with a single vLLM base model on Kubernetes. Dynamic loading, per-request routing, and multi-tenant fine-tuned models.

lorafine-tuningvllmmulti-tenant
advanced ⏱ 15 minutes

Multi-GPU PyTorch DDP on Kubernetes

Run PyTorch DistributedDataParallel across multiple GPUs on Kubernetes. torchrun, NCCL backend, pod topology, and scaling to multi-node training.

pytorchddpmulti-gpudistributed-training
intermediate ⏱ 15 minutes

NVIDIA Driver Update K8s Nodes Guide

Safely update NVIDIA GPU drivers on Kubernetes nodes. Rolling updates, drain strategy, driver compatibility matrix, and GPU Operator upgrades.

nvidia-driverupgraderolling-updategpu-operator
advanced ⏱ 15 minutes

NVIDIA PeerMem GPUDirect RDMA K8s

Configure nvidia_peermem and ib_register_peer_memory_client for GPUDirect RDMA on Kubernetes. Module loading and modprobe invalid argument fix.

nvidia-peermemgpudirectrdmaib-register-peer-memory
beginner ⏱ 15 minutes

nvidia-smi Monitoring in K8s Pods

Run nvidia-smi inside Kubernetes pods for GPU monitoring. Memory usage, temperature, utilization, and automated health checks with liveness probes.

nvidia-smigpu-monitoringhealth-check
intermediate ⏱ 15 minutes

Prefix Caching vLLM KV Cache K8s

Enable automatic prefix caching in vLLM on Kubernetes for shared-prompt workloads. KV cache reuse, memory savings, and chatbot latency optimization.

prefix-cachingkv-cachevllmlatency
intermediate ⏱ 15 minutes

Quantize LLMs AWQ GPTQ for K8s Deploy

Deploy AWQ and GPTQ quantized LLMs on Kubernetes. 4-bit inference with vLLM, model conversion, accuracy trade-offs, and GPU memory savings guide.

quantizationawqgptqvllm
advanced ⏱ 15 minutes

Speculative Decoding with vLLM on Kubernetes

Enable speculative decoding in vLLM on Kubernetes for 2-3x faster LLM inference. Draft model selection, acceptance rates, and latency optimization.

speculative-decodingvllminference-optimization
intermediate ⏱ 15 minutes

TensorRT-LLM vs vLLM Benchmark 2026

Compare TensorRT-LLM vs vLLM for LLM inference on Kubernetes. TTFT, throughput, GPU utilization benchmarks, and when to use each inference engine.

tensorrt-llmvllmbenchmarkinference
intermediate ⏱ 15 minutes

vLLM Alternatives LLM Inference K8s

Compare vLLM alternatives for LLM inference on Kubernetes. TensorRT-LLM, SGLang, NVIDIA NIM, Ollama, and text-generation-inference feature comparison.

vllmalternativesinferencecomparison
advanced ⏱ 15 minutes

Kubeflow PyTorchJob Training K8s

Run distributed PyTorch training on Kubernetes with Kubeflow PyTorchJob. ElasticPolicy, nproc_per_node, RDMA configuration, and multi-GPU scaling.

kubeflowpytorchjobdistributed-trainingpytorch
advanced ⏱ 15 minutes

NCCL Environment Variables Reference

Complete NCCL environment variables reference for Kubernetes GPU training. NCCL_IB_DISABLE, NCCL_SOCKET_IFNAME, NCCL_DEBUG, and network tuning guide.

ncclenvironment-variablesgpuinfiniband
advanced ⏱ 10 minutes

NCCL Test Benchmark Kubernetes

Run NCCL tests on Kubernetes for GPU communication benchmarking. all_reduce_perf, all_gather_perf, multi-node bandwidth, and latency validation.

ncclbenchmarkgpuall-reduce
intermediate ⏱ 15 minutes

GPU Time-Slicing vs MIG Comparison

Compare NVIDIA GPU time-slicing and MIG for K8s workloads. When to use each, performance trade-offs, and configuration examples.

gputime-slicingmignvidia
advanced ⏱ 15 minutes

TensorRT-LLM Kubernetes Deployment

Deploy TensorRT-LLM on K8s for optimized inference. Engine building, model conversion, and serving with Triton Inference Server.

tensorrt-llminferencetritonoptimization
intermediate ⏱ 15 minutes

vLLM Deployment Kubernetes Guide

Deploy vLLM inference engine on K8s. Model loading, tensor parallelism, continuous batching, and OpenAI-compatible API setup.

vllminferencellmdeployment
intermediate ⏱ 20 minutes

AI Resource Allocation Optimization

Optimize GPU and memory allocation for AI workloads on Kubernetes. Right-size GPU requests, bin-packing strategies, gang scheduling.

gpuresource-optimizationbin-packinggang-scheduling
beginner ⏱ 20 minutes

CNCF AI Projects Landscape Kubernetes

Navigate the CNCF AI project landscape for Kubernetes. Kubeflow, KServe, KAITO, Volcano, and emerging projects for training, serving, scheduling.

cncfai-landscapecloud-nativeecosystem
advanced ⏱ 20 minutes

Distributed Training TensorFlow PyTorch

Run distributed training jobs on Kubernetes with TensorFlow and PyTorch. Training Operator, multi-worker strategies, NCCL configuration.

distributed-trainingtensorflowpytorchtraining-operator
intermediate ⏱ 20 minutes

Feast Feature Store Kubernetes

Deploy Feast feature store on Kubernetes for ML feature management. Offline and online stores, feature serving, point-in-time joins.

feastfeature-storeml-featuresdata-engineering
intermediate ⏱ 20 minutes

GPU Sharing MIG and Time-Slicing Kubernetes

Share GPUs across multiple pods with NVIDIA MIG and time-slicing on Kubernetes. MIG profiles for A100/H100, time-slicing configuration.

gpumigtime-slicingnvidia
intermediate ⏱ 20 minutes

KAITO AI Model Inference Kubernetes

Deploy AI models with KAITO (Kubernetes AI Toolchain Operator) for automated GPU provisioning, model serving, and inference workload management.

kaitoinferencegpu-provisioningmodel-serving
advanced ⏱ 20 minutes

Katib Hyperparameter Tuning Kubernetes

Automate hyperparameter tuning with Katib on Kubernetes. Bayesian optimization, random search, grid search, early stopping.

katibhyperparameterautomltuning
advanced ⏱ 20 minutes

KnativeServing for AI Inference OpenShift

Configure KnativeServing with scale-to-zero, GPU scheduling features, Kourier ingress, and custom domain templates for AI inference workloads on OpenShift.

knativeserverlessinferencescale-to-zero
intermediate ⏱ 20 minutes

KServe Model Serving Kubernetes

Deploy ML models with KServe for serverless inference on Kubernetes. InferenceService, scale-to-zero, canary rollouts, model transformers.

kservemodel-servinginferenceserverless
advanced ⏱ 20 minutes

Kubeflow ML Platform Setup Kubernetes

Deploy Kubeflow as a production-ready ML platform on Kubernetes. Notebooks, pipelines, training operators, and model serving with KServe for end-to-end MLO.

kubeflowmlopsmachine-learningplatform
intermediate ⏱ 20 minutes

AI Cost Management on Kubernetes

Control AI infrastructure costs on Kubernetes with GPU utilization tracking, chargeback per team, spot instance strategies, right-sizing recommendations.

cost-managementgpu-costchargebackright-sizing
advanced ⏱ 20 minutes

AI Inference Optimization Kubernetes

Optimize AI inference performance on Kubernetes. Request batching, KV cache tuning, speculative decoding, continuous batching.

inferenceoptimizationbatchingkv-cache
intermediate ⏱ 20 minutes

GPU Node Provisioning Kubernetes

Automate GPU node provisioning for Kubernetes with Karpenter, Cluster Autoscaler, and cloud-specific node pools for AI and ML workloads.

gpukarpenterautoscalerprovisioning
advanced ⏱ 20 minutes

GPU Operator Advanced Configuration

Advanced NVIDIA GPU Operator configuration on Kubernetes. Driver containers, CUDA toolkit, GDS, GPUDirect RDMA, MIG manager, DCGM Exporter.

gpu-operatornvidiadrivercuda
intermediate ⏱ 20 minutes

Kueue Job Queuing Fair Sharing Kubernetes

Implement fair-share GPU job queuing with Kueue on Kubernetes. ClusterQueues, LocalQueues, ResourceFlavors, and cohort-based borrowing for multi-team AI cl.

kueuejob-queuingfair-sharinggpu-quota
advanced ⏱ 20 minutes

LLM Deployment Challenges Kubernetes

Address common LLM deployment challenges on Kubernetes. GPU memory management, model loading optimization, inference latency tuning, batch scheduling.

llmdeploymentgpu-memoryinference
intermediate ⏱ 20 minutes

ML Pipeline Automation Kubernetes

Automate ML pipelines on Kubernetes with Kubeflow Pipelines, Argo Workflows, and Tekton. Data preprocessing, training, evaluation, model registration.

ml-pipelinekubeflow-pipelinesargo-workflowsautomation
advanced ⏱ 20 minutes

ModelMesh Multi-Model Serving Kubernetes

Deploy hundreds of ML models on shared GPU infrastructure with ModelMesh. Intelligent model loading and unloading, memory management, routing.

modelmeshmulti-modelinferencegpu-sharing
advanced ⏱ 20 minutes

Multi-Cloud AI Workloads Kubernetes

Run AI workloads across multiple cloud providers with Kubernetes. GPU instance availability, spot pricing arbitrage, model portability.

multi-cloudgpu-availabilityspot-instancescloud-agnostic
advanced ⏱ 30 minutes

NCCL SR-IOV GDS PyTorch Configuration

Configure NCCL with SR-IOV RDMA and GPUDirect Storage on Kubernetes. PyTorch 25.11 container with NCCL 2.28, CUDA 13, MOFED 5.4, GDRCopy 2.

ncclsriovgdsgpudirect
advanced ⏱ 20 minutes

Volcano Job minAvailable Gang Schedule

Volcano batch scheduling with minAvailable gang scheduling on Kubernetes. Job configuration, queue policies, and AI training workload scheduling.

volcanobatch-schedulinggang-schedulingqueue
advanced ⏱ 15 minutes

AIPerf Offline vLLM Benchmarking

Benchmark vLLM inference with AIPerf in air-gapped Kubernetes clusters. Use dummy tokenizers, offline mode, custom endpoints.

aiperfvllmbenchmarkingoffline
advanced ⏱ 25 minutes

Run:ai Distributed vLLM with NCCL

Deploy distributed vLLM inference on Run:ai with NCCL over NVLink and RDMA. Tensor parallelism across GPUs with NCCL debug logging, SR-IOV networking.

runaivllmncclrdma
advanced ⏱ 20 minutes

AIPerf LLM Benchmarking on K8s

Benchmark generative AI inference on Kubernetes with NVIDIA AIPerf. Measure TTFT, ITL, throughput, and latency across vLLM, NIM.

aiperfbenchmarkingllminference
advanced ⏱ 25 minutes

DOCA Perftest RDMA Benchmarking

Run NVIDIA DOCA perftest on Kubernetes to benchmark RDMA bandwidth and latency between GPU nodes. Traffic patterns, GPUDirect memory modes.

docaperftestrdmagpudirect
advanced ⏱ 20 minutes

RetinaNet GPU Training on Kubernetes

Train RetinaNet object detection models on Kubernetes with unlimited memlock for RDMA, CRI-O ulimit configuration, and multi-GPU distributed training.

retinanetgpu-trainingmemlockcrio
advanced ⏱ 15 minutes

NCCL Topology Dump File for GPU Debugging

Use NCCL_TOPO_DUMP_FILE to capture and analyze GPU interconnect topology in Kubernetes. Debug NVLink, NVSwitch, and PCIe connection paths.

nccltopologygpunvlink
advanced ⏱ 30 minutes

Run:ai Distrib. vLLM Inference Multimodal LLMs

Deploy multimodal LLMs with Run:ai distributed inference and vLLM on Kubernetes. Tensor parallelism, NCCL over NVLink, GPUDirect RDMA.

runaivllmdistributed-inferencetensor-parallelism
advanced ⏱ 45 minutes

Inter-Node Tensor Parallelism on Kubernetes

Split a single LLM across multiple physical servers using tensor parallelism. Configure vLLM, NIM, and Ray for inter-node TP with NCCL over RDMA or TCP.

tensor-parallelismdistributed-inferencemulti-nodenccl
intermediate ⏱ 15 minutes

Triton Inference Server vs vLLM: Which to C...

Compare NVIDIA Triton Inference Server vs vLLM for LLM serving on Kubernetes. Performance, multi-model support, batching, GPU utilization.

tritonvllminferencecomparison
advanced ⏱ 30 minutes

Verify NCCL RDMA Traffic with Debug Logging

Prove NCCL uses RDMA for GPU communication on Kubernetes. Use NCCL_DEBUG and NCCL_DEBUG_SUBSYS=ALL to verify InfiniBand, RoCE.

ncclrdmainfinibandgpu-networking
intermediate ⏱ 20 minutes

NCCL_IB_DISABLE Environment Variable

NCCL_IB_DISABLE environment variable explained. Set NCCL_IB_DISABLE=1 for Ethernet-only clusters, debug InfiniBand errors, and tune GPU communication.

ncclinfinibandrdmagpu-networking
advanced ⏱ 30 minutes

vLLM on Huawei Ascend NPU: K8s Deployment

Deploy vLLM inference on Huawei Ascend NPUs in Kubernetes. Atlas 300I/910B device plugin, vllm-ascend container image, tensor parallelism, and model serving.

vllmascendnpuhuawei
intermediate ⏱ 20 minutes

Deploy vLLM OpenAI Container on Kubernetes

Deploy the vLLM OpenAI-compatible server container on Kubernetes. Pull ghcr.io/vllm-project/vllm-openai, configure GPU resources, model loading.

vllmopenai-apiinferencegpu
advanced ⏱ 20 minutes

AI-Native Development Platforms on Kubernetes

Build AI-native development platforms on Kubernetes. AI coding agents, automated testing, Copilot infrastructure, dev containers, and AI-driven CI/CD pipelines.

ai-nativedevelopment-platformscopilotci-cd
advanced ⏱ 25 minutes

Agentic AI and Multi-Agent Systems

Deploy autonomous AI agents and multi-agent orchestration on Kubernetes. LangGraph, CrewAI, AutoGen, tool-calling agents, agent-to-agent communication.

agentic-aimulti-agentlangchaincrewai
intermediate ⏱ 20 minutes

AI Infrastructure Cost Optimization

Optimize AI infrastructure costs on Kubernetes. GPU sharing, spot instances, inference batching, model quantization, token economics.

cost-optimizationgpu-sharingspot-instancesquantization
advanced ⏱ 20 minutes

AI Content Watermarking on Kubernetes

Deploy AI-generated content watermarking on Kubernetes. Invisible watermarks, SynthID integration, detection APIs, image and text watermarking pipelines.

watermarkingsynthidai-generated-contenttrust-safety
advanced ⏱ 30 minutes

AI Supercomputing on Kubernetes GPU Clusters

Build AI supercomputing platforms on Kubernetes. Multi-node GPU training, NVIDIA DGX SuperPOD, InfiniBand RDMA, NCCL tuning, Blackwell clusters.

supercomputinggpu-clustersnvidia-dgxinfiniband
advanced ⏱ 25 minutes

Autonomous Industrial Systems on Kubernetes

Orchestrate autonomous factories and logistics with Kubernetes. Digital twins, robot fleet coordination, industrial IoT pipelines, predictive maintenance.

industrial-aidigital-twiniotpredictive-maintenance
advanced ⏱ 25 minutes

Domain-Specific Language Models on Kubernetes

Deploy and fine-tune domain-specific LLMs on Kubernetes. Legal, healthcare, finance, and code models with LoRA fine-tuning, NIM serving, and RAG pipelines.

domain-specific-llmfine-tuninglorarag
advanced ⏱ 25 minutes

GitOps for AI Workloads on Kubernetes

Deploy AI models with GitOps on Kubernetes. Version ML models in Git, ArgoCD for model rollouts, Flux for GPU cluster sync.

gitopsai-workloadsargocdflux
advanced ⏱ 25 minutes

K8s AI Gateway: Inference Extension Guide

Use the Kubernetes AI Gateway and Inference Extension to route LLM traffic. Model-aware routing, load balancing across inference backends.

ai-gatewaygateway-apiinferencellm-routing
advanced ⏱ 25 minutes

Dynamic Resource Allocation for GPUs

Use Kubernetes Dynamic Resource Allocation to schedule GPUs. DRA ResourceClaims, partitionable devices, GPU sharing, and structured parameters for accelerators.

dragpu-schedulingresource-allocationdevice-plugin
advanced ⏱ 25 minutes

Kueue for Batch Jobs and GPU Queues

Use Kueue to manage batch job queues on Kubernetes. GPU quota, fair sharing, priority queues, ML training workloads, and multi-tenant cluster scheduling.

kueuebatch-jobsgpu-schedulingml-training
intermediate ⏱ 15 minutes

Llama 2 70B FP16 Model Size 140GB Guide

Llama 2 70B FP16 model size is 140GB. Complete GPU memory requirements for FP16, FP8, INT4 quantization, and multi-GPU tensor parallelism on Kubernetes.

llamamodel-sizinggpu-requirementsquantization
advanced ⏱ 25 minutes

Physical AI and Robotics Orchestration

Orchestrate physical AI and robotics fleets with Kubernetes. ROS 2 on K8s, robot fleet management, edge-cloud hybrid, NVIDIA Isaac.

physical-airoboticsros2edge-computing
advanced ⏱ 25 minutes

Quantum Computing on K8s: Hybrid Workflows

Run quantum computing workloads on Kubernetes. Qiskit, Cirq, PennyLane hybrid classical-quantum pipelines, quantum job scheduling, and QPU integration patterns.

quantum-computingqiskithybrid-workflowshpc
advanced ⏱ 30 minutes

Run:ai Topology-Aware Scheduling Deep Dive

Configure Run:ai topology-aware scheduling for distributed AI workloads. Multi-level hierarchies, required vs preferred placement, LeaderWorkerSet.

run-aitopology-awaregang-schedulingdistributed-workloads
intermediate ⏱ 25 minutes

NIM Model Profiles and Selection on Kubernetes

Configure NIM_MODEL_PROFILE for NVIDIA NIM deployments on Kubernetes. List profiles, select by ID or name, tune VRAM, and override with vLLM CLI args.

nvidia-nimmodel-profilesgpuvllm
advanced ⏱ 45 minutes

NIM Multi-Node Deployment with Helm on K8s

Deploy NVIDIA NIM across multiple Kubernetes nodes using Helm, LeaderWorkerSet, Ray, and vLLM. Run Llama 405B and DeepSeek-R1 on 16+ GPUs.

nvidia-nimmulti-nodeleaderworkersetray
intermediate ⏱ 15 minutes

NIM LLM Support Matrix and GPU Compatibility

Complete NVIDIA NIM support matrix for Kubernetes. Supported models, profiles, precision formats, GPU compatibility, and hardware requirements per model.

nvidia-nimgpu-compatibilitysupport-matrixmodel-profiles
advanced ⏱ 45 minutes

NVIDIA Dynamo Distributed Inference

Deploy NVIDIA Dynamo on Kubernetes for disaggregated LLM inference. KV-aware routing, prefill/decode splitting, Grove operator, and zero-config deployment.

nvidia-dynamodistributed-inferencedisaggregated-servingkv-cache
advanced ⏱ 40 minutes

Rebuild NIM with Custom Model on Kubernetes

Step-by-step guide to deploy custom, fine-tuned, or self-hosted models with NVIDIA NIM on Kubernetes. Model-free NIM from HuggingFace, S3, NGC, or local path.

nvidia-nimcustom-modelfine-tuningmodel-free
advanced ⏱ 40 minutes

Run:ai + Dynamo Multi-Node Scheduling on K8s

Deploy NVIDIA Dynamo with Run:ai v2.23 for gang scheduling and topology-aware placement. Atomic pod launches, zone co-location, and disaggregated inference.

nvidia-dynamorun-aigang-schedulingtopology-aware
intermediate ⏱ 20 minutes

Copy NVIDIA NIM Images to Internal Quay Reg...

Pull NIM container images from nvcr.io and push to an internal Quay registry. Covers authentication, tagging, air-gapped workflows, and curl token issues.

nvidia-nimquay-registrycontainer-imagesair-gapped
advanced ⏱ 45 minutes

Deploy Multinode NIM Models on Kubernetes

Run large language models across multiple GPU nodes with NVIDIA NIM. Tensor parallelism, NCCL, InfiniBand, and Kubernetes Job orchestration.

nvidia-nimmultinodetensor-parallelismnccl
advanced ⏱ 40 minutes

Distributed Inference with Run:ai

Deploy distributed AI inference with NVIDIA Run:ai on Kubernetes. Single-node Knative, multinode LeaderWorkerSet, NIM, autoscaling, and observability.

nvidia-runaidistributed-inferenceknativeleader-worker-set
advanced ⏱ 60 minutes

Run:ai NIM Distributed Inference Tutorial

Step-by-step guide to deploy DeepSeek-R1 distributed inference on Run:ai with LeaderWorkerSet, SGLang, PVC caching, and OpenShift security.

nvidia-runainvidia-nimdistributed-inferencedeepseek-r1
advanced ⏱ 15 minutes

Kubeflow Operator: Full ML Platform

Deploy the complete Kubeflow platform on Kubernetes with the Kubeflow Operator. Covers Pipelines, Notebooks, KServe, Katib, and multi-tenant ML workflows.

kubeflowmlopsoperatorml-platform
advanced ⏱ 15 minutes

GPU Sharing with MPS and MIG on Kubernetes

Share NVIDIA GPUs across multiple pods using MPS time-slicing and MIG hardware partitioning. Maximize GPU utilization for inference workloads.

gpu-sharingmpsmignvidia
intermediate ⏱ 15 minutes

Node Feature Discovery Operator for Kubernetes

Install and configure Node Feature Discovery (NFD) Operator to auto-detect hardware features like GPUs, NICs, CPU flags, and USB devices on Kubernetes nodes.

nfdnode-feature-discoveryoperatorgpu
advanced ⏱ 20 minutes

Enable GPUDirect Storage in ClusterPolicy

Enable NVIDIA GPUDirect Storage (GDS) in the GPU Operator ClusterPolicy for direct GPU-to-NVMe data paths. Driver module configuration and verification.

nvidiagdsgpu-operatorclusterpolicy
intermediate ⏱ 20 minutes

GPU Time-Slicing on Kubernetes

Share GPUs across multiple workloads using NVIDIA time-slicing on Kubernetes. Configure the device plugin, set replica counts, and manage fairness.

nvidiagputime-slicinggpu-sharing
intermediate ⏱ 30 minutes

NVIDIA GPU Operator Setup on Kubernetes

Install and configure NVIDIA GPU Operator on Kubernetes. Driver containers, toolkit, device plugin, DCGM monitoring, and ClusterPolicy setup.

nvidiagpu-operatorgpukubernetes
advanced ⏱ 45 minutes

NVIDIA Open GPU + GPUDirect RDMA + DOCA-OFE...

Deploy NVIDIA AI networking on Kubernetes: Open GPU driver with DMA-BUF, GPUDirect RDMA, DOCA-OFED, and SR-IOV VF isolation.

nvidiagpu-operatorgpudirectrdma
intermediate ⏱ 30 minutes

AI Model Storage: hostPath vs PVC Inference

Deploy AI models on Kubernetes using hostPath and PVC storage. Compare performance, security trade-offs, and production patterns for model serving.

model-servingstoragehostpathpvc
advanced ⏱ 35 minutes

Volcano Job minAvailable Gang Scheduling

Configure Volcano job minAvailable for gang scheduling on Kubernetes. Batch AI training, fair-share queues, job plugins, and GPU preemption guide.

volcanobatchgang-schedulingai-workloads
intermediate ⏱ 20 minutes

AIPerf Benchmark LLMs on Kubernetes

Deploy NVIDIA AIPerf to benchmark LLM inference performance on Kubernetes. Measure TTFT, ITL, throughput with real-time dashboard and GPU telemetry.

aiperfbenchmarkingnvidiainference
advanced ⏱ 30 minutes

AIPerf Concurrency Sweep on K8s

Run AIPerf concurrency sweeps on Kubernetes to find optimal LLM serving capacity. Automate 1-128 concurrent user benchmarks with batch Jobs.

aiperfbenchmarkingconcurrencyautoscaling
advanced ⏱ 25 minutes

AIPerf Goodput and SLO Benchmarks

Measure LLM goodput with AIPerf on Kubernetes. Define SLOs for TTFT and ITL, calculate effective throughput, and benchmark with timeslice analysis.

aiperfbenchmarkinggoodputslo
advanced ⏱ 30 minutes

AIPerf Multi-Model Benchmark on K8s

Compare multiple LLM models and backends with AIPerf on Kubernetes. Benchmark vLLM vs TGI vs Triton with automated multi-run confidence intervals.

aiperfbenchmarkingcomparisonvllm
advanced ⏱ 25 minutes

AIPerf Trace Replay Benchmarks on K8s

Replay production traffic traces with AIPerf on Kubernetes. Use moon_cake format, ShareGPT datasets, and fixed schedules for realistic LLM benchmarks.

aiperfbenchmarkingtrace-replaysharegpt
advanced ⏱ 15 minutes

Dell PowerEdge XE7740 GPU Node Setup

Configure Dell PowerEdge XE7740 GPU nodes with H200 GPUs for OpenShift and Kubernetes including BIOS, power, cooling, and network setup.

dellpoweredgexe7740h200
intermediate ⏱ 20 minutes

Deploy Fish Audio TTS on Kubernetes

Deploy Fish Audio S2-Pro 5B text-to-speech model on Kubernetes for high-quality voice synthesis with multi-speaker support and streaming audio.

fish-audiotext-to-speechttsvoice-synthesis
advanced ⏱ 45 minutes

Deploy GLM-5 754B on Kubernetes

Deploy Zhipu AI GLM-5 754B model on Kubernetes with vLLM. One of the largest open-weight models with multi-node tensor parallelism across 8+ GPUs.

glm-5zhipullmultra-large
beginner ⏱ 15 minutes

Deploy Granite 4.0 Speech on Kubernetes

Deploy IBM Granite 4.0 1B Speech model on Kubernetes for automatic speech recognition. Lightweight 2B model runs on CPU or small GPU for STT workloads.

graniteibmspeech-recognitionstt
advanced ⏱ 45 minutes

Deploy Kimi K2.5 1.1T MoE on Kubernetes

Deploy Moonshot AI Kimi-K2.5 1.1T MoE multimodal model on Kubernetes. The largest open MoE model with 2.69M downloads for frontier AI tasks.

kimimoonshotmixture-of-expertsmoe
advanced ⏱ 30 minutes

Deploy Llama 2 70B on Kubernetes

Deploy Meta Llama 2 70B on Kubernetes with multi-GPU tensor parallelism, vLLM serving, and production-ready health checks and resource limits.

llamallmvllmmulti-gpu
intermediate ⏱ 15 minutes

Deploy Llama 3.1 8B Instruct on K8s

Deploy Meta Llama 3.1 8B Instruct on Kubernetes with vLLM. Production-ready single-GPU deployment with 128K context, tool calling, and autoscaling.

llamallama-3.1metallm
advanced ⏱ 25 minutes

Deploy LTX Video Generation on K8s

Deploy Lightricks LTX-2.3 image-to-video model on Kubernetes for AI video generation with batch processing and S3 output storage.

ltxvideo-generationimage-to-videolightricks
advanced ⏱ 30 minutes

Deploy MiniMax M2.5 229B on Kubernetes

Deploy MiniMax M2.5 229B model on Kubernetes with vLLM. High-performance LLM with 485K downloads, optimized for multi-turn conversation and long context.

minimaxllmmulti-gputensor-parallelism
advanced ⏱ 25 minutes

Deploy NVIDIA Nemotron 120B MoE on K8s

Deploy NVIDIA Nemotron-3-Super-120B-A12B MoE model on Kubernetes. 120B total parameters with 12B active for enterprise-grade inference.

nemotronnvidiamixture-of-expertsmoe
intermediate ⏱ 20 minutes

Deploy Microsoft Phi-4 on Kubernetes

Deploy Microsoft Phi-4 small language model on Kubernetes with vLLM. Efficient 14B model with GPT-4 level reasoning on a single GPU.

phi-4microsoftsmall-language-modelvllm
intermediate ⏱ 20 minutes

Deploy Phi-4 Reasoning Vision on K8s

Deploy Microsoft Phi-4-reasoning-vision-15B on Kubernetes for multimodal chain-of-thought reasoning with visual understanding on a single GPU.

phi-4microsoftreasoningmultimodal
advanced ⏱ 30 minutes

Deploy Qwen3 235B MoE on Kubernetes

Deploy Alibaba Qwen3-235B-A22B mixture-of-experts model on Kubernetes. Only 22B parameters active per token for efficient 235B-class inference.

qwen3mixture-of-expertsmoellm
advanced ⏱ 25 minutes

Deploy Qwen3 Coder 80B on Kubernetes

Deploy Qwen3-Coder-Next 80B on Kubernetes for code generation, review, and refactoring. Production-ready AI coding assistant with multi-GPU serving.

qwen3code-generationcoding-assistantllm
intermediate ⏱ 15 minutes

Deploy Qwen3 TTS on Kubernetes

Deploy Qwen3-TTS-12Hz-1.7B-CustomVoice on Kubernetes for text-to-speech with custom voice cloning. 1.13M downloads, lightweight single-GPU deployment.

qwen3text-to-speechttsvoice-cloning
intermediate ⏱ 20 minutes

Deploy Qwen3.5 35B MoE on Kubernetes

Deploy Alibaba Qwen3.5-35B-A3B mixture-of-experts multimodal model on Kubernetes. 35B total parameters with only 3B active for ultra-efficient inference.

qwen3.5mixture-of-expertsmoemultimodal
advanced ⏱ 30 minutes

Deploy Qwen3.5 397B MoE on Kubernetes

Deploy Alibaba Qwen3.5-397B-A17B MoE multimodal model on Kubernetes. 397B total parameters with only 17B active per token for frontier VLM inference.

qwen3.5mixture-of-expertsmoemultimodal
intermediate ⏱ 20 minutes

Deploy Qwen3.5 9B Multimodal on K8s

Deploy Alibaba Qwen3.5-9B vision-language model on Kubernetes with vLLM. Process images and text with a single GPU deployment.

qwen3.5multimodalvision-languagevllm
advanced ⏱ 25 minutes

RetinaNet Object Detection on K8s

Deploy RetinaNet object detection model on Kubernetes with Triton Inference Server, TensorRT optimization, and batch processing pipelines.

retinanetobject-detectioncomputer-visiontriton
advanced ⏱ 25 minutes

Deploy Sarvam 105B on Kubernetes

Deploy Sarvam 105B multilingual LLM on Kubernetes with vLLM. India's largest open language model with native support for 10+ Indic languages.

sarvammultilingualindic-languagesllm
advanced ⏱ 30 minutes

Stable Diffusion XL on Kubernetes

Deploy Stable Diffusion XL for image generation on Kubernetes with TensorRT acceleration, queued batch processing, and S3 output storage.

stable-diffusionsdxlimage-generationdiffusion
intermediate ⏱ 20 minutes

Deploy Whisper Speech-to-Text on K8s

Deploy OpenAI Whisper for speech-to-text on Kubernetes with faster-whisper, batch transcription Jobs, and real-time streaming endpoints.

whisperspeech-to-texttranscriptionaudio
advanced ⏱ 15 minutes

Distributed Inference Kubernetes

Deploy distributed LLM inference with tensor parallelism across multiple GPUs and pipeline parallelism across nodes on Kubernetes.

distributed-inferencetensor-parallelismpipeline-parallelismvllm
intermediate ⏱ 15 minutes

GenAI-Perf Benchmark LLM Serving

Benchmark LLM inference endpoints with NVIDIA GenAI-Perf for throughput, latency percentiles, time-to-first-token, and ITL metrics.

genai-perfbenchmarkllminference
intermediate ⏱ 25 minutes

GenAI-Perf Benchmark Triton on K8s

Benchmark NVIDIA Triton Inference Server performance on Kubernetes using GenAI-Perf. Measure TTFT, inter-token latency, throughput, and GPU telemetry.

genai-perftritonbenchmarkingnvidia
advanced ⏱ 15 minutes

Distrib. Training Kubeflow Training Operator

Run multi-node distributed PyTorch and TensorFlow training jobs using Kubeflow Training Operator with NCCL, RDMA, and shared storage.

kubeflowdistributed-trainingpytorchnccl
intermediate ⏱ 15 minutes

Kubeflow Training Operator on Kubernetes

Install Kubeflow Training Operator for distributed ML training with PyTorchJob, TFJob, and MPIJob on GPU-enabled Kubernetes clusters.

kubeflowtraining-operatordistributed-trainingpytorch
advanced ⏱ 15 minutes

LeaderWorkerSet Operator for AI Workloads

Deploy distributed AI training with LeaderWorkerSet Operator on Kubernetes and OpenShift for leader-worker topology with gang scheduling.

leaderworkersetlwsdistributed-trainingopenshift
advanced ⏱ 15 minutes

Llama Stack on Kubernetes with NVIDIA NIM

Deploy Meta Llama Stack on Kubernetes for unified inference, RAG, agents, and safety APIs using NVIDIA NIM as the inference backend.

llama-stacknvidia-nimllamainference
advanced ⏱ 15 minutes

MLPerf Benchmarking on Kubernetes

Run MLPerf inference and training benchmarks on Kubernetes GPU clusters to validate AI workload performance and compare hardware configurations.

mlperfbenchmarkinginferencetraining
intermediate ⏱ 25 minutes

Shared Model Caching Across Pods on Kubernetes

Optimize LLM inference startup and reduce storage costs by sharing model weights across pods using emptyDir, hostPath, ReadWriteMany PVCs, and init.

model-cachingshared-memorypvcinit-containers
advanced ⏱ 30 minutes

MPI Operator for Distributed Training

Deploy MPI Operator on Kubernetes for distributed GPU training with Horovod and NCCL. Run multi-node MPI jobs natively in Kubernetes pods.

mpimpi-operatordistributed-traininghorovod
advanced ⏱ 30 minutes

Deploy NVIDIA Clara on Kubernetes

Deploy NVIDIA Clara medical AI and drug discovery platform on Kubernetes. Run digital biology and medtech inference workloads with GPU acceleration.

nvidiaclaramedical-aidrug-discovery
advanced ⏱ 15 minutes

NVIDIA H200 GPU Workloads on Kubernetes

Deploy and optimize AI workloads on NVIDIA H200 GPUs with 141GB HBM3e memory for large model inference and training on Kubernetes.

nvidiah200gpuhbm3e
advanced ⏱ 15 minutes

NVIDIA NeMo Training on Kubernetes

Deploy NVIDIA NeMo framework on Kubernetes for large language model pre-training, fine-tuning, and RLHF with multi-node GPU clusters.

nvidianemotrainingllm
advanced ⏱ 30 minutes

NVIDIA Pyxis and Enroot for SLURM

Use NVIDIA Pyxis and Enroot to run GPU containers in SLURM jobs. Bridge SLURM HPC scheduling with container-native AI workloads and NGC images.

pyxisenrootslurmnvidia
advanced ⏱ 15 minutes

Run:AI GPU Quotas on OpenShift

Configure Run:AI scheduler quotas for fair GPU sharing with guaranteed, over-quota borrowing, and per-tenant GPU allocation policies.

runaigpuquotasscheduling
advanced ⏱ 45 minutes

SLURM and Kubernetes Integration

Integrate SLURM HPC workload manager with Kubernetes for hybrid AI and scientific computing. Bridge HPC batch scheduling with container orchestration.

slurmhpcbatch-schedulinggpu
intermediate ⏱ 15 minutes

Time-Slicing vs MIG vs Full GPU Allocation

Compare GPU sharing strategies: time-slicing for notebooks, MIG for isolated inference, and full GPU for training workloads.

time-slicingmiggpu-sharingmulti-tenant
advanced ⏱ 30 minutes

Triton Autoscaling with GPU Metrics

Autoscale Triton Inference Server on Kubernetes using GPU utilization, request queue depth, and inference latency metrics with KEDA and HPA.

tritonautoscalinggpu-metricskeda
advanced ⏱ 35 minutes

Triton Multi-Model Serving on Kubernetes

Serve multiple LLMs simultaneously on Triton Inference Server using TensorRT-LLM and vLLM backends with model routing and GPU scheduling.

tritonmulti-modeltensorrt-llmvllm
advanced ⏱ 45 minutes

Triton TensorRT-LLM on Kubernetes

Deploy NVIDIA Triton Inference Server with TensorRT-LLM backend on Kubernetes for optimized large language model serving with GPU acceleration.

tritontensorrt-llmnvidiainference
intermediate ⏱ 20 minutes

TensorRT-LLM vs vLLM on Triton

Compare TensorRT-LLM and vLLM backends on Triton Inference Server. When to use each, performance benchmarks, and migration strategies.

tritontensorrt-llmvllmcomparison
advanced ⏱ 30 minutes

Triton with vLLM Backend on Kubernetes

Deploy NVIDIA Triton Inference Server with vLLM backend on Kubernetes for flexible LLM serving with PagedAttention and continuous batching.

tritonvllmnvidiainference
intermediate ⏱ 30 minutes

Deploying Vector Databases on Kubernetes

Deploy and operate vector databases (Milvus, Weaviate, Qdrant) on Kubernetes for RAG pipelines, semantic search, and AI applications with persistent.

vector-databasemilvusweaviateqdrant
intermediate ⏱ 20 minutes

Compare NCCL Intra-Node vs Inter-Node Perfo...

Build a repeatable comparison between local and cross-node NCCL throughput to validate GPU cluster interconnect scaling and identify bottlenecks early.

ncclintra-nodeinter-nodebenchmarking
intermediate ⏱ 20 minutes

Run NCCL AllGather Benchmarks Model Paralle...

Use all-gather NCCL tests to evaluate GPU communication behavior and throughput for tensor-parallel and model-parallel distributed AI workloads on Kubernetes.

ncclallgatheraimodel-parallel
intermediate ⏱ 20 minutes

Benchmark NCCL AllReduce Performance

Measure NCCL AllReduce bandwidth and latency on Kubernetes to validate distributed training network performance across multi-GPU clusters.

ncclallreducegpubenchmark
intermediate ⏱ 25 minutes

Run NCCL Tests for GPU Network Validation

Benchmark GPU-to-GPU communication using NVIDIA nccl-tests on Kubernetes or OpenShift to validate bandwidth and latency.

ncclnccl-testsgpukubernetes
advanced ⏱ 30 minutes

Deploy Mistral 7B with NVIDIA NIM

Step-by-step guide to deploy Mistral-7B using NVIDIA NIM with TensorRT-LLM backend on Kubernetes for optimized GPU inference.

nvidia-nimtensorrt-llmmistralllm
intermediate ⏱ 30 minutes

Deploy Mistral 7B with vLLM on Kubernetes

Step-by-step guide to deploy Mistral-7B-v0.1 using vLLM as an OpenAI-compatible inference server on Kubernetes with GPU fractioning.

vllmmistralllminference
advanced ⏱ 30 minutes

Autoscale LLM Inference on Kubernetes

Configure Horizontal Pod Autoscaling and KEDA for LLM workloads using GPU utilization, request queue depth, and custom metrics.

autoscalinghpakedallm
intermediate ⏱ 20 minutes

Quantize LLMs for Efficient GPU Inference

Run quantized LLM models (GPTQ, AWQ, GGUF) on Kubernetes to reduce GPU memory requirements and serve models on smaller GPUs.

quantizationgptqawqgguf
intermediate ⏱ 15 minutes

Kubernetes LLM Serving Frameworks Compared

Compare vLLM, NVIDIA NIM, Triton, Ollama, and llama.cpp for serving LLMs on Kubernetes — features, performance, and when to use each.

vllmnvidia-nimtritonollama
advanced ⏱ 30 minutes

Multi-GPU and Tensor Parallel LLM Inference

Deploy large language models across multiple GPUs using tensor parallelism with vLLM and NVIDIA NIM on Kubernetes for high-throughput inference serving.

multi-gputensor-parallelismpipeline-parallelismllm
intermediate ⏱ 25 minutes

Install NVIDIA GPU Operator on Kubernetes

Deploy the NVIDIA GPU Operator to automate GPU driver, container toolkit, and device plugin management across your Kubernetes cluster.

nvidiagpu-operatorgpudrivers
advanced ⏱ 45 minutes

Build a RAG Pipeline on Kubernetes

Deploy a Retrieval-Augmented Generation pipeline on Kubernetes using a vector database, embedding model, and LLM inference server.

ragretrieval-augmented-generationvector-databaseembeddings
beginner ⏱ 10 minutes

Test LLM Inference Endpoints with curl

Validate Kubernetes-hosted LLM inference services using curl against OpenAI-compatible /v1/models, /v1/completions, and /v1/chat/completions endpoints.

llminferencecurlopenai-api
advanced ⏱ 35 minutes

GPU Sharing and Bin Packing with KAI Scheduler

Maximize GPU utilization with KAI Scheduler GPU sharing, fractional GPUs, and bin packing strategies for Kubernetes AI workloads.

kai-schedulernvidiagpugpu-sharing
intermediate ⏱ 30 minutes

Installing NVIDIA KAI Scheduler AI Workloads

Deploy KAI Scheduler for optimized GPU resource allocation in Kubernetes AI/ML clusters with hierarchical queues and batch scheduling

kai-schedulernvidiagpuscheduling
intermediate ⏱ 35 minutes

Hierarchical Queues & Resource Fairness KAI...

Configure hierarchical queues in KAI Scheduler for multi-tenant GPU clusters with quotas, limits, and Dominant Resource Fairness (DRF)

kai-schedulernvidiagpuqueues
advanced ⏱ 40 minutes

Batch Scheduling PodGroups in KAI Scheduler

Implement gang scheduling for distributed training jobs using KAI Scheduler PodGroups to ensure all-or-nothing pod scheduling

kai-schedulernvidiagpupodgroups
advanced ⏱ 45 minutes

Topology-Aware Scheduling with KAI Scheduler

Optimize GPU workload placement using KAI Scheduler's Topology-Aware Scheduling (TAS) for NVLink, NVSwitch, and disaggregated serving architectures

kai-schedulernvidiagputopology
Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens