🤖 AI & GPU

AI/ML on Kubernetes: GPU scheduling, NVIDIA Triton, vLLM, model deployment (Llama, Qwen, Phi-4), distributed training, Kubeflow, NeMo, and inference optimization.

77 recipes 🟢 2 beginner 🟡 30 intermediate 🔴 45 advanced

advanced ⏱ 15 minutes

GPU Sharing with MPS and MIG on Kubernetes

Share NVIDIA GPUs across multiple pods using MPS time-slicing and MIG hardware partitioning. Maximize GPU utilization for inference workloads.

gpu-sharingmpsmignvidia

intermediate ⏱ 15 minutes

Node Feature Discovery Operator for Kubernetes

Install and configure Node Feature Discovery (NFD) Operator to auto-detect hardware features like GPUs, NICs, CPU flags, and USB devices on Kubernetes nodes.

nfdnode-feature-discoveryoperatorgpu

advanced ⏱ 20 minutes

Enable GPUDirect Storage in ClusterPolicy

Enable NVIDIA GPUDirect Storage (GDS) in the GPU Operator ClusterPolicy for direct GPU-to-NVMe data paths. Driver module configuration and verification.

nvidiagdsgpu-operatorclusterpolicy

intermediate ⏱ 20 minutes

GPU Time-Slicing on Kubernetes

Share GPUs across multiple workloads using NVIDIA time-slicing on Kubernetes. Configure the device plugin, set replica counts, and manage fairness.

nvidiagputime-slicinggpu-sharing

intermediate ⏱ 30 minutes

NVIDIA GPU Operator Setup on Kubernetes

Install and configure NVIDIA GPU Operator on Kubernetes. Driver containers, toolkit, device plugin, DCGM monitoring, and ClusterPolicy setup.

nvidiagpu-operatorgpukubernetes

advanced ⏱ 45 minutes

NVIDIA Open GPU + GPUDirect RDMA + DOCA-OFED + SR-IOV Stack

Deploy NVIDIA AI networking on Kubernetes: Open GPU driver with DMA-BUF, GPUDirect RDMA, DOCA-OFED, and SR-IOV VF isolation.

nvidiagpu-operatorgpudirectrdma

intermediate ⏱ 30 minutes

AI Model Storage: hostPath vs PVC for Inference

Deploy AI models on Kubernetes using hostPath and PVC storage. Compare performance, security trade-offs, and production patterns for model serving.

model-servingstoragehostpathpvc

intermediate ⏱ 20 minutes

AIPerf Benchmark LLMs on Kubernetes

Deploy NVIDIA AIPerf to benchmark LLM inference performance on Kubernetes. Measure TTFT, ITL, throughput with real-time dashboard and GPU telemetry.

aiperfbenchmarkingnvidiainference

advanced ⏱ 30 minutes

AIPerf Concurrency Sweep on K8s

Run AIPerf concurrency sweeps on Kubernetes to find optimal LLM serving capacity. Automate 1-128 concurrent user benchmarks with batch Jobs.

aiperfbenchmarkingconcurrencyautoscaling

advanced ⏱ 30 minutes

AIPerf Multi-Model Benchmark on K8s

Compare multiple LLM models and backends with AIPerf on Kubernetes. Benchmark vLLM vs TGI vs Triton with automated multi-run confidence intervals.

aiperfbenchmarkingcomparisonvllm

advanced ⏱ 25 minutes

AIPerf Goodput and SLO Benchmarks

Measure LLM goodput with AIPerf on Kubernetes. Define SLOs for TTFT and ITL, calculate effective throughput, and benchmark with timeslice analysis.

aiperfbenchmarkinggoodputslo

advanced ⏱ 35 minutes

Batch AI Workloads with Volcano Scheduler on Kubernetes

Schedule and manage batch AI training and inference jobs using Volcano scheduler with gang scheduling, fair-share queues, job plugins, and preemption on.

volcanobatchgang-schedulingai-workloads

advanced ⏱ 25 minutes

AIPerf Trace Replay Benchmarks on K8s

Replay production traffic traces with AIPerf on Kubernetes. Use moon_cake format, ShareGPT datasets, and fixed schedules for realistic LLM benchmarks.

aiperfbenchmarkingtrace-replaysharegpt

advanced ⏱ 15 minutes

Dell PowerEdge XE7740 GPU Node Setup

Configure Dell PowerEdge XE7740 GPU nodes with H200 GPUs for OpenShift and Kubernetes including BIOS, power, cooling, and network setup.

dellpoweredgexe7740h200

intermediate ⏱ 20 minutes

Deploy Fish Audio TTS on Kubernetes

Deploy Fish Audio S2-Pro 5B text-to-speech model on Kubernetes for high-quality voice synthesis with multi-speaker support and streaming audio.

fish-audiotext-to-speechttsvoice-synthesis

advanced ⏱ 45 minutes

Deploy GLM-5 754B on Kubernetes

Deploy Zhipu AI GLM-5 754B model on Kubernetes with vLLM. One of the largest open-weight models with multi-node tensor parallelism across 8+ GPUs.

glm-5zhipullmultra-large

beginner ⏱ 15 minutes

Deploy Granite 4.0 Speech on Kubernetes

Deploy IBM Granite 4.0 1B Speech model on Kubernetes for automatic speech recognition. Lightweight 2B model runs on CPU or small GPU for STT workloads.

graniteibmspeech-recognitionstt

advanced ⏱ 45 minutes

Deploy Kimi K2.5 1.1T MoE on Kubernetes

Deploy Moonshot AI Kimi-K2.5 1.1T MoE multimodal model on Kubernetes. The largest open MoE model with 2.69M downloads for frontier AI tasks.

kimimoonshotmixture-of-expertsmoe

advanced ⏱ 30 minutes

Deploy Llama 2 70B on Kubernetes

Deploy Meta Llama 2 70B on Kubernetes with multi-GPU tensor parallelism, vLLM serving, and production-ready health checks and resource limits.

llamallmvllmmulti-gpu

intermediate ⏱ 15 minutes

Deploy Llama 3.1 8B Instruct on K8s

Deploy Meta Llama 3.1 8B Instruct on Kubernetes with vLLM. Production-ready single-GPU deployment with 128K context, tool calling, and autoscaling.

llamallama-3.1metallm

advanced ⏱ 25 minutes

Deploy LTX Video Generation on K8s

Deploy Lightricks LTX-2.3 image-to-video model on Kubernetes for AI video generation with batch processing and S3 output storage.

ltxvideo-generationimage-to-videolightricks

advanced ⏱ 30 minutes

Deploy MiniMax M2.5 229B on Kubernetes

Deploy MiniMax M2.5 229B model on Kubernetes with vLLM. High-performance LLM with 485K downloads, optimized for multi-turn conversation and long context.

minimaxllmmulti-gputensor-parallelism

advanced ⏱ 25 minutes

Deploy NVIDIA Nemotron 120B MoE on K8s

Deploy NVIDIA Nemotron-3-Super-120B-A12B MoE model on Kubernetes. 120B total parameters with 12B active for enterprise-grade inference.

nemotronnvidiamixture-of-expertsmoe

intermediate ⏱ 20 minutes

Deploy Microsoft Phi-4 on Kubernetes

Deploy Microsoft Phi-4 small language model on Kubernetes with vLLM. Efficient 14B model with GPT-4 level reasoning on a single GPU.

phi-4microsoftsmall-language-modelvllm

intermediate ⏱ 20 minutes

Deploy Phi-4 Reasoning Vision on K8s

Deploy Microsoft Phi-4-reasoning-vision-15B on Kubernetes for multimodal chain-of-thought reasoning with visual understanding on a single GPU.

phi-4microsoftreasoningmultimodal

advanced ⏱ 30 minutes

Deploy Qwen3 235B MoE on Kubernetes

Deploy Alibaba Qwen3-235B-A22B mixture-of-experts model on Kubernetes. Only 22B parameters active per token for efficient 235B-class inference.

qwen3mixture-of-expertsmoellm

advanced ⏱ 25 minutes

Deploy Qwen3 Coder 80B on Kubernetes

Deploy Qwen3-Coder-Next 80B on Kubernetes for code generation, review, and refactoring. Production-ready AI coding assistant with multi-GPU serving.

qwen3code-generationcoding-assistantllm

intermediate ⏱ 15 minutes

Deploy Qwen3 TTS on Kubernetes

Deploy Qwen3-TTS-12Hz-1.7B-CustomVoice on Kubernetes for text-to-speech with custom voice cloning. 1.13M downloads, lightweight single-GPU deployment.

qwen3text-to-speechttsvoice-cloning

intermediate ⏱ 20 minutes

Deploy Qwen3.5 35B MoE on Kubernetes

Deploy Alibaba Qwen3.5-35B-A3B mixture-of-experts multimodal model on Kubernetes. 35B total parameters with only 3B active for ultra-efficient inference.

qwen3.5mixture-of-expertsmoemultimodal

advanced ⏱ 30 minutes

Deploy Qwen3.5 397B MoE on Kubernetes

Deploy Alibaba Qwen3.5-397B-A17B MoE multimodal model on Kubernetes. 397B total parameters with only 17B active per token for frontier VLM inference.

qwen3.5mixture-of-expertsmoemultimodal

intermediate ⏱ 20 minutes

Deploy Qwen3.5 9B Multimodal on K8s

Deploy Alibaba Qwen3.5-9B vision-language model on Kubernetes with vLLM. Process images and text with a single GPU deployment.

qwen3.5multimodalvision-languagevllm

advanced ⏱ 25 minutes

RetinaNet Object Detection on K8s

Deploy RetinaNet object detection model on Kubernetes with Triton Inference Server, TensorRT optimization, and batch processing pipelines.

retinanetobject-detectioncomputer-visiontriton

advanced ⏱ 25 minutes

Deploy Sarvam 105B on Kubernetes

Deploy Sarvam 105B multilingual LLM on Kubernetes with vLLM. India's largest open language model with native support for 10+ Indic languages.

sarvammultilingualindic-languagesllm

advanced ⏱ 30 minutes

Stable Diffusion XL on Kubernetes

Deploy Stable Diffusion XL for image generation on Kubernetes with TensorRT acceleration, queued batch processing, and S3 output storage.

stable-diffusionsdxlimage-generationdiffusion

intermediate ⏱ 20 minutes

Deploy Whisper Speech-to-Text on K8s

Deploy OpenAI Whisper for speech-to-text on Kubernetes with faster-whisper, batch transcription Jobs, and real-time streaming endpoints.

whisperspeech-to-texttranscriptionaudio

advanced ⏱ 15 minutes

Distributed Inference on Kubernetes

Deploy distributed LLM inference with tensor parallelism across multiple GPUs and pipeline parallelism across nodes on Kubernetes.

distributed-inferencetensor-parallelismpipeline-parallelismvllm

intermediate ⏱ 15 minutes

GenAI-Perf Benchmark LLM Serving

Benchmark LLM inference endpoints with NVIDIA GenAI-Perf for throughput, latency percentiles, time-to-first-token, and ITL metrics.

genai-perfbenchmarkllminference

intermediate ⏱ 25 minutes

GenAI-Perf Benchmark Triton on K8s

Benchmark NVIDIA Triton Inference Server performance on Kubernetes using GenAI-Perf. Measure TTFT, inter-token latency, throughput, and GPU telemetry.

genai-perftritonbenchmarkingnvidia

advanced ⏱ 15 minutes

Distributed Training with Kubeflow Training Operator

Run multi-node distributed PyTorch and TensorFlow training jobs using Kubeflow Training Operator with NCCL, RDMA, and shared storage.

kubeflowdistributed-trainingpytorchnccl

intermediate ⏱ 15 minutes

Kubeflow Training Operator on Kubernetes

Install Kubeflow Training Operator for distributed ML training with PyTorchJob, TFJob, and MPIJob on GPU-enabled Kubernetes clusters.

kubeflowtraining-operatordistributed-trainingpytorch

advanced ⏱ 15 minutes

LeaderWorkerSet Operator for AI Workloads

Deploy distributed AI training with LeaderWorkerSet Operator on Kubernetes and OpenShift for leader-worker topology with gang scheduling.

leaderworkersetlwsdistributed-trainingopenshift

advanced ⏱ 15 minutes

Llama Stack on Kubernetes with NVIDIA NIM

Deploy Meta Llama Stack on Kubernetes for unified inference, RAG, agents, and safety APIs using NVIDIA NIM as the inference backend.

llama-stacknvidia-nimllamainference

advanced ⏱ 15 minutes

MLPerf Benchmarking on Kubernetes

Run MLPerf inference and training benchmarks on Kubernetes GPU clusters to validate AI workload performance and compare hardware configurations.

mlperfbenchmarkinginferencetraining

intermediate ⏱ 25 minutes

Shared Model Caching Across Pods on Kubernetes

Optimize LLM inference startup and reduce storage costs by sharing model weights across pods using emptyDir, hostPath, ReadWriteMany PVCs, and init.

model-cachingshared-memorypvcinit-containers

advanced ⏱ 30 minutes

MPI Operator for Distributed Training

Deploy MPI Operator on Kubernetes for distributed GPU training with Horovod and NCCL. Run multi-node MPI jobs natively in Kubernetes pods.

mpimpi-operatordistributed-traininghorovod

advanced ⏱ 30 minutes

Deploy NVIDIA Clara on Kubernetes

Deploy NVIDIA Clara medical AI and drug discovery platform on Kubernetes. Run digital biology and medtech inference workloads with GPU acceleration.

nvidiaclaramedical-aidrug-discovery

advanced ⏱ 15 minutes

NVIDIA H200 GPU Workloads on Kubernetes

Deploy and optimize AI workloads on NVIDIA H200 GPUs with 141GB HBM3e memory for large model inference and training on Kubernetes.

nvidiah200gpuhbm3e

advanced ⏱ 15 minutes

NVIDIA H300 GPU Workloads on Kubernetes

Prepare for NVIDIA H300 Blackwell-Next GPUs on Kubernetes with next-gen HBM3e memory, NVLink 5.0, and FP4 inference capabilities.

nvidiah300blackwellgpu

advanced ⏱ 15 minutes

NVIDIA NeMo Training on Kubernetes

Deploy NVIDIA NeMo framework on Kubernetes for large language model pre-training, fine-tuning, and RLHF with multi-node GPU clusters.

nvidianemotrainingllm

advanced ⏱ 30 minutes

NVIDIA Pyxis and Enroot for SLURM

Use NVIDIA Pyxis and Enroot to run GPU containers in SLURM jobs. Bridge SLURM HPC scheduling with container-native AI workloads and NGC images.

pyxisenrootslurmnvidia

advanced ⏱ 15 minutes

Run:AI GPU Quotas on OpenShift

Configure Run:AI scheduler quotas for fair GPU sharing with guaranteed, over-quota borrowing, and per-tenant GPU allocation policies.

runaigpuquotasscheduling

advanced ⏱ 45 minutes

SLURM and Kubernetes Integration

Integrate SLURM HPC workload manager with Kubernetes for hybrid AI and scientific computing. Bridge HPC batch scheduling with container orchestration.

slurmhpcbatch-schedulinggpu

intermediate ⏱ 15 minutes

Time-Slicing vs MIG vs Full GPU Allocation

Compare GPU sharing strategies: time-slicing for notebooks, MIG for isolated inference, and full GPU for training workloads.

time-slicingmiggpu-sharingmulti-tenant

advanced ⏱ 30 minutes

Triton Autoscaling with GPU Metrics

Autoscale Triton Inference Server on Kubernetes using GPU utilization, request queue depth, and inference latency metrics with KEDA and HPA.

tritonautoscalinggpu-metricskeda

advanced ⏱ 35 minutes

Triton Multi-Model Serving on Kubernetes

Serve multiple LLMs simultaneously on Triton Inference Server using TensorRT-LLM and vLLM backends with model routing and GPU scheduling.

tritonmulti-modeltensorrt-llmvllm

advanced ⏱ 45 minutes

Triton TensorRT-LLM on Kubernetes

Deploy NVIDIA Triton Inference Server with TensorRT-LLM backend on Kubernetes for optimized large language model serving with GPU acceleration.

tritontensorrt-llmnvidiainference

intermediate ⏱ 20 minutes

TensorRT-LLM vs vLLM on Triton

Compare TensorRT-LLM and vLLM backends on Triton Inference Server. When to use each, performance benchmarks, and migration strategies.

tritontensorrt-llmvllmcomparison

advanced ⏱ 30 minutes

Triton with vLLM Backend on Kubernetes

Deploy NVIDIA Triton Inference Server with vLLM backend on Kubernetes for flexible LLM serving with PagedAttention and continuous batching.

tritonvllmnvidiainference

intermediate ⏱ 30 minutes

Deploying Vector Databases on Kubernetes

Deploy and operate vector databases (Milvus, Weaviate, Qdrant) on Kubernetes for RAG pipelines, semantic search, and AI applications with persistent.

vector-databasemilvusweaviateqdrant

intermediate ⏱ 20 minutes

Compare NCCL Intra-Node vs Inter-Node Performance

Build a repeatable comparison between local and cross-node NCCL throughput to validate GPU cluster interconnect scaling and identify bottlenecks early.

ncclintra-nodeinter-nodebenchmarking

intermediate ⏱ 20 minutes

Run NCCL AllGather Benchmarks for Model Parallel Validation

Use all-gather NCCL tests to evaluate GPU communication behavior and throughput for tensor-parallel and model-parallel distributed AI workloads on Kubernetes.

ncclallgatheraimodel-parallel

intermediate ⏱ 20 minutes

Benchmark NCCL AllReduce Performance on Kubernetes

Measure NCCL AllReduce bandwidth and latency on Kubernetes to validate distributed training network performance across multi-GPU clusters.

ncclallreducegpubenchmark

intermediate ⏱ 25 minutes

Run NCCL Tests on Kubernetes for GPU Network Validation

Benchmark GPU-to-GPU communication using NVIDIA nccl-tests on Kubernetes or OpenShift to validate bandwidth and latency.

ncclnccl-testsgpukubernetes

advanced ⏱ 30 minutes

Deploy Mistral 7B with NVIDIA NIM on Kubernetes

Step-by-step guide to deploy Mistral-7B using NVIDIA NIM with TensorRT-LLM backend on Kubernetes for optimized GPU inference.

nvidia-nimtensorrt-llmmistralllm

intermediate ⏱ 30 minutes

Deploy Mistral 7B with vLLM on Kubernetes

Step-by-step guide to deploy Mistral-7B-v0.1 using vLLM as an OpenAI-compatible inference server on Kubernetes with GPU fractioning.

vllmmistralllminference

advanced ⏱ 30 minutes

Autoscale LLM Inference on Kubernetes

Configure Horizontal Pod Autoscaling and KEDA for LLM workloads using GPU utilization, request queue depth, and custom metrics.

autoscalinghpakedallm

intermediate ⏱ 20 minutes

Quantize LLMs for Efficient GPU Inference on Kubernetes

Run quantized LLM models (GPTQ, AWQ, GGUF) on Kubernetes to reduce GPU memory requirements and serve models on smaller GPUs.

quantizationgptqawqgguf

intermediate ⏱ 15 minutes

Kubernetes LLM Serving Frameworks Compared

Compare vLLM, NVIDIA NIM, Triton, Ollama, and llama.cpp for serving LLMs on Kubernetes — features, performance, and when to use each.

vllmnvidia-nimtritonollama

advanced ⏱ 30 minutes

Multi-GPU and Tensor Parallel LLM Inference on Kubernetes

Deploy large language models across multiple GPUs using tensor parallelism with vLLM and NVIDIA NIM on Kubernetes for high-throughput inference serving.

multi-gputensor-parallelismpipeline-parallelismllm

intermediate ⏱ 25 minutes

Install NVIDIA GPU Operator on Kubernetes

Deploy the NVIDIA GPU Operator to automate GPU driver, container toolkit, and device plugin management across your Kubernetes cluster.

nvidiagpu-operatorgpudrivers

advanced ⏱ 45 minutes

Build a RAG Pipeline on Kubernetes

Deploy a Retrieval-Augmented Generation pipeline on Kubernetes using a vector database, embedding model, and LLM inference server.

ragretrieval-augmented-generationvector-databaseembeddings

beginner ⏱ 10 minutes

Test LLM Inference Endpoints with curl

Validate Kubernetes-hosted LLM inference services using curl against OpenAI-compatible /v1/models, /v1/completions, and /v1/chat/completions endpoints.

llminferencecurlopenai-api

advanced ⏱ 35 minutes

GPU Sharing and Bin Packing with KAI Scheduler

Maximize GPU utilization with KAI Scheduler GPU sharing, fractional GPUs, and bin packing strategies for Kubernetes AI workloads.

kai-schedulernvidiagpugpu-sharing

intermediate ⏱ 30 minutes

Installing NVIDIA KAI Scheduler for AI Workloads

Deploy KAI Scheduler for optimized GPU resource allocation in Kubernetes AI/ML clusters with hierarchical queues and batch scheduling

kai-schedulernvidiagpuscheduling

advanced ⏱ 40 minutes

Batch Scheduling with PodGroups in KAI Scheduler

Implement gang scheduling for distributed training jobs using KAI Scheduler PodGroups to ensure all-or-nothing pod scheduling

kai-schedulernvidiagpupodgroups

intermediate ⏱ 35 minutes

Hierarchical Queues and Resource Fairness with KAI Scheduler

Configure hierarchical queues in KAI Scheduler for multi-tenant GPU clusters with quotas, limits, and Dominant Resource Fairness (DRF)

kai-schedulernvidiagpuqueues

advanced ⏱ 45 minutes

Topology-Aware Scheduling with KAI Scheduler

Optimize GPU workload placement using KAI Scheduler's Topology-Aware Scheduling (TAS) for NVLink, NVSwitch, and disaggregated serving architectures

kai-schedulernvidiagputopology