📚Book Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) — free book giveaway!RSVP Booking.com Event

🤖 AI & ML Recipes

Run AI and ML workloads on Kubernetes with GPU scheduling, NVIDIA KAI Scheduler, model serving frameworks, and distributed training patterns.

71 recipes available

Intermediate

AI Model Storage: hostPath vs PVC for Inference

Deploy AI models on Kubernetes using hostPath and PersistentVolumeClaim storage. Compare performance, security trade-offs, and production patterns for model serving.

⏱ 30 minutes K8s 1.28+

AIPerf Benchmark LLMs on Kubernetes

Deploy NVIDIA AIPerf to benchmark LLM inference performance on Kubernetes. Measure TTFT, ITL, throughput with real-time dashboard and GPU telemetry.

⏱ 20 minutes K8s 1.28+

Deploy Fish Audio TTS on Kubernetes

Deploy Fish Audio S2-Pro 5B text-to-speech model on Kubernetes for high-quality voice synthesis with multi-speaker support and streaming audio.

⏱ 20 minutes K8s 1.28+

Deploy Llama 3.1 8B Instruct on K8s

Deploy Meta Llama 3.1 8B Instruct on Kubernetes with vLLM. Production-ready single-GPU deployment with 128K context, tool calling, and autoscaling.

⏱ 15 minutes K8s 1.28+

Deploy Microsoft Phi-4 on Kubernetes

Deploy Microsoft Phi-4 small language model on Kubernetes with vLLM. Efficient 14B model with GPT-4 level reasoning on a single GPU.

⏱ 20 minutes K8s 1.28+

Deploy Phi-4 Reasoning Vision on K8s

Deploy Microsoft Phi-4-reasoning-vision-15B on Kubernetes for multimodal chain-of-thought reasoning with visual understanding on a single GPU.

⏱ 20 minutes K8s 1.28+

Deploy Qwen3 TTS on Kubernetes

Deploy Qwen3-TTS-12Hz-1.7B-CustomVoice on Kubernetes for text-to-speech with custom voice cloning. 1.13M downloads, lightweight single-GPU deployment.

⏱ 15 minutes K8s 1.28+

Deploy Qwen3.5 35B MoE on Kubernetes

Deploy Alibaba Qwen3.5-35B-A3B mixture-of-experts multimodal model on Kubernetes. 35B total parameters with only 3B active for ultra-efficient inference.

⏱ 20 minutes K8s 1.28+

Deploy Qwen3.5 9B Multimodal on K8s

Deploy Alibaba Qwen3.5-9B vision-language model on Kubernetes with vLLM. Process images and text with a single GPU deployment.

⏱ 20 minutes K8s 1.28+

Deploy Whisper Speech-to-Text on K8s

Deploy OpenAI Whisper for speech-to-text on Kubernetes with faster-whisper, batch transcription Jobs, and real-time streaming endpoints.

⏱ 20 minutes K8s 1.28+

GenAI-Perf Benchmark LLM Serving

Benchmark LLM inference endpoints with NVIDIA GenAI-Perf for throughput, latency percentiles, time-to-first-token, and ITL metrics.

⏱ 15 minutes K8s 1.28+

GenAI-Perf Benchmark Triton on K8s

Benchmark NVIDIA Triton Inference Server performance on Kubernetes using GenAI-Perf. Measure TTFT, inter-token latency, throughput, and GPU telemetry.

⏱ 25 minutes K8s 1.28+

Kubeflow Training Operator on Kubernetes

Install Kubeflow Training Operator for distributed ML training with PyTorchJob, TFJob, and MPIJob on GPU-enabled Kubernetes clusters.

⏱ 15 minutes K8s 1.28+

Shared Model Caching Across Pods on Kubernetes

Optimize LLM inference startup and reduce storage costs by sharing model weights across pods using emptyDir, hostPath, ReadWriteMany PVCs, and init.

⏱ 25 minutes K8s 1.31+

Time-Slicing vs MIG vs Full GPU Allocation

Compare GPU sharing strategies: time-slicing for notebooks, MIG for isolated inference, and full GPU for training workloads.

⏱ 15 minutes K8s 1.28+

TensorRT-LLM vs vLLM on Triton

Compare TensorRT-LLM and vLLM backends on Triton Inference Server. When to use each, performance benchmarks, and migration strategies.

⏱ 20 minutes K8s 1.28+

Deploying Vector Databases on Kubernetes

Deploy and operate vector databases (Milvus, Weaviate, Qdrant) on Kubernetes for RAG pipelines, semantic search, and AI applications with persistent.

⏱ 30 minutes K8s 1.31+

Compare NCCL Intra-Node vs Inter-Node Performance

Build a repeatable comparison between local and cross-node NCCL throughput to validate GPU cluster interconnect scaling and identify bottlenecks early.

⏱ 20 minutes K8s 1.28+

Run NCCL AllGather Benchmarks for Model Parallel Validation

Use all-gather NCCL tests to evaluate GPU communication behavior and throughput for tensor-parallel and model-parallel distributed AI workloads on Kubernetes.

⏱ 20 minutes K8s 1.28+

Benchmark NCCL AllReduce Performance on Kubernetes

Measure NCCL AllReduce bandwidth and latency on Kubernetes to validate distributed training network performance across multi-GPU clusters.

⏱ 20 minutes K8s 1.28+

Run NCCL Tests on Kubernetes for GPU Network Validation

Benchmark GPU-to-GPU communication using NVIDIA nccl-tests on Kubernetes or OpenShift to validate bandwidth and latency.

⏱ 25 minutes K8s 1.28+

Deploy Mistral 7B with vLLM on Kubernetes

Step-by-step guide to deploy Mistral-7B-v0.1 using vLLM as an OpenAI-compatible inference server on Kubernetes with GPU fractioning.

⏱ 30 minutes K8s 1.28+

Quantize LLMs for Efficient GPU Inference on Kubernetes

Run quantized LLM models (GPTQ, AWQ, GGUF) on Kubernetes to reduce GPU memory requirements and serve models on smaller GPUs.

⏱ 20 minutes K8s 1.28+

Kubernetes LLM Serving Frameworks Compared

Compare vLLM, NVIDIA NIM, Triton, Ollama, and llama.cpp for serving LLMs on Kubernetes — features, performance, and when to use each.

⏱ 15 minutes K8s 1.28+

Install NVIDIA GPU Operator on Kubernetes

Deploy the NVIDIA GPU Operator to automate GPU driver, container toolkit, and device plugin management across your Kubernetes cluster.

⏱ 25 minutes K8s 1.28+

Installing NVIDIA KAI Scheduler for AI Workloads

Deploy KAI Scheduler for optimized GPU resource allocation in Kubernetes AI/ML clusters with hierarchical queues and batch scheduling

⏱ 30 minutes K8s 1.28+

Hierarchical Queues and Resource Fairness with KAI Scheduler

Configure hierarchical queues in KAI Scheduler for multi-tenant GPU clusters with quotas, limits, and Dominant Resource Fairness (DRF)

⏱ 35 minutes K8s 1.28+

Advanced

AIPerf Concurrency Sweep on K8s

Run AIPerf concurrency sweeps on Kubernetes to find optimal LLM serving capacity. Automate 1-128 concurrent user benchmarks with batch Jobs.

⏱ 30 minutes K8s 1.28+

AIPerf Multi-Model Benchmark on K8s

Compare multiple LLM models and backends with AIPerf on Kubernetes. Benchmark vLLM vs TGI vs Triton with automated multi-run confidence intervals.

⏱ 30 minutes K8s 1.28+

AIPerf Goodput and SLO Benchmarks

Measure LLM goodput with AIPerf on Kubernetes. Define SLOs for TTFT and ITL, calculate effective throughput, and benchmark with timeslice analysis.

⏱ 25 minutes K8s 1.28+

Batch AI Workloads with Volcano Scheduler on Kubernetes

Schedule and manage batch AI training and inference jobs using Volcano scheduler with gang scheduling, fair-share queues, job plugins, and preemption on.

⏱ 35 minutes K8s 1.31+

AIPerf Trace Replay Benchmarks on K8s

Replay production traffic traces with AIPerf on Kubernetes. Use moon_cake format, ShareGPT datasets, and fixed schedules for realistic LLM benchmarks.

⏱ 25 minutes K8s 1.28+

Dell PowerEdge XE7740 GPU Node Setup

Configure Dell PowerEdge XE7740 GPU nodes with H200 GPUs for OpenShift and Kubernetes including BIOS, power, cooling, and network setup.

⏱ 15 minutes K8s 1.28+

Deploy GLM-5 754B on Kubernetes

Deploy Zhipu AI GLM-5 754B model on Kubernetes with vLLM. One of the largest open-weight models with multi-node tensor parallelism across 8+ GPUs.

⏱ 45 minutes K8s 1.28+

Deploy Llama 2 70B on Kubernetes

Deploy Meta Llama 2 70B on Kubernetes with multi-GPU tensor parallelism, vLLM serving, and production-ready health checks and resource limits.

⏱ 30 minutes K8s 1.28+

Deploy Kimi K2.5 1.1T MoE on Kubernetes

Deploy Moonshot AI Kimi-K2.5 1.1T MoE multimodal model on Kubernetes. The largest open MoE model with 2.69M downloads for frontier AI tasks.

⏱ 45 minutes K8s 1.28+

Deploy LTX Video Generation on K8s

Deploy Lightricks LTX-2.3 image-to-video model on Kubernetes for AI video generation with batch processing and S3 output storage.

⏱ 25 minutes K8s 1.28+

Deploy MiniMax M2.5 229B on Kubernetes

Deploy MiniMax M2.5 229B model on Kubernetes with vLLM. High-performance LLM with 485K downloads, optimized for multi-turn conversation and long context.

⏱ 30 minutes K8s 1.28+

Deploy NVIDIA Nemotron 120B MoE on K8s

Deploy NVIDIA Nemotron-3-Super-120B-A12B MoE model on Kubernetes. 120B total parameters with 12B active for enterprise-grade inference.

⏱ 25 minutes K8s 1.28+

Deploy Qwen3 235B MoE on Kubernetes

Deploy Alibaba Qwen3-235B-A22B mixture-of-experts model on Kubernetes. Only 22B parameters active per token for efficient 235B-class inference.

⏱ 30 minutes K8s 1.28+

Deploy Qwen3 Coder 80B on Kubernetes

Deploy Qwen3-Coder-Next 80B on Kubernetes for code generation, review, and refactoring. Production-ready AI coding assistant with multi-GPU serving.

⏱ 25 minutes K8s 1.28+

Deploy Qwen3.5 397B MoE on Kubernetes

Deploy Alibaba Qwen3.5-397B-A17B MoE multimodal model on Kubernetes. 397B total parameters with only 17B active per token for frontier VLM inference.

⏱ 30 minutes K8s 1.28+

RetinaNet Object Detection on K8s

Deploy RetinaNet object detection model on Kubernetes with Triton Inference Server, TensorRT optimization, and batch processing pipelines.

⏱ 25 minutes K8s 1.28+

Deploy Sarvam 105B on Kubernetes

Deploy Sarvam 105B multilingual LLM on Kubernetes with vLLM. India's largest open language model with native support for 10+ Indic languages.

⏱ 25 minutes K8s 1.28+

Stable Diffusion XL on Kubernetes

Deploy Stable Diffusion XL for image generation on Kubernetes with TensorRT acceleration, queued batch processing, and S3 output storage.

⏱ 30 minutes K8s 1.28+

Distributed Inference on Kubernetes

Deploy distributed LLM inference with tensor parallelism across multiple GPUs and pipeline parallelism across nodes on Kubernetes.

⏱ 15 minutes K8s 1.28+

Distributed Training with Kubeflow Training Operator

Run multi-node distributed PyTorch and TensorFlow training jobs using Kubeflow Training Operator with NCCL, RDMA, and shared storage.

⏱ 15 minutes K8s 1.28+

LeaderWorkerSet Operator for AI Workloads

Deploy distributed AI training with LeaderWorkerSet Operator on Kubernetes and OpenShift for leader-worker topology with gang scheduling.

⏱ 15 minutes K8s 1.28+

Llama Stack on Kubernetes with NVIDIA NIM

Deploy Meta Llama Stack on Kubernetes for unified inference, RAG, agents, and safety APIs using NVIDIA NIM as the inference backend.

⏱ 15 minutes K8s 1.28+

MLPerf Benchmarking on Kubernetes

Run MLPerf inference and training benchmarks on Kubernetes GPU clusters to validate AI workload performance and compare hardware configurations.

⏱ 15 minutes K8s 1.28+

MPI Operator for Distributed Training

Deploy MPI Operator on Kubernetes for distributed GPU training with Horovod and NCCL. Run multi-node MPI jobs natively in Kubernetes pods.

⏱ 30 minutes K8s 1.28+

Deploy NVIDIA Clara on Kubernetes

Deploy NVIDIA Clara medical AI and drug discovery platform on Kubernetes. Run digital biology and medtech inference workloads with GPU acceleration.

⏱ 30 minutes K8s 1.28+

NVIDIA H200 GPU Workloads on Kubernetes

Deploy and optimize AI workloads on NVIDIA H200 GPUs with 141GB HBM3e memory for large model inference and training on Kubernetes.

⏱ 15 minutes K8s 1.28+

NVIDIA H300 GPU Workloads on Kubernetes

Prepare for NVIDIA H300 Blackwell-Next GPUs on Kubernetes with next-gen HBM3e memory, NVLink 5.0, and FP4 inference capabilities.

⏱ 15 minutes K8s 1.28+

NVIDIA NeMo Training on Kubernetes

Deploy NVIDIA NeMo framework on Kubernetes for large language model pre-training, fine-tuning, and RLHF with multi-node GPU clusters.

⏱ 15 minutes K8s 1.28+

NVIDIA Pyxis and Enroot for SLURM

Use NVIDIA Pyxis and Enroot to run GPU containers in SLURM jobs. Bridge SLURM HPC scheduling with container-native AI workloads and NGC images.

⏱ 30 minutes K8s 1.28+

Run:AI GPU Quotas on OpenShift

Configure Run:AI scheduler quotas for fair GPU sharing with guaranteed, over-quota borrowing, and per-tenant GPU allocation policies.

⏱ 15 minutes K8s 1.28+

SLURM and Kubernetes Integration

Integrate SLURM HPC workload manager with Kubernetes for hybrid AI and scientific computing. Bridge HPC batch scheduling with container orchestration.

⏱ 45 minutes K8s 1.28+

Triton Autoscaling with GPU Metrics

Autoscale Triton Inference Server on Kubernetes using GPU utilization, request queue depth, and inference latency metrics with KEDA and HPA.

⏱ 30 minutes K8s 1.28+

Triton Multi-Model Serving on Kubernetes

Serve multiple LLMs simultaneously on Triton Inference Server using TensorRT-LLM and vLLM backends with model routing and GPU scheduling.

⏱ 35 minutes K8s 1.28+

Triton TensorRT-LLM on Kubernetes

Deploy NVIDIA Triton Inference Server with TensorRT-LLM backend on Kubernetes for optimized large language model serving with GPU acceleration.

⏱ 45 minutes K8s 1.28+

Triton with vLLM Backend on Kubernetes

Deploy NVIDIA Triton Inference Server with vLLM backend on Kubernetes for flexible LLM serving with PagedAttention and continuous batching.

⏱ 30 minutes K8s 1.28+

Deploy Mistral 7B with NVIDIA NIM on Kubernetes

Step-by-step guide to deploy Mistral-7B using NVIDIA NIM with TensorRT-LLM backend on Kubernetes for optimized GPU inference.

⏱ 30 minutes K8s 1.28+

Autoscale LLM Inference on Kubernetes

Configure Horizontal Pod Autoscaling and KEDA for LLM workloads using GPU utilization, request queue depth, and custom metrics.

⏱ 30 minutes K8s 1.28+

Multi-GPU and Tensor Parallel LLM Inference on Kubernetes

Deploy large language models across multiple GPUs using tensor parallelism with vLLM and NVIDIA NIM on Kubernetes for high-throughput inference serving.

⏱ 30 minutes K8s 1.28+

Build a RAG Pipeline on Kubernetes

Deploy a Retrieval-Augmented Generation pipeline on Kubernetes using a vector database, embedding model, and LLM inference server.

⏱ 45 minutes K8s 1.28+

GPU Sharing and Bin Packing with KAI Scheduler

Maximize GPU utilization with KAI Scheduler GPU sharing, fractional GPUs, and bin packing strategies for Kubernetes AI workloads.

⏱ 35 minutes K8s 1.28+

Batch Scheduling with PodGroups in KAI Scheduler

Implement gang scheduling for distributed training jobs using KAI Scheduler PodGroups to ensure all-or-nothing pod scheduling

⏱ 40 minutes K8s 1.28+

Topology-Aware Scheduling with KAI Scheduler

Optimize GPU workload placement using KAI Scheduler's Topology-Aware Scheduling (TAS) for NVLink, NVSwitch, and disaggregated serving architectures

⏱ 45 minutes K8s 1.28+

Want more ai & ml patterns?

Our book includes an entire chapter dedicated to ai & ml with dozens more examples.

📖 Explore All Chapters
Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens