Run AI and ML workloads on Kubernetes with GPU scheduling, NVIDIA KAI Scheduler, model serving frameworks, and distributed training patterns.
71 recipes available
Deploy IBM Granite 4.0 1B Speech model on Kubernetes for automatic speech recognition. Lightweight 2B model runs on CPU or small GPU for STT workloads.
Validate Kubernetes-hosted LLM inference services using curl against OpenAI-compatible /v1/models, /v1/completions, and /v1/chat/completions endpoints.
Deploy AI models on Kubernetes using hostPath and PersistentVolumeClaim storage. Compare performance, security trade-offs, and production patterns for model serving.
Deploy NVIDIA AIPerf to benchmark LLM inference performance on Kubernetes. Measure TTFT, ITL, throughput with real-time dashboard and GPU telemetry.
Deploy Fish Audio S2-Pro 5B text-to-speech model on Kubernetes for high-quality voice synthesis with multi-speaker support and streaming audio.
Deploy Meta Llama 3.1 8B Instruct on Kubernetes with vLLM. Production-ready single-GPU deployment with 128K context, tool calling, and autoscaling.
Deploy Microsoft Phi-4 small language model on Kubernetes with vLLM. Efficient 14B model with GPT-4 level reasoning on a single GPU.
Deploy Microsoft Phi-4-reasoning-vision-15B on Kubernetes for multimodal chain-of-thought reasoning with visual understanding on a single GPU.
Deploy Qwen3-TTS-12Hz-1.7B-CustomVoice on Kubernetes for text-to-speech with custom voice cloning. 1.13M downloads, lightweight single-GPU deployment.
Deploy Alibaba Qwen3.5-35B-A3B mixture-of-experts multimodal model on Kubernetes. 35B total parameters with only 3B active for ultra-efficient inference.
Deploy Alibaba Qwen3.5-9B vision-language model on Kubernetes with vLLM. Process images and text with a single GPU deployment.
Deploy OpenAI Whisper for speech-to-text on Kubernetes with faster-whisper, batch transcription Jobs, and real-time streaming endpoints.
Benchmark LLM inference endpoints with NVIDIA GenAI-Perf for throughput, latency percentiles, time-to-first-token, and ITL metrics.
Benchmark NVIDIA Triton Inference Server performance on Kubernetes using GenAI-Perf. Measure TTFT, inter-token latency, throughput, and GPU telemetry.
Install Kubeflow Training Operator for distributed ML training with PyTorchJob, TFJob, and MPIJob on GPU-enabled Kubernetes clusters.
Optimize LLM inference startup and reduce storage costs by sharing model weights across pods using emptyDir, hostPath, ReadWriteMany PVCs, and init.
Compare GPU sharing strategies: time-slicing for notebooks, MIG for isolated inference, and full GPU for training workloads.
Compare TensorRT-LLM and vLLM backends on Triton Inference Server. When to use each, performance benchmarks, and migration strategies.
Deploy and operate vector databases (Milvus, Weaviate, Qdrant) on Kubernetes for RAG pipelines, semantic search, and AI applications with persistent.
Build a repeatable comparison between local and cross-node NCCL throughput to validate GPU cluster interconnect scaling and identify bottlenecks early.
Use all-gather NCCL tests to evaluate GPU communication behavior and throughput for tensor-parallel and model-parallel distributed AI workloads on Kubernetes.
Measure NCCL AllReduce bandwidth and latency on Kubernetes to validate distributed training network performance across multi-GPU clusters.
Benchmark GPU-to-GPU communication using NVIDIA nccl-tests on Kubernetes or OpenShift to validate bandwidth and latency.
Step-by-step guide to deploy Mistral-7B-v0.1 using vLLM as an OpenAI-compatible inference server on Kubernetes with GPU fractioning.
Run quantized LLM models (GPTQ, AWQ, GGUF) on Kubernetes to reduce GPU memory requirements and serve models on smaller GPUs.
Compare vLLM, NVIDIA NIM, Triton, Ollama, and llama.cpp for serving LLMs on Kubernetes — features, performance, and when to use each.
Deploy the NVIDIA GPU Operator to automate GPU driver, container toolkit, and device plugin management across your Kubernetes cluster.
Deploy KAI Scheduler for optimized GPU resource allocation in Kubernetes AI/ML clusters with hierarchical queues and batch scheduling
Configure hierarchical queues in KAI Scheduler for multi-tenant GPU clusters with quotas, limits, and Dominant Resource Fairness (DRF)
Run AIPerf concurrency sweeps on Kubernetes to find optimal LLM serving capacity. Automate 1-128 concurrent user benchmarks with batch Jobs.
Compare multiple LLM models and backends with AIPerf on Kubernetes. Benchmark vLLM vs TGI vs Triton with automated multi-run confidence intervals.
Measure LLM goodput with AIPerf on Kubernetes. Define SLOs for TTFT and ITL, calculate effective throughput, and benchmark with timeslice analysis.
Schedule and manage batch AI training and inference jobs using Volcano scheduler with gang scheduling, fair-share queues, job plugins, and preemption on.
Replay production traffic traces with AIPerf on Kubernetes. Use moon_cake format, ShareGPT datasets, and fixed schedules for realistic LLM benchmarks.
Configure Dell PowerEdge XE7740 GPU nodes with H200 GPUs for OpenShift and Kubernetes including BIOS, power, cooling, and network setup.
Deploy Zhipu AI GLM-5 754B model on Kubernetes with vLLM. One of the largest open-weight models with multi-node tensor parallelism across 8+ GPUs.
Deploy Meta Llama 2 70B on Kubernetes with multi-GPU tensor parallelism, vLLM serving, and production-ready health checks and resource limits.
Deploy Moonshot AI Kimi-K2.5 1.1T MoE multimodal model on Kubernetes. The largest open MoE model with 2.69M downloads for frontier AI tasks.
Deploy Lightricks LTX-2.3 image-to-video model on Kubernetes for AI video generation with batch processing and S3 output storage.
Deploy MiniMax M2.5 229B model on Kubernetes with vLLM. High-performance LLM with 485K downloads, optimized for multi-turn conversation and long context.
Deploy NVIDIA Nemotron-3-Super-120B-A12B MoE model on Kubernetes. 120B total parameters with 12B active for enterprise-grade inference.
Deploy Alibaba Qwen3-235B-A22B mixture-of-experts model on Kubernetes. Only 22B parameters active per token for efficient 235B-class inference.
Deploy Qwen3-Coder-Next 80B on Kubernetes for code generation, review, and refactoring. Production-ready AI coding assistant with multi-GPU serving.
Deploy Alibaba Qwen3.5-397B-A17B MoE multimodal model on Kubernetes. 397B total parameters with only 17B active per token for frontier VLM inference.
Deploy RetinaNet object detection model on Kubernetes with Triton Inference Server, TensorRT optimization, and batch processing pipelines.
Deploy Sarvam 105B multilingual LLM on Kubernetes with vLLM. India's largest open language model with native support for 10+ Indic languages.
Deploy Stable Diffusion XL for image generation on Kubernetes with TensorRT acceleration, queued batch processing, and S3 output storage.
Deploy distributed LLM inference with tensor parallelism across multiple GPUs and pipeline parallelism across nodes on Kubernetes.
Run multi-node distributed PyTorch and TensorFlow training jobs using Kubeflow Training Operator with NCCL, RDMA, and shared storage.
Deploy distributed AI training with LeaderWorkerSet Operator on Kubernetes and OpenShift for leader-worker topology with gang scheduling.
Deploy Meta Llama Stack on Kubernetes for unified inference, RAG, agents, and safety APIs using NVIDIA NIM as the inference backend.
Run MLPerf inference and training benchmarks on Kubernetes GPU clusters to validate AI workload performance and compare hardware configurations.
Deploy MPI Operator on Kubernetes for distributed GPU training with Horovod and NCCL. Run multi-node MPI jobs natively in Kubernetes pods.
Deploy NVIDIA Clara medical AI and drug discovery platform on Kubernetes. Run digital biology and medtech inference workloads with GPU acceleration.
Deploy and optimize AI workloads on NVIDIA H200 GPUs with 141GB HBM3e memory for large model inference and training on Kubernetes.
Prepare for NVIDIA H300 Blackwell-Next GPUs on Kubernetes with next-gen HBM3e memory, NVLink 5.0, and FP4 inference capabilities.
Deploy NVIDIA NeMo framework on Kubernetes for large language model pre-training, fine-tuning, and RLHF with multi-node GPU clusters.
Use NVIDIA Pyxis and Enroot to run GPU containers in SLURM jobs. Bridge SLURM HPC scheduling with container-native AI workloads and NGC images.
Configure Run:AI scheduler quotas for fair GPU sharing with guaranteed, over-quota borrowing, and per-tenant GPU allocation policies.
Integrate SLURM HPC workload manager with Kubernetes for hybrid AI and scientific computing. Bridge HPC batch scheduling with container orchestration.
Autoscale Triton Inference Server on Kubernetes using GPU utilization, request queue depth, and inference latency metrics with KEDA and HPA.
Serve multiple LLMs simultaneously on Triton Inference Server using TensorRT-LLM and vLLM backends with model routing and GPU scheduling.
Deploy NVIDIA Triton Inference Server with TensorRT-LLM backend on Kubernetes for optimized large language model serving with GPU acceleration.
Deploy NVIDIA Triton Inference Server with vLLM backend on Kubernetes for flexible LLM serving with PagedAttention and continuous batching.
Step-by-step guide to deploy Mistral-7B using NVIDIA NIM with TensorRT-LLM backend on Kubernetes for optimized GPU inference.
Configure Horizontal Pod Autoscaling and KEDA for LLM workloads using GPU utilization, request queue depth, and custom metrics.
Deploy large language models across multiple GPUs using tensor parallelism with vLLM and NVIDIA NIM on Kubernetes for high-throughput inference serving.
Deploy a Retrieval-Augmented Generation pipeline on Kubernetes using a vector database, embedding model, and LLM inference server.
Maximize GPU utilization with KAI Scheduler GPU sharing, fractional GPUs, and bin packing strategies for Kubernetes AI workloads.
Implement gang scheduling for distributed training jobs using KAI Scheduler PodGroups to ensure all-or-nothing pod scheduling
Optimize GPU workload placement using KAI Scheduler's Topology-Aware Scheduling (TAS) for NVLink, NVSwitch, and disaggregated serving architectures
Our book includes an entire chapter dedicated to ai & ml with dozens more examples.
📖 Explore All Chapters