π€ AI & GPU
AI/ML on Kubernetes: GPU scheduling, NVIDIA Triton, vLLM, model deployment (Llama, Qwen, Phi-4), distributed training, Kubeflow, NeMo, and inference optimization.
GPU Sharing with MPS and MIG on Kubernetes
Share NVIDIA GPUs across multiple pods using MPS time-slicing and MIG hardware partitioning. Maximize GPU utilization for inference workloads.
Node Feature Discovery Operator for Kubernetes
Install and configure Node Feature Discovery (NFD) Operator to auto-detect hardware features like GPUs, NICs, CPU flags, and USB devices on Kubernetes nodes.
Enable GPUDirect Storage in ClusterPolicy
Enable NVIDIA GPUDirect Storage (GDS) in the GPU Operator ClusterPolicy for direct GPU-to-NVMe data paths. Driver module configuration and verification.
GPU Time-Slicing on Kubernetes
Share GPUs across multiple workloads using NVIDIA time-slicing on Kubernetes. Configure the device plugin, set replica counts, and manage fairness.
NVIDIA GPU Operator Setup on Kubernetes
Install and configure NVIDIA GPU Operator on Kubernetes. Driver containers, toolkit, device plugin, DCGM monitoring, and ClusterPolicy setup.
NVIDIA Open GPU + GPUDirect RDMA + DOCA-OFED + SR-IOV Stack
Deploy NVIDIA AI networking on Kubernetes: Open GPU driver with DMA-BUF, GPUDirect RDMA, DOCA-OFED, and SR-IOV VF isolation.
AI Model Storage: hostPath vs PVC for Inference
Deploy AI models on Kubernetes using hostPath and PVC storage. Compare performance, security trade-offs, and production patterns for model serving.
AIPerf Benchmark LLMs on Kubernetes
Deploy NVIDIA AIPerf to benchmark LLM inference performance on Kubernetes. Measure TTFT, ITL, throughput with real-time dashboard and GPU telemetry.
AIPerf Concurrency Sweep on K8s
Run AIPerf concurrency sweeps on Kubernetes to find optimal LLM serving capacity. Automate 1-128 concurrent user benchmarks with batch Jobs.
AIPerf Multi-Model Benchmark on K8s
Compare multiple LLM models and backends with AIPerf on Kubernetes. Benchmark vLLM vs TGI vs Triton with automated multi-run confidence intervals.
AIPerf Goodput and SLO Benchmarks
Measure LLM goodput with AIPerf on Kubernetes. Define SLOs for TTFT and ITL, calculate effective throughput, and benchmark with timeslice analysis.
Batch AI Workloads with Volcano Scheduler on Kubernetes
Schedule and manage batch AI training and inference jobs using Volcano scheduler with gang scheduling, fair-share queues, job plugins, and preemption on.
AIPerf Trace Replay Benchmarks on K8s
Replay production traffic traces with AIPerf on Kubernetes. Use moon_cake format, ShareGPT datasets, and fixed schedules for realistic LLM benchmarks.
Dell PowerEdge XE7740 GPU Node Setup
Configure Dell PowerEdge XE7740 GPU nodes with H200 GPUs for OpenShift and Kubernetes including BIOS, power, cooling, and network setup.
Deploy Fish Audio TTS on Kubernetes
Deploy Fish Audio S2-Pro 5B text-to-speech model on Kubernetes for high-quality voice synthesis with multi-speaker support and streaming audio.
Deploy GLM-5 754B on Kubernetes
Deploy Zhipu AI GLM-5 754B model on Kubernetes with vLLM. One of the largest open-weight models with multi-node tensor parallelism across 8+ GPUs.
Deploy Granite 4.0 Speech on Kubernetes
Deploy IBM Granite 4.0 1B Speech model on Kubernetes for automatic speech recognition. Lightweight 2B model runs on CPU or small GPU for STT workloads.
Deploy Kimi K2.5 1.1T MoE on Kubernetes
Deploy Moonshot AI Kimi-K2.5 1.1T MoE multimodal model on Kubernetes. The largest open MoE model with 2.69M downloads for frontier AI tasks.
Deploy Llama 2 70B on Kubernetes
Deploy Meta Llama 2 70B on Kubernetes with multi-GPU tensor parallelism, vLLM serving, and production-ready health checks and resource limits.
Deploy Llama 3.1 8B Instruct on K8s
Deploy Meta Llama 3.1 8B Instruct on Kubernetes with vLLM. Production-ready single-GPU deployment with 128K context, tool calling, and autoscaling.
Deploy LTX Video Generation on K8s
Deploy Lightricks LTX-2.3 image-to-video model on Kubernetes for AI video generation with batch processing and S3 output storage.
Deploy MiniMax M2.5 229B on Kubernetes
Deploy MiniMax M2.5 229B model on Kubernetes with vLLM. High-performance LLM with 485K downloads, optimized for multi-turn conversation and long context.
Deploy NVIDIA Nemotron 120B MoE on K8s
Deploy NVIDIA Nemotron-3-Super-120B-A12B MoE model on Kubernetes. 120B total parameters with 12B active for enterprise-grade inference.
Deploy Microsoft Phi-4 on Kubernetes
Deploy Microsoft Phi-4 small language model on Kubernetes with vLLM. Efficient 14B model with GPT-4 level reasoning on a single GPU.
Deploy Phi-4 Reasoning Vision on K8s
Deploy Microsoft Phi-4-reasoning-vision-15B on Kubernetes for multimodal chain-of-thought reasoning with visual understanding on a single GPU.
Deploy Qwen3 235B MoE on Kubernetes
Deploy Alibaba Qwen3-235B-A22B mixture-of-experts model on Kubernetes. Only 22B parameters active per token for efficient 235B-class inference.
Deploy Qwen3 Coder 80B on Kubernetes
Deploy Qwen3-Coder-Next 80B on Kubernetes for code generation, review, and refactoring. Production-ready AI coding assistant with multi-GPU serving.
Deploy Qwen3 TTS on Kubernetes
Deploy Qwen3-TTS-12Hz-1.7B-CustomVoice on Kubernetes for text-to-speech with custom voice cloning. 1.13M downloads, lightweight single-GPU deployment.
Deploy Qwen3.5 35B MoE on Kubernetes
Deploy Alibaba Qwen3.5-35B-A3B mixture-of-experts multimodal model on Kubernetes. 35B total parameters with only 3B active for ultra-efficient inference.
Deploy Qwen3.5 397B MoE on Kubernetes
Deploy Alibaba Qwen3.5-397B-A17B MoE multimodal model on Kubernetes. 397B total parameters with only 17B active per token for frontier VLM inference.
Deploy Qwen3.5 9B Multimodal on K8s
Deploy Alibaba Qwen3.5-9B vision-language model on Kubernetes with vLLM. Process images and text with a single GPU deployment.
RetinaNet Object Detection on K8s
Deploy RetinaNet object detection model on Kubernetes with Triton Inference Server, TensorRT optimization, and batch processing pipelines.
Deploy Sarvam 105B on Kubernetes
Deploy Sarvam 105B multilingual LLM on Kubernetes with vLLM. India's largest open language model with native support for 10+ Indic languages.
Stable Diffusion XL on Kubernetes
Deploy Stable Diffusion XL for image generation on Kubernetes with TensorRT acceleration, queued batch processing, and S3 output storage.
Deploy Whisper Speech-to-Text on K8s
Deploy OpenAI Whisper for speech-to-text on Kubernetes with faster-whisper, batch transcription Jobs, and real-time streaming endpoints.
Distributed Inference on Kubernetes
Deploy distributed LLM inference with tensor parallelism across multiple GPUs and pipeline parallelism across nodes on Kubernetes.
GenAI-Perf Benchmark LLM Serving
Benchmark LLM inference endpoints with NVIDIA GenAI-Perf for throughput, latency percentiles, time-to-first-token, and ITL metrics.
GenAI-Perf Benchmark Triton on K8s
Benchmark NVIDIA Triton Inference Server performance on Kubernetes using GenAI-Perf. Measure TTFT, inter-token latency, throughput, and GPU telemetry.
Distributed Training with Kubeflow Training Operator
Run multi-node distributed PyTorch and TensorFlow training jobs using Kubeflow Training Operator with NCCL, RDMA, and shared storage.
Kubeflow Training Operator on Kubernetes
Install Kubeflow Training Operator for distributed ML training with PyTorchJob, TFJob, and MPIJob on GPU-enabled Kubernetes clusters.
LeaderWorkerSet Operator for AI Workloads
Deploy distributed AI training with LeaderWorkerSet Operator on Kubernetes and OpenShift for leader-worker topology with gang scheduling.
Llama Stack on Kubernetes with NVIDIA NIM
Deploy Meta Llama Stack on Kubernetes for unified inference, RAG, agents, and safety APIs using NVIDIA NIM as the inference backend.
MLPerf Benchmarking on Kubernetes
Run MLPerf inference and training benchmarks on Kubernetes GPU clusters to validate AI workload performance and compare hardware configurations.
Shared Model Caching Across Pods on Kubernetes
Optimize LLM inference startup and reduce storage costs by sharing model weights across pods using emptyDir, hostPath, ReadWriteMany PVCs, and init.
MPI Operator for Distributed Training
Deploy MPI Operator on Kubernetes for distributed GPU training with Horovod and NCCL. Run multi-node MPI jobs natively in Kubernetes pods.
Deploy NVIDIA Clara on Kubernetes
Deploy NVIDIA Clara medical AI and drug discovery platform on Kubernetes. Run digital biology and medtech inference workloads with GPU acceleration.
NVIDIA H200 GPU Workloads on Kubernetes
Deploy and optimize AI workloads on NVIDIA H200 GPUs with 141GB HBM3e memory for large model inference and training on Kubernetes.
NVIDIA H300 GPU Workloads on Kubernetes
Prepare for NVIDIA H300 Blackwell-Next GPUs on Kubernetes with next-gen HBM3e memory, NVLink 5.0, and FP4 inference capabilities.
NVIDIA NeMo Training on Kubernetes
Deploy NVIDIA NeMo framework on Kubernetes for large language model pre-training, fine-tuning, and RLHF with multi-node GPU clusters.
NVIDIA Pyxis and Enroot for SLURM
Use NVIDIA Pyxis and Enroot to run GPU containers in SLURM jobs. Bridge SLURM HPC scheduling with container-native AI workloads and NGC images.
Run:AI GPU Quotas on OpenShift
Configure Run:AI scheduler quotas for fair GPU sharing with guaranteed, over-quota borrowing, and per-tenant GPU allocation policies.
SLURM and Kubernetes Integration
Integrate SLURM HPC workload manager with Kubernetes for hybrid AI and scientific computing. Bridge HPC batch scheduling with container orchestration.
Time-Slicing vs MIG vs Full GPU Allocation
Compare GPU sharing strategies: time-slicing for notebooks, MIG for isolated inference, and full GPU for training workloads.
Triton Autoscaling with GPU Metrics
Autoscale Triton Inference Server on Kubernetes using GPU utilization, request queue depth, and inference latency metrics with KEDA and HPA.
Triton Multi-Model Serving on Kubernetes
Serve multiple LLMs simultaneously on Triton Inference Server using TensorRT-LLM and vLLM backends with model routing and GPU scheduling.
Triton TensorRT-LLM on Kubernetes
Deploy NVIDIA Triton Inference Server with TensorRT-LLM backend on Kubernetes for optimized large language model serving with GPU acceleration.
TensorRT-LLM vs vLLM on Triton
Compare TensorRT-LLM and vLLM backends on Triton Inference Server. When to use each, performance benchmarks, and migration strategies.
Triton with vLLM Backend on Kubernetes
Deploy NVIDIA Triton Inference Server with vLLM backend on Kubernetes for flexible LLM serving with PagedAttention and continuous batching.
Deploying Vector Databases on Kubernetes
Deploy and operate vector databases (Milvus, Weaviate, Qdrant) on Kubernetes for RAG pipelines, semantic search, and AI applications with persistent.
Compare NCCL Intra-Node vs Inter-Node Performance
Build a repeatable comparison between local and cross-node NCCL throughput to validate GPU cluster interconnect scaling and identify bottlenecks early.
Run NCCL AllGather Benchmarks for Model Parallel Validation
Use all-gather NCCL tests to evaluate GPU communication behavior and throughput for tensor-parallel and model-parallel distributed AI workloads on Kubernetes.
Benchmark NCCL AllReduce Performance on Kubernetes
Measure NCCL AllReduce bandwidth and latency on Kubernetes to validate distributed training network performance across multi-GPU clusters.
Run NCCL Tests on Kubernetes for GPU Network Validation
Benchmark GPU-to-GPU communication using NVIDIA nccl-tests on Kubernetes or OpenShift to validate bandwidth and latency.
Deploy Mistral 7B with NVIDIA NIM on Kubernetes
Step-by-step guide to deploy Mistral-7B using NVIDIA NIM with TensorRT-LLM backend on Kubernetes for optimized GPU inference.
Deploy Mistral 7B with vLLM on Kubernetes
Step-by-step guide to deploy Mistral-7B-v0.1 using vLLM as an OpenAI-compatible inference server on Kubernetes with GPU fractioning.
Autoscale LLM Inference on Kubernetes
Configure Horizontal Pod Autoscaling and KEDA for LLM workloads using GPU utilization, request queue depth, and custom metrics.
Quantize LLMs for Efficient GPU Inference on Kubernetes
Run quantized LLM models (GPTQ, AWQ, GGUF) on Kubernetes to reduce GPU memory requirements and serve models on smaller GPUs.
Kubernetes LLM Serving Frameworks Compared
Compare vLLM, NVIDIA NIM, Triton, Ollama, and llama.cpp for serving LLMs on Kubernetes β features, performance, and when to use each.
Multi-GPU and Tensor Parallel LLM Inference on Kubernetes
Deploy large language models across multiple GPUs using tensor parallelism with vLLM and NVIDIA NIM on Kubernetes for high-throughput inference serving.
Install NVIDIA GPU Operator on Kubernetes
Deploy the NVIDIA GPU Operator to automate GPU driver, container toolkit, and device plugin management across your Kubernetes cluster.
Build a RAG Pipeline on Kubernetes
Deploy a Retrieval-Augmented Generation pipeline on Kubernetes using a vector database, embedding model, and LLM inference server.
Test LLM Inference Endpoints with curl
Validate Kubernetes-hosted LLM inference services using curl against OpenAI-compatible /v1/models, /v1/completions, and /v1/chat/completions endpoints.
GPU Sharing and Bin Packing with KAI Scheduler
Maximize GPU utilization with KAI Scheduler GPU sharing, fractional GPUs, and bin packing strategies for Kubernetes AI workloads.
Installing NVIDIA KAI Scheduler for AI Workloads
Deploy KAI Scheduler for optimized GPU resource allocation in Kubernetes AI/ML clusters with hierarchical queues and batch scheduling
Batch Scheduling with PodGroups in KAI Scheduler
Implement gang scheduling for distributed training jobs using KAI Scheduler PodGroups to ensure all-or-nothing pod scheduling
Hierarchical Queues and Resource Fairness with KAI Scheduler
Configure hierarchical queues in KAI Scheduler for multi-tenant GPU clusters with quotas, limits, and Dominant Resource Fairness (DRF)
Topology-Aware Scheduling with KAI Scheduler
Optimize GPU workload placement using KAI Scheduler's Topology-Aware Scheduling (TAS) for NVLink, NVSwitch, and disaggregated serving architectures