🤖 AI & GPU
AI/ML on Kubernetes: GPU scheduling, NVIDIA Triton, vLLM, model deployment, distributed training, Kubeflow, and inference optimization.
H200 NVL 8-GPU Topology Bandwidth Tiers for Kubernetes
Map the three bandwidth tiers of 8× H200 NVL GPU nodes—NVLink (~337 GB/s), PCIe+UPI (~50 GB/s), RoCE (~35 GB/s)—for NCCL topology-aware NUMA scheduling.
Disable GDS and Enable IOMMU Passthrough on K8s GPUs
Disable GPUDirect Storage (GDS) when not needed and configure IOMMU passthrough mode for GPU and NIC device assignment. Kernel parameters, BIOS settings, VFIO
GPU Operator ClusterPolicy RDMA and GDS Configuration
Configure NVIDIA GPU Operator ClusterPolicy to disable RDMA and enable GPUDirect Storage (GDS). Control nvidia-peermem, nvidia-fs modules, driver
GPUDirect RDMA Setup and Verification on Kubernetes
Enable and verify GPUDirect RDMA for GPU-to-NIC direct data transfer on Kubernetes. Install nvidia-peermem, configure DMA-BUF, verify RDMA paths, troubleshoot
IOMMU Kernel Parameters for Kubernetes GPU Nodes
Configure IOMMU kernel parameters for optimal GPU and RDMA performance on Kubernetes. Compare intel_iommu, amd_iommu, and iommu settings, passthrough vs off vs
Kubeflow MPIJob Worker SSH Setup for GPU Training
Configure SSH daemon in Kubeflow MPIJob worker pods for multi-node GPU training. Covers SSHD setup in containers, host key generation, authorized keys from MPI
Kubernetes Topology Manager for GPU and NUMA Alignment
Configure Kubernetes Topology Manager to align CPU, GPU, and NIC allocations on the same NUMA node. Covers policies, kubelet config, and GPU performance tuning.
MPI DNS Resolution and Hostfile for Kubernetes GPU Jobs
Troubleshoot MPI hostfile DNS resolution in Kubeflow MPIJob on Kubernetes. Covers headless Service creation, subdomain configuration, DNS wait loops, FQDN
NCCL All-Reduce Benchmarking on Multi-Node GPUs
Run and interpret NCCL all_reduce_perf benchmarks on multi-node Kubernetes GPU clusters. Understand bus bandwidth results, expected throughput for H200 NVL
NCCL Channel Routing and Transport Path Analysis
Interpret NCCL channel logs to understand GPU communication paths on Kubernetes. Decode P2P/CUMEM, SHM/direct, NET/IB/GDRDMA transport
NCCL DMABUF Enable for GPUDirect RDMA on Kubernetes
Enable NCCL DMA-BUF support for GPUDirect RDMA in Kubernetes GPU clusters. Covers NCCL_DMABUF_ENABLE=1, kernel requirements, nvidia-peermem vs dmabuf, GPU
NCCL GPUDirect RDMA Distance Levels and PIX vs SYS
Understand NCCL GPU Direct RDMA distance-based enablement. When PIX mode disables GDRDMA for distant GPU-HCA pairs (distance 9 > 4) and when SYS mode enables
NCCL GPUDirect RDMA Level Tuning PIX PXB PHB SYS
Tune NCCL_NET_GDR_LEVEL for optimal GPUDirect RDMA performance on Kubernetes. Compare PIX, PXB, PHB, and SYS distance thresholds with PCIe topology. Benchmark
NCCL IB HCA Selection and QPS Tuning for RoCE
Configure NCCL_IB_HCA, NCCL_IB_GID_INDEX, NCCL_IB_QPS_PER_CONNECTION, and NCCL_IB_SPLIT_DATA_ON_QPS for optimal RoCE performance on Kubernetes GPU clusters.
NCCL Network Validation Script for OpenShift GPU Clusters
Build a comprehensive NCCL network validation script for OpenShift GPU clusters with SR-IOV. Configure NCCL_IB_GID_INDEX, NCCL_NET_GDR_LEVEL=SYS, per-rank HCA
Production NCCL Network Validator for Kubeflow MPIJob
Deploy a production-ready NCCL network validation framework using Kubeflow MPIJob on OpenShift. Complete validate_network.sh script
NCCL RoCE Validation MPIJob Complete Reference
Complete nccl-roce-validation.yaml MPIJob reference for OpenShift GPU clusters. Full launcher environment variables, OpenMPI control plane settings, NCCL
NCCL RoCE Validation with Kubeflow MPIJob on Kubernetes
Run NCCL all_reduce_perf validation tests using Kubeflow MPIJob on GPU clusters. Configure MPI launcher and workers, NCCL environment variables, test
Shared Memory Transport for NCCL Intra-Node GPU
Configure NCCL shared memory (SHM) transport for intra-node GPU communication on Kubernetes. Covers /dev/shm sizing with emptyDir and NVLink/PCIe P2P paths.
NVIDIA GPU Topology Matrix Interpretation on Kubernetes
Read and interpret nvidia-smi topo and nvidia-device-plugin topology matrices on Kubernetes GPU nodes. Understand X, NV, SYS, NODE, PIX, PXB, PHB connection
NVLink Bridge Architecture for GPU Kubernetes Nodes
Understand NVLink Bridge logical architecture in GPU servers for Kubernetes. Dual-socket PCIe Gen5 topology, NVL4 groups, GPU-NIC-NVMe placement, PCIe switch
OpenMPI Control Plane Separation for NCCL RDMA
Configure OpenMPI to use eth0 for MPI control traffic while NCCL uses net1 SR-IOV for data. Covers btl_tcp_if_include, pml, routed direct, plm_rsh_agent SSH
Run:ai GPU Scheduling with Kubeflow MPIJob
Integrate Run:ai GPU scheduler with Kubeflow MPIJob for multi-node NCCL workloads. Covers Run:ai project namespaces, GPU quota annotations, pod group
GenAI-Perf Benchmarking LLM Inference on Kubernetes
Benchmark LLM inference performance with NVIDIA GenAI-Perf on Kubernetes. Profile vLLM, TensorRT-LLM, and Triton endpoints with concurrency sweeps, token
NCCL Environment Variables Complete Reference
Complete reference for NCCL environment variables on Kubernetes. Configure network transport, InfiniBand, GPUDirect RDMA, socket
Kubernetes Volcano Batch Scheduler Gang Scheduling
Deploy Volcano batch scheduler for gang scheduling on Kubernetes. Configure minAvailable for all-or-nothing pod group scheduling, queue management, and GPU job
NCCL and RCCL Networking Performance on Kubernetes
Optimize NCCL (NVIDIA) and RCCL (AMD) collective communication performance on Kubernetes GPU clusters. Network transport selection, bandwidth tuning, latency
Weights and Biases Experiment Tracking on Kubernetes
Deploy Weights & Biases (W&B) on Kubernetes for ML experiment tracking, model registry, and hyperparameter sweeps. Self-hosted W&B Server, agent-based
Integrate DisaggregatedSet with llm-d on Kubernetes
Deploy disaggregated LLM inference using DisaggregatedSet and llm-d on Kubernetes. Install LWS then DS controller, model prefill/decode roles, wire llm-d
DisaggregatedSet for Multi-Role LLM Inference
Deploy disaggregated LLM inference on Kubernetes with DisaggregatedSet and LeaderWorkerSet. Separate prefill and decode phases across GPU pools
NCCL Topology Dump and Tuning on Kubernetes
Use NCCL_TOPO_DUMP_FILE to export and inject GPU topology on Kubernetes for reproducible distributed training performance. Topology XML caching, environment
Hermes Agent Self-Hosted AI on Kubernetes
Deploy Hermes Agent (Nous Research) on Kubernetes as a persistent self-hosted AI agent with memory, automated skill creation, multi-platform
NVIDIA Dynamo Production Tuning on Kubernetes
Tune NVIDIA Dynamo for production LLM inference: prefill/decode pool sizing, KV cache transfer optimization, NCCL backend selection, SLA-driven autoscaling
NVIDIA OpenShell Sandboxed AI Agent Runtime on Kubernetes
Deploy NVIDIA OpenShell on Kubernetes for safe, private autonomous AI agent execution. Declarative YAML network policies, sandboxed containers
Poolside AI Foundation Models on Kubernetes
Deploy Poolside AI foundation models for enterprise software agents on Kubernetes. On-prem and VPC deployment, multi-agent orchestration, sandboxed
Red Hat AI Studio on OpenShift
Deploy Red Hat AI Studio on OpenShift for end-to-end LLM development. Model catalog, InstructLab fine-tuning, experiment tracking, model
Tabnine AI Code Assistant Self-Hosted on Kubernetes
Deploy Tabnine Enterprise self-hosted on Kubernetes for private AI code completion and chat. On-prem model serving, multi-model support (Tabnine
GPUDirect Storage on Kubernetes
Configure NVIDIA GPUDirect Storage (GDS) for direct data path between NVMe/NFS storage and GPU memory bypassing CPU. Covers Magnum IO, cuFile API, GDS driver
NVIDIA PeerMem for GPU-Direct RDMA
Install and configure nvidia_peermem kernel module to enable GPU-Direct RDMA between NVIDIA GPUs and Mellanox RDMA NICs. Covers module
Disable PCIe ACS for GPU-Direct P2P
Disable PCIe Access Control Services (ACS) to enable GPU-Direct peer-to-peer DMA between GPUs and RDMA NICs. Covers BIOS disable, kernel override, and when
IOMMU BIOS and Kernel Config for NCCL GPU-Direct
Configure IOMMU at BIOS and kernel level to enable NCCL GPU-Direct RDMA on Kubernetes. Covers Intel VT-d, AMD-Vi, kernel parameters, passthrough
NCCL PXN Cross-NIC Communication via NVLink
Configure NCCL PXN (PCIe cross-NIC via NVLink) for multi-node GPU training where not every GPU has a direct RDMA NIC. Covers topology
Run:ai Distributed Inference with SR-IOV RDMA
Deploy distributed vLLM inference on Run:ai using SR-IOV RDMA for NCCL inter-node communication. Covers extended-resource for Mellanox VFs, network annotation
Run:ai Distributed Inference with vLLM and NCCL
Deploy distributed LLM inference on Run:ai with vLLM tensor parallelism across multiple workers. Covers multi-node GPU splitting, NCCL configuration, PVC model
Debug Distributed vLLM Inference with NCCL Verbose Logging
Debug distributed vLLM inference using NCCL_DEBUG=INFO and NCCL_DEBUG_SUBSYS=ALL. Covers air-gapped deployment with TRANSFORMERS_OFFLINE, interpreting NCCL
Kubernetes AI Infrastructure Scaling
Scale AI inference infrastructure on Kubernetes from 10K to 100K requests per second. Covers latency optimization, horizontal scaling, caching
Kubernetes for AI Search and Discoverability
Deploy AI-searchable services on Kubernetes: llms.txt implementation, RAG-optimized APIs, structured data for AI chatbots, and infrastructure patterns
Deep Learning with Large Datasets on K8s
Optimize deep learning training with large datasets on Kubernetes. Covers data loading, caching strategies, parallel prefetch, and storage architecture
Distributed Multi-GPU Inference on Kubernetes
Deploy distributed inference across multiple GPUs and nodes on Kubernetes. Covers tensor parallelism, pipeline parallelism, vLLM, and NIM multi-GPU serving.
FSDP LoRA Fine-Tuning LLMs on Kubernetes
Fine-tune large language models with FSDP and LoRA on Kubernetes. Covers memory-efficient loading, checkpoint strategies, and multi-node H200 training.
NVIDIA GenAI-Perf Inference Benchmarking
Benchmark LLM inference throughput and latency on Kubernetes using NVIDIA GenAI-Perf. Covers vLLM, Run:ai, concurrency testing, and multi-location client runs.
LeaderWorkerSet Multi-Node Inference on K8s
Deploy multi-node distributed inference using LeaderWorkerSet (LWS) operator on Kubernetes. Covers vLLM pipeline parallelism across nodes for 405B+ parameter
Mistral FSDP LoRA Complete Accelerate Config
Complete accelerate FSDP configuration for fine-tuning Mistral-Small-4 11B with LoRA on multi-GPU H200 clusters. Covers every FSDP2 setting with explanations.
Multi-Node Distributed Training on Kubernetes
Run distributed deep learning training across multiple GPU nodes on Kubernetes. Covers PyTorch DDP, DeepSpeed, Horovod, and MPI jobs with NCCL optimization.
NVIDIA GPUDirect Storage Benchmark on K8s
Benchmark NVIDIA GPUDirect Storage (GDS) on Kubernetes for direct NVMe-to-GPU data transfers. Covers gdsio, gds_stats, performance validation, and comparison
NVIDIA GPU Operator GitOps on OpenShift
Deploy NVIDIA GPU Operator on OpenShift via GitOps with ArgoCD. Covers ClusterPolicy configuration, DCGM exporter, drain settings, tolerations, and rolling
OpenShift GPU Node Resource Planning
Plan CPU, memory, and overhead budgets for GPU nodes running NVIDIA GPU Operator, Network Operator, Run:ai, and OpenShift infrastructure Pods. Understand what
Run:ai Backend Architecture on OpenShift
Understand the full Run:ai backend deployment on OpenShift with 40+ microservices including Keycloak, PostgreSQL, NATS, Thanos, Traefik, and workload
Run:ai Distributed PyTorch Training on OpenShift
Submit multi-node distributed PyTorch training jobs on OpenShift using Run:ai CLI. Covers DDP, FSDP, RDMA networking, and GPU scheduling.
FSDP Distributed Training on Run:ai
Run PyTorch FSDP distributed training workloads on Run:ai with GPU scheduling, event tracking, and GPU memory monitoring. Covers Mistral-class model
Run:ai GPU Metrics Pipeline with DCGM and Thanos
End-to-end GPU metrics pipeline on Run:ai: DCGM exporter collects GPU utilization, Prometheus scrapes, remote-writes to Thanos Receive, and Grafana dashboards
Run:ai Platform Backend Components
Overview of Run:ai backend StatefulSets and components on OpenShift: Thanos receive/query, Keycloak, NATS, Redis, PostgreSQL, workload controllers, and their
Run:ai Training Job Submit Script Pattern
Production pattern for submitting Run:ai training jobs via shell scripts with GPU fractional allocation, NFS mounts, custom Python environments, and private
Run:ai Workload Controllers on OpenShift
Understand Run:ai cluster-level workload controllers on OpenShift: workload-controller, workload-overseer, workload-exporter, and status-updater components.
Kubernetes 1.36 DRA for GPU and TPU Management
Use Dynamic Resource Allocation in Kubernetes 1.36 for advanced GPU/TPU management with partitionable devices, device taints, and tolerations.
Kubernetes 1.36 Gang Scheduling
Use gang scheduling in Kubernetes 1.36 to schedule Pod groups atomically. Essential for distributed ML training, MPI jobs, and Spark workloads.
Kubernetes 1.36 RestartAllContainers for ML
Use the RestartAllContainers policy in Kubernetes 1.36 to restart all Pod containers in-place when a worker fails, avoiding costly ML training rescheduling.
Kubernetes 1.36 Topology-Aware Scheduling
Use topology-aware workload scheduling in Kubernetes 1.36 to place Pods on nodes with optimal GPU, NUMA, and network topology for ML training.
NVIDIA GPU Feature Discovery for Kubernetes
Deploy GPU Feature Discovery (GFD) to auto-label Kubernetes nodes with GPU model, MIG capability, CUDA version, and driver info for intelligent scheduling.
OpenShift NVIDIA MIG Reconfiguration Without Reboot
Reconfigure NVIDIA MIG geometry on OpenShift without rebooting nodes. Use nvidia-mig-manager with node labels to dynamically switch GPU partitions.
Talos Linux MIG Configuration with GPU Operator
Configure NVIDIA MIG on Talos Linux Kubernetes clusters. Install GPU Operator, set MIG strategy, and dynamically partition A100 GPUs without node reboot.
DGX H100 nvidia-smi topo -m Guide
Read nvidia-smi topo -m output on DGX H100 systems. Understand NVLink, NVSwitch, PCIe topology, GPU-to-GPU bandwidth, and NUMA affinity for Kubernetes.
NVIDIA H300 GPU Setup on Kubernetes
Deploy NVIDIA H300 GPUs on Kubernetes. H300 vs H100 vs H200 specs comparison, memory bandwidth, GPU Operator setup, and AI inference optimization.
NVIDIA PyTorch Container on Kubernetes
Deploy nvcr.io/nvidia/pytorch containers on Kubernetes for GPU training. Version selection, CUDA compatibility, multi-node DDP, and NCCL configuration.
GenAI-Perf Benchmark LLM Kubernetes
Benchmark LLM inference with GenAI-Perf on Kubernetes. Use --service-kind openai for vLLM, NIM, and TGI. Measure TTFT, ITL, and throughput.
Continuous Batching LLM Inference K8s
Configure continuous batching for LLM inference on Kubernetes. vLLM and TRT-LLM batch scheduling, max-num-seqs tuning, and throughput optimization.
CUDA Version Compatibility K8s Guide
Match CUDA versions with GPU drivers and container images on Kubernetes. Forward compatibility, driver requirements, and container toolkit matrix.
DeepSpeed ZeRO Training Kubernetes
Deploy DeepSpeed ZeRO-1/2/3 for large model training on Kubernetes. Multi-node config, NCCL tuning, memory optimization, and 70B+ model training.
DGX H100 GPU Topology nvidia-smi
Inspect DGX H100 GPU topology with nvidia-smi topo -m. NVSwitch NV18 links, cross-socket detection, PCIe hierarchy, and NCCL performance validation.
GPU Feature Discovery Node Labels
Configure NVIDIA GPU Feature Discovery for automatic node labeling on Kubernetes. GPU model, driver version, CUDA, and MIG labels for scheduling.
GPU Node Affinity Scheduling K8s
Schedule GPU workloads with node affinity and topology on Kubernetes. GPU type selection, multi-GPU locality, and NUMA-aware pod placement.
K8s GPU Limits Requests Configuration
Configure GPU resource limits and requests in Kubernetes pod specs. nvidia.com/gpu resource, fractional GPUs, MIG slices, and multi-GPU allocation.
LoRA Adapter Serving vLLM on K8s
Serve multiple LoRA adapters with a single vLLM base model on Kubernetes. Dynamic loading, per-request routing, and multi-tenant fine-tuned models.
Multi-GPU PyTorch DDP on Kubernetes
Run PyTorch DistributedDataParallel across multiple GPUs on Kubernetes. torchrun, NCCL backend, pod topology, and scaling to multi-node training.
NVIDIA Driver Update K8s Nodes Guide
Safely update NVIDIA GPU drivers on Kubernetes nodes. Rolling updates, drain strategy, driver compatibility matrix, and GPU Operator upgrades.
NVIDIA PeerMem GPUDirect RDMA K8s
Configure nvidia_peermem and ib_register_peer_memory_client for GPUDirect RDMA on Kubernetes. Module loading and modprobe invalid argument fix.
nvidia-smi Monitoring in K8s Pods
Run nvidia-smi inside Kubernetes pods for GPU monitoring. Memory usage, temperature, utilization, and automated health checks with liveness probes.
Prefix Caching vLLM KV Cache K8s
Enable automatic prefix caching in vLLM on Kubernetes for shared-prompt workloads. KV cache reuse, memory savings, and chatbot latency optimization.
Quantize LLMs AWQ GPTQ for K8s Deploy
Deploy AWQ and GPTQ quantized LLMs on Kubernetes. 4-bit inference with vLLM, model conversion, accuracy trade-offs, and GPU memory savings guide.
Speculative Decoding with vLLM on Kubernetes
Enable speculative decoding in vLLM on Kubernetes for 2-3x faster LLM inference. Draft model selection, acceptance rates, and latency optimization.
TensorRT-LLM vs vLLM Benchmark 2026
Compare TensorRT-LLM vs vLLM for LLM inference on Kubernetes. TTFT, throughput, GPU utilization benchmarks, and when to use each inference engine.
vLLM Alternatives LLM Inference K8s
Compare vLLM alternatives for LLM inference on Kubernetes. TensorRT-LLM, SGLang, NVIDIA NIM, Ollama, and text-generation-inference feature comparison.
Kubeflow PyTorchJob Training K8s
Run distributed PyTorch training on Kubernetes with Kubeflow PyTorchJob. ElasticPolicy, nproc_per_node, RDMA configuration, and multi-GPU scaling.
NCCL Environment Variables Reference
Complete NCCL environment variables reference for Kubernetes GPU training. NCCL_IB_DISABLE, NCCL_SOCKET_IFNAME, NCCL_DEBUG, and network tuning guide.
NCCL Test Benchmark Kubernetes
Run NCCL tests on Kubernetes for GPU communication benchmarking. all_reduce_perf, all_gather_perf, multi-node bandwidth, and latency validation.
GPU Time-Slicing vs MIG Comparison
Compare NVIDIA GPU time-slicing and MIG for K8s workloads. When to use each, performance trade-offs, and configuration examples.
TensorRT-LLM Kubernetes Deployment
Deploy TensorRT-LLM on K8s for optimized inference. Engine building, model conversion, and serving with Triton Inference Server.
vLLM Deployment Kubernetes Guide
Deploy vLLM inference engine on K8s. Model loading, tensor parallelism, continuous batching, and OpenAI-compatible API setup.
AI Resource Allocation Optimization
Optimize GPU and memory allocation for AI workloads on Kubernetes. Right-size GPU requests, bin-packing strategies, gang scheduling.
CNCF AI Projects Landscape Kubernetes
Navigate the CNCF AI project landscape for Kubernetes. Kubeflow, KServe, KAITO, Volcano, and emerging projects for training, serving, scheduling.
Distributed Training TensorFlow PyTorch
Run distributed training jobs on Kubernetes with TensorFlow and PyTorch. Training Operator, multi-worker strategies, NCCL configuration.
Feast Feature Store Kubernetes
Deploy Feast feature store on Kubernetes for ML feature management. Offline and online stores, feature serving, point-in-time joins.
GPU Sharing MIG and Time-Slicing Kubernetes
Share GPUs across multiple pods with NVIDIA MIG and time-slicing on Kubernetes. MIG profiles for A100/H100, time-slicing configuration.
KAITO AI Model Inference Kubernetes
Deploy AI models with KAITO (Kubernetes AI Toolchain Operator) for automated GPU provisioning, model serving, and inference workload management.
Katib Hyperparameter Tuning Kubernetes
Automate hyperparameter tuning with Katib on Kubernetes. Bayesian optimization, random search, grid search, early stopping.
KnativeServing for AI Inference OpenShift
Configure KnativeServing with scale-to-zero, GPU scheduling features, Kourier ingress, and custom domain templates for AI inference workloads on OpenShift.
KServe Model Serving Kubernetes
Deploy ML models with KServe for serverless inference on Kubernetes. InferenceService, scale-to-zero, canary rollouts, model transformers.
Kubeflow ML Platform Setup Kubernetes
Deploy Kubeflow as a production-ready ML platform on Kubernetes. Notebooks, pipelines, training operators, and model serving with KServe for end-to-end MLO.
AI Cost Management on Kubernetes
Control AI infrastructure costs on Kubernetes with GPU utilization tracking, chargeback per team, spot instance strategies, right-sizing recommendations.
AI Inference Optimization Kubernetes
Optimize AI inference performance on Kubernetes. Request batching, KV cache tuning, speculative decoding, continuous batching.
GPU Node Provisioning Kubernetes
Automate GPU node provisioning for Kubernetes with Karpenter, Cluster Autoscaler, and cloud-specific node pools for AI and ML workloads.
GPU Operator Advanced Configuration
Advanced NVIDIA GPU Operator configuration on Kubernetes. Driver containers, CUDA toolkit, GDS, GPUDirect RDMA, MIG manager, DCGM Exporter.
Kueue Job Queuing Fair Sharing Kubernetes
Implement fair-share GPU job queuing with Kueue on Kubernetes. ClusterQueues, LocalQueues, ResourceFlavors, and cohort-based borrowing for multi-team AI cl.
LLM Deployment Challenges Kubernetes
Address common LLM deployment challenges on Kubernetes. GPU memory management, model loading optimization, inference latency tuning, batch scheduling.
ML Pipeline Automation Kubernetes
Automate ML pipelines on Kubernetes with Kubeflow Pipelines, Argo Workflows, and Tekton. Data preprocessing, training, evaluation, model registration.
ModelMesh Multi-Model Serving Kubernetes
Deploy hundreds of ML models on shared GPU infrastructure with ModelMesh. Intelligent model loading and unloading, memory management, routing.
Multi-Cloud AI Workloads Kubernetes
Run AI workloads across multiple cloud providers with Kubernetes. GPU instance availability, spot pricing arbitrage, model portability.
NCCL SR-IOV GDS PyTorch Configuration
Configure NCCL with SR-IOV RDMA and GPUDirect Storage on Kubernetes. PyTorch 25.11 container with NCCL 2.28, CUDA 13, MOFED 5.4, GDRCopy 2.
Volcano Job minAvailable Gang Schedule
Volcano batch scheduling with minAvailable gang scheduling on Kubernetes. Job configuration, queue policies, and AI training workload scheduling.
AIPerf Offline vLLM Benchmarking
Benchmark vLLM inference with AIPerf in air-gapped Kubernetes clusters. Use dummy tokenizers, offline mode, custom endpoints.
Run:ai Distributed vLLM with NCCL
Deploy distributed vLLM inference on Run:ai with NCCL over NVLink and RDMA. Tensor parallelism across GPUs with NCCL debug logging, SR-IOV networking.
AIPerf LLM Benchmarking on K8s
Benchmark generative AI inference on Kubernetes with NVIDIA AIPerf. Measure TTFT, ITL, throughput, and latency across vLLM, NIM.
DOCA Perftest RDMA Benchmarking
Run NVIDIA DOCA perftest on Kubernetes to benchmark RDMA bandwidth and latency between GPU nodes. Traffic patterns, GPUDirect memory modes.
RetinaNet GPU Training on Kubernetes
Train RetinaNet object detection models on Kubernetes with unlimited memlock for RDMA, CRI-O ulimit configuration, and multi-GPU distributed training.
NCCL Topology Dump File for GPU Debugging
Use NCCL_TOPO_DUMP_FILE to capture and analyze GPU interconnect topology in Kubernetes. Debug NVLink, NVSwitch, and PCIe connection paths.
Run:ai Distrib. vLLM Inference Multimodal LLMs
Deploy multimodal LLMs with Run:ai distributed inference and vLLM on Kubernetes. Tensor parallelism, NCCL over NVLink, GPUDirect RDMA.
Inter-Node Tensor Parallelism on Kubernetes
Split a single LLM across multiple physical servers using tensor parallelism. Configure vLLM, NIM, and Ray for inter-node TP with NCCL over RDMA or TCP.
Triton Inference Server vs vLLM: Which to C...
Compare NVIDIA Triton Inference Server vs vLLM for LLM serving on Kubernetes. Performance, multi-model support, batching, GPU utilization.
Verify NCCL RDMA Traffic with Debug Logging
Prove NCCL uses RDMA for GPU communication on Kubernetes. Use NCCL_DEBUG and NCCL_DEBUG_SUBSYS=ALL to verify InfiniBand, RoCE.
NCCL_IB_DISABLE Environment Variable
NCCL_IB_DISABLE environment variable explained. Set NCCL_IB_DISABLE=1 for Ethernet-only clusters, debug InfiniBand errors, and tune GPU communication.
vLLM on Huawei Ascend NPU: K8s Deployment
Deploy vLLM inference on Huawei Ascend NPUs in Kubernetes. Atlas 300I/910B device plugin, vllm-ascend container image, tensor parallelism, and model serving.
Deploy vLLM OpenAI Container on Kubernetes
Deploy the vLLM OpenAI-compatible server container on Kubernetes. Pull ghcr.io/vllm-project/vllm-openai, configure GPU resources, model loading.
AI-Native Development Platforms on Kubernetes
Build AI-native development platforms on Kubernetes. AI coding agents, automated testing, Copilot infrastructure, dev containers, and AI-driven CI/CD pipelines.
Agentic AI and Multi-Agent Systems
Deploy autonomous AI agents and multi-agent orchestration on Kubernetes. LangGraph, CrewAI, AutoGen, tool-calling agents, agent-to-agent communication.
AI Infrastructure Cost Optimization
Optimize AI infrastructure costs on Kubernetes. GPU sharing, spot instances, inference batching, model quantization, token economics.
AI Content Watermarking on Kubernetes
Deploy AI-generated content watermarking on Kubernetes. Invisible watermarks, SynthID integration, detection APIs, image and text watermarking pipelines.
AI Supercomputing on Kubernetes GPU Clusters
Build AI supercomputing platforms on Kubernetes. Multi-node GPU training, NVIDIA DGX SuperPOD, InfiniBand RDMA, NCCL tuning, Blackwell clusters.
Autonomous Industrial Systems on Kubernetes
Orchestrate autonomous factories and logistics with Kubernetes. Digital twins, robot fleet coordination, industrial IoT pipelines, predictive maintenance.
Domain-Specific Language Models on Kubernetes
Deploy and fine-tune domain-specific LLMs on Kubernetes. Legal, healthcare, finance, and code models with LoRA fine-tuning, NIM serving, and RAG pipelines.
GitOps for AI Workloads on Kubernetes
Deploy AI models with GitOps on Kubernetes. Version ML models in Git, ArgoCD for model rollouts, Flux for GPU cluster sync.
K8s AI Gateway: Inference Extension Guide
Use the Kubernetes AI Gateway and Inference Extension to route LLM traffic. Model-aware routing, load balancing across inference backends.
Dynamic Resource Allocation for GPUs
Use Kubernetes Dynamic Resource Allocation to schedule GPUs. DRA ResourceClaims, partitionable devices, GPU sharing, and structured parameters for accelerators.
Kueue for Batch Jobs and GPU Queues
Use Kueue to manage batch job queues on Kubernetes. GPU quota, fair sharing, priority queues, ML training workloads, and multi-tenant cluster scheduling.
Llama 2 70B FP16 Model Size 140GB Guide
Llama 2 70B FP16 model size is 140GB. Complete GPU memory requirements for FP16, FP8, INT4 quantization, and multi-GPU tensor parallelism on Kubernetes.
Physical AI and Robotics Orchestration
Orchestrate physical AI and robotics fleets with Kubernetes. ROS 2 on K8s, robot fleet management, edge-cloud hybrid, NVIDIA Isaac.
Quantum Computing on K8s: Hybrid Workflows
Run quantum computing workloads on Kubernetes. Qiskit, Cirq, PennyLane hybrid classical-quantum pipelines, quantum job scheduling, and QPU integration patterns.
Run:ai Topology-Aware Scheduling Deep Dive
Configure Run:ai topology-aware scheduling for distributed AI workloads. Multi-level hierarchies, required vs preferred placement, LeaderWorkerSet.
NIM Model Profiles and Selection on Kubernetes
Configure NIM_MODEL_PROFILE for NVIDIA NIM deployments on Kubernetes. List profiles, select by ID or name, tune VRAM, and override with vLLM CLI args.
NIM Multi-Node Deployment with Helm on K8s
Deploy NVIDIA NIM across multiple Kubernetes nodes using Helm, LeaderWorkerSet, Ray, and vLLM. Run Llama 405B and DeepSeek-R1 on 16+ GPUs.
NIM LLM Support Matrix and GPU Compatibility
Complete NVIDIA NIM support matrix for Kubernetes. Supported models, profiles, precision formats, GPU compatibility, and hardware requirements per model.
NVIDIA Dynamo Distributed Inference
Deploy NVIDIA Dynamo on Kubernetes for disaggregated LLM inference. KV-aware routing, prefill/decode splitting, Grove operator, and zero-config deployment.
Rebuild NIM with Custom Model on Kubernetes
Step-by-step guide to deploy custom, fine-tuned, or self-hosted models with NVIDIA NIM on Kubernetes. Model-free NIM from HuggingFace, S3, NGC, or local path.
Run:ai + Dynamo Multi-Node Scheduling on K8s
Deploy NVIDIA Dynamo with Run:ai v2.23 for gang scheduling and topology-aware placement. Atomic pod launches, zone co-location, and disaggregated inference.
Copy NVIDIA NIM Images to Internal Quay Reg...
Pull NIM container images from nvcr.io and push to an internal Quay registry. Covers authentication, tagging, air-gapped workflows, and curl token issues.
Deploy Multinode NIM Models on Kubernetes
Run large language models across multiple GPU nodes with NVIDIA NIM. Tensor parallelism, NCCL, InfiniBand, and Kubernetes Job orchestration.
Distributed Inference with Run:ai
Deploy distributed AI inference with NVIDIA Run:ai on Kubernetes. Single-node Knative, multinode LeaderWorkerSet, NIM, autoscaling, and observability.
Run:ai NIM Distributed Inference Tutorial
Step-by-step guide to deploy DeepSeek-R1 distributed inference on Run:ai with LeaderWorkerSet, SGLang, PVC caching, and OpenShift security.
Kubeflow Operator: Full ML Platform
Deploy the complete Kubeflow platform on Kubernetes with the Kubeflow Operator. Covers Pipelines, Notebooks, KServe, Katib, and multi-tenant ML workflows.
GPU Sharing with MPS and MIG on Kubernetes
Share NVIDIA GPUs across multiple pods using MPS time-slicing and MIG hardware partitioning. Maximize GPU utilization for inference workloads.
Node Feature Discovery Operator for Kubernetes
Install and configure Node Feature Discovery (NFD) Operator to auto-detect hardware features like GPUs, NICs, CPU flags, and USB devices on Kubernetes nodes.
Enable GPUDirect Storage in ClusterPolicy
Enable NVIDIA GPUDirect Storage (GDS) in the GPU Operator ClusterPolicy for direct GPU-to-NVMe data paths. Driver module configuration and verification.
GPU Time-Slicing on Kubernetes
Share GPUs across multiple workloads using NVIDIA time-slicing on Kubernetes. Configure the device plugin, set replica counts, and manage fairness.
NVIDIA GPU Operator Setup on Kubernetes
Install and configure NVIDIA GPU Operator on Kubernetes. Driver containers, toolkit, device plugin, DCGM monitoring, and ClusterPolicy setup.
NVIDIA Open GPU + GPUDirect RDMA + DOCA-OFE...
Deploy NVIDIA AI networking on Kubernetes: Open GPU driver with DMA-BUF, GPUDirect RDMA, DOCA-OFED, and SR-IOV VF isolation.
AI Model Storage: hostPath vs PVC Inference
Deploy AI models on Kubernetes using hostPath and PVC storage. Compare performance, security trade-offs, and production patterns for model serving.
Volcano Job minAvailable Gang Scheduling
Configure Volcano job minAvailable for gang scheduling on Kubernetes. Batch AI training, fair-share queues, job plugins, and GPU preemption guide.
AIPerf Benchmark LLMs on Kubernetes
Deploy NVIDIA AIPerf to benchmark LLM inference performance on Kubernetes. Measure TTFT, ITL, throughput with real-time dashboard and GPU telemetry.
AIPerf Concurrency Sweep on K8s
Run AIPerf concurrency sweeps on Kubernetes to find optimal LLM serving capacity. Automate 1-128 concurrent user benchmarks with batch Jobs.
AIPerf Goodput and SLO Benchmarks
Measure LLM goodput with AIPerf on Kubernetes. Define SLOs for TTFT and ITL, calculate effective throughput, and benchmark with timeslice analysis.
AIPerf Multi-Model Benchmark on K8s
Compare multiple LLM models and backends with AIPerf on Kubernetes. Benchmark vLLM vs TGI vs Triton with automated multi-run confidence intervals.
AIPerf Trace Replay Benchmarks on K8s
Replay production traffic traces with AIPerf on Kubernetes. Use moon_cake format, ShareGPT datasets, and fixed schedules for realistic LLM benchmarks.
Dell PowerEdge XE7740 GPU Node Setup
Configure Dell PowerEdge XE7740 GPU nodes with H200 GPUs for OpenShift and Kubernetes including BIOS, power, cooling, and network setup.
Deploy Fish Audio TTS on Kubernetes
Deploy Fish Audio S2-Pro 5B text-to-speech model on Kubernetes for high-quality voice synthesis with multi-speaker support and streaming audio.
Deploy GLM-5 754B on Kubernetes
Deploy Zhipu AI GLM-5 754B model on Kubernetes with vLLM. One of the largest open-weight models with multi-node tensor parallelism across 8+ GPUs.
Deploy Granite 4.0 Speech on Kubernetes
Deploy IBM Granite 4.0 1B Speech model on Kubernetes for automatic speech recognition. Lightweight 2B model runs on CPU or small GPU for STT workloads.
Deploy Kimi K2.5 1.1T MoE on Kubernetes
Deploy Moonshot AI Kimi-K2.5 1.1T MoE multimodal model on Kubernetes. The largest open MoE model with 2.69M downloads for frontier AI tasks.
Deploy Llama 2 70B on Kubernetes
Deploy Meta Llama 2 70B on Kubernetes with multi-GPU tensor parallelism, vLLM serving, and production-ready health checks and resource limits.
Deploy Llama 3.1 8B Instruct on K8s
Deploy Meta Llama 3.1 8B Instruct on Kubernetes with vLLM. Production-ready single-GPU deployment with 128K context, tool calling, and autoscaling.
Deploy LTX Video Generation on K8s
Deploy Lightricks LTX-2.3 image-to-video model on Kubernetes for AI video generation with batch processing and S3 output storage.
Deploy MiniMax M2.5 229B on Kubernetes
Deploy MiniMax M2.5 229B model on Kubernetes with vLLM. High-performance LLM with 485K downloads, optimized for multi-turn conversation and long context.
Deploy NVIDIA Nemotron 120B MoE on K8s
Deploy NVIDIA Nemotron-3-Super-120B-A12B MoE model on Kubernetes. 120B total parameters with 12B active for enterprise-grade inference.
Deploy Microsoft Phi-4 on Kubernetes
Deploy Microsoft Phi-4 small language model on Kubernetes with vLLM. Efficient 14B model with GPT-4 level reasoning on a single GPU.
Deploy Phi-4 Reasoning Vision on K8s
Deploy Microsoft Phi-4-reasoning-vision-15B on Kubernetes for multimodal chain-of-thought reasoning with visual understanding on a single GPU.
Deploy Qwen3 235B MoE on Kubernetes
Deploy Alibaba Qwen3-235B-A22B mixture-of-experts model on Kubernetes. Only 22B parameters active per token for efficient 235B-class inference.
Deploy Qwen3 Coder 80B on Kubernetes
Deploy Qwen3-Coder-Next 80B on Kubernetes for code generation, review, and refactoring. Production-ready AI coding assistant with multi-GPU serving.
Deploy Qwen3 TTS on Kubernetes
Deploy Qwen3-TTS-12Hz-1.7B-CustomVoice on Kubernetes for text-to-speech with custom voice cloning. 1.13M downloads, lightweight single-GPU deployment.
Deploy Qwen3.5 35B MoE on Kubernetes
Deploy Alibaba Qwen3.5-35B-A3B mixture-of-experts multimodal model on Kubernetes. 35B total parameters with only 3B active for ultra-efficient inference.
Deploy Qwen3.5 397B MoE on Kubernetes
Deploy Alibaba Qwen3.5-397B-A17B MoE multimodal model on Kubernetes. 397B total parameters with only 17B active per token for frontier VLM inference.
Deploy Qwen3.5 9B Multimodal on K8s
Deploy Alibaba Qwen3.5-9B vision-language model on Kubernetes with vLLM. Process images and text with a single GPU deployment.
RetinaNet Object Detection on K8s
Deploy RetinaNet object detection model on Kubernetes with Triton Inference Server, TensorRT optimization, and batch processing pipelines.
Deploy Sarvam 105B on Kubernetes
Deploy Sarvam 105B multilingual LLM on Kubernetes with vLLM. India's largest open language model with native support for 10+ Indic languages.
Stable Diffusion XL on Kubernetes
Deploy Stable Diffusion XL for image generation on Kubernetes with TensorRT acceleration, queued batch processing, and S3 output storage.
Deploy Whisper Speech-to-Text on K8s
Deploy OpenAI Whisper for speech-to-text on Kubernetes with faster-whisper, batch transcription Jobs, and real-time streaming endpoints.
Distributed Inference Kubernetes
Deploy distributed LLM inference with tensor parallelism across multiple GPUs and pipeline parallelism across nodes on Kubernetes.
GenAI-Perf Benchmark LLM Serving
Benchmark LLM inference endpoints with NVIDIA GenAI-Perf for throughput, latency percentiles, time-to-first-token, and ITL metrics.
GenAI-Perf Benchmark Triton on K8s
Benchmark NVIDIA Triton Inference Server performance on Kubernetes using GenAI-Perf. Measure TTFT, inter-token latency, throughput, and GPU telemetry.
Distrib. Training Kubeflow Training Operator
Run multi-node distributed PyTorch and TensorFlow training jobs using Kubeflow Training Operator with NCCL, RDMA, and shared storage.
Kubeflow Training Operator on Kubernetes
Install Kubeflow Training Operator for distributed ML training with PyTorchJob, TFJob, and MPIJob on GPU-enabled Kubernetes clusters.
LeaderWorkerSet Operator for AI Workloads
Deploy distributed AI training with LeaderWorkerSet Operator on Kubernetes and OpenShift for leader-worker topology with gang scheduling.
Llama Stack on Kubernetes with NVIDIA NIM
Deploy Meta Llama Stack on Kubernetes for unified inference, RAG, agents, and safety APIs using NVIDIA NIM as the inference backend.
MLPerf Benchmarking on Kubernetes
Run MLPerf inference and training benchmarks on Kubernetes GPU clusters to validate AI workload performance and compare hardware configurations.
Shared Model Caching Across Pods on Kubernetes
Optimize LLM inference startup and reduce storage costs by sharing model weights across pods using emptyDir, hostPath, ReadWriteMany PVCs, and init.
MPI Operator for Distributed Training
Deploy MPI Operator on Kubernetes for distributed GPU training with Horovod and NCCL. Run multi-node MPI jobs natively in Kubernetes pods.
Deploy NVIDIA Clara on Kubernetes
Deploy NVIDIA Clara medical AI and drug discovery platform on Kubernetes. Run digital biology and medtech inference workloads with GPU acceleration.
NVIDIA H200 GPU Workloads on Kubernetes
Deploy and optimize AI workloads on NVIDIA H200 GPUs with 141GB HBM3e memory for large model inference and training on Kubernetes.
NVIDIA NeMo Training on Kubernetes
Deploy NVIDIA NeMo framework on Kubernetes for large language model pre-training, fine-tuning, and RLHF with multi-node GPU clusters.
NVIDIA Pyxis and Enroot for SLURM
Use NVIDIA Pyxis and Enroot to run GPU containers in SLURM jobs. Bridge SLURM HPC scheduling with container-native AI workloads and NGC images.
Run:AI GPU Quotas on OpenShift
Configure Run:AI scheduler quotas for fair GPU sharing with guaranteed, over-quota borrowing, and per-tenant GPU allocation policies.
SLURM and Kubernetes Integration
Integrate SLURM HPC workload manager with Kubernetes for hybrid AI and scientific computing. Bridge HPC batch scheduling with container orchestration.
Time-Slicing vs MIG vs Full GPU Allocation
Compare GPU sharing strategies: time-slicing for notebooks, MIG for isolated inference, and full GPU for training workloads.
Triton Autoscaling with GPU Metrics
Autoscale Triton Inference Server on Kubernetes using GPU utilization, request queue depth, and inference latency metrics with KEDA and HPA.
Triton Multi-Model Serving on Kubernetes
Serve multiple LLMs simultaneously on Triton Inference Server using TensorRT-LLM and vLLM backends with model routing and GPU scheduling.
Triton TensorRT-LLM on Kubernetes
Deploy NVIDIA Triton Inference Server with TensorRT-LLM backend on Kubernetes for optimized large language model serving with GPU acceleration.
TensorRT-LLM vs vLLM on Triton
Compare TensorRT-LLM and vLLM backends on Triton Inference Server. When to use each, performance benchmarks, and migration strategies.
Triton with vLLM Backend on Kubernetes
Deploy NVIDIA Triton Inference Server with vLLM backend on Kubernetes for flexible LLM serving with PagedAttention and continuous batching.
Deploying Vector Databases on Kubernetes
Deploy and operate vector databases (Milvus, Weaviate, Qdrant) on Kubernetes for RAG pipelines, semantic search, and AI applications with persistent.
Compare NCCL Intra-Node vs Inter-Node Perfo...
Build a repeatable comparison between local and cross-node NCCL throughput to validate GPU cluster interconnect scaling and identify bottlenecks early.
Run NCCL AllGather Benchmarks Model Paralle...
Use all-gather NCCL tests to evaluate GPU communication behavior and throughput for tensor-parallel and model-parallel distributed AI workloads on Kubernetes.
Benchmark NCCL AllReduce Performance
Measure NCCL AllReduce bandwidth and latency on Kubernetes to validate distributed training network performance across multi-GPU clusters.
Run NCCL Tests for GPU Network Validation
Benchmark GPU-to-GPU communication using NVIDIA nccl-tests on Kubernetes or OpenShift to validate bandwidth and latency.
Deploy Mistral 7B with NVIDIA NIM
Step-by-step guide to deploy Mistral-7B using NVIDIA NIM with TensorRT-LLM backend on Kubernetes for optimized GPU inference.
Deploy Mistral 7B with vLLM on Kubernetes
Step-by-step guide to deploy Mistral-7B-v0.1 using vLLM as an OpenAI-compatible inference server on Kubernetes with GPU fractioning.
Autoscale LLM Inference on Kubernetes
Configure Horizontal Pod Autoscaling and KEDA for LLM workloads using GPU utilization, request queue depth, and custom metrics.
Quantize LLMs for Efficient GPU Inference
Run quantized LLM models (GPTQ, AWQ, GGUF) on Kubernetes to reduce GPU memory requirements and serve models on smaller GPUs.
Kubernetes LLM Serving Frameworks Compared
Compare vLLM, NVIDIA NIM, Triton, Ollama, and llama.cpp for serving LLMs on Kubernetes — features, performance, and when to use each.
Multi-GPU and Tensor Parallel LLM Inference
Deploy large language models across multiple GPUs using tensor parallelism with vLLM and NVIDIA NIM on Kubernetes for high-throughput inference serving.
Install NVIDIA GPU Operator on Kubernetes
Deploy the NVIDIA GPU Operator to automate GPU driver, container toolkit, and device plugin management across your Kubernetes cluster.
Build a RAG Pipeline on Kubernetes
Deploy a Retrieval-Augmented Generation pipeline on Kubernetes using a vector database, embedding model, and LLM inference server.
Test LLM Inference Endpoints with curl
Validate Kubernetes-hosted LLM inference services using curl against OpenAI-compatible /v1/models, /v1/completions, and /v1/chat/completions endpoints.
GPU Sharing and Bin Packing with KAI Scheduler
Maximize GPU utilization with KAI Scheduler GPU sharing, fractional GPUs, and bin packing strategies for Kubernetes AI workloads.
Installing NVIDIA KAI Scheduler AI Workloads
Deploy KAI Scheduler for optimized GPU resource allocation in Kubernetes AI/ML clusters with hierarchical queues and batch scheduling
Hierarchical Queues & Resource Fairness KAI...
Configure hierarchical queues in KAI Scheduler for multi-tenant GPU clusters with quotas, limits, and Dominant Resource Fairness (DRF)
Batch Scheduling PodGroups in KAI Scheduler
Implement gang scheduling for distributed training jobs using KAI Scheduler PodGroups to ensure all-or-nothing pod scheduling
Topology-Aware Scheduling with KAI Scheduler
Optimize GPU workload placement using KAI Scheduler's Topology-Aware Scheduling (TAS) for NVLink, NVSwitch, and disaggregated serving architectures