πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai advanced ⏱ 30 minutes K8s 1.28+

AI Supercomputing on Kubernetes GPU Clusters

Build AI supercomputing platforms on Kubernetes. Multi-node GPU training, NVIDIA DGX SuperPOD, InfiniBand RDMA, NCCL tuning, Blackwell clusters.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: AI supercomputing on Kubernetes combines GPU Operator, Network Operator (for InfiniBand/RDMA), and topology-aware scheduling to run distributed training across hundreds of GPUs. Key components: LeaderWorkerSet for multi-node jobs, NCCL for GPU communication, GPUDirect RDMA for bypassing CPU, and Run:ai or Volcano for fair scheduling.

The Problem

Training frontier AI models requires thousands of GPUs working in concert β€” NVIDIA’s DGX SuperPOD, Blackwell GB200 NVL72 racks, and custom HPC clusters. Kubernetes is becoming the orchestration layer for these supercomputers, but it requires specialized networking (InfiniBand), topology-aware placement (NVLink domains), and HPC-grade job scheduling that vanilla Kubernetes doesn’t provide.

flowchart TB
    subgraph SUPERPOD["AI Supercomputer (Kubernetes)"]
        subgraph RACK1["Rack 1 (NVLink Domain)"]
            DGX1["DGX H100 #1<br/>8Γ— H100 GPUs"]
            DGX2["DGX H100 #2<br/>8Γ— H100 GPUs"]
        end
        subgraph RACK2["Rack 2 (NVLink Domain)"]
            DGX3["DGX H100 #3<br/>8Γ— H100 GPUs"]
            DGX4["DGX H100 #4<br/>8Γ— H100 GPUs"]
        end
        IB["InfiniBand Fabric<br/>400 Gb/s RDMA"]
        RACK1 & RACK2 --> IB
    end
    
    SCHED["Topology-Aware<br/>Gang Scheduler"] --> SUPERPOD
    TRAIN["Distributed Training Job<br/>(PyTorch DDP / DeepSpeed)"] --> SCHED

The Solution

GPU Cluster Stack

# Install order matters β€” dependencies chain
# 1. Node Feature Discovery
helm install nfd nvidia/node-feature-discovery \
  --namespace gpu-operator --create-namespace

# 2. GPU Operator
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --set driver.enabled=true \
  --set driver.useOpenKernelModules=true \
  --set toolkit.enabled=true \
  --set mig.strategy=mixed

# 3. Network Operator (InfiniBand / RDMA)
helm install network-operator nvidia/network-operator \
  --namespace nvidia-network-operator --create-namespace \
  --set deployCR=true \
  --set sriovDevicePlugin.deploy=true \
  --set rdmaSharedDevicePlugin.deploy=true

# 4. Topology-aware scheduler (Run:ai or Volcano)
helm install runai runai/runai-cluster \
  --namespace runai --create-namespace \
  --set scheduler.topology.enabled=true

Multi-Node Training Job (LeaderWorkerSet)

apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: llama-training
spec:
  replicas: 1
  leaderWorkerTemplate:
    size: 4                          # 4 nodes Γ— 8 GPUs = 32 GPUs
    restartPolicy: RecreateGroupOnPodRestart
    leaderTemplate:
      metadata:
        labels:
          role: leader
      spec:
        containers:
          - name: trainer
            image: nvcr.io/nvidia/pytorch:24.04-py3
            command: ["torchrun"]
            args:
              - "--nnodes=4"
              - "--nproc_per_node=8"
              - "--rdzv_backend=c10d"
              - "--rdzv_endpoint=$(LWS_LEADER_ADDRESS):29500"
              - "train.py"
              - "--model=llama-3.1-70b"
              - "--batch-size=32"
              - "--gradient-accumulation=4"
            env:
              - name: NCCL_IB_HCA
                value: "mlx5"
              - name: NCCL_SOCKET_IFNAME
                value: "eth0"
              - name: NCCL_DEBUG
                value: "INFO"
            resources:
              limits:
                nvidia.com/gpu: 8
                rdma/rdma_shared_device_a: 1
            volumeMounts:
              - name: shm
                mountPath: /dev/shm
        volumes:
          - name: shm
            emptyDir:
              medium: Memory
              sizeLimit: 64Gi
    workerTemplate:
      spec:
        containers:
          - name: trainer
            image: nvcr.io/nvidia/pytorch:24.04-py3
            command: ["torchrun"]
            args:
              - "--nnodes=4"
              - "--nproc_per_node=8"
              - "--rdzv_backend=c10d"
              - "--rdzv_endpoint=$(LWS_LEADER_ADDRESS):29500"
              - "train.py"
            env:
              - name: NCCL_IB_HCA
                value: "mlx5"
            resources:
              limits:
                nvidia.com/gpu: 8
                rdma/rdma_shared_device_a: 1

NCCL Performance Tuning

env:
  # InfiniBand configuration
  - name: NCCL_IB_HCA
    value: "mlx5"                    # Use Mellanox HCAs
  - name: NCCL_IB_GID_INDEX
    value: "3"                       # RoCEv2 GID index
  - name: NCCL_NET_GDR_LEVEL
    value: "5"                       # GPUDirect RDMA level
  
  # Performance tuning
  - name: NCCL_ALGO
    value: "Ring,Tree"               # AllReduce algorithms
  - name: NCCL_PROTO
    value: "Simple,LL,LL128"         # Protocols
  - name: NCCL_MIN_NCHANNELS
    value: "4"
  - name: NCCL_MAX_NCHANNELS
    value: "32"
  
  # Debugging
  - name: NCCL_DEBUG
    value: "WARN"                    # INFO for troubleshooting
  - name: NCCL_DEBUG_SUBSYS
    value: "INIT,NET"

Topology-Aware Scheduling

# Ensure all training pods land in the same NVLink domain
metadata:
  annotations:
    kai.scheduler/topology: "nvidia.com/gpu.product"
    kai.scheduler/topology-preferred-placement: "rack"
spec:
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          job: llama-training

GPU Health Monitoring

# DCGM Exporter for GPU metrics
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dcgm-exporter
spec:
  template:
    spec:
      containers:
        - name: dcgm-exporter
          image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.1-ubuntu22.04
          env:
            - name: DCGM_EXPORTER_COLLECTORS
              value: "/etc/dcgm-exporter/dcp-metrics-included.csv"
          ports:
            - containerPort: 9400
          securityContext:
            capabilities:
              add: ["SYS_ADMIN"]
          resources:
            limits:
              nvidia.com/gpu: 0      # Monitor only, don't consume GPUs

Storage for Training Data

# High-performance shared storage for datasets
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: training-data
spec:
  accessModes:
    - ReadWriteMany                  # Shared across all training nodes
  resources:
    requests:
      storage: 10Ti
  storageClassName: lustre-fast      # Lustre, GPFS, or WekaFS

Common Issues

IssueCauseFix
NCCL timeoutNetwork misconfigurationCheck `NCCL_IB_HCA`, `NCCL_SOCKET_IFNAME`
GPUDirect RDMA not workingMissing open kernel modulesSet `useOpenKernelModules: true`
Training hangs at barrierOne node slower/failedEnable NCCL watchdog timeout
Uneven GPU utilizationPoor topology placementUse topology-aware scheduler
OOM on /dev/shmShared memory too smallSet `sizeLimit: 64Gi` on emptyDir
Slow data loadingStorage bandwidth bottleneckUse Lustre/WekaFS, increase `num_workers`

Best Practices

  • Use InfiniBand/RDMA for multi-node training β€” Ethernet adds 2-5Γ— latency
  • Enable GPUDirect RDMA β€” bypasses CPU for GPU-to-GPU communication
  • Topology-aware scheduling β€” keep training pods in the same NVLink domain
  • Shared memory must be large β€” PyTorch DataLoader needs /dev/shm
  • Monitor with DCGM β€” detect GPU errors before they stall training
  • Use gang scheduling β€” all-or-nothing: either all N nodes start, or none do

Key Takeaways

  • AI supercomputing on K8s: GPU Operator + Network Operator + topology-aware scheduler
  • LeaderWorkerSet manages multi-node training jobs (leader + workers)
  • NCCL environment variables are critical for communication performance
  • GPUDirect RDMA requires open kernel modules (`useOpenKernelModules: true`)
  • Topology-aware placement keeps pods near each other (same rack, same NVLink domain)
  • 2026 trend: Kubernetes replacing SLURM as the AI supercomputing orchestrator
#supercomputing #gpu-clusters #nvidia-dgx #infiniband #distributed-training
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens