🎤Speaking at KubeCon EU 2026Lessons Learned Orchestrating Multi-Tenant GPUs on OpenShift AIView Session
ai advanced ⏱ 45 minutes K8s 1.28+

Topology-Aware Scheduling with KAI Scheduler

Optimize GPU workload placement using KAI Scheduler's Topology-Aware Scheduling (TAS) for NVLink, NVSwitch, and disaggregated serving architectures

By Luca Berton

The Problem

Modern GPU servers have complex topologies with NVLink, NVSwitch, and PCIe interconnects. Placing workloads without considering topology leads to suboptimal performance—GPUs communicating over slow PCIe instead of fast NVLink can drastically slow distributed training.

The Solution

Use KAI Scheduler’s Topology-Aware Scheduling (TAS) to place workloads on GPUs with optimal interconnect topology. TAS understands NVLink domains, NVSwitch fabrics, and ensures distributed training jobs get GPUs with the fastest communication paths.

GPU Topology Architecture

flowchart TB
    subgraph node["🖥️ DGX H100 NODE"]
        subgraph nvswitch["🔀 NVSwitch Fabric"]
            SW1["NVSwitch 1"]
            SW2["NVSwitch 2"]
            SW3["NVSwitch 3"]
            SW4["NVSwitch 4"]
        end
        
        subgraph gpus["🎮 8x H100 GPUs"]
            direction LR
            G0["GPU 0"]
            G1["GPU 1"]
            G2["GPU 2"]
            G3["GPU 3"]
            G4["GPU 4"]
            G5["GPU 5"]
            G6["GPU 6"]
            G7["GPU 7"]
        end
        
        subgraph domains["📦 NVLink Domains"]
            D1["Domain A<br/>GPU 0-3"]
            D2["Domain B<br/>GPU 4-7"]
        end
    end
    
    subgraph kai["🎯 KAI TAS"]
        TAS["Topology-Aware<br/>Scheduler"]
        TOPO["Topology<br/>Discovery"]
    end
    
    gpus <--> nvswitch
    domains --> kai
    kai --> gpus

Step 1: Enable Topology-Aware Scheduling

# Install KAI with TAS enabled
helm upgrade -i kai-scheduler \
  oci://ghcr.io/nvidia/kai-scheduler/kai-scheduler \
  -n kai-scheduler \
  --version v0.12.10 \
  --set topologyAwareScheduling.enabled=true \
  --set topologyAwareScheduling.nvlinkAware=true

Step 2: View GPU Topology

# Check GPU topology on a node
nvidia-smi topo -m

# Example output for 8-GPU system:
#         GPU0  GPU1  GPU2  GPU3  GPU4  GPU5  GPU6  GPU7
# GPU0     X    NV18  NV18  NV18  NV18  NV18  NV18  NV18
# GPU1    NV18   X    NV18  NV18  NV18  NV18  NV18  NV18
# ...

# NV18 = NVLink 18 lanes (fast)
# SYS = System/PCIe (slow)
# PHB = PCIe Host Bridge
# NODE = Same NUMA node

Step 3: Label Nodes with Topology Information

# topology-labels.yaml
apiVersion: v1
kind: Node
metadata:
  name: dgx-h100-01
  labels:
    nvidia.com/gpu.count: "8"
    nvidia.com/gpu.product: "NVIDIA-H100-80GB-HBM3"
    nvidia.com/nvswitch.present: "true"
    nvidia.com/nvlink.version: "4"
    topology.kubernetes.io/zone: "gpu-zone-1"
    # Custom topology labels
    gpu.topology/nvlink-domain-size: "8"
    gpu.topology/interconnect: "nvswitch"
kubectl label node dgx-h100-01 \
  nvidia.com/nvswitch.present=true \
  gpu.topology/interconnect=nvswitch

Step 4: Configure TAS Policies

# tas-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: kai-scheduler-config
  namespace: kai-scheduler
data:
  config.yaml: |
    topologyAwareScheduling:
      enabled: true
      policies:
        # Prefer GPUs within same NVLink domain
        - name: nvlink-locality
          weight: 100
          topology: nvlink
          strategy: pack  # Pack within NVLink domain
        
        # Consider NUMA locality
        - name: numa-locality  
          weight: 50
          topology: numa
          strategy: pack
        
        # Prefer nodes with NVSwitch for multi-node
        - name: nvswitch-preference
          weight: 80
          topology: nvswitch
          strategy: prefer
      
      # Topology hierarchy
      levels:
        - name: gpu
          label: nvidia.com/gpu
        - name: nvlink-domain
          label: gpu.topology/nvlink-domain
        - name: numa-node
          label: topology.kubernetes.io/numa-node
        - name: node
          label: kubernetes.io/hostname
kubectl apply -f tas-config.yaml
kubectl rollout restart statefulset kai-scheduler -n kai-scheduler

Step 5: Request Topology-Aware Placement

# topology-aware-training.yaml
apiVersion: scheduling.run.ai/v2
kind: PodGroup
metadata:
  name: ddp-training-tas
  namespace: ml-training
spec:
  minMember: 4
  queue: training
  topologyPolicy:
    # All pods should be on same NVLink domain
    scope: nvlink-domain
    # Or: same node, same rack, etc.
---
apiVersion: batch/v1
kind: Job
metadata:
  name: ddp-nvlink-training
  namespace: ml-training
spec:
  parallelism: 4
  completions: 4
  template:
    metadata:
      annotations:
        scheduling.run.ai/pod-group: ddp-training-tas
        # Request topology-aware placement
        scheduling.run.ai/topology-policy: nvlink-locality
    spec:
      schedulerName: kai-scheduler
      containers:
      - name: trainer
        image: nvcr.io/nvidia/pytorch:24.01-py3
        command:
        - python
        - -c
        - |
          import torch
          import torch.distributed as dist
          import os
          
          dist.init_process_group(backend='nccl')
          rank = dist.get_rank()
          
          # Test NVLink bandwidth
          tensor = torch.randn(1024, 1024, 1024).cuda()
          
          import time
          start = time.time()
          for _ in range(100):
              dist.all_reduce(tensor)
          torch.cuda.synchronize()
          elapsed = time.time() - start
          
          if rank == 0:
              bandwidth = (tensor.numel() * 4 * 100 * 2) / elapsed / 1e9
              print(f"All-reduce bandwidth: {bandwidth:.2f} GB/s")
        resources:
          limits:
            nvidia.com/gpu: 2
        env:
        - name: NCCL_DEBUG
          value: "INFO"
        - name: NCCL_TOPO_DUMP_FILE
          value: "/tmp/nccl_topo.xml"

Step 6: Multi-Node Topology-Aware Scheduling

# multi-node-tas.yaml
apiVersion: scheduling.run.ai/v2
kind: PodGroup
metadata:
  name: multi-node-training
  namespace: ml-training
spec:
  minMember: 2  # 2 nodes
  queue: training
  topologyPolicy:
    scope: rack  # Place on nodes in same rack for low latency
    affinity:
      required:
        - key: nvidia.com/nvswitch.present
          operator: In
          values: ["true"]
---
apiVersion: v1
kind: Pod
metadata:
  name: trainer-node-0
  namespace: ml-training
  annotations:
    scheduling.run.ai/pod-group: multi-node-training
spec:
  schedulerName: kai-scheduler
  nodeSelector:
    nvidia.com/nvswitch.present: "true"
  containers:
  - name: trainer
    image: nvcr.io/nvidia/pytorch:24.01-py3
    resources:
      limits:
        nvidia.com/gpu: 8  # All GPUs on node
    env:
    - name: NCCL_IB_DISABLE
      value: "0"
    - name: NCCL_NET_GDR_LEVEL
      value: "5"

Step 7: Disaggregated Serving with TAS

# disaggregated-serving.yaml
# For architectures like Dynamo/Grove where components need topology awareness
apiVersion: scheduling.run.ai/v2
kind: PodGroup
metadata:
  name: llm-serving-pipeline
  namespace: ml-inference
spec:
  minMember: 3  # Router + Prefill + Decode
  queue: inference
  topologyPolicy:
    scope: node  # All components on same node for low latency
---
# Prefill worker - needs high memory bandwidth
apiVersion: v1
kind: Pod
metadata:
  name: prefill-worker
  namespace: ml-inference
  annotations:
    scheduling.run.ai/pod-group: llm-serving-pipeline
    scheduling.run.ai/topology-role: prefill
spec:
  schedulerName: kai-scheduler
  containers:
  - name: prefill
    image: vllm/vllm-openai:latest
    args:
    - --model=/models/llama-70b
    - --tensor-parallel-size=4
    - --pipeline-parallel-size=1
    resources:
      limits:
        nvidia.com/gpu: 4
---
# Decode worker - needs high compute
apiVersion: v1
kind: Pod
metadata:
  name: decode-worker
  namespace: ml-inference
  annotations:
    scheduling.run.ai/pod-group: llm-serving-pipeline
    scheduling.run.ai/topology-role: decode
spec:
  schedulerName: kai-scheduler
  containers:
  - name: decode
    image: vllm/vllm-openai:latest
    args:
    - --model=/models/llama-70b
    - --tensor-parallel-size=4
    resources:
      limits:
        nvidia.com/gpu: 4

Step 8: Monitor Topology Placement

# View pod placement with GPU allocation
kubectl get pods -n ml-training -o wide

# Check NCCL topology detection
kubectl exec -it <pod-name> -n ml-training -- \
  cat /tmp/nccl_topo.xml

# View GPU assignments
kubectl exec -it <pod-name> -n ml-training -- \
  nvidia-smi -L

# Check NVLink status
kubectl exec -it <pod-name> -n ml-training -- \
  nvidia-smi nvlink -s

# Verify topology-aware scheduling worked
kubectl get events -n ml-training | grep -i topology

Topology Scope Options

ScopeDescriptionUse Case
gpuSingle GPUInteractive notebooks
nvlink-domainGPUs connected via NVLinkSingle-node distributed training
nodeAll GPUs on a nodeLarge model training
rackNodes in same rackMulti-node with low latency
zoneAvailability zoneGeographic distribution

Step 9: Performance Comparison

# benchmark-topology.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: nccl-benchmark
  namespace: ml-training
spec:
  template:
    metadata:
      annotations:
        scheduling.run.ai/topology-policy: nvlink-locality
    spec:
      schedulerName: kai-scheduler
      restartPolicy: Never
      containers:
      - name: benchmark
        image: nvcr.io/nvidia/pytorch:24.01-py3
        command:
        - bash
        - -c
        - |
          # Run NCCL tests
          cd /opt/nccl-tests/build
          
          echo "=== All-Reduce Bandwidth ==="
          ./all_reduce_perf -b 8 -e 1G -f 2 -g 4
          
          echo "=== All-Gather Bandwidth ==="
          ./all_gather_perf -b 8 -e 1G -f 2 -g 4
        resources:
          limits:
            nvidia.com/gpu: 4

Expected results with proper topology:

  • NVLink: 400-600 GB/s
  • PCIe: 20-30 GB/s

Troubleshooting

# Check if TAS is enabled
kubectl get configmap kai-scheduler-config -n kai-scheduler -o yaml | grep topology

# View scheduler logs
kubectl logs -n kai-scheduler -l app=kai-scheduler | grep -i topology

# Check node labels
kubectl get nodes -L nvidia.com/nvswitch.present,gpu.topology/interconnect

Poor NCCL performance despite TAS

# Check NCCL topology
kubectl exec -it <pod> -- nvidia-smi topo -m

# Verify NCCL is using NVLink
kubectl exec -it <pod> -- bash -c 'NCCL_DEBUG=INFO python -c "import torch.distributed"' 2>&1 | grep -i nvlink

# Check for PCIe fallback
kubectl exec -it <pod> -- bash -c 'NCCL_DEBUG=INFO python train.py' 2>&1 | grep -i "using network"

Best Practices

PracticeDescription
Use NVLink for DDPAlways request topology-aware for distributed training
Match topology to workloadSmall jobs don’t need NVSwitch nodes
Test actual bandwidthVerify NCCL performance after placement
Label nodes accuratelyMaintain correct topology labels
Consider multi-node latencyUse same-rack placement for multi-node

Summary

Topology-Aware Scheduling in KAI Scheduler ensures distributed training and inference workloads are placed on GPUs with optimal interconnect topology. By leveraging NVLink and NVSwitch awareness, you can achieve maximum GPU communication bandwidth and significantly faster training times.


📘 Go Further with Kubernetes Recipes

Love this recipe? There’s so much more! This is just one of 100+ hands-on recipes in our comprehensive Kubernetes Recipes book.

Inside the book, you’ll master:

  • ✅ Production-ready deployment strategies
  • ✅ Advanced networking and security patterns
  • ✅ Observability, monitoring, and troubleshooting
  • ✅ Real-world best practices from industry experts

“The practical, recipe-based approach made complex Kubernetes concepts finally click for me.”

👉 Get Your Copy Now — Start building production-grade Kubernetes skills today!

#kai-scheduler #nvidia #gpu #topology #nvlink #nvswitch #distributed-training

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.