πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai advanced ⏱ 15 minutes K8s 1.28+

Kubernetes 1.36 Topology-Aware Scheduling

Use topology-aware workload scheduling in Kubernetes 1.36 to place Pods on nodes with optimal GPU, NUMA, and network topology for ML training.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Kubernetes 1.36 introduces Topology-Aware Workload Scheduling (KEP-5732). The scheduler considers GPU interconnect topology, NUMA zones, and network fabric when placing distributed workloads β€” reducing cross-node communication overhead by up to 10x.

The Problem

Distributed ML training performance depends heavily on how Pods are placed relative to each other:

  • NVLink vs PCIe: GPUs on the same NVLink domain communicate at 900 GB/s; across PCIe it drops to 32 GB/s
  • Same switch vs cross-switch: Network latency doubles when Pods land on nodes connected through different top-of-rack switches
  • NUMA distance: Memory access latency varies 2-3x across NUMA zones
  • Default scheduler ignores topology β€” it optimizes for resource utilization, not communication patterns

The Solution

Topology-aware scheduling places workload Pods on nodes that minimize communication latency based on the cluster’s physical topology.

Define Topology Domains

# Node labels define the topology hierarchy
# Level 0: GPU domain (NVLink)
# Level 1: Node (PCIe bus)
# Level 2: Rack (top-of-rack switch)
# Level 3: Cluster (spine switch)

apiVersion: v1
kind: Node
metadata:
  name: gpu-node-01
  labels:
    topology.kubernetes.io/zone: "us-east-1a"
    topology.kubernetes.io/rack: "rack-a1"
    topology.kubernetes.io/switch: "tor-a1"
    nvidia.com/gpu-fabric: "nvswitch-domain-1"

Topology-Aware Training Job

apiVersion: scheduling.k8s.io/v1alpha1
kind: TopologyPolicy
metadata:
  name: ml-training-topology
spec:
  levels:
    - name: gpu-domain
      labelKey: nvidia.com/gpu-fabric
      weight: 100    # Highest priority: same NVLink domain
    - name: rack
      labelKey: topology.kubernetes.io/rack
      weight: 50     # Second priority: same rack
    - name: zone
      labelKey: topology.kubernetes.io/zone
      weight: 10     # Third priority: same zone
  optimization: MinimizeSpread    # Pack Pods as close as possible
---
apiVersion: batch/v1
kind: Job
metadata:
  name: llm-training
  annotations:
    scheduling.k8s.io/topology-policy: ml-training-topology
spec:
  parallelism: 8
  completions: 8
  template:
    spec:
      containers:
        - name: trainer
          image: registry.example.com/training:v2.0
          resources:
            limits:
              nvidia.com/gpu: 8

Pod Topology Spread with GPU Awareness

apiVersion: v1
kind: Pod
metadata:
  name: training-worker
  labels:
    app: llm-training
    training-group: "finetune-llama"
spec:
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: nvidia.com/gpu-fabric
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          training-group: "finetune-llama"
    - maxSkew: 2
      topologyKey: topology.kubernetes.io/rack
      whenUnsatisfiable: ScheduleAnyway
      labelSelector:
        matchLabels:
          training-group: "finetune-llama"
  containers:
    - name: trainer
      image: registry.example.com/training:v2.0
      resources:
        limits:
          nvidia.com/gpu: 8

Verify Topology Placement

# Check which nodes training Pods landed on
kubectl get pods -l training-group=finetune-llama \
  -o custom-columns=\
'POD:metadata.name,NODE:spec.nodeName,RACK:status.hostIP'

# Verify GPU topology awareness
kubectl get pods -l training-group=finetune-llama -o json | \
  jq -r '.items[] | "\(.metadata.name) β†’ \(.spec.nodeName)"'

# Check node labels for topology
kubectl get nodes -l nvidia.com/gpu-fabric=nvswitch-domain-1 \
  --show-labels | grep topology

Impact on NCCL Performance

# Without topology-aware scheduling:
# Pods scattered across racks
# NCCL all-reduce: ~15 GB/s effective bandwidth

# With topology-aware scheduling:
# Pods packed on same NVLink domain
# NCCL all-reduce: ~150 GB/s effective bandwidth

# 10x improvement in collective communication!

Common Issues

Pods stuck in Pending with topology constraints

  • Cause: Not enough nodes in the preferred topology domain
  • Fix: Use whenUnsatisfiable: ScheduleAnyway for soft constraints

Training performance not improving

  • Cause: Network bottleneck is not the topology β€” could be NCCL config
  • Fix: Verify NCCL_SOCKET_IFNAME and NCCL_IB_HCA settings match your network

Topology labels missing on nodes

  • Cause: Node Feature Discovery not deployed or configured
  • Fix: Deploy NFD and GPU Feature Discovery to auto-label nodes

Best Practices

  1. Label all nodes with topology hierarchy β€” rack, switch, zone, GPU fabric
  2. Deploy GPU Feature Discovery β€” auto-detects NVLink domains and GPU topology
  3. Use hard constraints for NVLink β€” DoNotSchedule for GPU domain matching
  4. Use soft constraints for rack β€” ScheduleAnyway to avoid blocking
  5. Monitor NCCL bandwidth β€” verify topology placement improves training speed

Key Takeaways

  • Topology-aware scheduling is new in Kubernetes 1.36 (KEP-5732)
  • Places distributed workloads considering GPU, NUMA, and network topology
  • Up to 10x improvement in collective communication bandwidth
  • Essential for large-scale ML training with NVLink/NVSwitch clusters
  • Combines with gang scheduling for optimal distributed workload placement
#kubernetes-1.36 #scheduling #topology #gpu #numa
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens