Kubernetes 1.36 Topology-Aware Scheduling
Use topology-aware workload scheduling in Kubernetes 1.36 to place Pods on nodes with optimal GPU, NUMA, and network topology for ML training.
π‘ Quick Answer: Kubernetes 1.36 introduces Topology-Aware Workload Scheduling (KEP-5732). The scheduler considers GPU interconnect topology, NUMA zones, and network fabric when placing distributed workloads β reducing cross-node communication overhead by up to 10x.
The Problem
Distributed ML training performance depends heavily on how Pods are placed relative to each other:
- NVLink vs PCIe: GPUs on the same NVLink domain communicate at 900 GB/s; across PCIe it drops to 32 GB/s
- Same switch vs cross-switch: Network latency doubles when Pods land on nodes connected through different top-of-rack switches
- NUMA distance: Memory access latency varies 2-3x across NUMA zones
- Default scheduler ignores topology β it optimizes for resource utilization, not communication patterns
The Solution
Topology-aware scheduling places workload Pods on nodes that minimize communication latency based on the clusterβs physical topology.
Define Topology Domains
# Node labels define the topology hierarchy
# Level 0: GPU domain (NVLink)
# Level 1: Node (PCIe bus)
# Level 2: Rack (top-of-rack switch)
# Level 3: Cluster (spine switch)
apiVersion: v1
kind: Node
metadata:
name: gpu-node-01
labels:
topology.kubernetes.io/zone: "us-east-1a"
topology.kubernetes.io/rack: "rack-a1"
topology.kubernetes.io/switch: "tor-a1"
nvidia.com/gpu-fabric: "nvswitch-domain-1"Topology-Aware Training Job
apiVersion: scheduling.k8s.io/v1alpha1
kind: TopologyPolicy
metadata:
name: ml-training-topology
spec:
levels:
- name: gpu-domain
labelKey: nvidia.com/gpu-fabric
weight: 100 # Highest priority: same NVLink domain
- name: rack
labelKey: topology.kubernetes.io/rack
weight: 50 # Second priority: same rack
- name: zone
labelKey: topology.kubernetes.io/zone
weight: 10 # Third priority: same zone
optimization: MinimizeSpread # Pack Pods as close as possible
---
apiVersion: batch/v1
kind: Job
metadata:
name: llm-training
annotations:
scheduling.k8s.io/topology-policy: ml-training-topology
spec:
parallelism: 8
completions: 8
template:
spec:
containers:
- name: trainer
image: registry.example.com/training:v2.0
resources:
limits:
nvidia.com/gpu: 8Pod Topology Spread with GPU Awareness
apiVersion: v1
kind: Pod
metadata:
name: training-worker
labels:
app: llm-training
training-group: "finetune-llama"
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: nvidia.com/gpu-fabric
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
training-group: "finetune-llama"
- maxSkew: 2
topologyKey: topology.kubernetes.io/rack
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
training-group: "finetune-llama"
containers:
- name: trainer
image: registry.example.com/training:v2.0
resources:
limits:
nvidia.com/gpu: 8Verify Topology Placement
# Check which nodes training Pods landed on
kubectl get pods -l training-group=finetune-llama \
-o custom-columns=\
'POD:metadata.name,NODE:spec.nodeName,RACK:status.hostIP'
# Verify GPU topology awareness
kubectl get pods -l training-group=finetune-llama -o json | \
jq -r '.items[] | "\(.metadata.name) β \(.spec.nodeName)"'
# Check node labels for topology
kubectl get nodes -l nvidia.com/gpu-fabric=nvswitch-domain-1 \
--show-labels | grep topologyImpact on NCCL Performance
# Without topology-aware scheduling:
# Pods scattered across racks
# NCCL all-reduce: ~15 GB/s effective bandwidth
# With topology-aware scheduling:
# Pods packed on same NVLink domain
# NCCL all-reduce: ~150 GB/s effective bandwidth
# 10x improvement in collective communication!Common Issues
Pods stuck in Pending with topology constraints
- Cause: Not enough nodes in the preferred topology domain
- Fix: Use
whenUnsatisfiable: ScheduleAnywayfor soft constraints
Training performance not improving
- Cause: Network bottleneck is not the topology β could be NCCL config
- Fix: Verify NCCL_SOCKET_IFNAME and NCCL_IB_HCA settings match your network
Topology labels missing on nodes
- Cause: Node Feature Discovery not deployed or configured
- Fix: Deploy NFD and GPU Feature Discovery to auto-label nodes
Best Practices
- Label all nodes with topology hierarchy β rack, switch, zone, GPU fabric
- Deploy GPU Feature Discovery β auto-detects NVLink domains and GPU topology
- Use hard constraints for NVLink β
DoNotSchedulefor GPU domain matching - Use soft constraints for rack β
ScheduleAnywayto avoid blocking - Monitor NCCL bandwidth β verify topology placement improves training speed
Key Takeaways
- Topology-aware scheduling is new in Kubernetes 1.36 (KEP-5732)
- Places distributed workloads considering GPU, NUMA, and network topology
- Up to 10x improvement in collective communication bandwidth
- Essential for large-scale ML training with NVLink/NVSwitch clusters
- Combines with gang scheduling for optimal distributed workload placement

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
