Kubernetes Topology Manager for GPU and NUMA Alignment
Configure Kubernetes Topology Manager to align CPU, GPU, and NIC allocations on the same NUMA node. Covers policies, kubelet config, and GPU performance tuning.
π‘ Quick Answer: Topology Manager is a kubelet component that coordinates CPU Manager, Device Manager (GPUs), and Memory Manager to allocate resources from the same NUMA node. Set
topologyManagerPolicy: single-numa-nodein kubelet config to ensure GPUs, CPUs, and NICs are all co-located on one NUMA node β critical for GPU workloads where cross-NUMA memory access adds 30-50% latency penalty.
The Problem
- GPU allocated from NUMA 0 but CPUs from NUMA 1 β cross-NUMA memory access kills performance
- NIC on different NUMA node than GPU β GPUDirect RDMA crosses QPI/UPI interconnect
- Data loading from CPU to GPU traverses extra hop when NUMA-misaligned
- Default Kubernetes scheduling ignores hardware topology entirely
- Multi-GPU pods get GPUs from different NUMA nodes unnecessarily
The Solution
Topology Manager Policies
Policy β Behavior β Use Case
βββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββ
none β No topology alignment (default) β General workloads
best-effort β Try to align, admit pod anyway if can't β Mixed clusters
restricted β Align or reject pod (fail admission) β GPU/HPC nodes
single-numa-node β ALL resources must come from ONE NUMA node β Strict GPU/RDMA
βββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββ΄ββββββββββββββββββConfigure Kubelet
# /var/lib/kubelet/config.yaml
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
topologyManagerPolicy: "single-numa-node"
topologyManagerScope: "pod" # or "container" (per-container alignment)
cpuManagerPolicy: "static" # Required for CPU pinning
memoryManagerPolicy: "Static" # NUMA-aware memory allocation
reservedSystemCPUs: "0-3" # Reserve CPUs for system# Restart kubelet after config change
systemctl restart kubelet
# Verify
kubectl describe node gpu-node-1 | grep -A5 "Topology Manager"Topology Manager Scope
Scope β Alignment Granularity
ββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββ
pod β All containers in the pod must fit one NUMA node
β (stricter β pod rejected if any container can't align)
ββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββ
container β Each container independently aligned to a NUMA node
β (more flexible β different containers can use different NUMA)
ββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββββFull Configuration for GPU Nodes
# /var/lib/kubelet/config.yaml β optimized for 8-GPU dual-socket
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
# Topology alignment
topologyManagerPolicy: "single-numa-node"
topologyManagerScope: "pod"
# CPU pinning (exclusive cores for guaranteed QoS)
cpuManagerPolicy: "static"
cpuManagerPolicyOptions:
full-pcpus-only: "true" # Allocate full physical cores only
distribute-cpus-across-numa: "false" # Keep CPUs on one NUMA
# Memory management
memoryManagerPolicy: "Static"
reservedMemory:
- numaNode: 0
limits:
memory: "2Gi" # Reserve for system on NUMA 0
- numaNode: 1
limits:
memory: "2Gi" # Reserve for system on NUMA 1
# System reservation
reservedSystemCPUs: "0-3,64-67" # First 4 cores per socket for system
systemReserved:
cpu: "4000m"
memory: "8Gi"
kubeReserved:
cpu: "2000m"
memory: "4Gi"Pod Requesting NUMA-Aligned Resources
apiVersion: v1
kind: Pod
metadata:
name: gpu-training
spec:
containers:
- name: trainer
image: registry.example.com/training:v1
resources:
# Guaranteed QoS required for topology alignment
requests:
cpu: "16" # 16 exclusive CPUs
memory: "64Gi" # NUMA-local memory
nvidia.com/gpu: "4" # 4 GPUs (one NVL4 group)
rdma/rdma_shared_device_a: "1" # RDMA NIC
limits:
cpu: "16"
memory: "64Gi"
nvidia.com/gpu: "4"
rdma/rdma_shared_device_a: "1"
# requests == limits β Guaranteed QoS β topology manager appliesVerify NUMA Alignment
# Check which NUMA node resources came from
kubectl exec gpu-training -- bash -c '
echo "=== GPU NUMA Affinity ==="
nvidia-smi topo -m | head -20
echo "=== CPU Affinity ==="
taskset -p 1
cat /proc/self/status | grep Cpus_allowed_list
echo "=== Memory NUMA ==="
numactl --show
cat /proc/self/numa_maps | head -10
'
# From node: check kubelet topology decisions
journalctl -u kubelet | grep -i "topology"
# "Topology Admit Handler" messages show alignment decisionsWhat Happens on Admission Failure
Policy: single-numa-node
Pod requests: 4 GPUs + 32 CPUs + 128Gi memory
If NUMA 0 has: 4 GPUs available, 28 CPUs free, 200Gi memory
β Pod REJECTED (only 28 CPUs on NUMA 0, need 32)
β Event: "TopologyAffinityError"
kubectl describe pod gpu-training:
Events:
Type Reason Message
Warning TopologyAffinity Resources cannot be allocated with topology alignment
Fix: reduce CPU request to fit one NUMA node, or use "restricted" policyTopology Manager with GPU Operator
# GPU Operator Helm values for topology-aware deployment
# The GPU Operator automatically integrates with Topology Manager
# when it detects the kubelet policy is set
# ClusterPolicy (GPU Operator)
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: cluster-policy
spec:
operator:
defaultRuntime: containerd
devicePlugin:
config:
name: device-plugin-config
---
# Device plugin ConfigMap for topology awareness
apiVersion: v1
kind: ConfigMap
metadata:
name: device-plugin-config
namespace: gpu-operator
data:
config.yaml: |
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 1 # No time-slicing (full GPU per request)
flags:
migStrategy: none
# Device plugin reports topology hints to Topology Manager
# automatically when kubelet topologyManagerPolicy != "none"MachineConfig for OpenShift (Topology Manager)
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
name: gpu-topology-config
spec:
machineConfigPoolSelector:
matchLabels:
pools.operator.machineconfiguration.openshift.io/gpu-worker: ""
kubeletConfig:
topologyManagerPolicy: "single-numa-node"
topologyManagerScope: "pod"
cpuManagerPolicy: "static"
cpuManagerPolicyOptions:
full-pcpus-only: "true"
memoryManagerPolicy: "Static"
reservedSystemCPUs: "0-3,64-67"
systemReserved:
cpu: "4000m"
memory: "8Gi"
kubeReserved:
cpu: "2000m"
memory: "4Gi"
reservedMemory:
- numaNode: 0
limits:
memory: "2Gi"
- numaNode: 1
limits:
memory: "2Gi"Performance Impact
Workload: LLM training (4x H100, 64 CPUs, 256Gi RAM)
Without Topology Manager (NUMA misaligned):
- GPU memory bandwidth: ~3.2 TB/s (local)
- CPUβGPU data loading: ~40 GB/s (cross-NUMA via QPI)
- GPUDirect RDMA: ~35 GB/s (NIC on wrong socket)
- Training throughput: baseline
With single-numa-node policy (NUMA aligned):
- GPU memory bandwidth: ~3.2 TB/s (same)
- CPUβGPU data loading: ~64 GB/s (NUMA-local PCIe)
- GPUDirect RDMA: ~50 GB/s (NIC co-located with GPU)
- Training throughput: +15-30% improvement
The gain comes from:
- Eliminating QPI/UPI hops for memory access (+60% bandwidth)
- RDMA NIC using shortest PCIe path (+40% RDMA throughput)
- CPU data preprocessing on NUMA-local memory (reduced latency)Common Issues
Pods stuck Pending with TopologyAffinityError
- Cause: Resources canβt fit on a single NUMA node (too many CPUs/GPUs requested)
- Fix: Reduce resource request; use
restrictedinstead ofsingle-numa-node; or add nodes with larger NUMA domains
Topology Manager has no effect on pod
- Cause: Pod QoS is not Guaranteed (requests != limits)
- Fix: Set requests == limits for all containers (Topology Manager only applies to Guaranteed QoS)
Only one NUMA node utilized (other idle)
- Cause: All pods requesting
single-numa-nodefill NUMA 0 first; NUMA 1 resources stranded - Fix: Use
restrictedpolicy for smaller pods; or request full-node resources (8 GPUs)
CPU Manager not pinning CPUs
- Cause:
cpuManagerPolicy: staticrequires Guaranteed QoS AND integer CPU requests - Fix: Request whole CPUs (e.g.,
cpu: "16"notcpu: "15500m")
Best Practices
single-numa-nodefor GPU nodes β ensures GPU, CPU, NIC, memory all co-located- Guaranteed QoS required β set requests == limits for topology alignment to apply
- Integer CPU requests β
cpu: "16"(not millicores) for CPU pinning - Reserve system CPUs β
reservedSystemCPUsprevents workloads from using core 0 full-pcpus-onlyβ allocate complete physical cores (avoids SMT sibling sharing)- Size pods to fit NUMA β donβt request more CPUs than one NUMA node has
- Monitor NUMA utilization β prevent one NUMA filling while other is idle
Key Takeaways
- Topology Manager coordinates CPU, GPU, memory, and device allocation for NUMA alignment
single-numa-nodepolicy: all resources from ONE NUMA node or pod rejected- Only applies to Guaranteed QoS pods (requests == limits)
- Eliminates cross-NUMA penalties: +15-30% GPU training throughput improvement
- Requires
cpuManagerPolicy: staticfor CPU pinning to work - GPU Operator automatically provides topology hints to Topology Manager
- OpenShift: configure via KubeletConfig CR targeting GPU worker MachineConfigPool

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
