NVIDIA GPU Topology Matrix Interpretation on Kubernetes
Read and interpret nvidia-smi topo and nvidia-device-plugin topology matrices on Kubernetes GPU nodes. Understand X, NV, SYS, NODE, PIX, PXB, PHB connection
π‘ Quick Answer: The GPU topology matrix (from
nvidia-smi topo -mor nvidia-device-plugin logs) shows interconnect types between every GPU, NIC, and NVMe device. NV# = NVLink (fastest), PIX = same PCIe switch, PHB = same PCIe Host Bridge (CPU), SYS = crosses CPU socket (QPI/UPI), NODE = same NUMA node but different PCIe bridges. Use this matrix to ensure GPUs communicating via NCCL share NVLink, and NICs are co-located with their assigned GPUs.
The Problem
- Multi-GPU training is slow but you donβt know why β could be topology mismatch
- NCCL picks suboptimal communication paths because GPU-NIC affinity is wrong
- Need to verify that NVLink is actually connecting the expected GPU pairs
- Kubernetes schedules workloads without considering PCIe topology
- Donβt know which NIC to use for GPUDirect RDMA with a specific GPU
The Solution
Read the Topology Matrix
# On a GPU node (or via kubectl exec into nvidia-device-plugin pod)
nvidia-smi topo -mExample output for an 8-GPU server with NVLink Bridge (NVL4):
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3
GPU0 X NV6 NV6 NV6 SYS SYS SYS SYS PIX NODE NODE SYS
GPU1 NV6 X NV6 NV6 SYS SYS SYS SYS NODE PIX NODE SYS
GPU2 NV6 NV6 X NV6 SYS SYS SYS SYS NODE NODE PIX SYS
GPU3 NV6 NV6 NV6 X SYS SYS SYS SYS NODE NODE NODE SYS
GPU4 SYS SYS SYS SYS X NV6 NV6 NV6 SYS SYS SYS PIX
GPU5 SYS SYS SYS SYS NV6 X NV6 NV6 SYS SYS SYS NODE
GPU6 SYS SYS SYS SYS NV6 NV6 X NV6 SYS SYS SYS NODE
GPU7 SYS SYS SYS SYS NV6 NV6 NV6 X SYS SYS SYS NODE
CPU Affinity:
GPU0-3: NUMA 0 (CPUs 0,2,4,6,8,10...)
GPU4-7: NUMA 1 (CPUs 1,3,5,7,9,11...)Connection Type Legend
Type β Meaning β Bandwidth β Latency
ββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββΌββββββββ
X β Self β N/A β N/A
NV# β Bonded set of # NVLinks β 50-900 GB/s β Lowest
IX β Same NVSwitch fabric (not direct NVLink) β High β Low
PIX β Same PCIe switch (single hop) β ~32 GB/s β Low
PXB β Multiple PCIe bridges (no Host Bridge) β ~32 GB/s β Medium
PHB β PCIe Host Bridge (same CPU socket) β ~32 GB/s β Medium
NODE β Same NUMA node, different PCIe tree β ~32 GB/s β Higher
SYS β Crosses QPI/UPI (different CPU socket) β ~20-40 GB/s β Highest
ODE β Other device (connected but non-standard path) β Varies β Varies
ββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββββββ΄βββββββββββββββ΄ββββββββ
Performance ranking (best to worst):
NV# >> PIX > PXB > PHB > NODE > SYSInterpret NVLink Groups
From the matrix above, identify NVLink groups:
NVL4 Group 1: GPU0, GPU1, GPU2, GPU3 (all NV6 to each other)
βββ NUMA Node 0, CPUs 0,2,4,6,8,10
βββ Local NICs: NIC0 (PIX to GPU0), NIC1 (PIX to GPU1), NIC2 (PIX to GPU2)
NVL4 Group 2: GPU4, GPU5, GPU6, GPU7 (all NV6 to each other)
βββ NUMA Node 1, CPUs 1,3,5,7,9,11
βββ Local NICs: NIC3 (PIX to GPU4)
Cross-group: GPU0βGPU4 = SYS (crosses CPU socket via QPI/UPI)
βββ 5-10x slower than NVLink for collective opsGPU-NIC Affinity for GPUDirect RDMA
Best NIC for each GPU (PIX = same PCIe switch = optimal for RDMA):
GPU0 β NIC0 (PIX) GPU4 β NIC3 (PIX)
GPU1 β NIC1 (PIX) GPU5 β NIC4 (PIX)
GPU2 β NIC2 (PIX) GPU6 β NIC5 (PIX)
GPU3 β NIC2 (NODE) GPU7 β NIC5 (NODE)
Rule: Always use the NIC with PIX relationship to the GPU for RDMA.
NODE is acceptable. SYS means crossing sockets β avoid for RDMA if possible.Kubernetes NCCL Topology Configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: distributed-training
spec:
template:
spec:
containers:
- name: trainer
image: registry.example.com/training:v1
env:
# Force NCCL to use topology-aware algorithms
- name: NCCL_TOPO_DUMP_FILE
value: "/var/run/nvidia/topo.xml"
# Pin NCCL to local NIC for each GPU
- name: NCCL_NET_GDR_LEVEL
value: "PIX" # Only use GPUDirect when NIC is PIX-local
# Enable NVLink for intra-node
- name: NCCL_P2P_LEVEL
value: "NVL" # Use NVLink when available
# Cross-node via RDMA
- name: NCCL_IB_HCA
value: "mlx5_0,mlx5_1,mlx5_4,mlx5_6" # Only local NICs
resources:
limits:
nvidia.com/gpu: 4 # Request one NVL4 groupVerify Topology in Running Pod
# From inside a GPU pod
nvidia-smi topo -m
# Check which GPUs are NVLink-connected
nvidia-smi nvlink --status
# Per-GPU NVLink bandwidth
nvidia-smi nvlink -gt d # Data throughput
# Check NUMA node for each GPU
nvidia-smi topo -p 2 -i 0 # GPU 0's PCIe path
# Verify GPU-NIC locality
cat /proc/driver/nvidia/gpus/*/information
ibstat | grep -A5 "Port 1"Large-Scale Topology (200+ NICs)
On HPC nodes with many NICs (InfiniBand + Ethernet + management):
206 NICs found in the topology, only displaying 56 in the matrix.
The full matrix shows connectivity between:
- 8 GPUs (GPU0-GPU7)
- 10+ InfiniBand NICs (mlx5_0 through mlx5_17)
- 40+ virtual/sub-functions
- CPU affinity and NUMA ID per device
Key insight: Only ~8-10 NICs are relevant for NCCL traffic.
Filter by looking for PIX relationships to GPUs.# Find which NICs are PIX-local to GPUs
nvidia-smi topo -m | grep -E "^(GPU|NIC)" | head -20
# Or use nvidia-smi topo with specific devices
nvidia-smi topo -mp -i 0,1,2,3 # Matrix for GPU 0-3 onlyTopology-Aware Scheduling on Kubernetes
# Use GPU Feature Discovery labels for topology-aware placement
apiVersion: v1
kind: Pod
spec:
nodeSelector:
# Ensure node has NVLink
nvidia.com/gpu.family: hopper
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: feature.node.kubernetes.io/pci-10de.present # NVIDIA GPU
operator: In
values: ["true"]
- key: nvidia.com/gpu.count
operator: In
values: ["8"] # Full 8-GPU node onlyCommon Issues
NCCL using SYS path instead of NVLink
- Cause: NCCL topology detection failed; or GPUs from different NVL groups assigned
- Fix: Request GPUs in multiples matching NVL group size (4 for NVL4); set
NCCL_TOPO_DUMP_FILE
GPUDirect RDMA slow β crossing CPU sockets
- Cause: NIC has SYS relationship to GPU (wrong NUMA node)
- Fix: Pin
NCCL_IB_HCAto only NICs with PIX/NODE relationship to assigned GPUs
βNODEβ instead of βPIXβ for GPU-NIC pair
- Cause: NIC and GPU on same NUMA node but different PCIe switches
- Fix: Acceptable for RDMA (small penalty). PIX is ideal but NODE works well
Inconsistent topology after GPU reset
- Cause: nvidia-smi topo reads live PCIe state; device errors can change reported topology
- Fix:
nvidia-smi -r(reset); or reboot node if topology looks wrong
Best Practices
- Request full NVL groups β 4 GPUs for NVL4, 8 for NVL8 (avoid splitting groups)
- Map NICs to GPUs β use PIX-local NICs for GPUDirect RDMA (highest throughput)
- Set
NCCL_NET_GDR_LEVEL=PIXβ prevents NCCL from using distant NICs for RDMA - Dump topology at pod start β
NCCL_TOPO_DUMP_FILElets you verify paths - NUMA-pin application threads β match CPU affinity to GPU NUMA node
- Monitor NVLink utilization β
nvidia-smi nvlink -gt dduring training - Use GFD labels β GPU Feature Discovery exposes topology info as node labels
Key Takeaways
- Topology matrix shows interconnect type between every GPU, NIC, and device pair
- NV# (NVLink) is 10-50x faster than PCIe paths (PIX/SYS) for GPU-to-GPU communication
- NVL4 = 4 GPUs fully connected via NVLink; SYS between groups means crossing CPU sockets
- GPU-NIC affinity critical for GPUDirect RDMA: always use PIX-local NIC
- NUMA affinity: GPU0-3 on NUMA 0, GPU4-7 on NUMA 1 (typical 8-GPU dual-socket)
- Request GPUs in NVLink group multiples (4 or 8) to avoid cross-socket communication
nvidia-smi topo -mis your first diagnostic tool for GPU interconnect performance

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
