DGX H100 nvidia-smi topo -m Guide
Read nvidia-smi topo -m output on DGX H100 systems. Understand NVLink, NVSwitch, PCIe topology, GPU-to-GPU bandwidth, and NUMA affinity for Kubernetes.
π‘ Quick Answer: Run
nvidia-smi topo -mon a DGX H100 to see the GPU interconnect topology matrix. NVLink connections showNV18(18 NVLinks = full NVSwitch bandwidth ~900 GB/s bidirectional), PCIe showsPIX/PXB/PHB, andSYSmeans cross-NUMA. For Kubernetes, this topology determines GPU scheduling β always co-locate tensor-parallel GPUs on the same NVSwitch domain.
The Problem
Multi-GPU workloads perform differently depending on GPU placement:
- 2 GPUs connected via NVLink: ~900 GB/s bandwidth
- 2 GPUs on same PCIe switch: ~64 GB/s
- 2 GPUs across NUMA nodes: ~32 GB/s + latency penalty
Understanding nvidia-smi topo -m output is essential for optimal GPU scheduling.
The Solution
Read the Topology Matrix
# Run on any GPU node
nvidia-smi topo -mDGX H100 (8Γ H100 SXM) output:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 0-51 0
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 0-51 0
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 0-51 0
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 0-51 0
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 52-103 1
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 52-103 1
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 52-103 1
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X 52-103 1
Legend:
NV# = Connected via # NVLinks
PIX = Same PCIe switch
PXB = PCIe switches connected via same host bridge
PHB = Across host bridges (same NUMA)
SYS = Cross-NUMA socket (QPI/UPI)
NODE = Same NUMA nodeIn a DGX H100, all 8 GPUs are fully connected via NVSwitch (NV18 = 18 NVLinks each), giving ~900 GB/s bidirectional bandwidth between any GPU pair.
Topology Connection Types
| Code | Meaning | Bandwidth | Latency |
|---|---|---|---|
| NV18 | 18 NVLinks (NVSwitch) | ~900 GB/s | ~1 ΞΌs |
| NV12 | 12 NVLinks | ~600 GB/s | ~1 ΞΌs |
| NV4 | 4 NVLinks (A100 peer) | ~200 GB/s | ~2 ΞΌs |
| PIX | Same PCIe switch | ~64 GB/s (Gen5) | ~5 ΞΌs |
| PXB | Same host bridge | ~32 GB/s | ~10 ΞΌs |
| PHB | Same NUMA, different bridge | ~32 GB/s | ~15 ΞΌs |
| SYS | Cross-NUMA (QPI/UPI) | ~25 GB/s | ~20 ΞΌs |
DGX H100 vs A100 Topology
# DGX A100 (8Γ A100 SXM) β NVSwitch v2
# GPU0-GPU3: NV12 (within baseboard)
# GPU0-GPU4: NV12 (across baseboards via NVSwitch)
# All-to-all: NV12
# DGX H100 (8Γ H100 SXM) β NVSwitch v3
# All-to-all: NV18 (full NVSwitch bandwidth)
# Each GPU: 18 NVLinks Γ 50 GB/s = 900 GB/s per GPU
# HGX B200 (8Γ B200) β NVSwitch v4
# All-to-all: NV72 (NVLink 5)
# Each GPU: 1.8 TB/s per GPUCheck GPU-NIC Affinity (RDMA)
# Critical for multi-node training β GPU and NIC should be on same NUMA
nvidia-smi topo -m | grep -E "GPU|mlx"
# Check NIC NUMA affinity
cat /sys/class/infiniband/mlx5_0/device/numa_node
# 0 β NIC is on NUMA 0
# GPUs on NUMA 0: GPU0-GPU3
# GPUs on NUMA 1: GPU4-GPU7
# Best: GPU0 β mlx5_0, GPU4 β mlx5_1 (same NUMA as NIC)Kubernetes Topology-Aware Scheduling
# GPU Operator with topology-aware scheduling
apiVersion: v1
kind: Pod
metadata:
name: multi-gpu-training
spec:
containers:
- name: trainer
image: nvcr.io/nvidia/pytorch:24.07-py3
resources:
limits:
nvidia.com/gpu: 4 # Request 4 GPUs
env:
- name: CUDA_VISIBLE_DEVICES
value: "0,1,2,3" # Same NUMA group
- name: NCCL_SOCKET_IFNAME
value: "eth0"
- name: NCCL_DEBUG
value: "INFO" # Shows which transport NCCL picks# NCCL will log the transport used:
# NCCL INFO Channel 00/02 : 0[0] -> 1[1] via NVLink/NVSwitch
# If you see "via NET" instead, GPUs aren't on NVSwitch β check topo!Verify NVLink Bandwidth
# Run NCCL allreduce benchmark on DGX H100
kubectl exec -it gpu-pod -- \
/usr/bin/all_reduce_perf -b 8 -e 1G -f 2 -g 8
# Expected DGX H100 (8Γ H100, NVSwitch):
# size(B) busbw(GB/s)
# 1048576 280
# 67108864 430
# 1073741824 450 β Peak ~450 GB/s bus bandwidth
# If busbw < 200 GB/s, NVLink is not being used β check topologyCommon Issues
NCCL falling back to PCIe/NET despite NVLink
NCCL_P2P_DISABLE=1 or NCCL_P2P_LEVEL set too restrictively. Remove these env vars to let NCCL auto-detect NVLink.
Cross-NUMA GPU scheduling hurts performance
Request GPUs in multiples matching NUMA groups (4 GPUs per NUMA on DGX). Or use topology-aware scheduler (GPU Operator + NUMA-aware scheduling).
βtopo -mβ shows SYS between GPU and NIC
GPU and NIC on different NUMA nodes. Set NCCL_NET_GDR_LEVEL=SYS to allow GPUDirect RDMA across NUMA, or pin workloads to correct NUMA.
Best Practices
- Always check
nvidia-smi topo -mbefore running multi-GPU workloads - Co-locate tensor-parallel GPUs on same NVSwitch domain
- Match GPU-NIC NUMA affinity for GPUDirect RDMA
- Use NCCL_DEBUG=INFO to verify NVLink is actually being used
- Request GPUs in NUMA-aligned groups (4 per NUMA on DGX H100)
Key Takeaways
nvidia-smi topo -mshows GPU interconnect topology: NVLink, PCIe, NUMA- DGX H100: NV18 = 18 NVLinks per GPU pair via NVSwitch (~900 GB/s)
- GPU-NIC NUMA affinity is critical for multi-node training performance
- NCCL auto-detects topology but verify with NCCL_DEBUG=INFO logs
- Always schedule multi-GPU workloads within the same NUMA domain

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
