H200 NVL 8-GPU Topology Bandwidth Tiers for Kubernetes
Understand the three bandwidth tiers in 8Γ H200 NVL dual-socket GPU nodes: NVLink intra-domain (~337 GB/s), PCIe Gen5+UPI cross-domain (~50 GB/s), and RoCE inter-node (~35 GB/s). Plan NCCL topology-aware scheduling and NUMA placement for optimal distributed training performance.
π‘ Quick Answer: An 8Γ H200 NVL node has three distinct bandwidth tiers: (1) NVLink within each 4-GPU NVL4 bridge domain at ~337 GB/s, (2) PCIe Gen5 x16 + UPI cross-socket between the two 4-GPU halves at ~50 GB/s, and (3) RoCE 400G inter-node at ~35 GB/s. The two NVLink domains are NOT connected β cross-socket traffic uses PCIe/UPI, which is 7Γ slower than NVLink. NUMA-aware scheduling is critical to avoid the PCIe/UPI bottleneck.
The Problem
- 8-GPU H200 NVL nodes use NVL4 bridges (4-way NVLink) β NOT NVSwitch
- The two 4-GPU halves are on separate NUMA zones with no NVLink connection
- Cross-socket GPU communication falls from 337 GB/s to 50 GB/s (7Γ penalty)
- Default Kubernetes schedulers (including Run:ai) are not NUMA-aware for GPU+NIC placement
- Inter-node RoCE bandwidth (~35 GB/s) is actually close to cross-socket bandwidth
The Solution
Node Topology: Dual-Socket 8Γ H200 NVL
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Dual-Socket GPU Node β
β β
β NUMA Zone 0 UPI NUMA Zone 1 β
β ββββββββββββββββββββββββ ββββββββΊ ββββββββββββββββββββββββ β
β β CPU 0 β (~50 GB/s) β CPU 1 β β
β β β β β β
β β PCIe Switch Γ2 β β PCIe Switch Γ2 β β
β β (Gen5 x16) β β (Gen5 x16) β β
β β β β β β
β β ββββββββββββββββββ β β ββββββββββββββββββ β β
β β β NVL4 Bridge β β β β NVL4 Bridge β β β
β β β β β β β β β β
β β β GPU0 GPU1 β β β β GPU4 GPU5 β β β
β β β GPU2 GPU3 β β β β GPU6 GPU7 β β β
β β β (~337 GB/s) β β β β (~337 GB/s) β β β
β β ββββββββββββββββββ β β ββββββββββββββββββ β β
β β β β β β
β β NIC 400G (mlx5_0) β β NIC 400G (mlx5_1) β β
β ββββββββββββββββββββββββ ββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Bandwidth Tiers:
β NVLink intra-domain (4 GPU): ~337 GB/s (full P2P within NVL4)
β‘ PCIe Gen5 + UPI cross-domain: ~50 GB/s (two halves NOT NVLink-connected)
β’ RoCE 400G inter-node: ~35 GB/s (GPU/NIC on cross-NUMA zone)Bandwidth Tier Comparison
Path β Technology β Bandwidth β Latency β Notes
βββββββββββββββββββββββββββββΌββββββββββββββββββΌββββββββββββΌββββββββββΌββββββββββββββ
GPU0 β GPU1 (same NVL4) β NVLink 4 β ~337 GB/s β ~1 ΞΌs β Full mesh P2P
GPU0 β GPU2 (same NVL4) β NVLink 4 β ~337 GB/s β ~1 ΞΌs β Full mesh P2P
GPU0 β GPU4 (cross-socket) β PCIe Gen5 + UPI β ~50 GB/s β ~5 ΞΌs β 7Γ slower than NVL
GPU0 β Remote GPU (RoCE) β 400G RoCE β ~35 GB/s β ~10 ΞΌs β Inter-node RDMA
βββββββββββββββββββββββββββββ΄ββββββββββββββββββ΄ββββββββββββ΄ββββββββββ΄ββββββββββββββ
Key insight: Cross-socket (~50 GB/s) is only 1.4Γ faster than inter-node (~35 GB/s)
β For communication-heavy workloads, treating cross-socket as "slow" is valid
β Optimal placement keeps tensor parallelism WITHIN one NVL4 domainNVL4 Bridge vs NVSwitch
Architecture β GPU Connectivity β Intra-node BW β Typical Systems
βββββββββββββββββββββΌββββββββββββββββββββββββββΌββββββββββββββββΌββββββββββββββββ
NVL4 Bridge (this) β 4 GPUs fully connected β 337 GB/s Γ2 β Dell XE7740, etc.
β Two separate 4-GPU groupsβ (within each) β
NVSwitch (DGX) β All 8 GPUs fully meshed β 900 GB/s β DGX H100/H200
β via NVSwitch fabric β (all-to-all) β
βββββββββββββββββββββ΄ββββββββββββββββββββββββββ΄ββββββββββββββββ΄ββββββββββββββββ
NVL4 tradeoff: cheaper, fewer components, but creates the cross-socket bottleneck.
DGX NVSwitch: all 8 GPUs at full NVLink speed, no NUMA penalty for GPU-GPU traffic.NCCL Topology Impact
# nvidia-smi topo -m on 8Γ H200 NVL (NVL4 bridge):
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 X NV4 NV4 NV4 SYS SYS SYS SYS
GPU1 NV4 X NV4 NV4 SYS SYS SYS SYS
GPU2 NV4 NV4 X NV4 SYS SYS SYS SYS
GPU3 NV4 NV4 NV4 X SYS SYS SYS SYS
GPU4 SYS SYS SYS SYS X NV4 NV4 NV4
GPU5 SYS SYS SYS SYS NV4 X NV4 NV4
GPU6 SYS SYS SYS SYS NV4 NV4 X NV4
GPU7 SYS SYS SYS SYS NV4 NV4 NV4 X
NV4 = NVLink 4 (intra-domain, ~337 GB/s)
SYS = Cross-socket via UPI (~50 GB/s, NO NVLink path)
# NCCL builds rings that cross the SYS boundary β this is the bottleneckScheduling Strategies
# Strategy 1: Keep tensor parallelism within 4 GPUs (NVL4 domain)
# Best for: inference, small models that fit in 4 GPUs
env:
- name: CUDA_VISIBLE_DEVICES
value: "0,1,2,3" # Stay within NUMA Zone 0 NVL4
# Strategy 2: Use all 8 GPUs but accept cross-socket penalty
# Best for: large models requiring >4 GPUs per node
# NCCL will use NVLink for intra-domain, PCIe/UPI for cross-domain
env:
- name: NCCL_NVLS_ENABLE
value: "0" # Disable NVLink SHARP (no NVSwitch)
# Strategy 3: Prefer inter-node over cross-socket
# For 2-node jobs needing 8 GPUs total: use 4+4 (one NVL4 per node)
# Instead of 8 GPUs on one node with cross-socket penalty
resources:
limits:
nvidia.com/gpu: 4 # Half the node, all within one NVL4NUMA-Aware NIC Placement
Critical: Each 400G NIC is attached to ONE NUMA zone.
NIC (mlx5_0) β NUMA Zone 0 β PCIe Switch β GPU 0,1,2,3
NIC (mlx5_1) β NUMA Zone 1 β PCIe Switch β GPU 4,5,6,7
For inter-node RDMA (RoCE):
GPU 0 β mlx5_0: PIX/PXB distance β GPUDirect RDMA β (~35 GB/s)
GPU 0 β mlx5_1: SYS distance β Cross-socket DMA (extra UPI hop)
GPU 4 β mlx5_1: PIX/PXB distance β GPUDirect RDMA β (~35 GB/s)
GPU 4 β mlx5_0: SYS distance β Cross-socket DMA (extra UPI hop)
Problem: OpenShift/Run:ai cannot guarantee GPU-NIC NUMA affinity.
The SR-IOV VF allocated to a pod may be from the "wrong" NIC.
Mitigation: NCCL_NET_GDR_LEVEL=SYS allows all pairs, but cross-socket
GPUDirect adds latency. Topology-aware VF allocation is the real fix.NCCL Ring Algorithm on NVL4 Topology
# NCCL builds channels (rings/trees) that account for topology:
# Optimal ring for 8 GPUs across 2 NVL4 domains:
Ring: 0β1β2β3β[UPI]β4β5β6β7β[RoCE]β(next node)β[RoCE]β0
# The UPI crossing happens once per ring traversal
# Minimizing cross-domain hops is NCCL's graph search goal
# NCCL tuning for NVL4 topology:
env:
- name: NCCL_MIN_NCHANNELS
value: "4"
- name: NCCL_MAX_NCHANNELS
value: "16"
# More channels = more parallelism, amortizes UPI latencyBenchmark Expected Results
Test Configuration β Expected busbw β Bottleneck
ββββββββββββββββββββββββββββββΌβββββββββββββββββΌβββββββββββββββββββββ
4 GPU intra-NVL4 (same node) β ~300-337 GB/s β NVLink bandwidth
8 GPU single node (all GPUs) β ~50 GB/s β UPI cross-socket
4+4 GPU (2 nodes, NVL4 each) β ~35 GB/s β RoCE 400G
8+8 GPU (2 nodes, all GPUs) β ~35 GB/s β RoCE (UPI hidden in pipeline)
ββββββββββββββββββββββββββββββ΄βββββββββββββββββ΄βββββββββββββββββββββ
Key takeaway: For 8-GPU all_reduce on NVL4 nodes, the UPI bottleneck
dominates. Single-node 8-GPU is only ~50 GB/s, while cross-node is ~35 GB/s.
The incremental penalty of going multi-node is small (~30% less than cross-socket).Implications for Model Parallelism
Parallelism Strategy β Optimal Placement on NVL4 Nodes
βββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββ
Tensor Parallelism (TP=4) β Within ONE NVL4 domain (GPU 0-3 or 4-7)
Tensor Parallelism (TP=8) β Full node β accepts UPI penalty
Pipeline Parallelism (PP) β Across nodes β uses RoCE, less BW-sensitive
Data Parallelism (DP) β Across nodes β gradient allreduce over RoCE
TP=4 + PP=2 β TP within NVL4, PP across nodes β OPTIMAL
TP=4 + DP=N β TP within NVL4, DP across all nodes
βββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββ
Rule: Keep TP within NVLink domain. Use PP/DP for cross-socket and inter-node.Common Issues
All_reduce busbw only ~50 GB/s with 8 GPUs on one node
- Cause: UPI cross-socket bottleneck between NVL4 domains (expected)
- Fix: Not fixable on NVL4 hardware. Use TP=4 within domain, or accept penalty.
Inter-node worse than expected (~20 GB/s instead of ~35 GB/s)
- Cause: GPU and NIC on different NUMA zones (cross-socket GPUDirect)
- Fix: Ensure
NCCL_NET_GDR_LEVEL=SYSand check GPU-NIC affinity
NCCL hangs during 8-GPU ring formation
- Cause: SYS-level P2P disabled by kernel (IOMMU restriction)
- Fix: Verify
iommu=ptin kernel args; checknvidia-smi topo -p2p
Scheduler places 4 GPUs from each NUMA zone
- Cause: GPU device plugin doesnβt respect NUMA topology by default
- Fix: Enable topology manager in kubelet:
topologyManagerPolicy: best-effort
Best Practices
- Keep tensor parallelism β€ 4 on NVL4 nodes β stay within one NVLink domain
- Use pipeline parallelism for cross-socket and cross-node β less BW-sensitive
- Size NIC per NUMA zone β one 400G NIC per CPU socket for locality
- Measure all three tiers β validate NVLink, UPI, and RoCE independently
- Consider 2-node 4+4 over single-node 8 GPU β only 30% less BW, double memory
- Set
NCCL_NET_GDR_LEVEL=SYSβ even cross-socket RDMA is better than CPU bounce - Enable topology manager in kubelet for NUMA-aware GPU+NIC placement
Key Takeaways
- NVL4 bridge creates TWO separate 4-GPU NVLink domains β not one 8-GPU mesh
- Cross-socket (UPI) is 7Γ slower than intra-NVLink: 50 vs 337 GB/s
- Inter-node RoCE (~35 GB/s) is surprisingly close to cross-socket (~50 GB/s)
- Optimal strategy: TP=4 within NVL4, PP/DP for anything beyond
- NUMA-aware scheduling is critical but NOT default in OpenShift/Run:ai
- For models needing >4 GPUs: compare 8-GPU-single-node vs 2Γ4-GPU-multi-node
- NVSwitch systems (DGX) eliminate this problem β all 8 GPUs at ~900 GB/s

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
