πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai advanced ⏱ 15 minutes K8s 1.28+

H200 NVL 8-GPU Topology Bandwidth Tiers for Kubernetes

Understand the three bandwidth tiers in 8Γ— H200 NVL dual-socket GPU nodes: NVLink intra-domain (~337 GB/s), PCIe Gen5+UPI cross-domain (~50 GB/s), and RoCE inter-node (~35 GB/s). Plan NCCL topology-aware scheduling and NUMA placement for optimal distributed training performance.

By Luca Berton β€’ β€’ πŸ“– 8 min read

πŸ’‘ Quick Answer: An 8Γ— H200 NVL node has three distinct bandwidth tiers: (1) NVLink within each 4-GPU NVL4 bridge domain at ~337 GB/s, (2) PCIe Gen5 x16 + UPI cross-socket between the two 4-GPU halves at ~50 GB/s, and (3) RoCE 400G inter-node at ~35 GB/s. The two NVLink domains are NOT connected β€” cross-socket traffic uses PCIe/UPI, which is 7Γ— slower than NVLink. NUMA-aware scheduling is critical to avoid the PCIe/UPI bottleneck.

The Problem

  • 8-GPU H200 NVL nodes use NVL4 bridges (4-way NVLink) β€” NOT NVSwitch
  • The two 4-GPU halves are on separate NUMA zones with no NVLink connection
  • Cross-socket GPU communication falls from 337 GB/s to 50 GB/s (7Γ— penalty)
  • Default Kubernetes schedulers (including Run:ai) are not NUMA-aware for GPU+NIC placement
  • Inter-node RoCE bandwidth (~35 GB/s) is actually close to cross-socket bandwidth

The Solution

Node Topology: Dual-Socket 8Γ— H200 NVL

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     Dual-Socket GPU Node                             β”‚
β”‚                                                                     β”‚
β”‚  NUMA Zone 0                    UPI              NUMA Zone 1        β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   ◄──────►   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚       CPU 0          β”‚   (~50 GB/s)  β”‚       CPU 1          β”‚    β”‚
β”‚  β”‚                      β”‚               β”‚                      β”‚    β”‚
β”‚  β”‚  PCIe Switch Γ—2      β”‚               β”‚  PCIe Switch Γ—2      β”‚    β”‚
β”‚  β”‚  (Gen5 x16)          β”‚               β”‚  (Gen5 x16)          β”‚    β”‚
β”‚  β”‚                      β”‚               β”‚                      β”‚    β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚               β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚    β”‚
β”‚  β”‚  β”‚ NVL4 Bridge    β”‚  β”‚               β”‚  β”‚ NVL4 Bridge    β”‚  β”‚    β”‚
β”‚  β”‚  β”‚                β”‚  β”‚               β”‚  β”‚                β”‚  β”‚    β”‚
β”‚  β”‚  β”‚ GPU0 GPU1      β”‚  β”‚               β”‚  β”‚ GPU4 GPU5      β”‚  β”‚    β”‚
β”‚  β”‚  β”‚ GPU2 GPU3      β”‚  β”‚               β”‚  β”‚ GPU6 GPU7      β”‚  β”‚    β”‚
β”‚  β”‚  β”‚  (~337 GB/s)   β”‚  β”‚               β”‚  β”‚  (~337 GB/s)   β”‚  β”‚    β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚               β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚    β”‚
β”‚  β”‚                      β”‚               β”‚                      β”‚    β”‚
β”‚  β”‚  NIC 400G (mlx5_0)  β”‚               β”‚  NIC 400G (mlx5_1)  β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                                                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Bandwidth Tiers:
  β‘  NVLink intra-domain (4 GPU):  ~337 GB/s  (full P2P within NVL4)
  β‘‘ PCIe Gen5 + UPI cross-domain: ~50 GB/s   (two halves NOT NVLink-connected)
  β‘’ RoCE 400G inter-node:         ~35 GB/s   (GPU/NIC on cross-NUMA zone)

Bandwidth Tier Comparison

Path                        β”‚ Technology      β”‚ Bandwidth β”‚ Latency β”‚ Notes
────────────────────────────┼─────────────────┼───────────┼─────────┼──────────────
GPU0 ↔ GPU1 (same NVL4)    β”‚ NVLink 4        β”‚ ~337 GB/s β”‚ ~1 ΞΌs   β”‚ Full mesh P2P
GPU0 ↔ GPU2 (same NVL4)    β”‚ NVLink 4        β”‚ ~337 GB/s β”‚ ~1 ΞΌs   β”‚ Full mesh P2P
GPU0 ↔ GPU4 (cross-socket) β”‚ PCIe Gen5 + UPI β”‚ ~50 GB/s  β”‚ ~5 ΞΌs   β”‚ 7Γ— slower than NVL
GPU0 ↔ Remote GPU (RoCE)   β”‚ 400G RoCE       β”‚ ~35 GB/s  β”‚ ~10 ΞΌs  β”‚ Inter-node RDMA
────────────────────────────┴─────────────────┴───────────┴─────────┴──────────────

Key insight: Cross-socket (~50 GB/s) is only 1.4Γ— faster than inter-node (~35 GB/s)
β†’ For communication-heavy workloads, treating cross-socket as "slow" is valid
β†’ Optimal placement keeps tensor parallelism WITHIN one NVL4 domain

NVL4 Bridge vs NVSwitch

Architecture        β”‚ GPU Connectivity        β”‚ Intra-node BW β”‚ Typical Systems
────────────────────┼─────────────────────────┼───────────────┼────────────────
NVL4 Bridge (this)  β”‚ 4 GPUs fully connected  β”‚ 337 GB/s Γ—2   β”‚ Dell XE7740, etc.
                    β”‚ Two separate 4-GPU groupsβ”‚ (within each) β”‚ 
NVSwitch (DGX)      β”‚ All 8 GPUs fully meshed β”‚ 900 GB/s      β”‚ DGX H100/H200
                    β”‚ via NVSwitch fabric     β”‚ (all-to-all)  β”‚
────────────────────┴─────────────────────────┴───────────────┴────────────────

NVL4 tradeoff: cheaper, fewer components, but creates the cross-socket bottleneck.
DGX NVSwitch: all 8 GPUs at full NVLink speed, no NUMA penalty for GPU-GPU traffic.

NCCL Topology Impact

# nvidia-smi topo -m on 8Γ— H200 NVL (NVL4 bridge):

        GPU0  GPU1  GPU2  GPU3  GPU4  GPU5  GPU6  GPU7
GPU0     X    NV4   NV4   NV4   SYS   SYS   SYS   SYS
GPU1    NV4    X    NV4   NV4   SYS   SYS   SYS   SYS
GPU2    NV4   NV4    X    NV4   SYS   SYS   SYS   SYS
GPU3    NV4   NV4   NV4    X    SYS   SYS   SYS   SYS
GPU4    SYS   SYS   SYS   SYS    X    NV4   NV4   NV4
GPU5    SYS   SYS   SYS   SYS   NV4    X    NV4   NV4
GPU6    SYS   SYS   SYS   SYS   NV4   NV4    X    NV4
GPU7    SYS   SYS   SYS   SYS   NV4   NV4   NV4    X

NV4 = NVLink 4 (intra-domain, ~337 GB/s)
SYS = Cross-socket via UPI (~50 GB/s, NO NVLink path)

# NCCL builds rings that cross the SYS boundary β€” this is the bottleneck

Scheduling Strategies

# Strategy 1: Keep tensor parallelism within 4 GPUs (NVL4 domain)
# Best for: inference, small models that fit in 4 GPUs
env:
  - name: CUDA_VISIBLE_DEVICES
    value: "0,1,2,3"        # Stay within NUMA Zone 0 NVL4

# Strategy 2: Use all 8 GPUs but accept cross-socket penalty
# Best for: large models requiring >4 GPUs per node
# NCCL will use NVLink for intra-domain, PCIe/UPI for cross-domain
env:
  - name: NCCL_NVLS_ENABLE
    value: "0"              # Disable NVLink SHARP (no NVSwitch)

# Strategy 3: Prefer inter-node over cross-socket
# For 2-node jobs needing 8 GPUs total: use 4+4 (one NVL4 per node)
# Instead of 8 GPUs on one node with cross-socket penalty
resources:
  limits:
    nvidia.com/gpu: 4       # Half the node, all within one NVL4

NUMA-Aware NIC Placement

Critical: Each 400G NIC is attached to ONE NUMA zone.

  NIC (mlx5_0) β†’ NUMA Zone 0 β†’ PCIe Switch β†’ GPU 0,1,2,3
  NIC (mlx5_1) β†’ NUMA Zone 1 β†’ PCIe Switch β†’ GPU 4,5,6,7

For inter-node RDMA (RoCE):
  GPU 0 β†’ mlx5_0: PIX/PXB distance β†’ GPUDirect RDMA βœ“ (~35 GB/s)
  GPU 0 β†’ mlx5_1: SYS distance    β†’ Cross-socket DMA (extra UPI hop)
  GPU 4 β†’ mlx5_1: PIX/PXB distance β†’ GPUDirect RDMA βœ“ (~35 GB/s)
  GPU 4 β†’ mlx5_0: SYS distance    β†’ Cross-socket DMA (extra UPI hop)

Problem: OpenShift/Run:ai cannot guarantee GPU-NIC NUMA affinity.
The SR-IOV VF allocated to a pod may be from the "wrong" NIC.

Mitigation: NCCL_NET_GDR_LEVEL=SYS allows all pairs, but cross-socket
GPUDirect adds latency. Topology-aware VF allocation is the real fix.

NCCL Ring Algorithm on NVL4 Topology

# NCCL builds channels (rings/trees) that account for topology:

# Optimal ring for 8 GPUs across 2 NVL4 domains:
Ring: 0β†’1β†’2β†’3β†’[UPI]β†’4β†’5β†’6β†’7β†’[RoCE]β†’(next node)β†’[RoCE]β†’0

# The UPI crossing happens once per ring traversal
# Minimizing cross-domain hops is NCCL's graph search goal

# NCCL tuning for NVL4 topology:
env:
  - name: NCCL_MIN_NCHANNELS
    value: "4"
  - name: NCCL_MAX_NCHANNELS
    value: "16"
  # More channels = more parallelism, amortizes UPI latency

Benchmark Expected Results

Test Configuration           β”‚ Expected busbw β”‚ Bottleneck
─────────────────────────────┼────────────────┼─────────────────────
4 GPU intra-NVL4 (same node) β”‚ ~300-337 GB/s  β”‚ NVLink bandwidth
8 GPU single node (all GPUs)  β”‚ ~50 GB/s       β”‚ UPI cross-socket
4+4 GPU (2 nodes, NVL4 each) β”‚ ~35 GB/s       β”‚ RoCE 400G
8+8 GPU (2 nodes, all GPUs)  β”‚ ~35 GB/s       β”‚ RoCE (UPI hidden in pipeline)
─────────────────────────────┴────────────────┴─────────────────────

Key takeaway: For 8-GPU all_reduce on NVL4 nodes, the UPI bottleneck
dominates. Single-node 8-GPU is only ~50 GB/s, while cross-node is ~35 GB/s.
The incremental penalty of going multi-node is small (~30% less than cross-socket).

Implications for Model Parallelism

Parallelism Strategy          β”‚ Optimal Placement on NVL4 Nodes
──────────────────────────────┼─────────────────────────────────────────
Tensor Parallelism (TP=4)     β”‚ Within ONE NVL4 domain (GPU 0-3 or 4-7)
Tensor Parallelism (TP=8)     β”‚ Full node β€” accepts UPI penalty
Pipeline Parallelism (PP)     β”‚ Across nodes β€” uses RoCE, less BW-sensitive
Data Parallelism (DP)         β”‚ Across nodes β€” gradient allreduce over RoCE
TP=4 + PP=2                   β”‚ TP within NVL4, PP across nodes ← OPTIMAL
TP=4 + DP=N                   β”‚ TP within NVL4, DP across all nodes
──────────────────────────────┴─────────────────────────────────────────

Rule: Keep TP within NVLink domain. Use PP/DP for cross-socket and inter-node.

Common Issues

All_reduce busbw only ~50 GB/s with 8 GPUs on one node

  • Cause: UPI cross-socket bottleneck between NVL4 domains (expected)
  • Fix: Not fixable on NVL4 hardware. Use TP=4 within domain, or accept penalty.

Inter-node worse than expected (~20 GB/s instead of ~35 GB/s)

  • Cause: GPU and NIC on different NUMA zones (cross-socket GPUDirect)
  • Fix: Ensure NCCL_NET_GDR_LEVEL=SYS and check GPU-NIC affinity

NCCL hangs during 8-GPU ring formation

  • Cause: SYS-level P2P disabled by kernel (IOMMU restriction)
  • Fix: Verify iommu=pt in kernel args; check nvidia-smi topo -p2p

Scheduler places 4 GPUs from each NUMA zone

  • Cause: GPU device plugin doesn’t respect NUMA topology by default
  • Fix: Enable topology manager in kubelet: topologyManagerPolicy: best-effort

Best Practices

  1. Keep tensor parallelism ≀ 4 on NVL4 nodes β€” stay within one NVLink domain
  2. Use pipeline parallelism for cross-socket and cross-node β€” less BW-sensitive
  3. Size NIC per NUMA zone β€” one 400G NIC per CPU socket for locality
  4. Measure all three tiers β€” validate NVLink, UPI, and RoCE independently
  5. Consider 2-node 4+4 over single-node 8 GPU β€” only 30% less BW, double memory
  6. Set NCCL_NET_GDR_LEVEL=SYS β€” even cross-socket RDMA is better than CPU bounce
  7. Enable topology manager in kubelet for NUMA-aware GPU+NIC placement

Key Takeaways

  • NVL4 bridge creates TWO separate 4-GPU NVLink domains β€” not one 8-GPU mesh
  • Cross-socket (UPI) is 7Γ— slower than intra-NVLink: 50 vs 337 GB/s
  • Inter-node RoCE (~35 GB/s) is surprisingly close to cross-socket (~50 GB/s)
  • Optimal strategy: TP=4 within NVL4, PP/DP for anything beyond
  • NUMA-aware scheduling is critical but NOT default in OpenShift/Run:ai
  • For models needing >4 GPUs: compare 8-GPU-single-node vs 2Γ—4-GPU-multi-node
  • NVSwitch systems (DGX) eliminate this problem β€” all 8 GPUs at ~900 GB/s
#gpu #nccl #performance #networking #architecture
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens