πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai advanced ⏱ 15 minutes K8s 1.28+

NVLink Bridge Architecture for GPU Kubernetes Nodes

Understand NVLink Bridge logical architecture in GPU servers for Kubernetes. Dual-socket PCIe Gen5 topology, NVL4 groups, GPU-NIC-NVMe placement, PCIe switch

By Luca Berton β€’ β€’ πŸ“– 7 min read

πŸ’‘ Quick Answer: NVLink Bridge connects groups of 4 GPUs (NVL4) with high-bandwidth NVLink for direct GPU-to-GPU communication bypassing PCIe. In a typical 8-GPU dual-socket server: CPU β†’ PCIe Gen5 x16 β†’ PCIe Switch β†’ GPUs + NICs. Each CPU socket owns 4 GPUs + 2 NICs in two NVL4 groups. NVLink provides 900 GB/s (H100) between grouped GPUs vs ~64 GB/s for PCIe Gen5 β€” making NVLink group sizing critical for distributed training performance.

The Problem

  • Multi-GPU training performance varies wildly depending on which GPUs are assigned
  • Cross-socket GPU communication is 10x slower than intra-NVLink-group
  • Need to understand the physical topology to properly size GPU requests
  • NIC placement relative to GPUs matters for GPUDirect RDMA performance
  • PCIe switch hierarchy creates bandwidth bottlenecks if not understood

The Solution

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                          DUAL-SOCKET GPU SERVER (8x GPU)                              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚           SOCKET 0 (NUMA 0)          β”‚            SOCKET 1 (NUMA 1)                  β”‚
β”‚                                      β”‚                                               β”‚
β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                  β”‚
β”‚         β”‚   System Memory  β”‚         β”‚         β”‚   System Memory  β”‚                  β”‚
β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                  β”‚
β”‚                  β”‚                   β”‚                   β”‚                            β”‚
β”‚            β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”             β”‚            β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”                      β”‚
β”‚            β”‚    CPU 0   │◄──── QPI/UPI ────►      β”‚    CPU 1   β”‚                      β”‚
β”‚            β””β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”˜             β”‚           β””β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”˜                      β”‚
β”‚          Gen5 β”‚      β”‚ Gen5           β”‚         Gen5 β”‚      β”‚ Gen5                    β”‚
β”‚          x16  β”‚      β”‚ x16            β”‚         x16  β”‚      β”‚ x16                     β”‚
β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β” β”Œβ”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β” β”Œβ”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”‚
β”‚    β”‚ PCIe Switch β”‚ β”‚ PCIe Switch β”‚    β”‚    β”‚ PCIe Switch β”‚ β”‚ PCIe Switch β”‚             β”‚
β”‚    β””β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”˜ β””β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”˜    β”‚    β””β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”˜ β””β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”˜            β”‚
β”‚     β”‚  β”‚  β”‚  β”‚  β”‚    β”‚  β”‚  β”‚  β”‚  β”‚    β”‚     β”‚  β”‚  β”‚  β”‚      β”‚  β”‚  β”‚  β”‚  β”‚             β”‚
β”‚   Gen5 Gen5 Gen5     Gen5 Gen5 Gen5   β”‚   Gen5 Gen5 Gen5   Gen5 Gen5 Gen5            β”‚
β”‚   x16  x16  x16     x16  x16  x16    β”‚   x16  x16  x16   x16  x16  x16             β”‚
β”‚     β”‚  β”‚  β”‚  β”‚  β”‚    β”‚  β”‚  β”‚  β”‚  β”‚    β”‚     β”‚  β”‚  β”‚  β”‚      β”‚  β”‚  β”‚  β”‚  β”‚             β”‚
β”‚   β”Œβ”€β”β”Œβ”€β”β”Œβ”€β”β”Œβ”€β”β”Œβ”€β” β”Œβ”€β”β”Œβ”€β”β”Œβ”€β”β”Œβ”€β”     β”‚   β”Œβ”€β”β”Œβ”€β”β”Œβ”€β”β”Œβ”€β”   β”Œβ”€β”β”Œβ”€β”β”Œβ”€β”β”Œβ”€β”β”Œβ”€β”          β”‚
β”‚   β”‚Nβ”‚β”‚Gβ”‚β”‚Gβ”‚β”‚Gβ”‚β”‚Gβ”‚β”‚Nβ”‚                  β”‚   β”‚Nβ”‚β”‚Gβ”‚β”‚Gβ”‚β”‚Gβ”‚β”‚Gβ”‚β”‚Nβ”‚                          β”‚
β”‚   β”‚Iβ”‚β”‚Pβ”‚β”‚Pβ”‚β”‚Pβ”‚β”‚Pβ”‚β”‚Iβ”‚                  β”‚   β”‚Iβ”‚β”‚Pβ”‚β”‚Pβ”‚β”‚Pβ”‚β”‚Pβ”‚β”‚Iβ”‚                          β”‚
β”‚   β”‚Cβ”‚β”‚Uβ”‚β”‚Uβ”‚β”‚Uβ”‚β”‚Uβ”‚β”‚Cβ”‚                  β”‚   β”‚Cβ”‚β”‚Uβ”‚β”‚Uβ”‚β”‚Uβ”‚β”‚Uβ”‚β”‚Cβ”‚                          β”‚
β”‚   β”‚0β”‚β”‚0β”‚β”‚1β”‚β”‚2β”‚β”‚3β”‚β”‚1β”‚                  β”‚   β”‚2β”‚β”‚4β”‚β”‚5β”‚β”‚6β”‚β”‚7β”‚β”‚3β”‚                          β”‚
β”‚   β””β”€β”˜β””β”¬β”˜β””β”¬β”˜β””β”¬β”˜β””β”¬β”˜β””β”€β”˜                 β”‚   β””β”€β”˜β””β”¬β”˜β””β”¬β”˜β””β”¬β”˜β””β”¬β”˜β””β”€β”˜                         β”‚
β”‚        β””β”€β”€β”΄β”€β”€β”΄β”€β”€β”˜                     β”‚        β””β”€β”€β”΄β”€β”€β”΄β”€β”€β”˜                             β”‚
β”‚         NVL4 Group 0                  β”‚         NVL4 Group 1                          β”‚
β”‚     (900 GB/s per direction)          β”‚     (900 GB/s per direction)                  β”‚
β”‚                                       β”‚                                               β”‚
β”‚  β”Œβ”€β”€β”€β”                                β”‚                                  β”Œβ”€β”€β”€β”        β”‚
β”‚  β”‚NVMeβ”‚ ← Gen4 x4                    β”‚                      Gen4 x4 β†’ β”‚NVMeβ”‚        β”‚
β”‚  β””β”€β”€β”€β”˜                                β”‚                                  β””β”€β”€β”€β”˜        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Bandwidth comparison:
  NVLink (NVL4, H100): 900 GB/s bidirectional
  PCIe Gen5 x16:       ~64 GB/s bidirectional  
  QPI/UPI (cross-socket): ~40 GB/s
  
  NVLink is 14x faster than PCIe for GPU-to-GPU!

Bandwidth Hierarchy

Connection Path                    β”‚ Bandwidth        β”‚ Use Case
───────────────────────────────────┼──────────────────┼─────────────────────────
GPU↔GPU (NVLink, same NVL4 group) β”‚ 900 GB/s (H100)  β”‚ Tensor parallelism
                                   β”‚ 600 GB/s (A100)  β”‚ All-reduce within node
───────────────────────────────────┼──────────────────┼─────────────────────────
GPU↔GPU (PCIe, cross NVL4 group)  β”‚ ~64 GB/s Gen5    β”‚ Avoid if possible
                                   β”‚ ~32 GB/s Gen4    β”‚ (14x slower than NVLink)
───────────────────────────────────┼──────────────────┼─────────────────────────
GPU↔NIC (GPUDirect RDMA, PIX)     β”‚ ~50 GB/s (400G)  β”‚ Cross-node all-reduce
                                   β”‚ ~25 GB/s (200G)  β”‚ Data parallel gradient sync
───────────────────────────────────┼──────────────────┼─────────────────────────
GPU↔CPU Memory (PCIe)             β”‚ ~64 GB/s Gen5    β”‚ Data loading, preprocessing
───────────────────────────────────┼──────────────────┼─────────────────────────
CPU↔CPU (QPI/UPI)                 β”‚ ~40 GB/s         β”‚ Cross-socket access
───────────────────────────────────┴──────────────────┴─────────────────────────

Kubernetes Scheduling Implications

# CORRECT: Request 4 GPUs (fills one NVL4 group)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tensor-parallel-inference
spec:
  template:
    spec:
      containers:
        - name: vllm
          resources:
            limits:
              nvidia.com/gpu: 4    # One full NVL4 group
          env:
            - name: NCCL_P2P_LEVEL
              value: "NVL"
            # All 4 GPUs communicate at 900 GB/s via NVLink
# SUBOPTIMAL: Request 5 GPUs (splits across NVL4 groups)
# Result: 4 GPUs at NVLink speed + 1 GPU at PCIe speed (14x slower)
# The 5th GPU becomes a bottleneck for all-reduce operations
spec:
  containers:
    - name: training
      resources:
        limits:
          nvidia.com/gpu: 5    # Avoid β€” crosses NVL4 boundary

Optimal GPU Request Sizes

NVL4 Architecture (4 GPUs per NVLink group):
  βœ… Request 1 GPU  β€” single GPU workload
  βœ… Request 2 GPUs β€” same NVL4 group (if topology-aware scheduler)
  βœ… Request 4 GPUs β€” full NVL4 group (optimal for TP=4)
  βœ… Request 8 GPUs β€” full node (both NVL4 groups, cross-socket via PXN)
  ❌ Request 3 GPUs β€” wastes 1 NVLink slot
  ❌ Request 5 GPUs β€” one GPU on wrong socket
  ❌ Request 6 GPUs β€” 4+2 split, 2 GPUs slower

NVL8 Architecture (8 GPUs fully NVLink-connected, e.g., DGX H100):
  βœ… Request 1, 2, 4, or 8 GPUs
  ❌ Request 3, 5, 6, 7 β€” partial NVLink utilization

NIC Placement and GPUDirect RDMA

Each PCIe switch hosts:
  - 4 GPUs (NVL4 group)
  - 1-2 NICs (ConnectX-7 / BlueField-3)
  - Each NIC is "PIX" to its co-located GPUs

For GPUDirect RDMA:
  GPU0 ←PIXβ†’ NIC0: Data flows GPU β†’ PCIe switch β†’ NIC (single hop)
  GPU0 ←SYSβ†’ NIC3: Data flows GPU β†’ PCIe switch β†’ CPU0 β†’ QPI β†’ CPU1 β†’ PCIe switch β†’ NIC
                    (4 hops, 2x latency, reduced throughput)

NCCL automatically selects the nearest NIC when NCCL_TOPO_DUMP_FILE is set.
Force with: NCCL_IB_HCA=mlx5_0:1,mlx5_1:1  (only local NICs)

NVMe Placement

NVMe drives connect via Gen4 x4 to the outermost PCIe switch port:
  - One NVMe per socket (or shared)
  - Used for checkpoint storage, dataset caching
  - Gen4 x4 = ~8 GB/s (sufficient for checkpoint writes)
  - Ensure checkpoint writes go to NUMA-local NVMe

Cross-Node Communication (NCCL PXN)

For 2+ node training with NVL4 architecture:

Without PXN:
  GPU0 (Node A) β†’ NIC0 (Node A) β†’ Network β†’ NIC0 (Node B) β†’ GPU0 (Node B)
  Only 1 NIC per direction (bottleneck: 50 GB/s)

With NCCL PXN (Proxy via NVLink):
  GPU0 (Node A) β†’ NVLink β†’ GPU1 (Node A) β†’ NIC1 (Node A) β†’ Network
  GPU0 (Node A) β†’ NVLink β†’ GPU2 (Node A) β†’ NIC2 (Node A) β†’ Network
  Multiple NICs saturated simultaneously via NVLink proxying!
  Effective: 4x NIC bandwidth = 200 GB/s cross-node

Enable: NCCL_PXN_DISABLE=0 (enabled by default on modern NCCL)

Common Issues

Training slower with 8 GPUs than expected vs 4 GPUs

  • Cause: 8 GPUs span two NVL4 groups; cross-group communication via PCIe/SYS
  • Fix: Use PXN for inter-group; or accept ~80% scaling for 8 GPU vs 4 GPU jobs

GPUDirect RDMA throughput lower than expected

  • Cause: NIC on wrong socket (SYS path to GPU instead of PIX)
  • Fix: Pin NCCL to PIX-local NICs: NCCL_IB_HCA with only socket-local interfaces
  • Cause: GPUs from different NVL4 groups assigned; or NVLink disabled
  • Fix: Request GPUs in NVL4-aligned quantities; check nvidia-smi nvlink --status

vLLM tensor parallelism slow at TP=8

  • Cause: TP=8 spans both sockets β€” half the all-reduce traffic goes over PCIe
  • Fix: Use TP=4 (one NVL4 group) + PP=2; or accept cross-socket penalty on NVL4 systems

Best Practices

  1. Align GPU requests to NVL group size β€” 4 for NVL4, 8 for DGX/NVL8
  2. Use topology-aware scheduling β€” Run:ai, Volcano, or NVIDIA DRA plugin
  3. Pin NICs to GPU groups β€” ensures GPUDirect RDMA uses shortest PCIe path
  4. Set NCCL_TOPO_DUMP_FILE β€” lets NCCL auto-optimize ring/tree algorithms
  5. Enable PXN for cross-node β€” multiplies effective network bandwidth via NVLink proxy
  6. TP within NVLink group, DP across nodes β€” minimize cross-socket traffic
  7. Benchmark before production β€” all_reduce_perf from nccl-tests validates topology

Key Takeaways

  • NVLink Bridge connects 4 GPUs (NVL4) at 900 GB/s β€” 14x faster than PCIe Gen5
  • Dual-socket = two independent NVL4 groups; cross-group = PCIe/QPI bottleneck
  • Architecture: CPU β†’ Gen5 x16 β†’ PCIe Switch β†’ (GPUs + NICs); NVLink between GPUs
  • Request GPUs in NVL4-aligned quantities (1, 2, 4, or 8 β€” never 3, 5, 6)
  • NIC-GPU PIX locality critical for GPUDirect RDMA β€” same PCIe switch = best
  • PXN proxies traffic through NVLink to saturate multiple NICs simultaneously
  • NVMe on Gen4 x4 for checkpoint/data β€” sufficient throughput for storage operations
#nvlink #gpu-architecture #pcie #nvidia #hpc
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens