NVLink Bridge Architecture for GPU Kubernetes Nodes
Understand NVLink Bridge logical architecture in GPU servers for Kubernetes. Dual-socket PCIe Gen5 topology, NVL4 groups, GPU-NIC-NVMe placement, PCIe switch
π‘ Quick Answer: NVLink Bridge connects groups of 4 GPUs (NVL4) with high-bandwidth NVLink for direct GPU-to-GPU communication bypassing PCIe. In a typical 8-GPU dual-socket server: CPU β PCIe Gen5 x16 β PCIe Switch β GPUs + NICs. Each CPU socket owns 4 GPUs + 2 NICs in two NVL4 groups. NVLink provides 900 GB/s (H100) between grouped GPUs vs ~64 GB/s for PCIe Gen5 β making NVLink group sizing critical for distributed training performance.
The Problem
- Multi-GPU training performance varies wildly depending on which GPUs are assigned
- Cross-socket GPU communication is 10x slower than intra-NVLink-group
- Need to understand the physical topology to properly size GPU requests
- NIC placement relative to GPUs matters for GPUDirect RDMA performance
- PCIe switch hierarchy creates bandwidth bottlenecks if not understood
The Solution
NVLink Bridge Logical Architecture
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DUAL-SOCKET GPU SERVER (8x GPU) β
ββββββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββ€
β SOCKET 0 (NUMA 0) β SOCKET 1 (NUMA 1) β
β β β
β ββββββββββββββββββββ β ββββββββββββββββββββ β
β β System Memory β β β System Memory β β
β ββββββββββ¬ββββββββββ β ββββββββββ¬ββββββββββ β
β β β β β
β βββββββ΄ββββββ β βββββββ΄ββββββ β
β β CPU 0 ββββββ QPI/UPI βββββΊ β CPU 1 β β
β ββββ¬βββββββ¬βββ β ββββ¬βββββββ¬βββ β
β Gen5 β β Gen5 β Gen5 β β Gen5 β
β x16 β β x16 β x16 β β x16 β
β ββββββββββββ΄βββ βββ΄βββββββββββ β βββββββββββ΄βββ βββ΄βββββββββββ β
β β PCIe Switch β β PCIe Switch β β β PCIe Switch β β PCIe Switch β β
β ββ¬βββ¬βββ¬βββ¬βββ¬β ββ¬βββ¬βββ¬βββ¬βββ β ββ¬βββ¬βββ¬βββ¬βββ ββ¬βββ¬βββ¬βββ¬βββ¬β β
β β β β β β β β β β β β β β β β β β β β β β
β Gen5 Gen5 Gen5 Gen5 Gen5 Gen5 β Gen5 Gen5 Gen5 Gen5 Gen5 Gen5 β
β x16 x16 x16 x16 x16 x16 β x16 x16 x16 x16 x16 x16 β
β β β β β β β β β β β β β β β β β β β β β β
β βββββββββββββββ ββββββββββββ β ββββββββββββ βββββββββββββββ β
β βNββGββGββGββGββNβ β βNββGββGββGββGββNβ β
β βIββPββPββPββPββIβ β βIββPββPββPββPββIβ β
β βCββUββUββUββUββCβ β βCββUββUββUββUββCβ β
β β0ββ0ββ1ββ2ββ3ββ1β β β2ββ4ββ5ββ6ββ7ββ3β β
β βββββ¬βββ¬βββ¬βββ¬ββββ β βββββ¬βββ¬βββ¬βββ¬ββββ β
β ββββ΄βββ΄βββ β ββββ΄βββ΄βββ β
β NVL4 Group 0 β NVL4 Group 1 β
β (900 GB/s per direction) β (900 GB/s per direction) β
β β β
β βββββ β βββββ β
β βNVMeβ β Gen4 x4 β Gen4 x4 β βNVMeβ β
β βββββ β βββββ β
ββββββββββββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββ
Bandwidth comparison:
NVLink (NVL4, H100): 900 GB/s bidirectional
PCIe Gen5 x16: ~64 GB/s bidirectional
QPI/UPI (cross-socket): ~40 GB/s
NVLink is 14x faster than PCIe for GPU-to-GPU!Bandwidth Hierarchy
Connection Path β Bandwidth β Use Case
ββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββββββββ
GPUβGPU (NVLink, same NVL4 group) β 900 GB/s (H100) β Tensor parallelism
β 600 GB/s (A100) β All-reduce within node
ββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββββββββ
GPUβGPU (PCIe, cross NVL4 group) β ~64 GB/s Gen5 β Avoid if possible
β ~32 GB/s Gen4 β (14x slower than NVLink)
ββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββββββββ
GPUβNIC (GPUDirect RDMA, PIX) β ~50 GB/s (400G) β Cross-node all-reduce
β ~25 GB/s (200G) β Data parallel gradient sync
ββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββββββββ
GPUβCPU Memory (PCIe) β ~64 GB/s Gen5 β Data loading, preprocessing
ββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββββββββ
CPUβCPU (QPI/UPI) β ~40 GB/s β Cross-socket access
ββββββββββββββββββββββββββββββββββββ΄βββββββββββββββββββ΄βββββββββββββββββββββββββKubernetes Scheduling Implications
# CORRECT: Request 4 GPUs (fills one NVL4 group)
apiVersion: apps/v1
kind: Deployment
metadata:
name: tensor-parallel-inference
spec:
template:
spec:
containers:
- name: vllm
resources:
limits:
nvidia.com/gpu: 4 # One full NVL4 group
env:
- name: NCCL_P2P_LEVEL
value: "NVL"
# All 4 GPUs communicate at 900 GB/s via NVLink# SUBOPTIMAL: Request 5 GPUs (splits across NVL4 groups)
# Result: 4 GPUs at NVLink speed + 1 GPU at PCIe speed (14x slower)
# The 5th GPU becomes a bottleneck for all-reduce operations
spec:
containers:
- name: training
resources:
limits:
nvidia.com/gpu: 5 # Avoid β crosses NVL4 boundaryOptimal GPU Request Sizes
NVL4 Architecture (4 GPUs per NVLink group):
β
Request 1 GPU β single GPU workload
β
Request 2 GPUs β same NVL4 group (if topology-aware scheduler)
β
Request 4 GPUs β full NVL4 group (optimal for TP=4)
β
Request 8 GPUs β full node (both NVL4 groups, cross-socket via PXN)
β Request 3 GPUs β wastes 1 NVLink slot
β Request 5 GPUs β one GPU on wrong socket
β Request 6 GPUs β 4+2 split, 2 GPUs slower
NVL8 Architecture (8 GPUs fully NVLink-connected, e.g., DGX H100):
β
Request 1, 2, 4, or 8 GPUs
β Request 3, 5, 6, 7 β partial NVLink utilizationNIC Placement and GPUDirect RDMA
Each PCIe switch hosts:
- 4 GPUs (NVL4 group)
- 1-2 NICs (ConnectX-7 / BlueField-3)
- Each NIC is "PIX" to its co-located GPUs
For GPUDirect RDMA:
GPU0 βPIXβ NIC0: Data flows GPU β PCIe switch β NIC (single hop)
GPU0 βSYSβ NIC3: Data flows GPU β PCIe switch β CPU0 β QPI β CPU1 β PCIe switch β NIC
(4 hops, 2x latency, reduced throughput)
NCCL automatically selects the nearest NIC when NCCL_TOPO_DUMP_FILE is set.
Force with: NCCL_IB_HCA=mlx5_0:1,mlx5_1:1 (only local NICs)NVMe Placement
NVMe drives connect via Gen4 x4 to the outermost PCIe switch port:
- One NVMe per socket (or shared)
- Used for checkpoint storage, dataset caching
- Gen4 x4 = ~8 GB/s (sufficient for checkpoint writes)
- Ensure checkpoint writes go to NUMA-local NVMeCross-Node Communication (NCCL PXN)
For 2+ node training with NVL4 architecture:
Without PXN:
GPU0 (Node A) β NIC0 (Node A) β Network β NIC0 (Node B) β GPU0 (Node B)
Only 1 NIC per direction (bottleneck: 50 GB/s)
With NCCL PXN (Proxy via NVLink):
GPU0 (Node A) β NVLink β GPU1 (Node A) β NIC1 (Node A) β Network
GPU0 (Node A) β NVLink β GPU2 (Node A) β NIC2 (Node A) β Network
Multiple NICs saturated simultaneously via NVLink proxying!
Effective: 4x NIC bandwidth = 200 GB/s cross-node
Enable: NCCL_PXN_DISABLE=0 (enabled by default on modern NCCL)Common Issues
Training slower with 8 GPUs than expected vs 4 GPUs
- Cause: 8 GPUs span two NVL4 groups; cross-group communication via PCIe/SYS
- Fix: Use PXN for inter-group; or accept ~80% scaling for 8 GPU vs 4 GPU jobs
GPUDirect RDMA throughput lower than expected
- Cause: NIC on wrong socket (SYS path to GPU instead of PIX)
- Fix: Pin NCCL to PIX-local NICs:
NCCL_IB_HCAwith only socket-local interfaces
NCCL reporting βUsing PCIeβ instead of βNVLinkβ
- Cause: GPUs from different NVL4 groups assigned; or NVLink disabled
- Fix: Request GPUs in NVL4-aligned quantities; check
nvidia-smi nvlink --status
vLLM tensor parallelism slow at TP=8
- Cause: TP=8 spans both sockets β half the all-reduce traffic goes over PCIe
- Fix: Use TP=4 (one NVL4 group) + PP=2; or accept cross-socket penalty on NVL4 systems
Best Practices
- Align GPU requests to NVL group size β 4 for NVL4, 8 for DGX/NVL8
- Use topology-aware scheduling β Run:ai, Volcano, or NVIDIA DRA plugin
- Pin NICs to GPU groups β ensures GPUDirect RDMA uses shortest PCIe path
- Set
NCCL_TOPO_DUMP_FILEβ lets NCCL auto-optimize ring/tree algorithms - Enable PXN for cross-node β multiplies effective network bandwidth via NVLink proxy
- TP within NVLink group, DP across nodes β minimize cross-socket traffic
- Benchmark before production β
all_reduce_perffrom nccl-tests validates topology
Key Takeaways
- NVLink Bridge connects 4 GPUs (NVL4) at 900 GB/s β 14x faster than PCIe Gen5
- Dual-socket = two independent NVL4 groups; cross-group = PCIe/QPI bottleneck
- Architecture: CPU β Gen5 x16 β PCIe Switch β (GPUs + NICs); NVLink between GPUs
- Request GPUs in NVL4-aligned quantities (1, 2, 4, or 8 β never 3, 5, 6)
- NIC-GPU PIX locality critical for GPUDirect RDMA β same PCIe switch = best
- PXN proxies traffic through NVLink to saturate multiple NICs simultaneously
- NVMe on Gen4 x4 for checkpoint/data β sufficient throughput for storage operations

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
