NCCL GPUDirect RDMA Distance Levels and PIX vs SYS
Understand NCCL GPU Direct RDMA distance-based enablement. When PIX mode disables GDRDMA for distant GPU-HCA pairs (distance 9 > 4) and when SYS mode enables
π‘ Quick Answer: When
NCCL_NET_GDR_LEVEL=PIX, NCCL only enables GPUDirect RDMA if the GPU and HCA are within PCIe distance β€ 4 (same switch). If distance > 4 (e.g., distance 9 = cross-socket), NCCL logsGPU Direct RDMA Disabled for GPU X / HCA Y (distance 9 > 4)and falls back to host-staged transfers. Switch toNCCL_NET_GDR_LEVEL=SYSto enable GDRDMA regardless of distance β the log then showsGPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 4 <= 4), read 0 mode Default.
The Problem
- With
NCCL_NET_GDR_LEVEL=PIX, some GPU-HCA pairs get GDRDMA disabled due to topology distance - SR-IOV VF assignment is non-deterministic β some ranks get close HCAs, others get far ones
- Need to understand NCCLβs distance calculation and when GDRDMA gets disabled vs enabled
- Inconsistent performance across ranks because some use GDRDMA and others fall back
The Solution
NCCL Distance Calculation
NCCL measures PCIe topology distance between GPU and HCA:
Distance β Meaning β Level Name
ββββββββββΌβββββββββββββββββββββββββββββββββββΌβββββββββββ
1 β Same PCIe switch (PIX) β PIX
2 β Same PCIe root complex β PIX
3 β Through PCIe Host Bridge (PHB) β PHB
4 β Same NUMA node (NODE) β NODE
5 β Cross-NUMA, same machine β SYS
6-9 β Further cross-socket paths β SYS
ββββββββββ΄βββββββββββββββββββββββββββββββββββ΄βββββββββββ
NCCL_NET_GDR_LEVEL controls the maximum distance threshold:
PIX β threshold = 4 (only same NUMA or closer)
PHB β threshold = 4 (same as PIX in practice)
NODE β threshold = 4 (same NUMA node)
SYS β threshold = 9+ (always enable, any distance)Log Output: PIX Mode (Distance Check Fails)
# NCCL_NET_GDR_LEVEL=PIX in mpijob.yaml worker env:
# Line 187-188: NCCL_NET_GDR_LEVEL: "PIX"
# GPU 2 is far from HCA 0 (distance 9 = cross-socket):
NCCL INFO GPU Direct RDMA Disabled for GPU 2 / HCA 0 (distance 9 > 4)
# But GPU 0 is close to HCA 0 (distance 4 = same NUMA):
NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 4 <= 4), read 0 mode Default
NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 4 <= 4), read 1 mode Default
# Result: Mixed β some channels use GDRDMA, others fall back
# Channel 00/0 : 2[0] -> 0[0] [send] via NET/IB/0 β NO GDRDMA (GPU 2 too far)
# Channel 00/0 : 2[0] -> 0[0] [receive] via NET/IB/0/GDRDMA β HAS GDRDMA (GPU 0 is close)Log Output: SYS Mode (Always Enabled)
# NCCL_NET_GDR_LEVEL=SYS (from validate_network.sh):
# All GPU-HCA pairs get GDRDMA regardless of distance:
NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 4 <= 4), read 0 mode Default
NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 4 <= 4), read 1 mode Default
NCCL INFO GPU Direct RDMA Enabled for GPU 2 / HCA 0 (distance 9 <= 9), read 0 mode Default
NCCL INFO GPU Direct RDMA Enabled for GPU 2 / HCA 0 (distance 9 <= 9), read 1 mode Default
# All channels show /GDRDMA:
# Channel 00/0 : 2[0] -> 0[0] [send] via NET/IB/0/GDRDMA
# Channel 00/0 : 2[0] -> 0[0] [receive] via NET/IB/0/GDRDMARead Modes in GDRDMA
From logs: "read 0 mode Default" / "read 1 mode Default"
read 0 = GPU reads from NIC buffer (receive path)
read 1 = NIC reads from GPU buffer (send path)
mode Default = using default DMA-BUF or peermem method
Other modes you might see:
mode Default β standard nvidia-peermem / DMA-BUF
mode DMABUF β explicitly using DMA-BUF interface (kernel 5.12+)
mode PEERMEM β using legacy nvidia-peermem interfaceWhy PIX Disables Some Pairs
In your Dell XE7745 (2-socket, 8 GPUs):
Socket 0 (PCIe domain 0000):
GPU 0 [0000:18:00] ββ
GPU 1 [0000:67:00] ββ€ββ PCIe Switch A ββ HCA 0 (mlx5_0)
GPU 2 [0000:b2:00] ββ€ HCA 1 (mlx5_3)
GPU 3 [0000:d8:00] ββ
Socket 1 (PCIe domain 0001):
GPU 4 [0001:18:00] ββ
GPU 5 [0001:69:00] ββ€ββ PCIe Switch B ββ HCA 2 (mlx5_5)
GPU 6 [0001:8f:00] ββ€ HCA 3 (mlx5_6)
GPU 7 [0001:b3:00] ββ
With SR-IOV shared device plugin:
Pod gets ONE VF β could be from ANY of the 4 PFs (mlx5_0-6)
If pod's GPU is on Socket 1 but VF is from Socket 0 HCA:
Distance = 9 (cross-socket) β PIX disables GDRDMA!
With NCCL_NET_GDR_LEVEL=SYS:
Distance = 9 but threshold = 9+ β GDRDMA still enabled
Performance: slightly worse than PIX-local, but much better than no GDRDMAPerformance Impact
Scenario β Effective Bandwidth β Latency
βββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββΌββββββββ
GDRDMA enabled, PIX-local (dist β€ 2) β 48-50 GB/s β ~1 Β΅s
GDRDMA enabled, SYS (dist 9) β 38-42 GB/s β ~3 Β΅s
GDRDMA disabled (host staging) β 25-30 GB/s β ~8 Β΅s
βββββββββββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββ΄ββββββββ
Cross-socket GDRDMA (SYS): ~20% less than PIX-local
No GDRDMA (host staging): ~40-50% less than PIX-local
Conclusion: SYS mode with cross-socket GDRDMA is ALWAYS better than no GDRDMA.
Use SYS for SR-IOV (non-deterministic placement).
Use PIX only when VF-to-GPU affinity is guaranteed (dedicated NICs, no SR-IOV).Proxy Progress and Transport Details
From logs:
NCCL INFO [Proxy Progress] Device 0 CPU core 127
βββ Network proxy thread for GPU 0 pinned to core 127
βββ Should be on same NUMA as GPU 0 for optimal proxy performance
NCCL INFO New proxy send connection 4 from local rank 0, transport 2
NCCL INFO New proxy recv connection 2 from local rank 0, transport 2
βββ transport 2 = NET (network)
βββ transport 0 = P2P (NVLink)
βββ transport 1 = SHM (shared memory)
NCCL INFO Connected to proxy localRank 0 -> connection 0x7fd020000f00
βββ Connection handle allocated for rank 0's proxy threadConfiguration Comparison in mpijob.yaml
# Test with PIX (restrictive β disables far GPU-HCA pairs):
env:
- name: NCCL_IB_DISABLE
value: "0"
- name: NCCL_COLLNET_ENABLE
value: "0"
- name: NCCL_NET_GDR_LEVEL
value: "PIX" # β Only same-switch GDRDMA
- name: NCCL_DMABUF_ENABLE
value: "1"
- name: NCCL_SHM_DISABLE
value: "0"
# Test with SYS (permissive β GDRDMA for all pairs):
env:
- name: NCCL_IB_DISABLE
value: "0"
- name: NCCL_COLLNET_ENABLE
value: "0"
- name: NCCL_NET_GDR_LEVEL
value: "SYS" # β GDRDMA regardless of distance
- name: NCCL_DMABUF_ENABLE
value: "1"
- name: NCCL_SHM_DISABLE
value: "0"Interpreting Mixed GDRDMA Logs
When some pairs are enabled and others disabled (PIX mode):
GPU Direct RDMA Disabled for GPU 2 / HCA 0 (distance 9 > 4)
βββ GPU 2 on socket 1, HCA 0 on socket 0 β too far for PIX
GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 4 <= 4), read 0 mode Default
βββ GPU 0 on socket 0, HCA 0 on socket 0 β close enough
This means:
- Channels FROM GPU 0 β remote: GDRDMA send (fast)
- Channels TO GPU 0 β remote: GDRDMA receive (fast)
- Channels FROM GPU 2 β remote: HOST-STAGED send (slow)
- Channels TO GPU 2 β remote: depends on remote GPU's proximity
Result: Bottleneck on the slowest path (GPU 2's host-staged transfers)
Fix: Use NCCL_NET_GDR_LEVEL=SYSCommon Issues
βGPU Direct RDMA Disabled for GPU X / HCA Y (distance 9 > 4)β
- Cause:
NCCL_NET_GDR_LEVEL=PIXand GPU is on different socket than HCA - Fix: Set
NCCL_NET_GDR_LEVEL=SYSβ enables GDRDMA at any distance
GDRDMA enabled for some GPUs but not others (inconsistent perf)
- Cause: SR-IOV VF from far PF assigned to pod; some GPUs close, others far
- Fix: Use SYS level; or implement topology-aware SR-IOV (pin VFs to local GPUs)
βread 0 mode Defaultβ but bandwidth still low
- Cause: Cross-socket GDRDMA is slower than PIX-local (~20% less)
- Fix: This is expected. For optimal: ensure VF is from PF on same socket as GPU
Distance always shows 9 (even for seemingly local pairs)
- Cause: SR-IOV VF may report different PCIe topology than parent PF
- Fix: Verify with
nvidia-smi topo -mandibdev2netdevon host (not in pod)
Best Practices
- Use
NCCL_NET_GDR_LEVEL=SYSfor SR-IOV β consistent GDRDMA for all ranks - Use
NCCL_NET_GDR_LEVEL=PIXonly with dedicated NICs β when GPU-HCA locality is guaranteed - Check logs for βDisabledβ messages β any disabled pair becomes the bottleneck
- Compare PIX vs SYS benchmark results β quantify topology impact for your hardware
- Pin proxy threads to GPU-local NUMA β reduces proxy latency for network operations
- Monitor per-rank bandwidth β identify if specific ranks underperform due to distance
Key Takeaways
NCCL_NET_GDR_LEVEL=PIX: threshold 4 β disables GDRDMA when GPU-HCA distance > 4NCCL_NET_GDR_LEVEL=SYS: threshold 9+ β always enables GDRDMA regardless of distance- Log message:
distance 9 > 4= cross-socket GPU-HCA pair, GDRDMA disabled - Log message:
distance 4 <= 4= same-NUMA GPU-HCA pair, GDRDMA enabled - Cross-socket GDRDMA (SYS): ~20% less than PIX-local, but 40-50% better than no GDRDMA
- SR-IOV makes VF placement non-deterministic β always use SYS
- Mixed enabled/disabled creates bottleneck on slowest rank β avoid with SYS
- βread 0/1 mode Defaultβ = DMA-BUF or peermem method for receive/send paths

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
