πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai advanced ⏱ 15 minutes K8s 1.28+

NCCL GPUDirect RDMA Distance Levels and PIX vs SYS

Understand NCCL GPU Direct RDMA distance-based enablement. When PIX mode disables GDRDMA for distant GPU-HCA pairs (distance 9 > 4) and when SYS mode enables

By Luca Berton β€’ β€’ πŸ“– 8 min read

πŸ’‘ Quick Answer: When NCCL_NET_GDR_LEVEL=PIX, NCCL only enables GPUDirect RDMA if the GPU and HCA are within PCIe distance ≀ 4 (same switch). If distance > 4 (e.g., distance 9 = cross-socket), NCCL logs GPU Direct RDMA Disabled for GPU X / HCA Y (distance 9 > 4) and falls back to host-staged transfers. Switch to NCCL_NET_GDR_LEVEL=SYS to enable GDRDMA regardless of distance β€” the log then shows GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 4 <= 4), read 0 mode Default.

The Problem

  • With NCCL_NET_GDR_LEVEL=PIX, some GPU-HCA pairs get GDRDMA disabled due to topology distance
  • SR-IOV VF assignment is non-deterministic β€” some ranks get close HCAs, others get far ones
  • Need to understand NCCL’s distance calculation and when GDRDMA gets disabled vs enabled
  • Inconsistent performance across ranks because some use GDRDMA and others fall back

The Solution

NCCL Distance Calculation

NCCL measures PCIe topology distance between GPU and HCA:

Distance β”‚ Meaning                          β”‚ Level Name
─────────┼──────────────────────────────────┼───────────
    1    β”‚ Same PCIe switch (PIX)           β”‚ PIX
    2    β”‚ Same PCIe root complex           β”‚ PIX
    3    β”‚ Through PCIe Host Bridge (PHB)   β”‚ PHB
    4    β”‚ Same NUMA node (NODE)            β”‚ NODE
    5    β”‚ Cross-NUMA, same machine         β”‚ SYS
    6-9  β”‚ Further cross-socket paths       β”‚ SYS
─────────┴──────────────────────────────────┴───────────

NCCL_NET_GDR_LEVEL controls the maximum distance threshold:
  PIX  β†’ threshold = 4  (only same NUMA or closer)
  PHB  β†’ threshold = 4  (same as PIX in practice)
  NODE β†’ threshold = 4  (same NUMA node)
  SYS  β†’ threshold = 9+ (always enable, any distance)

Log Output: PIX Mode (Distance Check Fails)

# NCCL_NET_GDR_LEVEL=PIX in mpijob.yaml worker env:
# Line 187-188: NCCL_NET_GDR_LEVEL: "PIX"

# GPU 2 is far from HCA 0 (distance 9 = cross-socket):
NCCL INFO GPU Direct RDMA Disabled for GPU 2 / HCA 0 (distance 9 > 4)

# But GPU 0 is close to HCA 0 (distance 4 = same NUMA):
NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 4 <= 4), read 0 mode Default
NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 4 <= 4), read 1 mode Default

# Result: Mixed β€” some channels use GDRDMA, others fall back
# Channel 00/0 : 2[0] -> 0[0] [send] via NET/IB/0          ← NO GDRDMA (GPU 2 too far)
# Channel 00/0 : 2[0] -> 0[0] [receive] via NET/IB/0/GDRDMA ← HAS GDRDMA (GPU 0 is close)

Log Output: SYS Mode (Always Enabled)

# NCCL_NET_GDR_LEVEL=SYS (from validate_network.sh):
# All GPU-HCA pairs get GDRDMA regardless of distance:

NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 4 <= 4), read 0 mode Default
NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 4 <= 4), read 1 mode Default
NCCL INFO GPU Direct RDMA Enabled for GPU 2 / HCA 0 (distance 9 <= 9), read 0 mode Default
NCCL INFO GPU Direct RDMA Enabled for GPU 2 / HCA 0 (distance 9 <= 9), read 1 mode Default

# All channels show /GDRDMA:
# Channel 00/0 : 2[0] -> 0[0] [send] via NET/IB/0/GDRDMA
# Channel 00/0 : 2[0] -> 0[0] [receive] via NET/IB/0/GDRDMA

Read Modes in GDRDMA

From logs: "read 0 mode Default" / "read 1 mode Default"

read 0 = GPU reads from NIC buffer (receive path)
read 1 = NIC reads from GPU buffer (send path)
mode Default = using default DMA-BUF or peermem method

Other modes you might see:
  mode Default  β€” standard nvidia-peermem / DMA-BUF
  mode DMABUF   β€” explicitly using DMA-BUF interface (kernel 5.12+)
  mode PEERMEM  β€” using legacy nvidia-peermem interface

Why PIX Disables Some Pairs

In your Dell XE7745 (2-socket, 8 GPUs):

Socket 0 (PCIe domain 0000):
  GPU 0 [0000:18:00] ─┐
  GPU 1 [0000:67:00] ──── PCIe Switch A ── HCA 0 (mlx5_0)
  GPU 2 [0000:b2:00] ──                    HCA 1 (mlx5_3)
  GPU 3 [0000:d8:00] β”€β”˜

Socket 1 (PCIe domain 0001):
  GPU 4 [0001:18:00] ─┐
  GPU 5 [0001:69:00] ──── PCIe Switch B ── HCA 2 (mlx5_5)
  GPU 6 [0001:8f:00] ──                    HCA 3 (mlx5_6)
  GPU 7 [0001:b3:00] β”€β”˜

With SR-IOV shared device plugin:
  Pod gets ONE VF β€” could be from ANY of the 4 PFs (mlx5_0-6)

If pod's GPU is on Socket 1 but VF is from Socket 0 HCA:
  Distance = 9 (cross-socket) β†’ PIX disables GDRDMA!

With NCCL_NET_GDR_LEVEL=SYS:
  Distance = 9 but threshold = 9+ β†’ GDRDMA still enabled
  Performance: slightly worse than PIX-local, but much better than no GDRDMA

Performance Impact

Scenario                              β”‚ Effective Bandwidth β”‚ Latency
──────────────────────────────────────┼─────────────────────┼────────
GDRDMA enabled, PIX-local (dist ≀ 2) β”‚ 48-50 GB/s          β”‚ ~1 Β΅s
GDRDMA enabled, SYS (dist 9)         β”‚ 38-42 GB/s          β”‚ ~3 Β΅s
GDRDMA disabled (host staging)       β”‚ 25-30 GB/s          β”‚ ~8 Β΅s
──────────────────────────────────────┴─────────────────────┴────────

Cross-socket GDRDMA (SYS): ~20% less than PIX-local
No GDRDMA (host staging):  ~40-50% less than PIX-local

Conclusion: SYS mode with cross-socket GDRDMA is ALWAYS better than no GDRDMA.
Use SYS for SR-IOV (non-deterministic placement).
Use PIX only when VF-to-GPU affinity is guaranteed (dedicated NICs, no SR-IOV).

Proxy Progress and Transport Details

From logs:
  NCCL INFO [Proxy Progress] Device 0 CPU core 127
  └── Network proxy thread for GPU 0 pinned to core 127
  └── Should be on same NUMA as GPU 0 for optimal proxy performance

  NCCL INFO New proxy send connection 4 from local rank 0, transport 2
  NCCL INFO New proxy recv connection 2 from local rank 0, transport 2
  └── transport 2 = NET (network)
  └── transport 0 = P2P (NVLink)
  └── transport 1 = SHM (shared memory)

  NCCL INFO Connected to proxy localRank 0 -> connection 0x7fd020000f00
  └── Connection handle allocated for rank 0's proxy thread

Configuration Comparison in mpijob.yaml

# Test with PIX (restrictive β€” disables far GPU-HCA pairs):
env:
  - name: NCCL_IB_DISABLE
    value: "0"
  - name: NCCL_COLLNET_ENABLE
    value: "0"
  - name: NCCL_NET_GDR_LEVEL
    value: "PIX"              # ← Only same-switch GDRDMA
  - name: NCCL_DMABUF_ENABLE
    value: "1"
  - name: NCCL_SHM_DISABLE
    value: "0"

# Test with SYS (permissive β€” GDRDMA for all pairs):
env:
  - name: NCCL_IB_DISABLE
    value: "0"
  - name: NCCL_COLLNET_ENABLE
    value: "0"
  - name: NCCL_NET_GDR_LEVEL
    value: "SYS"              # ← GDRDMA regardless of distance
  - name: NCCL_DMABUF_ENABLE
    value: "1"
  - name: NCCL_SHM_DISABLE
    value: "0"

Interpreting Mixed GDRDMA Logs

When some pairs are enabled and others disabled (PIX mode):

GPU Direct RDMA Disabled for GPU 2 / HCA 0 (distance 9 > 4)
  └── GPU 2 on socket 1, HCA 0 on socket 0 β†’ too far for PIX

GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 4 <= 4), read 0 mode Default
  └── GPU 0 on socket 0, HCA 0 on socket 0 β†’ close enough

This means:
  - Channels FROM GPU 0 β†’ remote: GDRDMA send (fast)
  - Channels TO GPU 0 ← remote: GDRDMA receive (fast)
  - Channels FROM GPU 2 β†’ remote: HOST-STAGED send (slow)
  - Channels TO GPU 2 ← remote: depends on remote GPU's proximity

Result: Bottleneck on the slowest path (GPU 2's host-staged transfers)
Fix: Use NCCL_NET_GDR_LEVEL=SYS

Common Issues

”GPU Direct RDMA Disabled for GPU X / HCA Y (distance 9 > 4)”

  • Cause: NCCL_NET_GDR_LEVEL=PIX and GPU is on different socket than HCA
  • Fix: Set NCCL_NET_GDR_LEVEL=SYS β€” enables GDRDMA at any distance

GDRDMA enabled for some GPUs but not others (inconsistent perf)

  • Cause: SR-IOV VF from far PF assigned to pod; some GPUs close, others far
  • Fix: Use SYS level; or implement topology-aware SR-IOV (pin VFs to local GPUs)

β€œread 0 mode Default” but bandwidth still low

  • Cause: Cross-socket GDRDMA is slower than PIX-local (~20% less)
  • Fix: This is expected. For optimal: ensure VF is from PF on same socket as GPU

Distance always shows 9 (even for seemingly local pairs)

  • Cause: SR-IOV VF may report different PCIe topology than parent PF
  • Fix: Verify with nvidia-smi topo -m and ibdev2netdev on host (not in pod)

Best Practices

  1. Use NCCL_NET_GDR_LEVEL=SYS for SR-IOV β€” consistent GDRDMA for all ranks
  2. Use NCCL_NET_GDR_LEVEL=PIX only with dedicated NICs β€” when GPU-HCA locality is guaranteed
  3. Check logs for β€œDisabled” messages β€” any disabled pair becomes the bottleneck
  4. Compare PIX vs SYS benchmark results β€” quantify topology impact for your hardware
  5. Pin proxy threads to GPU-local NUMA β€” reduces proxy latency for network operations
  6. Monitor per-rank bandwidth β€” identify if specific ranks underperform due to distance

Key Takeaways

  • NCCL_NET_GDR_LEVEL=PIX: threshold 4 β€” disables GDRDMA when GPU-HCA distance > 4
  • NCCL_NET_GDR_LEVEL=SYS: threshold 9+ β€” always enables GDRDMA regardless of distance
  • Log message: distance 9 > 4 = cross-socket GPU-HCA pair, GDRDMA disabled
  • Log message: distance 4 <= 4 = same-NUMA GPU-HCA pair, GDRDMA enabled
  • Cross-socket GDRDMA (SYS): ~20% less than PIX-local, but 40-50% better than no GDRDMA
  • SR-IOV makes VF placement non-deterministic β†’ always use SYS
  • Mixed enabled/disabled creates bottleneck on slowest rank β€” avoid with SYS
  • β€œread 0/1 mode Default” = DMA-BUF or peermem method for receive/send paths
#nccl #gpudirect #rdma #topology #troubleshooting
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens