NCCL GPUDirect RDMA Level Tuning PIX PXB PHB SYS
Tune NCCL_NET_GDR_LEVEL for optimal GPUDirect RDMA performance on Kubernetes. Compare PIX, PXB, PHB, and SYS distance thresholds with PCIe topology. Benchmark
π‘ Quick Answer:
NCCL_NET_GDR_LEVELcontrols the maximum PCIe distance at which NCCL enables GPUDirect RDMA (GPU memory β NIC without CPU copy). Values from most restrictive to least:LOC(same device) βPIX(same PCIe switch) βPXB(same root complex) βPHB(same NUMA/host bridge) βSYS(any distance, crosses sockets). Start withPHBfor safety, testSYSfor maximum bandwidth, fall back toPXBif you see errors.
The Problem
- GPUDirect RDMA performance depends on PCIe topology between GPU and NIC
- Wrong GDR level either disables RDMA for valid pairs or enables it for unstable paths
- SR-IOV VFs may have different effective PCIe distances than physical functions
- No clear guidance on which level works best for specific hardware configurations
- Need systematic testing methodology to find optimal setting
The Solution
Understanding GDR Levels
Level β PCIe Distance β Meaning β Risk
βββββββΌββββββββββββββββΌββββββββββββββββββββββββββββββββββββΌββββββββββ
LOC β β€ 3 β Same device (loopback only) β None
PIX β β€ 4 β Same PCIe switch β None
PXB β β€ 5 β Same PCIe root complex β Low
PHB β β€ 6 β Same CPU socket / host bridge β Low
SYS β β€ 9 β Cross-socket, any path β Medium
βββββββ΄ββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββ΄ββββββββββ
Higher level = more GPU-NIC pairs can use RDMA = more bandwidth potential
But: cross-socket RDMA may add latency or cause stability issues on some platformsPCIe Topology Example
Socket 0 Socket 1
βββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββ
β Root Complex 0 β β Root Complex 1 β
β βββ PCIe Switch A β β βββ PCIe Switch C β
β β βββ GPU 0 [0000:42:00] β β β βββ GPU 2 [0000:8c:00] β
β β βββ NIC 0 (mlx5_0) β β β βββ NIC 2 (mlx5_5) β
β βββ PCIe Switch B β β βββ PCIe Switch D β
β βββ GPU 1 [0000:5e:00] β β βββ GPU 3 [0000:c7:00] β
β βββ NIC 1 (mlx5_3) β β βββ NIC 3 (mlx5_6) β
βββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββ
GPU 0 β NIC 0: PIX (same switch) β Always works
GPU 0 β NIC 1: PXB (same root complex) β Works with PXB+
GPU 0 β NIC 2: SYS (cross-socket) β Only works with SYSTesting Each Level with MPIJob
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: nccl-gdr-level-test
namespace: gpu-benchmark
spec:
slotsPerWorker: 2
runPolicy:
cleanPodPolicy: None
backoffLimit: 0
mpiReplicaSpecs:
Launcher:
replicas: 1
restartPolicy: Never
template:
spec:
containers:
- name: launcher
image: registry.example.com/nccl-validator:v6
args: ["mpi-job"]
env:
- name: MPI_NP
value: "4"
- name: GPUS_PER_MPI_PROCESS
value: "1"
- name: NCCL_SOCKET_IFNAME
value: "net1"
- name: NCCL_NET_GDR_LEVEL
value: "PXB" # Change per test run
- name: NCCL_DMABUF_ENABLE
value: "1"
- name: NCCL_DEBUG
value: "INFO"
- name: NCCL_DEBUG_SUBSYS
value: "INIT,NET,GRAPH"
- name: NCCL_TEST_MIN_BYTES
value: "1G"
- name: NCCL_TEST_MAX_BYTES
value: "16G"
- name: OMPI_MCA_btl_tcp_if_include
value: "eth0"
resources:
requests:
cpu: "1"
memory: "2Gi"
Worker:
replicas: 2
restartPolicy: Never
template:
metadata:
annotations:
k8s.v1.cni.cncf.io/networks: sriov-rdma-net
spec:
subdomain: nccl-gdr-level-test
containers:
- name: worker
image: registry.example.com/nccl-validator:v6
args: ["shell"]
env:
- name: START_SSHD
value: "true"
- name: NCCL_NET_GDR_LEVEL
value: "PXB" # Must match launcher
- name: NCCL_DMABUF_ENABLE
value: "1"
resources:
requests:
nvidia.com/gpu: 2
openshift.io/mellanoxnics: 1
limits:
nvidia.com/gpu: 2
openshift.io/mellanoxnics: 1Automated Comparison Script
#!/bin/bash
# Run GDR level comparison across all levels
NAMESPACE="gpu-benchmark"
LEVELS=("PIX" "PXB" "PHB" "SYS")
RESULTS_FILE="/tmp/gdr-comparison.csv"
echo "level,min_busbw,max_busbw,avg_busbw" > "${RESULTS_FILE}"
for level in "${LEVELS[@]}"; do
echo "=== Testing NCCL_NET_GDR_LEVEL=${level} ==="
# Update the MPIJob YAML
sed "s/value: \".*\" # Change per test run/value: \"${level}\" # Change per test run/" \
nccl-gdr-test.yaml | kubectl apply -n "${NAMESPACE}" -f -
# Wait for completion
kubectl wait --for=condition=Succeeded mpijob/nccl-gdr-level-test \
-n "${NAMESPACE}" --timeout=600s
# Extract busbw from launcher logs
BUSBW=$(kubectl logs -n "${NAMESPACE}" \
nccl-gdr-level-test-launcher -- 2>/dev/null | \
grep -E "^\s+[0-9]" | awk '{print $NF}' | \
sort -n | tail -1)
echo "${level},${BUSBW}" >> "${RESULTS_FILE}"
# Cleanup
kubectl delete mpijob nccl-gdr-level-test -n "${NAMESPACE}"
sleep 30 # Wait for pods to terminate
done
echo ""
echo "=== Results ==="
cat "${RESULTS_FILE}"Interpreting NCCL Logs for GDR Status
# GDR ENABLED β look for "GPU Direct RDMA Enabled" in logs:
NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 9 <= 9), read 1 mode Default
# The "distance X <= Y" shows:
# X = actual PCIe distance between GPU and HCA
# Y = threshold from NCCL_NET_GDR_LEVEL setting
#
# distance 4 = PIX (same switch)
# distance 5 = PXB (same root complex)
# distance 6 = PHB (same host bridge)
# distance 9 = SYS (cross-socket QPI/UPI)
# GDR DISABLED β you'll see socket transport instead:
NCCL INFO Channel 0/0 : 0[0] -> 2[0] [send] via NET/IB/0 # No GDRDMA suffix
# vs enabled:
NCCL INFO Channel 0/0 : 0[0] -> 2[0] [send] via NET/IB/0/GDRDMAWhen to Use Each Level
# PIX β Ultra-conservative, only same-switch GPU-NIC pairs
# Use when: Debugging RDMA errors, isolating topology issues
# Expect: Some ranks may fall back to socket transport
env:
- name: NCCL_NET_GDR_LEVEL
value: "PIX"
# PXB β Safe for most single-socket configurations
# Use when: GPU and NIC on different switches but same CPU
# Expect: All intra-socket pairs use RDMA
env:
- name: NCCL_NET_GDR_LEVEL
value: "PXB"
# PHB β Recommended starting point (script default)
# Use when: Standard dual-socket with NUMA-local NICs
# Expect: All same-NUMA pairs use RDMA
env:
- name: NCCL_NET_GDR_LEVEL
value: "PHB"
# SYS β Maximum performance, all pairs use RDMA
# Use when: Platform validated, IOMMU enabled, stable
# Expect: Cross-socket RDMA enabled, highest bandwidth
env:
- name: NCCL_NET_GDR_LEVEL
value: "SYS"Common Issues
GDR enabled but bandwidth lower than expected
- Cause: Cross-socket RDMA adds QPI/UPI hop latency
- Fix: Compare SYS vs PHB results. If PHB is faster, cross-socket overhead dominates. Use PHB + topology-aware scheduling.
βGPU Direct RDMA Enabledβ not appearing in logs
- Cause: GDR level too restrictive for your topology
- Fix: Increase level (PIX β PXB β PHB β SYS) or check
NCCL_DMABUF_ENABLE=1
Inconsistent bandwidth across runs
- Cause: SR-IOV VF assignment non-deterministic; different VFs have different PCIe distances
- Fix: Pin VFs to specific NUMA nodes via SriovNetworkNodePolicy
priorityfield
IOMMU errors with SYS level
- Cause: Cross-socket DMA requires IOMMU passthrough or permissive mode
- Fix: Verify
intel_iommu=on iommu=ptin kernel args; checkdmesg | grep -i iommu
Best Practices
- Always test incrementally: PIX β PXB β PHB β SYS, comparing busbw at each level
- Check IOMMU first:
SYSrequires proper IOMMU configuration - Match launcher and worker env: Both must set same
NCCL_NET_GDR_LEVEL - Use
NCCL_DMABUF_ENABLE=1: Required for modern GPUDirect RDMA with DMA-BUF - Log the distance:
NCCL_DEBUG=INFOshows actual PCIe distance in βEnabledβ messages - Validate per-rank HCA selection: Each rank should use the NIC closest to its GPU
- PHB is the safe production default: Enables RDMA for all same-NUMA pairs without cross-socket risk
Key Takeaways
NCCL_NET_GDR_LEVELis the primary knob for GPUDirect RDMA enable/disable per pair- Higher levels enable more GPU-NIC pairs but may cross NUMA boundaries
PHB(default) is optimal for most configurations β same-NUMA RDMA without cross-socketSYSgives maximum bandwidth when platform supports cross-socket DMA reliably- Always verify with
NCCL_DEBUG=INFOβ look for βGPU Direct RDMA Enabled (distance X <= Y)β - SR-IOV VF placement affects effective distance β topology-aware scheduling helps
- Test with
NCCL_NET_PLUGIN=nonefirst (socket baseline) then with IB plugin (RDMA)

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
