GPUDirect RDMA Setup and Verification on Kubernetes
Enable and verify GPUDirect RDMA for GPU-to-NIC direct data transfer on Kubernetes. Install nvidia-peermem, configure DMA-BUF, verify RDMA paths, troubleshoot
π‘ Quick Answer: GPUDirect RDMA allows the NIC to read/write GPU memory directly without CPU involvement, reducing latency by ~50% and increasing bandwidth by 30-50% for inter-node GPU communication. Enable with:
modprobe nvidia-peermem, verify withcat /sys/module/nvidia_peermem/version, and confirm in NCCL logs by checking for/GDRDMAsuffix on NET/IB channels.
The Problem
- Inter-node GPU communication routes data GPU β CPU memory β NIC (two extra copies)
- Without GDRDMA, NCCL falls back to host staging β wastes PCIe bandwidth and adds latency
- Need to verify GPUDirect RDMA is actually active (not just configured)
- nvidia-peermem module may not load automatically after driver install
- DMA-BUF kernel support required but may be missing on older kernels
The Solution
Enable nvidia-peermem
# Load the nvidia-peermem kernel module
modprobe nvidia-peermem
# Verify loaded
lsmod | grep nvidia_peermem
# nvidia_peermem 16384 0
# Check version
cat /sys/module/nvidia_peermem/version
# 2.0
# Make persistent across reboots
echo "nvidia-peermem" >> /etc/modules-load.d/nvidia-peermem.confVerify DMA-BUF Support
# DMA-BUF is required for modern GPUDirect RDMA (kernel 5.12+)
# Check kernel support
grep CONFIG_DMA_SHARED_BUFFER /boot/config-$(uname -r)
# CONFIG_DMA_SHARED_BUFFER=y
# Verify NVIDIA driver exposes DMA-BUF per GPU
ls /sys/bus/pci/devices/0000:*/dma_buf_supported 2>/dev/null
# If exists, DMA-BUF is available
# In NCCL logs, look for:
# NCCL INFO DMA-BUF is available on GPU device 0
# NCCL INFO DMA-BUF is available on GPU device 1
# ... (must appear for EACH GPU)GPU Operator Configuration for GDRDMA
# ClusterPolicy with GPUDirect RDMA enabled
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: cluster-policy
spec:
operator:
defaultRuntime: containerd
driver:
enabled: true
rdma:
enabled: true # Loads nvidia-peermem automatically
useHostMofed: true # Use host-installed MLNX_OFED
devicePlugin:
enabled: true
gfd:
enabled: trueOpenShift MachineConfig for nvidia-peermem
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
name: 99-nvidia-peermem
labels:
machineconfiguration.openshift.io/role: gpu-worker
spec:
config:
ignition:
version: 3.4.0
storage:
files:
- path: /etc/modules-load.d/nvidia-peermem.conf
mode: 0644
contents:
source: data:,nvidia-peermemVerify GDRDMA in NCCL
# Run with NCCL_DEBUG=INFO
export NCCL_DEBUG=INFO
export NCCL_NET_GDR_LEVEL=5 # Use GDRDMA at all topology distances
# In output, look for:
# β
GOOD β GDRDMA active:
# NCCL INFO Channel 00/0 : 0[0] -> 8[0] [send] via NET/IB/0/GDRDMA
# β BAD β GDRDMA not active:
# NCCL INFO Channel 00/0 : 0[0] -> 8[0] [send] via NET/IB/0
# (no /GDRDMA suffix = data stages through CPU memory)NCCL_NET_GDR_LEVEL Explained
Level β Meaning β GDRDMA Used When
βββββββΌβββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββ
0 β Disabled β Never
1 β Only when GPU and NIC on same PCIe switch β PIX only
2 β Same PCIe tree (through Host Bridge) β PIX + PHB
3 β Same NUMA node β PIX + PHB + NODE
4 β Same machine (may cross sockets) β PIX + PHB + NODE + SYS
5 β Always use GDRDMA regardless of distance β All (recommended)
βββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββ
Recommendation: NCCL_NET_GDR_LEVEL=5
Even cross-socket GDRDMA is faster than CPU staging for large messages.
NCCL automatically selects the nearest NIC anyway via topology detection.Performance Comparison
Transfer Type β Latency β Bandwidth β CPU Load
βββββββββββββββββββββββββββββΌβββββββββββββΌββββββββββββββΌββββββββββ
GPU β NIC (GDRDMA) β ~1-2 Β΅s β 48-50 GB/s β ~0%
GPU β CPU β NIC (staged) β ~5-10 Β΅s β 25-35 GB/s β High
βββββββββββββββββββββββββββββ΄βββββββββββββ΄ββββββββββββββ΄ββββββββββ
For 8-GPU all-reduce across 2 nodes:
With GDRDMA: ~35 GB/s bus bandwidth
Without GDRDMA: ~20-25 GB/s bus bandwidth
Difference: 30-50% throughput gainTest GDRDMA Directly (without NCCL)
# Use perftest tools with GPU memory flag
# Server (node 1):
ib_write_bw -d mlx5_0 --use_cuda=0 --report_gbits -s 1048576
# Client (node 2):
ib_write_bw -d mlx5_0 --use_cuda=0 --report_gbits -s 1048576 10.10.0.1
# Expected output with GDRDMA:
# #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec]
# 1048576 5000 395.2 393.8
# Without --use_cuda (CPU memory):
# 1048576 5000 380.1 378.5
# If --use_cuda shows significantly LESS than without β GDRDMA brokenKubernetes Pod with GDRDMA
apiVersion: v1
kind: Pod
metadata:
name: gdrdma-test
spec:
containers:
- name: nccl-test
image: nvcr.io/nvidia/pytorch:24.04-py3
env:
- name: NCCL_NET_GDR_LEVEL
value: "5"
- name: NCCL_IB_HCA
value: "mlx5_0,mlx5_3,mlx5_5,mlx5_6"
- name: NCCL_DEBUG
value: "INFO"
resources:
limits:
nvidia.com/gpu: "8"
rdma/rdma_shared_device_a: "1"
securityContext:
capabilities:
add: ["IPC_LOCK"] # Required for RDMA memory registration
volumeMounts:
- name: shm
mountPath: /dev/shm
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: "32Gi"Common Issues
βnvidia_peermem: module not foundβ
- Cause: NVIDIA driver version too old; or module not built
- Fix: Upgrade to NVIDIA driver β₯ 520; or install
nvidia-peermempackage from CUDA repo
GDRDMA active but bandwidth lower than expected
- Cause: NIC on different NUMA node (SYS path); or PCIe Gen4 vs Gen5 limitation
- Fix: Use PIX-local NICs (
NCCL_IB_HCA); verify PCIe link speed withlspci -vvv
βDMA-BUF is NOT availableβ in NCCL logs
- Cause: Kernel < 5.12; or nvidia driver built without DMA-BUF support
- Fix: Upgrade kernel to β₯ 5.12; rebuild NVIDIA driver with DMA-BUF; or fall back to nvidia-peermem legacy mode
GDRDMA works for some GPUs but not others
- Cause: nvidia-peermem not registered for all GPUs; or some GPUs on unsupported PCIe topology
- Fix: Check
/sys/module/nvidia_peermem/; restart driver; verify each GPU shows DMA-BUF available
Best Practices
- Always set
NCCL_NET_GDR_LEVEL=5β let NCCL use GDRDMA regardless of topology distance - Verify
/GDRDMAsuffix in channel logs β confirms GPU-direct path is active - Load nvidia-peermem at boot β donβt rely on manual modprobe
- Test with
ib_write_bw --use_cudaβ validates GDRDMA independently of NCCL - Use PIX-local NICs β best GDRDMA throughput when NIC shares PCIe switch with GPU
- IPC_LOCK capability required β for RDMA memory registration in containers
- Large /dev/shm β NCCL needs shared memory for internal buffers
Key Takeaways
- GPUDirect RDMA: NIC reads/writes GPU memory directly (no CPU copies)
- Enable:
modprobe nvidia-peermem+NCCL_NET_GDR_LEVEL=5 - Verify: NCCL channel logs must show
/GDRDMAsuffix on all NET/IB channels - DMA-BUF (kernel β₯ 5.12) is the modern interface; nvidia-peermem provides it
- Performance gain: 30-50% more bandwidth, 50% less latency vs CPU staging
- GPU Operator: set
driver.rdma.enabled=truefor automatic nvidia-peermem - Container needs:
IPC_LOCKcapability + RDMA device access + large shared memory

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
