NVIDIA PeerMem for GPU-Direct RDMA
Install and configure nvidia_peermem kernel module to enable GPU-Direct RDMA between NVIDIA GPUs and Mellanox RDMA NICs. Covers module
π‘ Quick Answer:
nvidia_peermemis the kernel module that bridges NVIDIA GPU memory to the Linux RDMA subsystem, enabling GPU-Direct RDMA β NIC reads/writes directly to GPU VRAM without CPU copies. Without it, NCCL falls back to CPU-staged transfers (2-5x slower for inter-node communication).
The Problem
GPU-Direct RDMA requires a bridge between two kernel subsystems:
- NVIDIA GPU driver manages GPU memory (VRAM)
- RDMA/InfiniBand subsystem manages NIC DMA operations
- Neither knows about the otherβs memory regions
nvidia_peermemregisters GPU memory with the RDMA stack- Without it: NIC β CPU RAM β GPU (two copies, high latency)
- With it: NIC β GPU VRAM (one DMA, zero CPU involvement)
The Solution
Data Path Comparison
Without nvidia_peermem (CPU-staged):
Remote GPU β RDMA NIC β Host RAM β PCIe β Local GPU
Bandwidth: ~12 GB/s (limited by CPU memory copy)
Latency: ~15ΞΌs
With nvidia_peermem (GPU-Direct RDMA):
Remote GPU β RDMA NIC β PCIe β Local GPU
Bandwidth: ~24-48 GB/s (limited by PCIe/NIC speed)
Latency: ~3ΞΌsLoad nvidia_peermem
# Check if already loaded
lsmod | grep nvidia_peermem
# nvidia_peermem 16384 0
# If not loaded:
modprobe nvidia_peermem
# Verify registration
dmesg | grep -i peermem
# Expected: "nvidia peermem loaded successfully"
# Or: "nvidia peermem registered"
# Make persistent across reboots
echo "nvidia_peermem" >> /etc/modules-load.d/gpu-rdma.confPrerequisites (Load Order)
# nvidia_peermem depends on both:
# 1. NVIDIA driver (nvidia, nvidia_uvm)
# 2. RDMA core (ib_core, mlx5_ib)
# Correct load order:
modprobe ib_core
modprobe mlx5_core
modprobe mlx5_ib
modprobe nvidia
modprobe nvidia_uvm
modprobe nvidia_peermem # Must be LAST
# Check dependencies
modinfo nvidia_peermem
# depends: ib_core, nvidiaOpenShift GPU Operator (Automatic)
# GPU Operator handles nvidia_peermem automatically when:
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: gpu-cluster-policy
spec:
driver:
enabled: true
# nvidia_peermem loaded as part of driver stack
rdma:
enabled: true # β This enables nvidia_peermem
useHostMofed: true # Use host-installed MLNX_OFED
# Or with containerized MOFED:
driver:
rdma:
enabled: true
useHostMofed: false # GPU Operator manages MOFED tooOpenShift MachineConfig
# If not using GPU Operator for driver management:
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
name: 99-gpu-worker-peermem
labels:
machineconfiguration.openshift.io/role: gpu-worker
spec:
config:
ignition:
version: 3.2.0
storage:
files:
- path: /etc/modules-load.d/gpu-rdma.conf
mode: 0644
contents:
inline: |
ib_core
mlx5_core
mlx5_ib
nvidia
nvidia_uvm
nvidia_peermemTalos Linux
machine:
kernel:
modules:
- name: ib_core
- name: mlx5_core
- name: mlx5_ib
- name: nvidia
- name: nvidia_uvm
- name: nvidia_peermemVerify GPU-Direct RDMA is Active
# Check peermem is registered with RDMA subsystem
cat /sys/module/nvidia_peermem/refcnt
# > 0 means actively in use
# Check NVIDIA driver sees peermem
nvidia-smi -q | grep -i "peer"
# Or check nvidia-persistenced log
# Test with NCCL
export NCCL_DEBUG=INFO
export NCCL_NET_GDR_LEVEL=5
# In NCCL output, look for:
# "GPU Direct RDMA Enabled for ..."
# "NET/IB : GPU Direct RDMA enabled"
# If you see:
# "GPU Direct RDMA Disabled"
# β nvidia_peermem not loaded or not registered
# Test with perftest
# Server:
ib_write_bw --use_cuda=0 -d mlx5_0 -a
# Client:
ib_write_bw --use_cuda=0 -d mlx5_0 -a <server-ip>
# --use_cuda=0 tests GPU 0 RDMA directly
# Should show ~24 GB/s for 200Gb/s NICNCCL_NET_GDR_LEVEL Values
GDR Level Path When to Use
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
0 (LOC) Disabled Debugging only
1 (PIX) Same PCIe switch Conservative
2 (PXB) Same PCIe bus Safe default
3 (PHB) Same NUMA node Recommended minimum
4 (SYS) Cross-NUMA (via QPI/UPI) Large systems
5 (ALL) Any path including remote Maximum performance β
Recommendation: Use 5 (ALL) for GPU clusters with proper IOMMU/ACS configTroubleshooting: peermem Wonβt Load
# Error: modprobe: FATAL: Module nvidia_peermem not found
# Cause: nvidia driver version doesn't include peermem
# Fix: Update to NVIDIA driver 470+ (peermem included since 470)
# Error: "nvidia_peermem: Unknown symbol ib_register_peer_memory_client"
# Cause: ib_core not loaded or MOFED version mismatch
# Fix: Load ib_core first; ensure MOFED matches kernel version
# Error: nvidia_peermem loads but refcnt stays 0
# Cause: No RDMA traffic using GPU memory yet (normal if idle)
# Fix: Run an NCCL test β refcnt will increase during GPU-Direct transfers
# Error: "GPU Direct RDMA Disabled" in NCCL
# Cause: NCCL_NET_GDR_LEVEL=0 or peermem not registered
# Fix: Set NCCL_NET_GDR_LEVEL=5; check dmesg for peermem registrationCommon Issues
peermem loaded but NCCL doesnβt use GPU-Direct
- Cause:
NCCL_NET_GDR_LEVELnot set or set too low - Fix: Set
NCCL_NET_GDR_LEVEL=5
peermem registration fails after driver update
- Cause: NVIDIA driver and MOFED version incompatibility
- Fix: Rebuild peermem against current kernel; or update both together
Performance same with and without peermem
- Cause: GPU and NIC on different NUMA nodes; data crosses QPI anyway
- Fix: Check
nvidia-smi topo -m; schedule on nodes with GPU-NIC NUMA locality
Best Practices
- Use GPU Operator
rdma.enabled: trueβ manages peermem automatically - Set
NCCL_NET_GDR_LEVEL=5β enables GPU-Direct on all paths - Verify with
dmesg | grep peermemafter every node boot - Load order matters β ib_core β mlx5 β nvidia β peermem
- Test with
ib_write_bw --use_cudaβ validates GPU memory RDMA path - Match MOFED + NVIDIA driver versions β incompatibility = silent failure
Key Takeaways
nvidia_peermembridges NVIDIA GPU memory to RDMA subsystem- Without it: NIC β CPU β GPU (2 copies). With it: NIC β GPU (zero-copy)
- 2-4x bandwidth improvement for inter-node GPU communication
- GPU Operator loads it automatically with
rdma.enabled: true - Must load AFTER both ib_core and nvidia modules
NCCL_NET_GDR_LEVEL=5tells NCCL to use GPU-Direct RDMA on all paths- Verify:
dmesg | grep peermem+NCCL_DEBUG=INFOshows βGPU Direct RDMA Enabledβ - Included in NVIDIA driver 470+; no separate package needed

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
