πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
Troubleshooting advanced ⏱ 30 minutes K8s 1.28+

Fix NVIDIA Peer Memory Driver Not Detected

Diagnose and resolve the 'NVIDIA peer memory driver not detected' error when running GPU workloads with RDMA on Kubernetes and OpenShift.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: The nvidia-peermem module fails to load when it was compiled against the wrong RDMA stack. Reinstall the NVIDIA driver after MLNX_OFED, or on OpenShift force a driver pod rebuild by deleting the driver DaemonSet.

GPU workloads using NCCL or MPI may log NVIDIA peer memory driver not detected or GPU Direct RDMA Disabled when the nvidia-peermem kernel module cannot load.

Symptoms

Common error messages include:

modprobe: ERROR: could not insert 'nvidia_peermem': Invalid argument
dmesg: Unknown symbol ib_register_peer_memory_client
NVIDIA peer memory driver not detected
GPU Direct RDMA Disabled

Root Cause

The nvidia-peermem module was compiled without the RDMA peer-memory symbols from MLNX_OFED. This happens when the GPU driver is installed before MOFED, causing an ABI mismatch.

Diagnose

Check module status and kernel messages:

sudo modprobe nvidia-peermem
sudo dmesg | tail
lsmod | grep peermem

If dmesg shows Unknown symbol ib_register_peer_memory_client, the RDMA stack and driver are mismatched.

Fix on Bare Metal Kubernetes

Reinstall the NVIDIA driver after MLNX_OFED:

# Verify MLNX_OFED is present
ofed_info -s

# Uninstall GPU driver
sudo systemctl stop nvidia-persistenced
sudo apt purge -y nvidia-driver-<version>
sudo reboot

# Reinstall GPU driver (now compiles against MOFED symbols)
sudo apt install nvidia-driver-<version>
sudo reboot

# Verify
sudo modprobe nvidia-peermem
lsmod | grep peermem

Fix on OpenShift

On OpenShift, do not manually install drivers. Force the GPU Operator to rebuild:

# Delete driver pods to trigger rebuild
oc delete pod -n gpu-operator -l app=nvidia-driver-daemonset

# Verify module loads in driver pod
oc logs -n gpu-operator ds/nvidia-driver-daemonset -c nvidia-peermem-ctr

The GPU Operator rebuilds nvidia-peermem.ko against the host kernel and MOFED symbols.

Validate

Run an NCCL test with debug logging:

NCCL_DEBUG=INFO all_reduce_test

Look for NET/IB: GPU Direct RDMA enabled in the output.

Why This Matters

Without nvidia-peermem, GPU Direct RDMA is disabled and all GPU-to-GPU communication over the network falls back to CPU-staged copies, severely degrading multi-node training performance.

#nvidia #gpu #rdma #peermem #troubleshooting #openshift
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens