GPUDirect RDMA via DMA-BUF
Configure GPUDirect RDMA using DMA-BUF kernel subsystem for zero-copy GPU-to-GPU transfers over InfiniBand and RoCE networks.
π‘ Quick Answer: With kernel β₯ 6.x and open GPU kernel modules, GPUDirect RDMA uses DMA-BUF instead of nvidia-peermem. Set
NCCL_NET_GDR_LEVEL=5and ensuregdrcopyis enabled in ClusterPolicy. Data flows directly from GPU memory to NIC via DMA, bypassing CPU entirely.
The Problem
Multi-node GPU training requires gradient synchronization across nodes. Without GPUDirect RDMA, data follows: GPU β CPU memory β kernel β NIC β network β NIC β kernel β CPU β GPU. This path adds latency and consumes CPU. Legacy nvidia-peermem was a fragile out-of-tree module that broke on kernel updates.
The Solution
DMA-BUF Data Path
# Legacy path (without GPUDirect RDMA):
# GPU β PCIe β CPU Memory β Kernel TCP/IP β NIC β Network
# Latency: ~50ΞΌs, CPU overhead: high
# GPUDirect RDMA via nvidia-peermem (legacy):
# GPU β PCIe β NIC β Network (direct, kernel module)
# Latency: ~2ΞΌs, but nvidia-peermem breaks on kernel updates
# GPUDirect RDMA via DMA-BUF (current):
# GPU β PCIe β NIC β Network (direct, in-tree kernel subsystem)
# Latency: ~2ΞΌs, stable across kernel updatesEnable in ClusterPolicy
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: gpu-cluster-policy
spec:
driver:
enabled: true
useOpenKernelModules: true
gdrcopy:
enabled: true # GPUDirect RDMA Copy library
dcgm:
enabled: true
devicePlugin:
enabled: trueNCCL Configuration for GPUDirect RDMA
env:
# Enable GPUDirect RDMA
- name: NCCL_NET_GDR_LEVEL
value: "5" # LOC=0, SYS=3, PHB=4, PIX=5 (most aggressive)
- name: NCCL_IB_DISABLE
value: "0"
- name: NCCL_IB_HCA
value: "mlx5_0,mlx5_1"
# Verify GDR is active in debug logs
- name: NCCL_DEBUG
value: "INFO"
- name: NCCL_DEBUG_SUBSYS
value: "NET"Verify GPUDirect RDMA
# Check DMA-BUF support
cat /proc/modules | grep dma_buf
# Verify nvidia-peermem is NOT loaded (replaced by DMA-BUF)
lsmod | grep nvidia_peermem
# Should be empty
# Check GDR devices
ls /dev/nvidia-uvm*
nvidia-smi topo -m
# Look for "NV#" (NVLink) and "SYS" with PIX/PHB connections
# Run GDR bandwidth test
gdrcopy_copybw
# Expected: 10-12 GB/s for GPUβNIC path
# NCCL test with GDR
all_reduce_perf -b 8 -e 2G -f 2 -g 8
# Look for "Using GPUDirect RDMA" in outputgraph TD
A[GPU 0 Memory] -->|PCIe Direct| B[ConnectX-7 NIC]
B -->|InfiniBand RDMA| C[Network Switch]
C -->|InfiniBand RDMA| D[Remote ConnectX-7 NIC]
D -->|PCIe Direct| E[GPU 0 Memory on Remote Node]
F[DMA-BUF Kernel Subsystem] -->|Manages| A
F -->|Manages| B
G[No CPU Copy] --> H[Zero-copy transfer]
G --> I[Sub-2ΞΌs latency]
G --> J[No CPU overhead]Common Issues
- NCCL falls back to non-GDR path β check
NCCL_NET_GDR_LEVEL=5; verify GPU and NIC are on same PCIe switch (checknvidia-smi topo -m) - nvidia-peermem still loading β remove old MachineConfig that loads nvidia-peermem; open modules use DMA-BUF automatically
- GDR performance worse than expected β GPU and NIC must be on same NUMA node; check
numactl -Handnvidia-smi topo -m - gdrcopy test fails β ensure
gdrcopyis enabled in ClusterPolicy and kernel β₯ 6.x
Best Practices
- Use open kernel modules + DMA-BUF β eliminates nvidia-peermem upgrade fragility
- Verify GPU-NIC affinity with
nvidia-smi topo -mβ best performance when on same PCIe switch - Set
NCCL_NET_GDR_LEVEL=5(PIX) for most aggressive GDR usage - Run
all_reduce_perfto verify GDR is active before production training - Enable
gdrcopyin ClusterPolicy for optimized GPU memory copy operations
Key Takeaways
- GPUDirect RDMA enables zero-copy GPU-to-NIC data transfer for NCCL
- DMA-BUF (kernel β₯ 6.x) replaces nvidia-peermem with stable in-tree subsystem
- Reduces inter-node communication latency to ~2ΞΌs with zero CPU overhead
- Critical for multi-node distributed training performance
- GPU-NIC PCIe affinity directly impacts GDR bandwidth

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
