πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
Networking advanced ⏱ 15 minutes K8s 1.28+

GPUDirect RDMA via DMA-BUF

Configure GPUDirect RDMA using DMA-BUF kernel subsystem for zero-copy GPU-to-GPU transfers over InfiniBand and RoCE networks.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: With kernel β‰₯ 6.x and open GPU kernel modules, GPUDirect RDMA uses DMA-BUF instead of nvidia-peermem. Set NCCL_NET_GDR_LEVEL=5 and ensure gdrcopy is enabled in ClusterPolicy. Data flows directly from GPU memory to NIC via DMA, bypassing CPU entirely.

The Problem

Multi-node GPU training requires gradient synchronization across nodes. Without GPUDirect RDMA, data follows: GPU β†’ CPU memory β†’ kernel β†’ NIC β†’ network β†’ NIC β†’ kernel β†’ CPU β†’ GPU. This path adds latency and consumes CPU. Legacy nvidia-peermem was a fragile out-of-tree module that broke on kernel updates.

The Solution

DMA-BUF Data Path

# Legacy path (without GPUDirect RDMA):
# GPU β†’ PCIe β†’ CPU Memory β†’ Kernel TCP/IP β†’ NIC β†’ Network
# Latency: ~50ΞΌs, CPU overhead: high

# GPUDirect RDMA via nvidia-peermem (legacy):
# GPU β†’ PCIe β†’ NIC β†’ Network (direct, kernel module)
# Latency: ~2ΞΌs, but nvidia-peermem breaks on kernel updates

# GPUDirect RDMA via DMA-BUF (current):
# GPU β†’ PCIe β†’ NIC β†’ Network (direct, in-tree kernel subsystem)
# Latency: ~2ΞΌs, stable across kernel updates

Enable in ClusterPolicy

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  driver:
    enabled: true
    useOpenKernelModules: true
  gdrcopy:
    enabled: true        # GPUDirect RDMA Copy library
  dcgm:
    enabled: true
  devicePlugin:
    enabled: true

NCCL Configuration for GPUDirect RDMA

env:
  # Enable GPUDirect RDMA
  - name: NCCL_NET_GDR_LEVEL
    value: "5"          # LOC=0, SYS=3, PHB=4, PIX=5 (most aggressive)
  - name: NCCL_IB_DISABLE
    value: "0"
  - name: NCCL_IB_HCA
    value: "mlx5_0,mlx5_1"
  # Verify GDR is active in debug logs
  - name: NCCL_DEBUG
    value: "INFO"
  - name: NCCL_DEBUG_SUBSYS
    value: "NET"

Verify GPUDirect RDMA

# Check DMA-BUF support
cat /proc/modules | grep dma_buf

# Verify nvidia-peermem is NOT loaded (replaced by DMA-BUF)
lsmod | grep nvidia_peermem
# Should be empty

# Check GDR devices
ls /dev/nvidia-uvm*
nvidia-smi topo -m
# Look for "NV#" (NVLink) and "SYS" with PIX/PHB connections

# Run GDR bandwidth test
gdrcopy_copybw
# Expected: 10-12 GB/s for GPU↔NIC path

# NCCL test with GDR
all_reduce_perf -b 8 -e 2G -f 2 -g 8
# Look for "Using GPUDirect RDMA" in output
graph TD
    A[GPU 0 Memory] -->|PCIe Direct| B[ConnectX-7 NIC]
    B -->|InfiniBand RDMA| C[Network Switch]
    C -->|InfiniBand RDMA| D[Remote ConnectX-7 NIC]
    D -->|PCIe Direct| E[GPU 0 Memory on Remote Node]
    
    F[DMA-BUF Kernel Subsystem] -->|Manages| A
    F -->|Manages| B
    
    G[No CPU Copy] --> H[Zero-copy transfer]
    G --> I[Sub-2ΞΌs latency]
    G --> J[No CPU overhead]

Common Issues

  • NCCL falls back to non-GDR path β€” check NCCL_NET_GDR_LEVEL=5; verify GPU and NIC are on same PCIe switch (check nvidia-smi topo -m)
  • nvidia-peermem still loading β€” remove old MachineConfig that loads nvidia-peermem; open modules use DMA-BUF automatically
  • GDR performance worse than expected β€” GPU and NIC must be on same NUMA node; check numactl -H and nvidia-smi topo -m
  • gdrcopy test fails β€” ensure gdrcopy is enabled in ClusterPolicy and kernel β‰₯ 6.x

Best Practices

  • Use open kernel modules + DMA-BUF β€” eliminates nvidia-peermem upgrade fragility
  • Verify GPU-NIC affinity with nvidia-smi topo -m β€” best performance when on same PCIe switch
  • Set NCCL_NET_GDR_LEVEL=5 (PIX) for most aggressive GDR usage
  • Run all_reduce_perf to verify GDR is active before production training
  • Enable gdrcopy in ClusterPolicy for optimized GPU memory copy operations

Key Takeaways

  • GPUDirect RDMA enables zero-copy GPU-to-NIC data transfer for NCCL
  • DMA-BUF (kernel β‰₯ 6.x) replaces nvidia-peermem with stable in-tree subsystem
  • Reduces inter-node communication latency to ~2ΞΌs with zero CPU overhead
  • Critical for multi-node distributed training performance
  • GPU-NIC PCIe affinity directly impacts GDR bandwidth
#gpudirect #rdma #dma-buf #infiniband #nccl #gpu
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens