πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai advanced ⏱ 45 minutes K8s 1.28+

NVIDIA Open GPU + GPUDirect RDMA + DOCA-OFED + SR-IOV Stack

Deploy NVIDIA AI networking on Kubernetes: Open GPU driver with DMA-BUF, GPUDirect RDMA, DOCA-OFED, and SR-IOV VF isolation.

By Luca Berton β€’ β€’ πŸ“– 15 min read

πŸ’‘ Quick Answer: The full stack is: (1) GPU Operator with useOpenKernelModules: true for the open-source GPU kernel driver exposing DMA-BUF, (2) DOCA-OFED (Network Operator) with nfsrdma module for RDMA transport + NFSoRDMA, (3) nvidia-peermem module bridging GPU memory ↔ RDMA NIC for GPUDirect RDMA, and (4) SR-IOV VFs for per-pod dedicated NIC bandwidth. This enables zero-copy data paths from storage β†’ NIC β†’ GPU with no CPU involvement.

The Problem

AI training at scale requires moving massive datasets (hundreds of GB) from distributed storage into GPU memory across a cluster. The default path β€” storage β†’ NIC β†’ CPU β†’ system RAM β†’ PCIe β†’ GPU β€” creates three bottleneck hops. Each hop adds latency and consumes CPU cycles that should be running training code. At 8-GPU-per-node scale with 400Gbps networking, the CPU becomes the bottleneck, capping effective throughput at 40-60% of wire speed.

You need the full zero-copy data path: storage β†’ RDMA NIC β†’ PCIe β†’ GPU memory, bypassing the CPU entirely.

The Architecture

graph TB
    subgraph "The Full Stack"
        A[Open GPU Kernel Driver<br>DMA-BUF export of GPU memory] 
        B[nvidia-peermem<br>GPUDirect RDMA bridge]
        C[DOCA-OFED Driver<br>RDMA + NFSoRDMA modules]
        D[SR-IOV Virtual Functions<br>Per-pod NIC isolation]
    end
    
    subgraph "Data Path"
        E[NFS/Lustre Storage] -->|RDMA transport| F[ConnectX NIC]
        F -->|DMA via peermem| G[GPU HBM Memory]
    end
    
    A --> B
    C --> B
    C --> D
    
    style A fill:#76b900,color:#000
    style B fill:#ff6f00,color:#000
    style C fill:#0077b6,color:#fff
    style D fill:#7b2cbf,color:#fff

How the Layers Connect

LayerComponentRole
GPUOpen Kernel Driver + DMA-BUFExposes GPU memory pages to the Linux DMA-BUF subsystem so other devices (NICs) can address them
Bridgenvidia-peermemRegisters GPU memory with the InfiniBand/RDMA core so NICs can DMA directly into GPU HBM
NetworkDOCA-OFED + nfsrdmaProvides RDMA verbs stack + NFSoRDMA kernel module for RDMA-based NFS mounts
IsolationSR-IOV VFsGives each pod its own hardware NIC slice with dedicated queues, bypassing the kernel network stack

Without this stack: Storage β†’ NIC β†’ CPU memcpy β†’ System RAM β†’ CPU memcpy β†’ GPU (3-5ms per transfer, CPU at 100%)

With this stack: Storage β†’ NIC β†’ GPU (0.5ms per transfer, CPU at ~5%)

The Solution

Prerequisites

  • NVIDIA GPUs: A100, H100, H200, or later (PCIe or SXM)
  • NVIDIA ConnectX-6 Dx or later NICs (ConnectX-7 recommended)
  • Firmware: ConnectX NIC firmware with SRIOV and RDMA enabled
  • Kubernetes: 1.28+ with GPU Operator and Network Operator
  • OpenShift: 4.14+ (if using OpenShift)
  • Storage: NFS server with RDMA support (e.g., NetApp ONTAP, DDN EXA5) or Lustre with LNet RDMA

Step 1: Deploy GPU Operator with Open Kernel Modules

The open-source GPU kernel driver (nvidia-open) is required for DMA-BUF support. The proprietary driver does NOT export DMA-BUF interfaces.

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  operator:
    defaultRuntime: crio    # containerd for non-OpenShift
  driver:
    enabled: true
    # === KEY: Use open kernel modules ===
    useOpenKernelModules: true
    version: "560.35.05"
    # Open driver image (note: -open suffix)
    image: nvcr.io/nvidia/driver
    repository: nvcr.io/nvidia
    licensingConfig:
      nlsEnabled: false
    rdma:
      enabled: true          # Load nvidia-peermem module
      useHostMofed: false     # MOFED managed by Network Operator
  # GPUDirect RDMA peer memory module
  gdrcopy:
    enabled: true
  # DCGM for monitoring
  dcgm:
    enabled: true
  dcgmExporter:
    enabled: true
    config:
      name: dcgm-exporter-config
  toolkit:
    enabled: true
  devicePlugin:
    enabled: true
    config:
      name: device-plugin-config
  # GDS for GPUDirect Storage (NVMe β†’ GPU)
  gds:
    enabled: true
  # Node Feature Discovery
  nfd:
    enabled: true
  # MIG manager (for A100/H100 MIG partitioning)
  migManager:
    enabled: true

Why open kernel modules?

The open-source NVIDIA kernel driver (nvidia-open) implements the Linux DMA-BUF standard interface (dma_buf_ops). This allows:

  • Other kernel subsystems to import GPU memory as DMA-BUF file descriptors
  • The RDMA stack to register GPU pages for peer-to-peer DMA via nvidia-peermem
  • Future kernel features (KFD, P2PDMA) to interoperate with GPU memory
# Verify open driver is loaded (not proprietary)
oc debug node/gpu-worker-1 -- chroot /host bash -c '
  modinfo nvidia | grep -E "^filename|^description|^license"
'
# filename:       /lib/modules/.../nvidia.ko
# description:    NVIDIA Unified Memory
# license:        Dual MIT/GPL    ← Open driver (proprietary shows "NVIDIA")

# Verify DMA-BUF support
oc debug node/gpu-worker-1 -- chroot /host bash -c '
  ls /sys/kernel/dmabuf/buffers/ 2>/dev/null && echo "DMA-BUF active" || echo "No DMA-BUF"
'

Step 2: Deploy Network Operator with DOCA-OFED + NFSoRDMA

apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  ofedDriver:
    # === DOCA-OFED driver with NFSoRDMA ===
    image: doca-driver
    repository: nvcr.io/nvidia/mellanox
    version: "24.10-0.7.0.0-0"
    startupProbe:
      initialDelaySeconds: 10
      periodSeconds: 20
    livenessProbe:
      initialDelaySeconds: 30
      periodSeconds: 30
    rdmaSharedDevicePlugin:
      image: k8s-rdma-shared-dev-plugin
      repository: ghcr.io/mellanox
      version: "v1.5.1"
      config: |
        {
          "configList": [
            {
              "resourceName": "rdma_shared_device_a",
              "rdmaHcaMax": 63,
              "selectors": {
                "vendors": ["15b3"],
                "deviceIDs": ["101d", "101e", "a2dc"],
                "ifNames": ["ens8f0", "ens8f1"]
              }
            }
          ]
        }
    env:
      # === Enable NFSoRDMA kernel module ===
      - name: NFSRDMA_ENABLE
        value: "true"
      - name: RESTORE_DRIVER_ON_POD_TERMINATION
        value: "true"
      # Load nvidia-peermem for GPUDirect RDMA
      - name: CREATE_IFNAMES_UDEV
        value: "true"
    # Kernel modules to load
    # DOCA-OFED loads: mlx5_core, mlx5_ib, ib_core, rdma_cm, ib_uverbs
    # With NFSRDMA_ENABLE: also loads xprtrdma, svcrdma (NFSoRDMA transport)
    # nvidia-peermem: loaded by GPU Operator when rdma.enabled=true
  
  sriovDevicePlugin:
    image: sriov-network-device-plugin
    repository: ghcr.io/k8snetworkplumbingwg
    version: "v3.7.0"
    config: |
      {
        "resourceList": [
          {
            "resourcePrefix": "nvidia.com",
            "resourceName": "sriov_rdma_vf",
            "selectors": {
              "vendors": ["15b3"],
              "devices": ["101e"],
              "drivers": ["mlx5_core"],
              "isRdma": true
            }
          }
        ]
      }

Verify DOCA-OFED and NFSoRDMA:

# Check OFED driver pods are running
oc get pods -n nvidia-network-operator -l app=mofed

# Verify kernel modules on a GPU node
oc debug node/gpu-worker-1 -- chroot /host bash -c '
echo "=== RDMA Core ==="
lsmod | grep -E "^mlx5_core|^mlx5_ib|^ib_core|^rdma_cm"

echo ""
echo "=== NFSoRDMA ==="
lsmod | grep -E "^xprtrdma|^svcrdma|^rpcrdma"

echo ""
echo "=== GPUDirect RDMA (peermem) ==="
lsmod | grep nvidia_peermem

echo ""
echo "=== DMA-BUF ==="
lsmod | grep dmabuf
'

# Expected output:
# === RDMA Core ===
# mlx5_core           2097152  1 mlx5_ib
# mlx5_ib              409600  0
# ib_core              524288  7 mlx5_ib,ib_uverbs,rdma_cm,...
# rdma_cm              131072  1 ...
#
# === NFSoRDMA ===
# xprtrdma              90112  0     ← NFS client RDMA transport
# svcrdma               77824  0     ← NFS server RDMA transport (if serving)
# rpcrdma_core          45056  2 xprtrdma,svcrdma
#
# === GPUDirect RDMA (peermem) ===
# nvidia_peermem         16384  0     ← GPU↔NIC bridge
#
# === DMA-BUF ===
# dmabuf                 ...         ← DMA-BUF subsystem

Step 3: Configure SR-IOV Virtual Functions

Create VFs on the ConnectX NICs so each pod gets its own hardware-isolated NIC slice with dedicated RDMA resources:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: gpu-rdma-vfs
  namespace: openshift-sriov-network-operator
spec:
  nodeSelector:
    node-role.kubernetes.io/gpu-worker: ""
    feature.node.kubernetes.io/network-sriov.capable: "true"
  resourceName: sriov_rdma_vf
  numVfs: 8                     # 8 VFs = 1 per GPU (for 8-GPU nodes)
  nicSelector:
    vendor: "15b3"              # Mellanox/NVIDIA
    deviceID: "101d"            # ConnectX-6 Dx (use 101e for CX-7)
    pfNames: ["ens8f0"]         # Physical function name
  deviceType: netdevice          # Use netdevice for RDMA (not vfio-pci)
  isRdma: true                  # Enable RDMA on VFs
  linkType: ETH                 # Ethernet (use IB for InfiniBand)
  mtu: 9000                     # Jumbo frames for RDMA performance
  # === Important: Access mode for RDMA NICs ===
  # Switch ports must be in ACCESS mode (untagged)
  # NFSoRDMA does NOT support VLAN tagging

Why deviceType: netdevice and not vfio-pci?

For GPUDirect RDMA, the VF must use the kernel mlx5_core driver (netdevice mode) so that:

  • The ib_core RDMA subsystem can register the device
  • nvidia-peermem can map GPU pages to the NIC’s RDMA context
  • NFSoRDMA can use the RDMA transport layer

vfio-pci bypasses the kernel entirely (for DPDK/userspace drivers) β€” no RDMA verbs, no peermem, no NFSoRDMA.

# Verify VFs are created
oc debug node/gpu-worker-1 -- chroot /host bash -c '
echo "=== SR-IOV VFs ==="
ip link show ens8f0
# Should show: vf 0, vf 1, ... vf 7

echo ""
echo "=== RDMA devices ==="
rdma link show
# Should show mlx5_0, mlx5_1, ... for each VF with RDMA capability

echo ""
echo "=== Allocatable resources ==="
'
oc get node gpu-worker-1 -o json | jq '.status.allocatable | with_entries(select(.key | contains("sriov")))'
# "nvidia.com/sriov_rdma_vf": "8"

Step 4: Create SR-IOV Network Attachment

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: gpu-rdma-net
  namespace: openshift-sriov-network-operator
spec:
  resourceName: sriov_rdma_vf
  networkNamespace: ai-training
  ipam: |
    {
      "type": "host-local",
      "subnet": "192.168.100.0/24",
      "rangeStart": "192.168.100.10",
      "rangeEnd": "192.168.100.200"
    }
  # Configure for RDMA
  capabilities: '{ "rdma": true }'

Step 5: Mount NFSoRDMA Storage

On each GPU node, mount the NFS export with RDMA transport:

# MachineConfig for NFSoRDMA mount (OpenShift)
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  name: 99-gpu-worker-nfsordma-mount
  labels:
    machineconfiguration.openshift.io/role: gpu-worker
spec:
  config:
    ignition:
      version: 3.2.0
    systemd:
      units:
        - name: mnt-ai\\x2ddata.mount
          enabled: true
          contents: |
            [Unit]
            Description=NFSoRDMA AI Training Data
            After=network-online.target openibd.service
            Wants=network-online.target
            
            [Mount]
            What=nfs-server.internal.example.com:/exports/ai-data
            Where=/mnt/ai-data
            Type=nfs
            Options=rdma,port=20049,vers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,nconnect=16
            
            [Install]
            WantedBy=multi-user.target

Mount options explained:

  • rdma β€” use RDMA transport instead of TCP
  • port=20049 β€” NFS server’s RDMA listen port (standard NFSoRDMA port)
  • nconnect=16 β€” 16 parallel RDMA connections for aggregate bandwidth
  • rsize=1048576,wsize=1048576 β€” 1MB read/write blocks for large sequential I/O
# Verify NFSoRDMA mount is active
oc debug node/gpu-worker-1 -- chroot /host bash -c '
mount | grep rdma
# nfs-server:/exports/ai-data on /mnt/ai-data type nfs4 (rdma,...)

# Check RDMA transport is active (not falling back to TCP)
cat /proc/mounts | grep ai-data
nfsstat -m | grep ai-data
# Shows: proto=rdma
'

Step 6: Deploy AI Training Pod with Full Stack

apiVersion: v1
kind: Pod
metadata:
  name: gpu-rdma-training
  namespace: ai-training
  annotations:
    k8s.v1.cni.cncf.io/networks: gpu-rdma-net    # SR-IOV VF attachment
spec:
  containers:
    - name: training
      image: nvcr.io/nvidia/pytorch:24.12-py3
      command: ["bash", "-c"]
      args:
        - |
          echo "=== Verifying Full Stack ==="
          
          # 1. Check GPU (open driver)
          nvidia-smi --query-gpu=name,driver_version,memory.total --format=csv
          
          # 2. Check RDMA device (SR-IOV VF)
          ibv_devinfo 2>/dev/null || echo "Install rdma-core for ibv_devinfo"
          
          # 3. Check GPUDirect RDMA (peermem)
          cat /proc/driver/nvidia-peermem/version 2>/dev/null || \
            echo "nvidia-peermem not visible in container β€” check host"
          
          # 4. Check NFSoRDMA mount
          ls -la /data/
          dd if=/data/training-set/shard-000.tar of=/dev/null bs=1M count=1024 2>&1 | tail -1
          
          # 5. Run NCCL test with GPUDirect
          # NCCL will automatically use GPUDirect RDMA if available
          export NCCL_NET_GDR_LEVEL=5       # Enable GPUDirect RDMA in NCCL
          export NCCL_IB_DISABLE=0           # Use InfiniBand/RoCE
          export NCCL_SOCKET_IFNAME=net1     # SR-IOV VF interface
          
          echo "Stack verified. Starting training..."
          python train.py --data-dir /data/training-set/
      resources:
        limits:
          nvidia.com/gpu: 1
          nvidia.com/sriov_rdma_vf: 1       # One SR-IOV VF per GPU
        requests:
          nvidia.com/gpu: 1
          nvidia.com/sriov_rdma_vf: 1
      volumeMounts:
        - name: ai-data
          mountPath: /data
          readOnly: true
        - name: shm
          mountPath: /dev/shm
      securityContext:
        capabilities:
          add: ["IPC_LOCK"]     # Required for RDMA memory registration
  volumes:
    - name: ai-data
      hostPath:
        path: /mnt/ai-data     # NFSoRDMA mount point
        type: Directory
    - name: shm
      emptyDir:
        medium: Memory
        sizeLimit: "32Gi"       # Large SHM for PyTorch DataLoader workers

Verifying the Full Data Path

Test 1: GPUDirect RDMA (peermem) Active

# On the host node
oc debug node/gpu-worker-1 -- chroot /host bash -c '
# Check peermem is registered with InfiniBand core
cat /sys/module/nvidia_peermem/parameters/peermem_enabled 2>/dev/null
# 1 = active

# Check RDMA device capabilities
rdma link show | head -5
# link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev ens8f0v0

# Verify peer memory is registered
dmesg | grep -i "nvidia peermem"
# nvidia-peermem registered successfully
'

Test 2: NFSoRDMA Throughput

# From inside a GPU pod with the NFSoRDMA mount
dd if=/data/large-file.bin of=/dev/null bs=4M count=4096
# With RDMA: ~12-24 GB/s (100-200 Gbps) depending on NIC
# Without RDMA (TCP): ~3-6 GB/s

# Check NFS stats to confirm RDMA transport
nfsstat -m
# /data from nfs-server:/exports/ai-data
#  Flags: rdma,rw,vers=4.1
#  Proto: rdma     ← Confirmed RDMA, not TCP

Test 3: End-to-End GPU Bandwidth

# Python test: measure storage β†’ GPU throughput via GPUDirect RDMA
import torch
import time
import numpy as np

# Read from NFSoRDMA mount directly into GPU
gpu = torch.device('cuda:0')
data = np.memmap('/data/training-set/shard-000.bin', dtype='float32', mode='r')

start = time.time()
tensor = torch.from_numpy(data[:256_000_000]).to(gpu)  # 1GB
elapsed = time.time() - start

print(f"Transferred 1GB to GPU in {elapsed:.3f}s = {1/elapsed:.1f} GB/s")
# With full stack: 8-15 GB/s
# Without GPUDirect: 2-4 GB/s

Test 4: NCCL All-Reduce with GPUDirect RDMA

# Multi-node NCCL test
export NCCL_NET_GDR_LEVEL=5
export NCCL_IB_DISABLE=0
export NCCL_DEBUG=INFO

mpirun -np 16 -hostfile hosts \
  --mca btl_tcp_if_include net1 \
  /opt/nccl-tests/build/all_reduce_perf -b 1M -e 4G -f 2

# Look for "NET/IB" and "GDR" in NCCL debug output:
# NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE
# NCCL INFO NET/IB : GPU Direct RDMA Enabled for ...

The Full Module Dependency Chain

graph BT
    subgraph "GPU Stack (GPU Operator)"
        NVIDIA[nvidia.ko<br>Open Kernel Module<br>DMA-BUF export]
        UVM[nvidia-uvm.ko<br>Unified Virtual Memory]
        PEERMEM[nvidia-peermem.ko<br>GPU↔RDMA bridge]
    end
    
    subgraph "RDMA Stack (Network Operator / DOCA-OFED)"
        MLX5[mlx5_core.ko<br>ConnectX driver]
        MLX5IB[mlx5_ib.ko<br>IB/RoCE verbs]
        IBCORE[ib_core.ko<br>RDMA subsystem]
        RDMACM[rdma_cm.ko<br>Connection manager]
        XPRTRDMA[xprtrdma.ko<br>NFS client RDMA]
        SVCRDMA[svcrdma.ko<br>NFS server RDMA]
        SUNRPC[sunrpc.ko<br>RPC layer]
    end
    
    subgraph "SR-IOV"
        VF0[VF 0 β€” Pod A]
        VF1[VF 1 β€” Pod B]
        VFn[VF N β€” Pod N]
    end
    
    NVIDIA --> PEERMEM
    IBCORE --> PEERMEM
    MLX5 --> MLX5IB
    MLX5IB --> IBCORE
    IBCORE --> RDMACM
    RDMACM --> XPRTRDMA
    SUNRPC --> XPRTRDMA
    SUNRPC --> SVCRDMA
    RDMACM --> SVCRDMA
    MLX5 --> VF0
    MLX5 --> VF1
    MLX5 --> VFn
    
    style NVIDIA fill:#76b900,color:#000
    style PEERMEM fill:#ff6f00,color:#000
    style IBCORE fill:#0077b6,color:#fff
    style XPRTRDMA fill:#e63946,color:#fff

Common Issues

nvidia-peermem Not Loading

# Check if GPU Operator has rdma enabled
oc get clusterpolicy -o jsonpath='{.items[0].spec.driver.rdma}'
# {"enabled": true}

# Check driver pod logs
oc logs -n gpu-operator -l app=nvidia-driver-daemonset -c nvidia-driver | grep peermem
# If "MOFED driver not found" β†’ Network Operator must deploy DOCA-OFED first

# Correct startup order:
# 1. Network Operator deploys DOCA-OFED β†’ mlx5_core, ib_core loaded
# 2. GPU Operator deploys open driver β†’ nvidia.ko loaded
# 3. GPU Operator loads nvidia-peermem β†’ bridges the two

Open Driver Not Installed (Proprietary Loaded Instead)

# Check which driver is loaded
cat /proc/driver/nvidia/version
# If it shows "NVIDIA UNIX Open Kernel Module" β†’ correct
# If it shows "NVIDIA UNIX x86_64 Kernel Module" β†’ proprietary, no DMA-BUF

# Fix: ensure ClusterPolicy has useOpenKernelModules: true
oc patch clusterpolicy gpu-cluster-policy --type merge -p '
  {"spec":{"driver":{"useOpenKernelModules":true}}}'
# GPU Operator will redeploy driver DaemonSet with open modules

NFSoRDMA Falling Back to TCP

# Check if xprtrdma module is loaded
lsmod | grep xprtrdma
# If missing, NFSRDMA_ENABLE wasn't set in NicClusterPolicy

# Check mount transport
cat /proc/mounts | grep nfs
# If "proto=tcp" instead of "proto=rdma":
# 1. NFS server may not support RDMA β€” check server config
# 2. Port 20049 may be blocked β€” check firewall
# 3. RDMA device may not be on the right subnet β€” check IP routing

# Test RDMA connectivity to NFS server
rdma_client -s nfs-server.internal.example.com -p 20049

SR-IOV VFs Not RDMA-Capable

# Check VF RDMA capability
rdma link show
# If VFs don't show RDMA devices:
# 1. Verify isRdma: true in SriovNetworkNodePolicy
# 2. Verify deviceType: netdevice (not vfio-pci)
# 3. Check NIC firmware supports RDMA on VFs:
mlxconfig -d /dev/mst/mt4125_pciconf0 query | grep SRIOV_EN
# SRIOV_EN = True
mlxconfig -d /dev/mst/mt4125_pciconf0 query | grep RDMA
# RDMA_ENABLED = True

Switch Port Configuration for RDMA

Critical: NFSoRDMA does not support 802.1Q VLAN tagging. Switch ports connected to RDMA NICs must be in access mode (untagged). Each VLAN requires a dedicated physical NIC.

# Switch configuration example (Arista/Cisco style)
interface Ethernet1/1
  description gpu-worker-1-rdma
  switchport mode access
  switchport access vlan 100
  mtu 9216
  no shutdown
  # DO NOT use: switchport mode trunk

Performance Tuning

Optimal Settings

# On each GPU node (via MachineConfig):

# 1. Enable adaptive RX coalescing for RDMA
ethtool -C ens8f0 adaptive-rx on

# 2. Set ring buffer to maximum
ethtool -G ens8f0 rx 8192 tx 8192

# 3. Enable PCI relaxed ordering (GPUDirect optimization)
setpci -s <NIC_PCI_BUS>:00.0 CAP_EXP+10.w=0020

# 4. Jumbo frames (must match switch MTU)
ip link set ens8f0 mtu 9000

# 5. NUMA-aware GPU↔NIC pairing
# GPU 0 on NUMA 0 should use NIC on NUMA 0
nvidia-smi topo -m
# Shows GPU-NIC affinity β€” schedule pods to match

Expected Performance

ConfigurationNFS Read ThroughputGPU Load Time (10GB)CPU Usage
TCP NFS + CPU copy3-6 GB/s1.7-3.3s80-100%
RDMA NFS + CPU copy10-20 GB/s0.5-1.0s20-40%
RDMA NFS + GPUDirect15-25 GB/s0.4-0.7s3-8%
RDMA NFS + GPUDirect + SR-IOV20-28 GB/s0.35-0.5s2-5%

SR-IOV adds ~10-15% throughput by eliminating kernel networking overhead and providing dedicated hardware queues per pod.

Best Practices

  • Match GPU and NIC NUMA nodes β€” cross-NUMA transfers add 30-50% latency
  • Use open kernel modules β€” required for DMA-BUF; also preferred by NVIDIA for datacenter GPUs going forward
  • Deploy Network Operator before GPU Operator β€” nvidia-peermem needs ib_core to be loaded first
  • One SR-IOV VF per GPU β€” matches the 1:1 GPU:NIC bandwidth ratio for maximum throughput
  • Dedicated NICs for RDMA β€” don’t share RDMA NICs with cluster management traffic
  • Access mode switch ports β€” NFSoRDMA cannot do VLAN tagging; each VLAN = dedicated NIC
  • Test with ib_write_bw before deploying workloads β€” verify raw RDMA bandwidth between nodes
  • Monitor with DCGM + rdma metrics β€” track PCIe throughput and RDMA counters

Key Takeaways

  • The full stack is: Open GPU driver (DMA-BUF) β†’ nvidia-peermem (bridge) β†’ DOCA-OFED (RDMA) β†’ SR-IOV (isolation)
  • Open kernel modules (useOpenKernelModules: true) are required β€” proprietary driver lacks DMA-BUF
  • nvidia-peermem is the glue: it registers GPU memory with the RDMA subsystem
  • NFSoRDMA requires NFSRDMA_ENABLE=true in NicClusterPolicy and access-mode switch ports
  • SR-IOV VFs with isRdma: true + deviceType: netdevice give per-pod RDMA hardware isolation
  • Deploy order matters: Network Operator β†’ GPU Operator β†’ verify peermem β†’ mount NFSoRDMA β†’ deploy workloads
  • This stack eliminates CPU from the storage β†’ GPU data path, achieving 20-28 GB/s per NIC
#nvidia #gpu-operator #gpudirect #rdma #doca #sriov #nfsordma #dma-buf #networking #ai
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens