NVIDIA Open GPU + GPUDirect RDMA + DOCA-OFED + SR-IOV Stack
Deploy NVIDIA AI networking on Kubernetes: Open GPU driver with DMA-BUF, GPUDirect RDMA, DOCA-OFED, and SR-IOV VF isolation.
π‘ Quick Answer: The full stack is: (1) GPU Operator with
useOpenKernelModules: truefor the open-source GPU kernel driver exposing DMA-BUF, (2) DOCA-OFED (Network Operator) withnfsrdmamodule for RDMA transport + NFSoRDMA, (3)nvidia-peermemmodule bridging GPU memory β RDMA NIC for GPUDirect RDMA, and (4) SR-IOV VFs for per-pod dedicated NIC bandwidth. This enables zero-copy data paths from storage β NIC β GPU with no CPU involvement.
The Problem
AI training at scale requires moving massive datasets (hundreds of GB) from distributed storage into GPU memory across a cluster. The default path β storage β NIC β CPU β system RAM β PCIe β GPU β creates three bottleneck hops. Each hop adds latency and consumes CPU cycles that should be running training code. At 8-GPU-per-node scale with 400Gbps networking, the CPU becomes the bottleneck, capping effective throughput at 40-60% of wire speed.
You need the full zero-copy data path: storage β RDMA NIC β PCIe β GPU memory, bypassing the CPU entirely.
The Architecture
graph TB
subgraph "The Full Stack"
A[Open GPU Kernel Driver<br>DMA-BUF export of GPU memory]
B[nvidia-peermem<br>GPUDirect RDMA bridge]
C[DOCA-OFED Driver<br>RDMA + NFSoRDMA modules]
D[SR-IOV Virtual Functions<br>Per-pod NIC isolation]
end
subgraph "Data Path"
E[NFS/Lustre Storage] -->|RDMA transport| F[ConnectX NIC]
F -->|DMA via peermem| G[GPU HBM Memory]
end
A --> B
C --> B
C --> D
style A fill:#76b900,color:#000
style B fill:#ff6f00,color:#000
style C fill:#0077b6,color:#fff
style D fill:#7b2cbf,color:#fffHow the Layers Connect
| Layer | Component | Role |
|---|---|---|
| GPU | Open Kernel Driver + DMA-BUF | Exposes GPU memory pages to the Linux DMA-BUF subsystem so other devices (NICs) can address them |
| Bridge | nvidia-peermem | Registers GPU memory with the InfiniBand/RDMA core so NICs can DMA directly into GPU HBM |
| Network | DOCA-OFED + nfsrdma | Provides RDMA verbs stack + NFSoRDMA kernel module for RDMA-based NFS mounts |
| Isolation | SR-IOV VFs | Gives each pod its own hardware NIC slice with dedicated queues, bypassing the kernel network stack |
Without this stack: Storage β NIC β CPU memcpy β System RAM β CPU memcpy β GPU (3-5ms per transfer, CPU at 100%)
With this stack: Storage β NIC β GPU (0.5ms per transfer, CPU at ~5%)
The Solution
Prerequisites
- NVIDIA GPUs: A100, H100, H200, or later (PCIe or SXM)
- NVIDIA ConnectX-6 Dx or later NICs (ConnectX-7 recommended)
- Firmware: ConnectX NIC firmware with SRIOV and RDMA enabled
- Kubernetes: 1.28+ with GPU Operator and Network Operator
- OpenShift: 4.14+ (if using OpenShift)
- Storage: NFS server with RDMA support (e.g., NetApp ONTAP, DDN EXA5) or Lustre with LNet RDMA
Step 1: Deploy GPU Operator with Open Kernel Modules
The open-source GPU kernel driver (nvidia-open) is required for DMA-BUF support. The proprietary driver does NOT export DMA-BUF interfaces.
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: gpu-cluster-policy
spec:
operator:
defaultRuntime: crio # containerd for non-OpenShift
driver:
enabled: true
# === KEY: Use open kernel modules ===
useOpenKernelModules: true
version: "560.35.05"
# Open driver image (note: -open suffix)
image: nvcr.io/nvidia/driver
repository: nvcr.io/nvidia
licensingConfig:
nlsEnabled: false
rdma:
enabled: true # Load nvidia-peermem module
useHostMofed: false # MOFED managed by Network Operator
# GPUDirect RDMA peer memory module
gdrcopy:
enabled: true
# DCGM for monitoring
dcgm:
enabled: true
dcgmExporter:
enabled: true
config:
name: dcgm-exporter-config
toolkit:
enabled: true
devicePlugin:
enabled: true
config:
name: device-plugin-config
# GDS for GPUDirect Storage (NVMe β GPU)
gds:
enabled: true
# Node Feature Discovery
nfd:
enabled: true
# MIG manager (for A100/H100 MIG partitioning)
migManager:
enabled: trueWhy open kernel modules?
The open-source NVIDIA kernel driver (nvidia-open) implements the Linux DMA-BUF standard interface (dma_buf_ops). This allows:
- Other kernel subsystems to import GPU memory as DMA-BUF file descriptors
- The RDMA stack to register GPU pages for peer-to-peer DMA via
nvidia-peermem - Future kernel features (KFD, P2PDMA) to interoperate with GPU memory
# Verify open driver is loaded (not proprietary)
oc debug node/gpu-worker-1 -- chroot /host bash -c '
modinfo nvidia | grep -E "^filename|^description|^license"
'
# filename: /lib/modules/.../nvidia.ko
# description: NVIDIA Unified Memory
# license: Dual MIT/GPL β Open driver (proprietary shows "NVIDIA")
# Verify DMA-BUF support
oc debug node/gpu-worker-1 -- chroot /host bash -c '
ls /sys/kernel/dmabuf/buffers/ 2>/dev/null && echo "DMA-BUF active" || echo "No DMA-BUF"
'Step 2: Deploy Network Operator with DOCA-OFED + NFSoRDMA
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
name: nic-cluster-policy
spec:
ofedDriver:
# === DOCA-OFED driver with NFSoRDMA ===
image: doca-driver
repository: nvcr.io/nvidia/mellanox
version: "24.10-0.7.0.0-0"
startupProbe:
initialDelaySeconds: 10
periodSeconds: 20
livenessProbe:
initialDelaySeconds: 30
periodSeconds: 30
rdmaSharedDevicePlugin:
image: k8s-rdma-shared-dev-plugin
repository: ghcr.io/mellanox
version: "v1.5.1"
config: |
{
"configList": [
{
"resourceName": "rdma_shared_device_a",
"rdmaHcaMax": 63,
"selectors": {
"vendors": ["15b3"],
"deviceIDs": ["101d", "101e", "a2dc"],
"ifNames": ["ens8f0", "ens8f1"]
}
}
]
}
env:
# === Enable NFSoRDMA kernel module ===
- name: NFSRDMA_ENABLE
value: "true"
- name: RESTORE_DRIVER_ON_POD_TERMINATION
value: "true"
# Load nvidia-peermem for GPUDirect RDMA
- name: CREATE_IFNAMES_UDEV
value: "true"
# Kernel modules to load
# DOCA-OFED loads: mlx5_core, mlx5_ib, ib_core, rdma_cm, ib_uverbs
# With NFSRDMA_ENABLE: also loads xprtrdma, svcrdma (NFSoRDMA transport)
# nvidia-peermem: loaded by GPU Operator when rdma.enabled=true
sriovDevicePlugin:
image: sriov-network-device-plugin
repository: ghcr.io/k8snetworkplumbingwg
version: "v3.7.0"
config: |
{
"resourceList": [
{
"resourcePrefix": "nvidia.com",
"resourceName": "sriov_rdma_vf",
"selectors": {
"vendors": ["15b3"],
"devices": ["101e"],
"drivers": ["mlx5_core"],
"isRdma": true
}
}
]
}Verify DOCA-OFED and NFSoRDMA:
# Check OFED driver pods are running
oc get pods -n nvidia-network-operator -l app=mofed
# Verify kernel modules on a GPU node
oc debug node/gpu-worker-1 -- chroot /host bash -c '
echo "=== RDMA Core ==="
lsmod | grep -E "^mlx5_core|^mlx5_ib|^ib_core|^rdma_cm"
echo ""
echo "=== NFSoRDMA ==="
lsmod | grep -E "^xprtrdma|^svcrdma|^rpcrdma"
echo ""
echo "=== GPUDirect RDMA (peermem) ==="
lsmod | grep nvidia_peermem
echo ""
echo "=== DMA-BUF ==="
lsmod | grep dmabuf
'
# Expected output:
# === RDMA Core ===
# mlx5_core 2097152 1 mlx5_ib
# mlx5_ib 409600 0
# ib_core 524288 7 mlx5_ib,ib_uverbs,rdma_cm,...
# rdma_cm 131072 1 ...
#
# === NFSoRDMA ===
# xprtrdma 90112 0 β NFS client RDMA transport
# svcrdma 77824 0 β NFS server RDMA transport (if serving)
# rpcrdma_core 45056 2 xprtrdma,svcrdma
#
# === GPUDirect RDMA (peermem) ===
# nvidia_peermem 16384 0 β GPUβNIC bridge
#
# === DMA-BUF ===
# dmabuf ... β DMA-BUF subsystemStep 3: Configure SR-IOV Virtual Functions
Create VFs on the ConnectX NICs so each pod gets its own hardware-isolated NIC slice with dedicated RDMA resources:
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: gpu-rdma-vfs
namespace: openshift-sriov-network-operator
spec:
nodeSelector:
node-role.kubernetes.io/gpu-worker: ""
feature.node.kubernetes.io/network-sriov.capable: "true"
resourceName: sriov_rdma_vf
numVfs: 8 # 8 VFs = 1 per GPU (for 8-GPU nodes)
nicSelector:
vendor: "15b3" # Mellanox/NVIDIA
deviceID: "101d" # ConnectX-6 Dx (use 101e for CX-7)
pfNames: ["ens8f0"] # Physical function name
deviceType: netdevice # Use netdevice for RDMA (not vfio-pci)
isRdma: true # Enable RDMA on VFs
linkType: ETH # Ethernet (use IB for InfiniBand)
mtu: 9000 # Jumbo frames for RDMA performance
# === Important: Access mode for RDMA NICs ===
# Switch ports must be in ACCESS mode (untagged)
# NFSoRDMA does NOT support VLAN taggingWhy deviceType: netdevice and not vfio-pci?
For GPUDirect RDMA, the VF must use the kernel mlx5_core driver (netdevice mode) so that:
- The
ib_coreRDMA subsystem can register the device nvidia-peermemcan map GPU pages to the NICβs RDMA context- NFSoRDMA can use the RDMA transport layer
vfio-pci bypasses the kernel entirely (for DPDK/userspace drivers) β no RDMA verbs, no peermem, no NFSoRDMA.
# Verify VFs are created
oc debug node/gpu-worker-1 -- chroot /host bash -c '
echo "=== SR-IOV VFs ==="
ip link show ens8f0
# Should show: vf 0, vf 1, ... vf 7
echo ""
echo "=== RDMA devices ==="
rdma link show
# Should show mlx5_0, mlx5_1, ... for each VF with RDMA capability
echo ""
echo "=== Allocatable resources ==="
'
oc get node gpu-worker-1 -o json | jq '.status.allocatable | with_entries(select(.key | contains("sriov")))'
# "nvidia.com/sriov_rdma_vf": "8"Step 4: Create SR-IOV Network Attachment
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
name: gpu-rdma-net
namespace: openshift-sriov-network-operator
spec:
resourceName: sriov_rdma_vf
networkNamespace: ai-training
ipam: |
{
"type": "host-local",
"subnet": "192.168.100.0/24",
"rangeStart": "192.168.100.10",
"rangeEnd": "192.168.100.200"
}
# Configure for RDMA
capabilities: '{ "rdma": true }'Step 5: Mount NFSoRDMA Storage
On each GPU node, mount the NFS export with RDMA transport:
# MachineConfig for NFSoRDMA mount (OpenShift)
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
name: 99-gpu-worker-nfsordma-mount
labels:
machineconfiguration.openshift.io/role: gpu-worker
spec:
config:
ignition:
version: 3.2.0
systemd:
units:
- name: mnt-ai\\x2ddata.mount
enabled: true
contents: |
[Unit]
Description=NFSoRDMA AI Training Data
After=network-online.target openibd.service
Wants=network-online.target
[Mount]
What=nfs-server.internal.example.com:/exports/ai-data
Where=/mnt/ai-data
Type=nfs
Options=rdma,port=20049,vers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,nconnect=16
[Install]
WantedBy=multi-user.targetMount options explained:
rdmaβ use RDMA transport instead of TCPport=20049β NFS serverβs RDMA listen port (standard NFSoRDMA port)nconnect=16β 16 parallel RDMA connections for aggregate bandwidthrsize=1048576,wsize=1048576β 1MB read/write blocks for large sequential I/O
# Verify NFSoRDMA mount is active
oc debug node/gpu-worker-1 -- chroot /host bash -c '
mount | grep rdma
# nfs-server:/exports/ai-data on /mnt/ai-data type nfs4 (rdma,...)
# Check RDMA transport is active (not falling back to TCP)
cat /proc/mounts | grep ai-data
nfsstat -m | grep ai-data
# Shows: proto=rdma
'Step 6: Deploy AI Training Pod with Full Stack
apiVersion: v1
kind: Pod
metadata:
name: gpu-rdma-training
namespace: ai-training
annotations:
k8s.v1.cni.cncf.io/networks: gpu-rdma-net # SR-IOV VF attachment
spec:
containers:
- name: training
image: nvcr.io/nvidia/pytorch:24.12-py3
command: ["bash", "-c"]
args:
- |
echo "=== Verifying Full Stack ==="
# 1. Check GPU (open driver)
nvidia-smi --query-gpu=name,driver_version,memory.total --format=csv
# 2. Check RDMA device (SR-IOV VF)
ibv_devinfo 2>/dev/null || echo "Install rdma-core for ibv_devinfo"
# 3. Check GPUDirect RDMA (peermem)
cat /proc/driver/nvidia-peermem/version 2>/dev/null || \
echo "nvidia-peermem not visible in container β check host"
# 4. Check NFSoRDMA mount
ls -la /data/
dd if=/data/training-set/shard-000.tar of=/dev/null bs=1M count=1024 2>&1 | tail -1
# 5. Run NCCL test with GPUDirect
# NCCL will automatically use GPUDirect RDMA if available
export NCCL_NET_GDR_LEVEL=5 # Enable GPUDirect RDMA in NCCL
export NCCL_IB_DISABLE=0 # Use InfiniBand/RoCE
export NCCL_SOCKET_IFNAME=net1 # SR-IOV VF interface
echo "Stack verified. Starting training..."
python train.py --data-dir /data/training-set/
resources:
limits:
nvidia.com/gpu: 1
nvidia.com/sriov_rdma_vf: 1 # One SR-IOV VF per GPU
requests:
nvidia.com/gpu: 1
nvidia.com/sriov_rdma_vf: 1
volumeMounts:
- name: ai-data
mountPath: /data
readOnly: true
- name: shm
mountPath: /dev/shm
securityContext:
capabilities:
add: ["IPC_LOCK"] # Required for RDMA memory registration
volumes:
- name: ai-data
hostPath:
path: /mnt/ai-data # NFSoRDMA mount point
type: Directory
- name: shm
emptyDir:
medium: Memory
sizeLimit: "32Gi" # Large SHM for PyTorch DataLoader workersVerifying the Full Data Path
Test 1: GPUDirect RDMA (peermem) Active
# On the host node
oc debug node/gpu-worker-1 -- chroot /host bash -c '
# Check peermem is registered with InfiniBand core
cat /sys/module/nvidia_peermem/parameters/peermem_enabled 2>/dev/null
# 1 = active
# Check RDMA device capabilities
rdma link show | head -5
# link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev ens8f0v0
# Verify peer memory is registered
dmesg | grep -i "nvidia peermem"
# nvidia-peermem registered successfully
'Test 2: NFSoRDMA Throughput
# From inside a GPU pod with the NFSoRDMA mount
dd if=/data/large-file.bin of=/dev/null bs=4M count=4096
# With RDMA: ~12-24 GB/s (100-200 Gbps) depending on NIC
# Without RDMA (TCP): ~3-6 GB/s
# Check NFS stats to confirm RDMA transport
nfsstat -m
# /data from nfs-server:/exports/ai-data
# Flags: rdma,rw,vers=4.1
# Proto: rdma β Confirmed RDMA, not TCPTest 3: End-to-End GPU Bandwidth
# Python test: measure storage β GPU throughput via GPUDirect RDMA
import torch
import time
import numpy as np
# Read from NFSoRDMA mount directly into GPU
gpu = torch.device('cuda:0')
data = np.memmap('/data/training-set/shard-000.bin', dtype='float32', mode='r')
start = time.time()
tensor = torch.from_numpy(data[:256_000_000]).to(gpu) # 1GB
elapsed = time.time() - start
print(f"Transferred 1GB to GPU in {elapsed:.3f}s = {1/elapsed:.1f} GB/s")
# With full stack: 8-15 GB/s
# Without GPUDirect: 2-4 GB/sTest 4: NCCL All-Reduce with GPUDirect RDMA
# Multi-node NCCL test
export NCCL_NET_GDR_LEVEL=5
export NCCL_IB_DISABLE=0
export NCCL_DEBUG=INFO
mpirun -np 16 -hostfile hosts \
--mca btl_tcp_if_include net1 \
/opt/nccl-tests/build/all_reduce_perf -b 1M -e 4G -f 2
# Look for "NET/IB" and "GDR" in NCCL debug output:
# NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE
# NCCL INFO NET/IB : GPU Direct RDMA Enabled for ...The Full Module Dependency Chain
graph BT
subgraph "GPU Stack (GPU Operator)"
NVIDIA[nvidia.ko<br>Open Kernel Module<br>DMA-BUF export]
UVM[nvidia-uvm.ko<br>Unified Virtual Memory]
PEERMEM[nvidia-peermem.ko<br>GPUβRDMA bridge]
end
subgraph "RDMA Stack (Network Operator / DOCA-OFED)"
MLX5[mlx5_core.ko<br>ConnectX driver]
MLX5IB[mlx5_ib.ko<br>IB/RoCE verbs]
IBCORE[ib_core.ko<br>RDMA subsystem]
RDMACM[rdma_cm.ko<br>Connection manager]
XPRTRDMA[xprtrdma.ko<br>NFS client RDMA]
SVCRDMA[svcrdma.ko<br>NFS server RDMA]
SUNRPC[sunrpc.ko<br>RPC layer]
end
subgraph "SR-IOV"
VF0[VF 0 β Pod A]
VF1[VF 1 β Pod B]
VFn[VF N β Pod N]
end
NVIDIA --> PEERMEM
IBCORE --> PEERMEM
MLX5 --> MLX5IB
MLX5IB --> IBCORE
IBCORE --> RDMACM
RDMACM --> XPRTRDMA
SUNRPC --> XPRTRDMA
SUNRPC --> SVCRDMA
RDMACM --> SVCRDMA
MLX5 --> VF0
MLX5 --> VF1
MLX5 --> VFn
style NVIDIA fill:#76b900,color:#000
style PEERMEM fill:#ff6f00,color:#000
style IBCORE fill:#0077b6,color:#fff
style XPRTRDMA fill:#e63946,color:#fffCommon Issues
nvidia-peermem Not Loading
# Check if GPU Operator has rdma enabled
oc get clusterpolicy -o jsonpath='{.items[0].spec.driver.rdma}'
# {"enabled": true}
# Check driver pod logs
oc logs -n gpu-operator -l app=nvidia-driver-daemonset -c nvidia-driver | grep peermem
# If "MOFED driver not found" β Network Operator must deploy DOCA-OFED first
# Correct startup order:
# 1. Network Operator deploys DOCA-OFED β mlx5_core, ib_core loaded
# 2. GPU Operator deploys open driver β nvidia.ko loaded
# 3. GPU Operator loads nvidia-peermem β bridges the twoOpen Driver Not Installed (Proprietary Loaded Instead)
# Check which driver is loaded
cat /proc/driver/nvidia/version
# If it shows "NVIDIA UNIX Open Kernel Module" β correct
# If it shows "NVIDIA UNIX x86_64 Kernel Module" β proprietary, no DMA-BUF
# Fix: ensure ClusterPolicy has useOpenKernelModules: true
oc patch clusterpolicy gpu-cluster-policy --type merge -p '
{"spec":{"driver":{"useOpenKernelModules":true}}}'
# GPU Operator will redeploy driver DaemonSet with open modulesNFSoRDMA Falling Back to TCP
# Check if xprtrdma module is loaded
lsmod | grep xprtrdma
# If missing, NFSRDMA_ENABLE wasn't set in NicClusterPolicy
# Check mount transport
cat /proc/mounts | grep nfs
# If "proto=tcp" instead of "proto=rdma":
# 1. NFS server may not support RDMA β check server config
# 2. Port 20049 may be blocked β check firewall
# 3. RDMA device may not be on the right subnet β check IP routing
# Test RDMA connectivity to NFS server
rdma_client -s nfs-server.internal.example.com -p 20049SR-IOV VFs Not RDMA-Capable
# Check VF RDMA capability
rdma link show
# If VFs don't show RDMA devices:
# 1. Verify isRdma: true in SriovNetworkNodePolicy
# 2. Verify deviceType: netdevice (not vfio-pci)
# 3. Check NIC firmware supports RDMA on VFs:
mlxconfig -d /dev/mst/mt4125_pciconf0 query | grep SRIOV_EN
# SRIOV_EN = True
mlxconfig -d /dev/mst/mt4125_pciconf0 query | grep RDMA
# RDMA_ENABLED = TrueSwitch Port Configuration for RDMA
Critical: NFSoRDMA does not support 802.1Q VLAN tagging. Switch ports connected to RDMA NICs must be in access mode (untagged). Each VLAN requires a dedicated physical NIC.
# Switch configuration example (Arista/Cisco style)
interface Ethernet1/1
description gpu-worker-1-rdma
switchport mode access
switchport access vlan 100
mtu 9216
no shutdown
# DO NOT use: switchport mode trunkPerformance Tuning
Optimal Settings
# On each GPU node (via MachineConfig):
# 1. Enable adaptive RX coalescing for RDMA
ethtool -C ens8f0 adaptive-rx on
# 2. Set ring buffer to maximum
ethtool -G ens8f0 rx 8192 tx 8192
# 3. Enable PCI relaxed ordering (GPUDirect optimization)
setpci -s <NIC_PCI_BUS>:00.0 CAP_EXP+10.w=0020
# 4. Jumbo frames (must match switch MTU)
ip link set ens8f0 mtu 9000
# 5. NUMA-aware GPUβNIC pairing
# GPU 0 on NUMA 0 should use NIC on NUMA 0
nvidia-smi topo -m
# Shows GPU-NIC affinity β schedule pods to matchExpected Performance
| Configuration | NFS Read Throughput | GPU Load Time (10GB) | CPU Usage |
|---|---|---|---|
| TCP NFS + CPU copy | 3-6 GB/s | 1.7-3.3s | 80-100% |
| RDMA NFS + CPU copy | 10-20 GB/s | 0.5-1.0s | 20-40% |
| RDMA NFS + GPUDirect | 15-25 GB/s | 0.4-0.7s | 3-8% |
| RDMA NFS + GPUDirect + SR-IOV | 20-28 GB/s | 0.35-0.5s | 2-5% |
SR-IOV adds ~10-15% throughput by eliminating kernel networking overhead and providing dedicated hardware queues per pod.
Best Practices
- Match GPU and NIC NUMA nodes β cross-NUMA transfers add 30-50% latency
- Use open kernel modules β required for DMA-BUF; also preferred by NVIDIA for datacenter GPUs going forward
- Deploy Network Operator before GPU Operator β nvidia-peermem needs ib_core to be loaded first
- One SR-IOV VF per GPU β matches the 1:1 GPU:NIC bandwidth ratio for maximum throughput
- Dedicated NICs for RDMA β donβt share RDMA NICs with cluster management traffic
- Access mode switch ports β NFSoRDMA cannot do VLAN tagging; each VLAN = dedicated NIC
- Test with
ib_write_bwbefore deploying workloads β verify raw RDMA bandwidth between nodes - Monitor with DCGM +
rdmametrics β track PCIe throughput and RDMA counters
Key Takeaways
- The full stack is: Open GPU driver (DMA-BUF) β nvidia-peermem (bridge) β DOCA-OFED (RDMA) β SR-IOV (isolation)
- Open kernel modules (
useOpenKernelModules: true) are required β proprietary driver lacks DMA-BUF nvidia-peermemis the glue: it registers GPU memory with the RDMA subsystem- NFSoRDMA requires
NFSRDMA_ENABLE=truein NicClusterPolicy and access-mode switch ports - SR-IOV VFs with
isRdma: true+deviceType: netdevicegive per-pod RDMA hardware isolation - Deploy order matters: Network Operator β GPU Operator β verify peermem β mount NFSoRDMA β deploy workloads
- This stack eliminates CPU from the storage β GPU data path, achieving 20-28 GB/s per NIC

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
