RDMA Configuration with NVIDIA Network Operator
Deploy and configure RDMA for GPU clusters using the NVIDIA Network Operator. NicClusterPolicy setup, MLNX_OFED driver container, shared and SR-IOV RDMA device
π‘ Quick Answer: The NVIDIA Network Operator deploys the full RDMA stack on Kubernetes via
NicClusterPolicy: MLNX_OFED driver container (or host MOFED), RDMA shared device plugin, SR-IOV device plugin, secondary network (Multus + Whereabouts/nv-ipam), and nv-peer-mem for GPUDirect. Install via Helm (network-operatorchart), configureNicClusterPolicywith your NIC selectors, and pods automatically get RDMA access withrdma/rdma_shared_device_a: 1.
The Problem
- Need full RDMA stack (driver, device plugin, IPAM, CNI) deployed consistently across GPU nodes
- Manual MLNX_OFED installation is fragile and version-specific
- Must coordinate RDMA device plugin, secondary networks, and GPUDirect integration
- SR-IOV and shared RDMA modes need different plugin configurations
- RDMA setup must integrate with GPU Operator for GPUDirect RDMA
The Solution
Install NVIDIA Network Operator
# Add Helm repo
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
# Install Network Operator
helm install network-operator nvidia/network-operator \
--namespace nvidia-network-operator \
--create-namespace \
--version 25.1.0 \
--set deployCR=false # Deploy NicClusterPolicy separatelyNicClusterPolicy: Full RDMA Stack
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
name: nic-cluster-policy
spec:
# MLNX_OFED Driver Container (containerized OFED)
ofedDriver:
image: doca-driver
repository: nvcr.io/nvidia/mellanox
version: "25.04-0.7.0.0-0"
startupProbe:
initialDelaySeconds: 10
periodSeconds: 20
livenessProbe:
initialDelaySeconds: 30
periodSeconds: 30
rdmaSubsystemNamespace: "shared" # Enable shared RDMA namespace mode
# RDMA Shared Device Plugin
rdmaSharedDevicePlugin:
image: k8s-rdma-shared-dev-plugin
repository: ghcr.io/mellanox
version: "1.5.1"
config: |
{
"configList": [
{
"resourceName": "rdma_shared_device_a",
"rdmaHcaMax": 63,
"selectors": {
"vendors": ["15b3"],
"deviceIDs": ["101d", "101e", "a2dc"],
"drivers": ["mlx5_core"]
}
}
]
}
# Secondary Network (Multus + IPAM + CNI)
secondaryNetwork:
cniPlugins:
image: plugins
repository: ghcr.io/k8snetworkplumbingwg
version: "v1.5.0"
multus:
image: multus-cni
repository: ghcr.io/k8snetworkplumbingwg
version: "v4.1.0"
ipamPlugin:
image: whereabouts
repository: ghcr.io/k8snetworkplumbingwg
version: "v0.7.0"
# nv-ipam (NVIDIA IPAM for GPU fabric β alternative to whereabouts)
nvIpam:
image: nvidia-k8s-ipam
repository: ghcr.io/mellanox
version: "0.2.0"
enableWebhook: trueNicClusterPolicy: With SR-IOV (Exclusive RDMA)
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
name: nic-cluster-policy
spec:
ofedDriver:
image: doca-driver
repository: nvcr.io/nvidia/mellanox
version: "25.04-0.7.0.0-0"
# SR-IOV Device Plugin (exclusive VF per pod)
sriovDevicePlugin:
image: sriov-network-device-plugin
repository: ghcr.io/k8snetworkplumbingwg
version: "v3.7.0"
config: |
{
"resourceList": [
{
"resourcePrefix": "nvidia.com",
"resourceName": "rdma_vf",
"selectors": {
"vendors": ["15b3"],
"devices": ["101e"],
"drivers": ["mlx5_core"],
"isRdma": true
}
}
]
}
# SR-IOV Network Operator integration
sriovNetworkOperator:
enabled: true
secondaryNetwork:
cniPlugins:
image: plugins
repository: ghcr.io/k8snetworkplumbingwg
version: "v1.5.0"
multus:
image: multus-cni
repository: ghcr.io/k8snetworkplumbingwg
version: "v4.1.0"Use Host MOFED Instead of Container
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
name: nic-cluster-policy
spec:
ofedDriver:
image: doca-driver
repository: nvcr.io/nvidia/mellanox
version: "25.04-0.7.0.0-0"
ofedDriverParams:
# Use host-installed MLNX_OFED instead of containerized
useHostOfed: true # β Skips driver container, uses host MOFED
rdmaSharedDevicePlugin:
image: k8s-rdma-shared-dev-plugin
repository: ghcr.io/mellanox
version: "1.5.1"
config: |
{
"configList": [
{
"resourceName": "rdma_shared_device_a",
"rdmaHcaMax": 63,
"selectors": {
"vendors": ["15b3"]
}
}
]
}GPU Operator + Network Operator Integration
# GPU Operator ClusterPolicy β reference Network Operator for RDMA
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: cluster-policy
spec:
driver:
enabled: true
rdma:
enabled: true # GPU Operator loads nvidia-peermem
useHostMofed: true # Uses MOFED from Network Operator
# Network Operator manages:
# - MLNX_OFED driver (containerized or host)
# - RDMA device plugins (shared or SR-IOV)
# - Secondary networks (Multus, IPAM, CNI)
#
# GPU Operator manages:
# - nvidia-peermem (GPUDirect RDMA bridge between GPU and NIC)
# - GPU driver, device-plugin, toolkit, DCGM
#
# Together: full GPUDirect RDMA stackSecondary Network for RDMA Pods
# MacvlanNetwork β automatic NetworkAttachmentDefinition creation
apiVersion: mellanox.com/v1alpha1
kind: MacvlanNetwork
metadata:
name: gpu-rdma-net
spec:
networkNamespace: "default"
master: "ens8f0" # RDMA-capable interface
mode: "bridge"
mtu: 9000 # Jumbo frames for RDMA
ipam: |
{
"type": "nv-ipam",
"poolName": "gpu-fabric-pool"
}
---
# IPPool for nv-ipam
apiVersion: nv-ipam.nvidia.com/v1alpha1
kind: IPPool
metadata:
name: gpu-fabric-pool
namespace: nvidia-network-operator
spec:
subnet: "10.10.0.0/16"
perNodeBlockSize: 64
gateway: "10.10.0.1"Pod Consuming RDMA via Network Operator
apiVersion: v1
kind: Pod
metadata:
name: gpu-training
annotations:
k8s.v1.cni.cncf.io/networks: gpu-rdma-net
spec:
containers:
- name: trainer
image: nvcr.io/nvidia/pytorch:24.04-py3
resources:
limits:
nvidia.com/gpu: "8"
rdma/rdma_shared_device_a: "1"
securityContext:
capabilities:
add: ["IPC_LOCK"]
env:
- name: NCCL_IB_HCA
value: "mlx5_0,mlx5_3,mlx5_5,mlx5_6"
- name: NCCL_NET_GDR_LEVEL
value: "5"
volumeMounts:
- name: shm
mountPath: /dev/shm
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: "64Gi"Verify Network Operator Deployment
# Check operator pod
kubectl get pods -n nvidia-network-operator
# NAME READY STATUS
# network-operator-controller-manager-xxx 1/1 Running
# mofed-ubuntu22.04-ds-xxxxx 1/1 Running (per node)
# rdma-shared-dp-ds-xxxxx 1/1 Running (per node)
# multus-ds-xxxxx 1/1 Running (per node)
# whereabouts-ds-xxxxx 1/1 Running (per node)
# Check NicClusterPolicy status
kubectl get nicclusterpolicy -o yaml | grep -A20 "status:"
# status:
# appliedStates:
# - name: state-OFED
# state: ready
# - name: state-RDMA-device-plugin
# state: ready
# - name: state-Multus
# state: ready
# Check RDMA resources on nodes
kubectl get nodes -o json | jq '.items[].status.allocatable | with_entries(select(.key | startswith("rdma")))'
# { "rdma/rdma_shared_device_a": "63" }
# Check OFED version in driver container
kubectl exec -n nvidia-network-operator mofed-ubuntu22.04-ds-xxxxx -- ofed_info -s
# MLNX_OFED_LINUX-25.04-0.7.0.0:Network Operator Components
Component β Deployed By β Function
ββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββ
DOCA/MOFED Driver Container β ofedDriver β Containerized MLNX_OFED
RDMA Shared Device Plugin β rdmaSharedDevicePlugin β Shared /dev/infiniband access
SR-IOV Device Plugin β sriovDevicePlugin β Exclusive VF per pod
Multus CNI β secondaryNetwork.multus β Multiple network interfaces
Whereabouts IPAM β secondaryNetwork.ipam β IP allocation for secondary nets
nv-ipam β nvIpam β NVIDIA IPAM (GPU fabric pools)
CNI Plugins β secondaryNetwork.cni β macvlan, ipvlan, bridge CNIs
IB-Kubernetes β ibKubernetes β InfiniBand partition management
nvidia-peermem β GPU Operator (separate) β GPUDirect RDMA bridge
ββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββFirmware Configuration via Network Operator
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
name: nic-cluster-policy
spec:
ofedDriver:
image: doca-driver
repository: nvcr.io/nvidia/mellanox
version: "25.04-0.7.0.0-0"
# NIC firmware configuration
nicConfigurationOperator:
enabled: true
---
# NicConfigurationTemplate β configure NIC firmware settings
apiVersion: configuration.net.nvidia.com/v1alpha1
kind: NicConfigurationTemplate
metadata:
name: rdma-optimized
spec:
nodeSelector:
node-role.kubernetes.io/gpu-worker: ""
nicSelector:
vendor: "15b3"
deviceID: "101d" # ConnectX-7
template:
parameters:
# Enable RoCE
ROCE_MODE: "2"
# Enable GPUDirect
ATS_ENABLED: "true"
# PFC (Priority Flow Control)
PFC_ENABLED: "true"
PRIO_TC_MAP: "0,0,0,3,0,0,0,0"RoCE vs InfiniBand Configuration
# For RoCE (RDMA over Converged Ethernet):
rdmaSharedDevicePlugin:
config: |
{
"configList": [{
"resourceName": "rdma_shared_device_a",
"rdmaHcaMax": 63,
"selectors": {
"vendors": ["15b3"],
"linkTypes": ["ETH"] # β Ethernet (RoCE)
}
}]
}
# For InfiniBand:
rdmaSharedDevicePlugin:
config: |
{
"configList": [{
"resourceName": "rdma_shared_device_ib",
"rdmaHcaMax": 63,
"selectors": {
"vendors": ["15b3"],
"linkTypes": ["IB"] # β InfiniBand
}
}]
}Common Issues
MOFED driver container stuck in Init
- Cause: Host kernel headers not available; or existing MOFED conflicts
- Fix: Install
kernel-develmatching running kernel; or remove host MOFED and let container manage it
βNo RDMA resourcesβ after NicClusterPolicy applied
- Cause: Device plugin selector doesnβt match any NICs; or plugin not scheduled on node
- Fix: Check
ibstaton node for actual device IDs; verify node selector matches GPU workers
Network Operator conflicts with SR-IOV Network Operator
- Cause: Both trying to manage SR-IOV; or Multus conflict
- Fix: Use one or the other for SR-IOV. Network Operator can embed SR-IOV; donβt install both standalone
OFED driver container version mismatch with host kernel
- Cause: Containerized MOFED built for different kernel
- Fix: Use
useHostOfed: trueif host already has MOFED; or match container image to kernel version
Best Practices
- Network Operator for NIC stack, GPU Operator for GPU stack β clear separation
useHostMofed: truein GPU Operator β tells it Network Operator manages MOFED- Pin DOCA/MOFED versions β avoid surprise driver updates breaking RDMA
- Use nv-ipam over whereabouts β better integration with GPU fabric topology
- Separate resource names per fabric β
rdma_gpu_fabricvsrdma_storage_fabric - Monitor NicClusterPolicy status β all states should show βreadyβ
- Jumbo frames (MTU 9000) β significant throughput improvement for RDMA
Key Takeaways
- NVIDIA Network Operator: single CR (
NicClusterPolicy) deploys entire RDMA stack - Components: MOFED driver + device plugins + Multus + IPAM + CNI plugins
- Shared RDMA:
rdmaSharedDevicePluginβ many pods share one PF (training clusters) - SR-IOV RDMA:
sriovDevicePluginβ exclusive VF per pod (multi-tenant) - GPU Operator handles nvidia-peermem; Network Operator handles everything NIC-side
useHostMofed: truein GPU Operator connects both operators- Secondary networks (MacvlanNetwork + IPPool) give pods fabric IPs automatically
- RoCE vs IB: use
linkTypesselector in device plugin config

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
