πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
Networking advanced ⏱ 20 minutes K8s 1.28+

SR-IOV NetworkNodePolicy for RDMA

Configure SriovNetworkNodePolicy on OpenShift to create RDMA-capable VFs on Mellanox ConnectX NICs for GPUDirect RDMA and high-performance AI networking.

By Luca Berton β€’ β€’ πŸ“– 7 min read

πŸ’‘ Quick Answer: Create a SriovNetworkNodePolicy with deviceType: netdevice and isRdma: true to provision RDMA-capable Virtual Functions on Mellanox ConnectX adapters β€” required for GPUDirect RDMA in multi-node AI training.

The Problem

Multi-node GPU workloads (distributed training, tensor parallelism) need low-latency, high-bandwidth pod-to-pod communication. Standard Kubernetes networking adds kernel overhead and can’t provide the 100-400 Gb/s throughput that NCCL expects. You need SR-IOV Virtual Functions with RDMA verbs so GPUs can transfer data directly over InfiniBand or RoCE without CPU involvement.

Key challenges:

  • VFs must be created with the correct deviceType β€” vfio-pci bypasses the kernel and breaks RDMA verbs
  • The NIC vendor/device IDs must match your exact hardware
  • Priority ordering matters when multiple policies target the same NIC
  • RDMA must be explicitly enabled (isRdma: true)

The Solution

Prerequisites

Install the SR-IOV Network Operator and verify NIC discovery:

# Check operator is running
oc get pods -n openshift-sriov-network-operator

# Verify node NIC discovery
oc get sriovnetworknodestates -n openshift-sriov-network-operator -o yaml

Look for your Mellanox NICs in the output β€” note the vendor, deviceID, PF names, and PCI addresses.

Identify Your Hardware

# On a worker node (via debug pod)
oc debug node/worker-gpu-01 -- chroot /host lspci -nn | grep Mellanox

# Example output:
# 3b:00.0 Ethernet controller [0200]: Mellanox Technologies ConnectX-6 Dx [15b3:101d]
# 3b:00.1 Ethernet controller [0200]: Mellanox Technologies ConnectX-6 Dx [15b3:101d]

Common Mellanox device IDs:

NIC ModelVendorDevice ID
ConnectX-515b31017
ConnectX-5 Ex15b31019
ConnectX-615b3101b
ConnectX-6 Dx15b3101d
ConnectX-6 Lx15b3101f
ConnectX-715b31021
BlueField-215b3a2d6

SriovNetworkNodePolicy

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: mlx-rdma-policy
  namespace: openshift-sriov-network-operator
spec:
  resourceName: mlxrdma
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  priority: 10
  numVfs: 8
  nicSelector:
    vendor: "15b3"
    deviceID: "101d"
    pfNames: ["ens8f0", "ens8f1"]
    rootDevices: ["0000:3b:00.0", "0000:3b:00.1"]
  deviceType: netdevice
  isRdma: true

Field Reference

FieldValueWhy
deviceTypenetdeviceMust be netdevice for RDMA. vfio-pci passes the VF to userspace via VFIO β€” no kernel network stack, no RDMA verbs.
isRdmatrueMounts RDMA device files (/dev/infiniband/) into pods requesting this resource.
priority10Lower number = higher priority. Use when multiple policies could match the same NIC. Range: 0-99.
numVfs8Number of VFs to create per PF. Match to max pods-per-node needing RDMA. Don’t exceed NIC max (typically 127 for ConnectX-6+).
vendor15b3Mellanox/NVIDIA PCI vendor ID.
deviceID101dConnectX-6 Dx. Change to match your NIC model (see table above).
pfNames["ens8f0"]Physical Function interface names. Use to target specific ports.
rootDevices["0000:3b:00.0"]PCI bus addresses. Alternative to pfNames for precise targeting.

Create the SriovNetwork

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: rdma-net
  namespace: openshift-sriov-network-operator
spec:
  resourceName: mlxrdma
  networkNamespace: ai-training
  ipam: |-
    {
      "type": "whereabouts",
      "range": "10.56.0.0/16",
      "exclude": ["10.56.0.0/24"]
    }
  capabilities: '{ "rdma": true }'

Pod Requesting RDMA VF

apiVersion: v1
kind: Pod
metadata:
  name: nccl-test
  namespace: ai-training
  annotations:
    k8s.v1.cni.cncf.io/networks: rdma-net
spec:
  containers:
    - name: nccl
      image: nvcr.io/nvidia/pytorch:24.05-py3
      resources:
        requests:
          nvidia.com/gpu: 1
          openshift.io/mlxrdma: "1"
        limits:
          nvidia.com/gpu: 1
          openshift.io/mlxrdma: "1"
      securityContext:
        capabilities:
          add: ["IPC_LOCK"]
      env:
        - name: NCCL_DEBUG
          value: "INFO"
        - name: NCCL_IB_HCA
          value: "mlx5"

Verify VFs Are Created

# Check VFs on worker node
oc debug node/worker-gpu-01 -- chroot /host ip link show ens8f0

# Should show VF entries:
#   vf 0 ... , link-state auto
#   vf 1 ... , link-state auto
#   ...

# Check RDMA devices
oc debug node/worker-gpu-01 -- chroot /host rdma link show

# Verify SR-IOV operator allocated resources
oc get sriovnetworknodestates -n openshift-sriov-network-operator \
  -o jsonpath='{.items[0].status.syncStatus}'

Verify RDMA Inside Pod

# Exec into the pod
oc exec -it nccl-test -- bash

# Check RDMA devices are visible
ibv_devinfo
rdma link show

# Run a quick bandwidth test
ib_write_bw -d mlx5_2 &   # On one pod
ib_write_bw -d mlx5_2 <peer-ip>  # On the other
graph TD
    POLICY[SriovNetworkNodePolicy<br/>deviceType: netdevice<br/>isRdma: true] -->|Creates| VF[Virtual Functions<br/>on ConnectX NIC]
    NET[SriovNetwork<br/>capabilities: rdma] -->|Provides| NAD[NetworkAttachmentDefinition]
    VF --> POD[AI Training Pod]
    NAD --> POD
    POD -->|RDMA verbs| GPU[GPU<br/>GPUDirect RDMA]
    GPU -->|DMA-BUF| VF
    VF -->|Wire| SWITCH[Network Switch<br/>PFC enabled]

Common Issues

VFs created but no RDMA devices in pod

Ensure isRdma: true is set. Without it, the RDMA device files aren’t mounted:

# Check if /dev/infiniband/ exists in the pod
oc exec nccl-test -- ls /dev/infiniband/
# Should list: rdma_cm, uverbs0, etc.

Using deviceType: vfio-pci breaks RDMA

vfio-pci passes the VF as a raw PCIe device to userspace via VFIO. This is for DPDK or KubeVirt VM passthrough β€” it bypasses the kernel network stack entirely, so RDMA verbs (ibv_*) don’t work. Always use netdevice for RDMA.

Node reboots after applying policy

The SR-IOV operator drains and reboots nodes to apply firmware-level VF changes. This is expected. Control the blast radius:

# Check which nodes will be affected
oc get sriovnetworknodestates -n openshift-sriov-network-operator

# Apply during maintenance windows
# Or use node selectors to target one node at a time

numVfs exceeds NIC maximum

Check your NIC’s VF limit:

oc debug node/worker-gpu-01 -- chroot /host \
  cat /sys/class/net/ens8f0/device/sriov_totalvfs

Policy priority conflicts

When two policies match the same NIC, lower priority number wins. If priority is equal, behavior is undefined. Always use distinct priorities:

# GPU nodes: 8 VFs for RDMA
priority: 10
# Storage nodes: 4 VFs for NVMe-oF
priority: 20

NCCL falls back to TCP instead of RDMA

Check NCCL debug output for NET/IB (RDMA) vs NET/Socket (TCP):

# Good: NCCL INFO NET/IB : Using [0]mlx5_2:1/RoCE
# Bad:  NCCL INFO NET/Socket : Using [0]eth0

If TCP fallback occurs:

  1. Verify IPC_LOCK capability is granted
  2. Check NCCL_IB_HCA=mlx5 environment variable
  3. Confirm RDMA device is visible: ibv_devinfo

Best Practices

  • Always use deviceType: netdevice for RDMA workloads β€” vfio-pci is for DPDK/VM passthrough only
  • Set numVfs to match your max pods-per-node, not the NIC maximum β€” unused VFs waste resources
  • Use pfNames OR rootDevices to target specific ports β€” using both is redundant but harmless
  • Grant IPC_LOCK capability (via SCC or securityContext) for RDMA memory registration
  • Enable open GPU kernel modules (useOpenKernelModules: true in ClusterPolicy) for DMA-BUF / GPUDirect RDMA
  • Install NFD before the GPU Operator: NFD β†’ GPU Operator β†’ Network Operator β†’ SR-IOV Operator
  • Configure PFC on the switch for lossless RoCE traffic (priority 3)
  • Test with ib_write_bw before running NCCL to isolate networking issues from GPU issues
  • Use the Shared RDMA Device Plugin for single-tenant clusters; SR-IOV for multi-tenant isolation

Key Takeaways

  • SriovNetworkNodePolicy creates VFs at the firmware level β€” nodes reboot during apply
  • deviceType: netdevice + isRdma: true is the only valid combination for RDMA
  • vfio-pci bypasses kernel networking β€” no RDMA verbs, no ibv_devinfo
  • The SriovNetwork resource creates a NetworkAttachmentDefinition for Multus
  • Pods request VFs via resource name (openshift.io/<resourceName>) and network annotation
  • priority field resolves conflicts when multiple policies match the same NIC
  • NCCL needs IPC_LOCK, NCCL_IB_HCA=mlx5, and visible /dev/infiniband/ devices
#sriov #rdma #mellanox #gpudirect #openshift #connectx
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens