SR-IOV Multus Network Attachment for GPU RDMA Pods
Configure Multus CNI NetworkAttachmentDefinition for SR-IOV RDMA in Kubernetes GPU workloads. Covers k8s.v1.cni.cncf.io/networks annotation, IPAM
π‘ Quick Answer: Add
k8s.v1.cni.cncf.io/networks: <network-name>annotation to worker pods requesting SR-IOV RDMA interfaces. Combined withopenshift.io/mellanoxnics: 1in resource limits, this gives the pod anet1interface backed by a Mellanox VF with RDMA capabilities and/dev/infinibanddevice access.
The Problem
- GPU pods need a secondary RDMA-capable network interface for NCCL data plane
- Default pod network (eth0) doesnβt support RDMA
- Must coordinate between Multus annotation and SR-IOV device plugin resource
- IPAM must assign IPs on the correct subnet for inter-node RDMA communication
- Need to verify end-to-end: annotation β VF allocation β net1 β RDMA functionality
The Solution
NetworkAttachmentDefinition
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
name: sriov-rdma-net
namespace: gpu-benchmark
annotations:
k8s.v1.cni.cncf.io/resourceName: openshift.io/mellanoxnics
spec:
config: |
{
"cniVersion": "0.3.1",
"name": "sriov-rdma-net",
"type": "sriov",
"vlan": 0,
"ipam": {
"type": "whereabouts",
"range": "192.168.100.0/24",
"exclude": ["192.168.100.0/32", "192.168.100.255/32"]
}
}Pod Annotation
metadata:
annotations:
# Request one SR-IOV interface from the named network
k8s.v1.cni.cncf.io/networks: sriov-rdma-net
# Multiple interfaces (if needed):
# k8s.v1.cni.cncf.io/networks: sriov-rdma-net, sriov-rdma-net
# With explicit interface name:
# k8s.v1.cni.cncf.io/networks: |
# [{"name": "sriov-rdma-net", "interface": "net1"}]Resource Request
resources:
requests:
nvidia.com/gpu: 2
openshift.io/mellanoxnics: 1 # Allocates 1 SR-IOV VF
limits:
nvidia.com/gpu: 2
openshift.io/mellanoxnics: 1What Gets Injected Into the Pod
When both annotation AND resource are present:
1. Multus reads the annotation β calls SR-IOV CNI plugin
2. SR-IOV device plugin allocates a VF from the pool
3. CNI plugin:
- Moves VF netdev into pod network namespace
- Names it "net1" (second interface after eth0)
- Applies IPAM (assigns IP from whereabouts range)
4. Device plugin provides:
- /dev/infiniband/uverbs0 (RDMA user verbs)
- /dev/infiniband/rdma_cm (connection manager)
Result inside pod:
eth0: 10.128.4.15/23 (Kubernetes pod network, DNS)
net1: 192.168.100.5/24 (SR-IOV VF, RDMA-capable)
/dev/infiniband/uverbs0 (RDMA device)Namespace-Scoped Network Names
# The annotation references a NetworkAttachmentDefinition in the SAME namespace:
annotations:
k8s.v1.cni.cncf.io/networks: sriov-rdma-net
# Looks for: NetworkAttachmentDefinition "sriov-rdma-net" in pod's namespace
# Cross-namespace reference (if allowed by policy):
annotations:
k8s.v1.cni.cncf.io/networks: gpu-infra/sriov-rdma-net
# Looks for: NetworkAttachmentDefinition "sriov-rdma-net" in namespace "gpu-infra"Verification Commands
# Inside the pod:
# Check net1 exists with IP
ip addr show net1
# Expected: inet 192.168.100.X/24
# Check RDMA device
ibv_devinfo
# Expected: port_state PORT_ACTIVE, transport InfiniBand or Ethernet
# Check /dev/infiniband
ls -la /dev/infiniband/
# Expected: uverbs0, rdma_cm
# Verify VF driver
ethtool -i net1
# Expected: driver: mlx5_core
# Ping another worker's net1 (RDMA subnet)
ping -I net1 192.168.100.6 -c 3Common IPAM Options
# Option 1: Whereabouts (distributed IPAM)
"ipam": {
"type": "whereabouts",
"range": "192.168.100.0/24"
}
# Option 2: NVIDIA nv-ipam (GPU-fabric aware)
"ipam": {
"type": "nv-ipam",
"poolName": "gpu-rdma-pool"
}
# Option 3: Static (for testing)
"ipam": {
"type": "static",
"addresses": [{"address": "192.168.100.10/24"}]
}
# Option 4: Host-local (single-node only)
"ipam": {
"type": "host-local",
"subnet": "192.168.100.0/24"
}Common Issues
net1 not appearing in pod
- Cause: Annotation name doesnβt match NetworkAttachmentDefinition name
- Fix: Verify NAD exists in same namespace; check spelling exactly
net1 exists but no IP assigned
- Cause: IPAM exhausted or misconfigured
- Fix: Check whereabouts IP pool; verify range has available addresses
/dev/infiniband missing despite net1 present
- Cause:
openshift.io/mellanoxnicsnot in resource request, or VF not RDMA-capable - Fix: Add resource request; verify SriovNetworkNodePolicy has
isRdma: true
βFailed to allocate SR-IOV VFβ in events
- Cause: All VFs on the node are in use
- Fix: Check
kubectl get node -o json | jq '.status.allocatable'for available VFs
Multiple pods get same IP
- Cause: Whereabouts leader election failure or stale IP leases
- Fix: Delete stale whereabouts IP allocations; restart whereabouts pods
Best Practices
- Match annotation name to NAD name exactly β case-sensitive
- Always request the device plugin resource β annotation alone is insufficient
- Use whereabouts or nv-ipam for multi-node β host-local causes IP conflicts
- One VF per pod is typical β
openshift.io/mellanoxnics: 1 - Verify with
ibv_devinfoinside pod β confirms RDMA device is functional - Size IP pool for maximum concurrent pods β each worker needs one IP
- Use VLAN 0 unless switch requires tagged frames for RDMA traffic
Key Takeaways
- Two pieces needed: Multus annotation (which network) + resource request (which device)
- Pod gets
net1(RDMA interface) +/dev/infiniband(verbs device) + IP from IPAM eth0= pod network (DNS, SSH, MPI control) |net1= RDMA (NCCL data)- NetworkAttachmentDefinition must exist in the podβs namespace
- SR-IOV device plugin manages VF pool; Multus/CNI handles network namespace moves
- IPAM choice matters: whereabouts for multi-node, nv-ipam for GPU-fabric awareness

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
