OpenShift SR-IOV Network with NVIDIA IPAM for GPU Fabric
Configure SriovNetwork resources on OpenShift with nv-ipam for GPU fabric IP allocation. SR-IOV Network Operator setup, Mellanox NIC resource targeting, IPAM
💡 Quick Answer: Create a
SriovNetworkCR in theopenshift-sriov-network-operatornamespace to define a GPU fabric network that usesnv-ipamfor IP allocation. The CR specifies the SR-IOV resource name (Mellanox NICs), IPAM configuration with pool name, and target namespace. The operator automatically generates aNetworkAttachmentDefinitionthat pods reference viak8s.v1.cni.cncf.io/networksannotation.
The Problem
- GPU nodes have dedicated Mellanox NICs for RDMA fabric but need automated IP management
- Manual IP assignment doesn’t scale across hundreds of GPU pods
- Need SR-IOV VFs attached to pods for GPUDirect RDMA with proper IPAM
- Standard DHCP doesn’t integrate with GPU topology awareness
- Must align network resources with GPU NUMA locality
The Solution
SriovNetwork with nv-ipam
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
name: gpu-fabric
namespace: openshift-sriov-network-operator
finalizers:
- netattdef.finalizers.sriovnetwork.openshift.io
spec:
ipam: |
{
"type": "nv-ipam",
"poolName": "gpu-fabric"
}
logLevel: info
networkNamespace: gpu-workloads
resourceName: mellanoxnicsPrerequisites: SriovNetworkNodePolicy
# First define which NICs to use and how many VFs
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: mellanox-gpu-fabric
namespace: openshift-sriov-network-operator
spec:
resourceName: mellanoxnics
nodeSelector:
node-role.kubernetes.io/gpu-worker: ""
numVfs: 8
nicSelector:
vendor: "15b3" # Mellanox
deviceID: "101d" # ConnectX-7
pfNames: ["ens8f0", "ens8f1"]
deviceType: netdevice # or vfio-pci for DPDK
isRdma: true # Enable RDMA on VFs
linkType: IB # InfiniBand (or eth for RoCE)nv-ipam IPPool Configuration
# Define IP pool for the GPU fabric
apiVersion: nv-ipam.nvidia.com/v1alpha1
kind: IPPool
metadata:
name: gpu-fabric
namespace: nvidia-network-operator
spec:
subnet: "10.10.0.0/16"
perNodeBlockSize: 64 # 64 IPs per node
gateway: "10.10.0.1"
nodeSelector:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/gpu-worker
operator: ExistsPod Using the SR-IOV Network
apiVersion: v1
kind: Pod
metadata:
name: distributed-training
namespace: gpu-workloads
annotations:
k8s.v1.cni.cncf.io/networks: gpu-fabric
spec:
containers:
- name: trainer
image: registry.example.com/training:v1
resources:
requests:
nvidia.com/gpu: "4"
openshift.io/mellanoxnics: "1" # Request one SR-IOV VF
limits:
nvidia.com/gpu: "4"
openshift.io/mellanoxnics: "1"
env:
- name: NCCL_IB_HCA
value: "mlx5"
- name: NCCL_NET_GDR_LEVEL
value: "PIX"Multiple Networks (Storage + GPU Fabric)
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
name: gpu-rdma-fabric
namespace: openshift-sriov-network-operator
spec:
ipam: |
{
"type": "nv-ipam",
"poolName": "rdma-fabric"
}
logLevel: info
networkNamespace: gpu-workloads
resourceName: mellanoxnics-ib
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
name: storage-network
namespace: openshift-sriov-network-operator
spec:
ipam: |
{
"type": "nv-ipam",
"poolName": "storage-net"
}
networkNamespace: gpu-workloads
resourceName: mellanoxnics-eth# Pod with dual networks
metadata:
annotations:
k8s.v1.cni.cncf.io/networks: |
[
{"name": "gpu-rdma-fabric", "namespace": "gpu-workloads"},
{"name": "storage-network", "namespace": "gpu-workloads"}
]Verify the Generated NetworkAttachmentDefinition
# SriovNetwork operator auto-creates NAD in target namespace
oc get net-attach-def -n gpu-workloads
# NAME AGE
# gpu-fabric 5d
oc get net-attach-def gpu-fabric -n gpu-workloads -o yaml
# Shows the generated CNI config with SR-IOV device plugin + nv-ipam
# Check SR-IOV VF allocation
oc get sriovnetworknodestates -n openshift-sriov-network-operator
oc describe sriovnetworknodestate gpu-worker-0 -n openshift-sriov-network-operatorArgoCD Integration
# The SriovNetwork CR works with ArgoCD GitOps
# Annotations show tracking:
metadata:
annotations:
argocd.argoproj.io/tracking-id: "openshift-sriov-network:sriovnetwork.openshift.io/SriovNetwork:openshift-sriov-network-operator/gpu-fabric"Verify RDMA Connectivity in Pod
# Inside the pod — check SR-IOV interface
ip addr show # net1 = SR-IOV VF with nv-ipam assigned IP
# Test RDMA
ibv_devinfo # Should show VF device
ib_write_bw -d mlx5_0 --report_gbits # Server
ib_write_bw -d mlx5_0 --report_gbits <peer-ip> # Client
# Verify GPUDirect RDMA path
nvidia-smi topo -m | grep -E "NIC|mlx5"
# NIC should show PIX to local GPUsCommon Issues
NetworkAttachmentDefinition not created in target namespace
- Cause:
networkNamespacedoesn’t exist; or operator lacks permissions - Fix: Create target namespace first; verify operator ClusterRole includes target namespace
Pod stuck Pending — “insufficient SR-IOV resources”
- Cause: All VFs allocated; or SriovNetworkNodePolicy not applied yet
- Fix: Check
oc get sriovnetworknodestates; increasenumVfs; wait for node drain/reboot after policy change
nv-ipam not assigning IPs
- Cause: IPPool not created; or pool name mismatch; or nv-ipam controller not running
- Fix: Verify IPPool CR exists with matching
poolName; check nv-ipam-controller logs
RDMA not working on VF
- Cause:
isRdma: truenot set in SriovNetworkNodePolicy; or wrong deviceType - Fix: Set
isRdma: true; usedeviceType: netdevicefor RDMA (notvfio-pci)
Best Practices
- Use
nv-ipamover DHCP — purpose-built for GPU fabric, topology-aware pools - Set
isRdma: true— required for GPUDirect RDMA on SR-IOV VFs - Match
resourceNameacross SriovNetworkNodePolicy and SriovNetwork - Separate InfiniBand and Ethernet — different SriovNetworkNodePolicies per link type
perNodeBlockSizein IPPool — allocate enough IPs for max pods per node- Finalizers protect cleanup — don’t force-delete SriovNetwork CRs
- GitOps-friendly — SriovNetwork CRs work well with ArgoCD tracking
Key Takeaways
SriovNetworkCR defines the network; operator auto-generatesNetworkAttachmentDefinitionnv-ipamprovides GPU-fabric-aware IP allocation with per-node poolsresourceNamelinks SriovNetwork to SriovNetworkNodePolicy (defines VFs)networkNamespacedetermines where the NAD is created (where pods consume it)- Pods request VFs via resource limits (
openshift.io/<resourceName>: "1") isRdma: true+deviceType: netdevice= RDMA-capable SR-IOV VFs- Finalizer
netattdef.finalizers.sriovnetwork.openshift.ioensures clean NAD deletion

Recommended
Kubernetes Recipes — The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book →Learn by Doing
CopyPasteLearn — Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses →🎓 Deepen Your Skills — Hands-on Courses
Courses by CopyPasteLearn.com — Learn IT by Doing
