SR-IOV Device Plugin PF Flag on Kubernetes
Configure SR-IOV device plugin PF flag in Kubernetes. Expose physical functions as allocatable resources for exclusive RDMA access.
💡 Quick Answer: The SR-IOV Network Device Plugin can flag Physical Functions (PFs) as allocatable resources alongside or instead of VFs. Set
"isRdma": trueon the PF resource pool to expose the PF’s RDMA device directly, or use the"pfNames"selector to target specific PFs. This is critical for single-tenant GPU clusters where pods need the full PF bandwidth for GPUDirect RDMA without VF overhead.
The Problem
By default, the SR-IOV device plugin only exposes Virtual Functions (VFs) as Kubernetes resources. But some workloads need direct Physical Function access:
- GPUDirect RDMA at full line rate — VFs add overhead and limit MTU/features
- DPDK applications that need PF-level control
- Single-tenant GPU nodes where VF splitting is unnecessary
- RDMA verbs on the PF — some NICs don’t support RDMA on VFs without extra config
- Monitoring and diagnostics — PF exposes full hardware counters
- NFSoRDMA or storage-path networking — PF provides maximum throughput
The Solution
Device Plugin ConfigMap — Flagging PFs
apiVersion: v1
kind: ConfigMap
metadata:
name: sriovdp-config
namespace: kube-system
data:
config.json: |
{
"resourceList": [
{
"resourceName": "mlnx_rdma_pf",
"resourcePrefix": "nvidia.com",
"selectors": {
"vendors": ["15b3"],
"devices": ["101b"],
"pfNames": ["ens8f0", "ens8f1"],
"isRdma": true,
"needVhostNet": false
}
},
{
"resourceName": "mlnx_rdma_vf",
"resourcePrefix": "nvidia.com",
"selectors": {
"vendors": ["15b3"],
"devices": ["101e"],
"drivers": ["mlx5_core"],
"isRdma": true
}
}
]
}Key: the "pfNames" selector combined with the PF’s PCI device ID (e.g., 101b for ConnectX-6) tells the device plugin to register the PF itself as an allocatable resource — not its VFs.
How PF Flagging Works
graph TD
subgraph NIC: ConnectX-6
PF0[PF0: ens8f0<br/>PCI 101b<br/>200Gb/s full BW]
VF0[VF0: ens8f0v0]
VF1[VF1: ens8f0v1]
VF2[VF2: ens8f0v2]
PF0 --> VF0
PF0 --> VF1
PF0 --> VF2
end
subgraph Device Plugin
DP[SR-IOV Device Plugin]
DP --> |pfNames selector<br/>flags PF| RES_PF[nvidia.com/mlnx_rdma_pf: 1]
DP --> |VF device IDs| RES_VF[nvidia.com/mlnx_rdma_vf: 3]
end
style PF0 fill:#76B900,color:white
style RES_PF fill:#4CAF50,color:white
style RES_VF fill:#2196F3,color:whitePod Requesting PF Directly
apiVersion: v1
kind: Pod
metadata:
name: rdma-gpu-training
spec:
containers:
- name: pytorch
image: nvcr.io/nvidia/pytorch:25.11-py3
resources:
limits:
nvidia.com/gpu: 8
nvidia.com/mlnx_rdma_pf: 1 # Full PF, not a VF
env:
- name: NCCL_IB_HCA
value: "mlx5_0" # PF RDMA device
- name: NCCL_NET_GDR_LEVEL
value: "SYS"
securityContext:
capabilities:
add: ["IPC_LOCK"]PF vs VF: When to Use Each
| Aspect | PF (Physical Function) | VF (Virtual Function) |
|---|---|---|
| Bandwidth | Full line rate (200/400Gb/s) | Shared, limited by VF count |
| Features | All NIC features available | Subset of PF features |
| RDMA | Always supported | Requires isRdma: true + driver |
| Isolation | No hardware isolation | Hardware-level isolation |
| Use case | Single-tenant GPU, max BW | Multi-tenant, shared nodes |
| Device type | netdevice | netdevice or vfio-pci |
| GPUDirect | Best perf, direct DMA path | Slight overhead |
Verifying PF Allocation
# Check node allocatable resources
kubectl get node gpu-node-1 -o json | jq '.status.allocatable' | grep mlnx
# "nvidia.com/mlnx_rdma_pf": "2" ← 2 PFs available
# Check device plugin pods
kubectl get pods -n kube-system -l app=sriovdp
# Verify RDMA device inside pod
kubectl exec -it rdma-gpu-training -- ibv_devinfo
# hca_id: mlx5_0
# transport: InfiniBand (0)
# fw_ver: 28.39.1002
# node_guid: 0x...
# sys_image_guid: 0x...
# phys_port_cnt: 1
# port: 1
# state: PORT_ACTIVE (4)
# max_mtu: 4096 (5)
# active_mtu: 4096 (5)
# link_layer: Ethernet
# Verify PF bandwidth (not VF-limited)
kubectl exec -it rdma-gpu-training -- ibv_devinfo -v | grep active_speed
# active_speed: 200 Gb/sec (128)
# Check NCCL picks up RDMA
kubectl exec -it rdma-gpu-training -- \
NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=NET python -c "import torch.distributed"
# NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE ← RDMA confirmedShared RDMA Device Plugin (Alternative)
For simpler setups where you just need RDMA access without SR-IOV VF management:
apiVersion: v1
kind: ConfigMap
metadata:
name: rdma-devices
namespace: kube-system
data:
config.json: |
{
"periodicUpdate": 300,
"configList": [
{
"resourceName": "rdma_shared_device",
"resourcePrefix": "nvidia.com",
"rdmaHcaMax": 100,
"devices": ["ens8f0", "ens8f1"]
}
]
}The shared RDMA device plugin exposes PF RDMA devices as shared resources (many pods share one PF). Use this for single-tenant nodes; use SR-IOV PF flagging when you need the device plugin to manage exclusive PF access.
OpenShift SR-IOV Operator — PF Policy
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: pf-rdma-policy
namespace: openshift-sriov-network-operator
spec:
nodeSelector:
feature.node.kubernetes.io/network-sriov.capable: "true"
resourceName: mlnxRdmaPf
numVfs: 0 # Zero VFs = expose PF only
nicSelector:
pfNames:
- ens8f0
- ens8f1
vendor: "15b3"
deviceID: "101b"
deviceType: netdevice # Must be netdevice for RDMA
isRdma: trueSetting numVfs: 0 with isRdma: true tells the SR-IOV operator to expose the PF as an RDMA resource without creating any VFs.
Network Attachment for PF
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
name: rdma-pf-net
namespace: gpu-workloads
annotations:
k8s.v1.cni.cncf.io/resourceName: nvidia.com/mlnx_rdma_pf
spec:
config: |
{
"cniVersion": "0.3.1",
"name": "rdma-pf-network",
"type": "host-device",
"device": "ens8f0",
"ipam": {
"type": "whereabouts",
"range": "10.10.10.0/24"
}
}Pod annotation to attach:
metadata:
annotations:
k8s.v1.cni.cncf.io/networks: rdma-pf-netCommon Issues
PF not showing as allocatable resource
Device plugin selector doesn’t match the PF’s PCI device ID. Check lspci -nn | grep Mellanox — PF and VF have different device IDs (e.g., 101b vs 101e for ConnectX-6).
Pod gets PF but no RDMA device inside
isRdma: true missing in the device plugin config, or the container runtime doesn’t mount /dev/infiniband/ devices. Ensure the RDMA subsystem is in shared mode: rdma system set netns shared.
PF allocated but bandwidth is VF-limited
VFs are still created and consuming PF bandwidth. Set numVfs: 0 in the SR-IOV policy to ensure no VFs exist, giving the PF full line rate.
“resource already allocated” error
PF is a single resource — only one pod can claim it exclusively. For shared access, use the RDMA shared device plugin instead.
Best Practices
- PF for single-tenant GPU nodes — maximum bandwidth, no VF overhead
- VFs for multi-tenant — hardware isolation between tenants
- Set
numVfs: 0when exposing PF — VFs steal bandwidth and features isRdma: trueis mandatory — without it, RDMA devices aren’t mounteddeviceType: netdevicefor RDMA —vfio-pcibypasses kernel, no RDMA verbs- Use shared RDMA plugin for simplicity — when you don’t need exclusive PF allocation
- Verify with
ibv_devinfo— confirm RDMA device is visible and active inside the pod
Key Takeaways
- SR-IOV device plugin can flag PFs as allocatable resources using
pfNamesselector - PF flagging gives pods full NIC bandwidth without VF overhead
- Set
numVfs: 0in SR-IOV policy to expose PF only (no VFs) - Critical for GPUDirect RDMA at maximum line rate in single-tenant GPU clusters
isRdma: true+deviceType: netdeviceare mandatory for RDMA on PF- Shared RDMA device plugin is the simpler alternative for non-exclusive PF access
- PF is a single allocatable resource — one pod per PF for exclusive access

Recommended
Kubernetes Recipes — The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book →Learn by Doing
CopyPasteLearn — Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses →🎓 Deepen Your Skills — Hands-on Courses
Courses by CopyPasteLearn.com — Learn IT by Doing
