SR-IOV VF to Container Mapping and Lifecycle
How SR-IOV Virtual Functions are mapped to containers in Kubernetes. Covers VF allocation flow, link state management (VFs are down when unassigned), device
π‘ Quick Answer: SR-IOV VFs are mapped to containers via the device plugin and Multus CNI. VFs show link state βdownβ when not assigned to any Pod β this is normal. When a Pod requests a VF, the device plugin assigns it, Multus moves it into the Podβs network namespace, and the link comes up. On Pod deletion, the VF returns to the pool with link state down.
The Problem
Questions that arise when managing SR-IOV VFs on GPU nodes:
- βWhy are VFs showing link state DOWN?β β Normal β theyβre not assigned to a Pod
- How does a VF get inside a containerβs network namespace?
- What happens to the VF when the Pod dies?
- How does Kubernetes know which VF to assign to which Pod?
The Solution
VF Lifecycle: From Creation to Container
VF Lifecycle:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1. NODE BOOT β SR-IOV Config Daemon creates VFs
βββββββββββββββββββββββββββββββββββββββββββ
β PF: mlx5_0 (ens1f0np0) β
β βββ VF0: ens1f0v0 state: DOWN βββ β
β βββ VF1: ens1f0v1 state: DOWN β β
β βββ VF2: ens1f0v2 state: DOWN β Normal! β
β βββ VF3: ens1f0v3 state: DOWN β Not in use β
β βββ ... βββ β
βββββββββββββββββββββββββββββββββββββββββββ
2. DEVICE PLUGIN registers VFs as allocatable resources
Node capacity: openshift.io/mellanoxnics: 16
3. POD SCHEDULED β kubelet calls device plugin Allocate()
Device plugin returns:
β’ VF PCI address (e.g., 0000:ca:00.7)
β’ Device mounts (/dev/infiniband/uverbs31, rdma_cm)
β’ Environment (PCIDEVICE_OPENSHIFT_IO_MELLANOXNICS_INFO)
4. MULTUS CNI moves VF into Pod network namespace
ββββββββββββββββββββββββ
β Pod netns β
β βββ eth0 (default) β β OVN/Calico veth
β βββ rdma0 (VF) β β SR-IOV VF moved here
β β state: UP β
β
β β IP: 10.100.0.17 β β Assigned by nv-ipam
β βββ /dev/infiniband/ β β RDMA devices mounted
ββββββββββββββββββββββββ
5. POD DELETED β CNI DEL moves VF back to host namespace
VF returns to state: DOWN (available for next Pod)VF Link State Explained
# On the host β checking VF states
ip link show ens1f0np0
# Output:
# ens1f0np0: <BROADCAST,MULTICAST,UP> ... state UP
# vf 0 ... link-state auto (state: down) β Not assigned
# vf 1 ... link-state auto (state: down) β Not assigned
# vf 2 ... link-state auto (state: up) β In use by Pod
# vf 3 ... link-state auto (state: down) β Not assigned
# This is NORMAL:
# DOWN = VF is idle, not assigned to any Pod
# UP = VF is inside a Pod's network namespace
# AUTO = link state follows actual link (up when connected)VF Link States:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
State Meaning Action Needed?
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
down Not assigned to Pod β
Normal β idle VF
up Assigned to running Pod β
Normal β in use
auto (down) Auto mode, no Pod β
Normal β waiting
disable Administratively disabled β οΈ Check policy
error Hardware/driver issue β Investigate
Key insight: VFs SHOULD be down when not in use.
An "up" VF without a Pod means something leaked.The Full Allocation Flow
Pod Request β Device Plugin β Multus β Container
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Step 1: Pod spec requests SR-IOV resource
resources:
requests:
openshift.io/mellanoxnics: "1"
Step 2: Scheduler finds node with available VF
Node gpu-worker-01:
Allocatable: openshift.io/mellanoxnics: 16
Allocated: openshift.io/mellanoxnics: 3
Available: 13
Step 3: kubelet calls device plugin gRPC Allocate()
β Device plugin picks next free VF: 0000:ca:00.7
β Returns AllocateResponse:
{
envs: {
"PCIDEVICE_OPENSHIFT_IO_MELLANOXNICS": "0000:ca:00.7"
},
mounts: [
{containerPath: "/dev/infiniband/uverbs31", hostPath: "..."},
{containerPath: "/dev/infiniband/rdma_cm", hostPath: "..."}
]
}
Step 4: Container runtime creates Pod sandbox
Step 5: Multus CNI is called (ADD)
β Reads network annotation: k8s.v1.cni.cncf.io/networks: gpu-rdma
β Calls SR-IOV CNI plugin
β SR-IOV CNI:
a) Finds VF by PCI address (0000:ca:00.7)
b) Gets VF netdev name (ens1f0v7)
c) Moves VF to Pod network namespace
d) Renames to requested interface (rdma0)
e) Sets link UP
f) Calls IPAM (nv-ipam) β assigns 10.100.0.X
g) Configures IP on interface
Step 6: Pod is running with VF
β VF is in Pod's netns, state UP, IP assigned
β RDMA devices mounted at /dev/infiniband/
β NCCL can use it for GPU-Direct RDMA
Step 7: Pod terminates
β Multus CNI DEL called
β SR-IOV CNI moves VF back to host namespace
β VF state returns to DOWN
β Device plugin marks VF as available
β IP returned to nv-ipam poolInspect VF-to-Pod Mapping
# Which Pods are using which VFs?
# Method 1: Check device plugin allocation
kubectl get pods -n ai-training -o json | jq -r '
.items[] | select(.spec.containers[].resources.requests["openshift.io/mellanoxnics"]) |
"\(.metadata.name): \(.metadata.annotations["k8s.v1.cni.cncf.io/network-status"])"'
# Method 2: From the node β find VFs in non-default namespaces
ip netns list
# Each Pod has a netns; VFs moved there are "missing" from host
# Method 3: Check PCI device assignment
ls /sys/bus/pci/devices/0000:ca:00.7/net/
# Empty = VF is in a Pod's netns
# Shows interface name = VF is on host (available)
# Method 4: SR-IOV device plugin socket
kubectl exec -n openshift-sriov-network-operator sriov-device-plugin-<hash> -- \
cat /var/lib/kubelet/device-plugins/kubelet_internal_checkpoint
# Shows allocated devices per Pod
# Method 5: Check from inside the Pod
kubectl exec -it nccl-training -- ip link show rdma0
# Shows the VF interface with its state and MACVF Configuration by Device Plugin
# What the device plugin configures on the VF before assignment:
# (Based on SriovNetworkNodePolicy settings)
# Spoofcheck β usually disabled for RDMA
ip link set ens1f0np0 vf 7 spoofchk off
# Trust β required for RDMA QP operations
ip link set ens1f0np0 vf 7 trust on
# VLAN (if configured in policy)
ip link set ens1f0np0 vf 7 vlan 100
# MAC (if specified, otherwise auto)
ip link set ens1f0np0 vf 7 mac 00:11:22:33:44:55
# Link state auto (comes up when moved to netns)
ip link set ens1f0np0 vf 7 state auto
# Rate limiting (if QoS configured)
ip link set ens1f0np0 vf 7 max_tx_rate 50000 # 50GbpsVerify VF Health
# Quick health check: all VFs accounted for
TOTAL_VFS=$(cat /sys/class/net/ens1f0np0/device/sriov_numvfs)
IN_USE=$(ip link show ens1f0np0 | grep -c "state up")
AVAILABLE=$((TOTAL_VFS - IN_USE))
echo "Total VFs: $TOTAL_VFS"
echo "In use (UP): $IN_USE"
echo "Available (DOWN): $AVAILABLE"
# Check for stuck VFs (UP but no Pod)
for i in $(seq 0 $((TOTAL_VFS-1))); do
vf_dir="/sys/bus/pci/devices/$(readlink /sys/class/net/ens1f0np0/device/virtfn${i} | sed 's|../||')"
if [ -z "$(ls ${vf_dir}/net/ 2>/dev/null)" ]; then
echo "VF $i: in Pod netns (IN USE)"
else
vf_name=$(ls ${vf_dir}/net/)
state=$(cat ${vf_dir}/net/${vf_name}/operstate)
echo "VF $i: on host as ${vf_name} (state: ${state})"
fi
doneWhat Happens on Pod Crash/Eviction
Scenario VF Behavior
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Normal Pod delete CNI DEL β VF back to host β DOWN β available
Pod crash (OOM/segfault) CRI cleanup β CNI DEL β VF back β available
Node reboot All VFs recreated by config daemon β DOWN
kubelet restart Existing Pods keep VFs; no change
Pod eviction Same as normal delete (graceful)
Force delete (no grace) CNI DEL still called by CRI β VF recovered
Edge case: CNI DEL fails
β VF stuck in dead netns
β Fix: restart sriov-device-plugin Pod on that node
β Or: reboot node (nuclear option)Common Issues
All VFs showing βdownβ β is something broken?
- Cause: No Pods requesting SR-IOV resources on this node
- Fix: This is normal! VFs are down when not assigned. They come up when a Pod uses them.
VF stuck βupβ but Pod is gone
- Cause: CNI DEL failed during Pod cleanup (rare race condition)
- Fix: Restart sriov-device-plugin Pod; it re-syncs state
Pod canβt get VF β βinsufficient resourcesβ
- Cause: All VFs allocated to other Pods (or stuck)
- Fix: Check
kubectl describe node | grep mellanox; free leaked VFs
VF in Pod shows βNO-CARRIERβ
- Cause: Physical link on PF is down (cable, switch port)
- Fix: Check
ip link show ens1f0np0on host β PF must be UP first
Best Practices
- Donβt panic at DOWN VFs β idle VFs should be down
- Monitor allocated vs total β alert at >80% utilization
- One VF per GPU for RDMA workloads β matches traffic pattern
- Set
trust on+spoofchk offfor RDMA VFs - Check
sriov_numvfsafter reboot β config daemon should restore - Label nodes with VF count for scheduler awareness
- Test VF recovery β delete Pods and verify VFs return to pool
Key Takeaways
- VFs are supposed to be DOWN when not in use β this is healthy idle state
- Device plugin assigns VFs β Multus/SR-IOV CNI moves VF into Pod netns β link goes UP
- On Pod delete: VF returns to host namespace β link goes DOWN β available for next Pod
- The full chain: Pod spec β scheduler β device plugin β CRI β Multus β SR-IOV CNI β IPAM
- Stuck VFs (up without Pod) are rare β restart device plugin to re-sync
- Each VF gets: network namespace move, IP from IPAM, RDMA device mounts, trust/spoofchk config
- Monitor
AllocatablevsAllocatedon nodes to track VF pool health

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
