Shared RDMA Device Plugin for Kubernetes GPU Pods
Configure the RDMA shared device plugin to allow multiple pods to share RDMA-capable NICs on Kubernetes. K8s-rdma-shared-dev-plugin setup, resource
π‘ Quick Answer: The RDMA shared device plugin (
k8s-rdma-shared-dev-plugin) exposes RDMA-capable NICs as shared Kubernetes resources that multiple pods can use simultaneously. Unlike SR-IOV (exclusive VF per pod), shared RDMA gives all pods access to the same physical NICβs RDMA capabilities via/dev/infiniband/devices. Configure with a ConfigMap specifying resource name and device selectors, then pods requestrdma/rdma_shared_device_a: 1.
The Problem
- SR-IOV gives exclusive VF per pod β limited by number of VFs (typically 8-128)
- Many GPU training pods need RDMA but donβt need exclusive NIC access
- Running out of SR-IOV VFs on large multi-tenant GPU clusters
- Need GPUDirect RDMA for all pods without dedicating a VF to each
- Simple shared access to InfiniBand/RoCE NICs for NCCL multi-node training
The Solution
Deploy RDMA Shared Device Plugin
# ConfigMap defining shared RDMA resources
apiVersion: v1
kind: ConfigMap
metadata:
name: rdma-devices
namespace: kube-system
data:
config.json: |
{
"periodicUpdateInterval": 300,
"configList": [
{
"resourceName": "rdma_shared_device_a",
"rdmaHcaMax": 100,
"selectors": {
"vendors": ["15b3"],
"deviceIDs": ["101d", "101e"],
"ifNames": ["ens8f0", "ens8f1", "ens9f0", "ens9f1"]
}
},
{
"resourceName": "rdma_shared_device_b",
"rdmaHcaMax": 100,
"selectors": {
"vendors": ["15b3"],
"ifNames": ["ens10f0", "ens10f1"]
}
}
]
}
---
# DaemonSet deploying the plugin
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: rdma-shared-dp
namespace: kube-system
spec:
selector:
matchLabels:
app: rdma-shared-dp
template:
metadata:
labels:
app: rdma-shared-dp
spec:
hostNetwork: true
nodeSelector:
node-role.kubernetes.io/gpu-worker: ""
containers:
- name: rdma-shared-dp
image: ghcr.io/mellanox/k8s-rdma-shared-dev-plugin:latest
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
- name: config
mountPath: /k8s-rdma-shared-dev-plugin
- name: devinfiniband
mountPath: /dev/infiniband
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
- name: config
configMap:
name: rdma-devices
- name: devinfiniband
hostPath:
path: /dev/infinibandVia NVIDIA Network Operator (Recommended)
# NicClusterPolicy with shared RDMA device plugin
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
name: nic-cluster-policy
spec:
rdmaSharedDevicePlugin:
image: k8s-rdma-shared-dev-plugin
repository: ghcr.io/mellanox
version: latest
config: |
{
"configList": [
{
"resourceName": "rdma_shared_device_a",
"rdmaHcaMax": 100,
"selectors": {
"vendors": ["15b3"],
"deviceIDs": ["101d"]
}
}
]
}Verify Resources on Nodes
# Check node allocatable resources
kubectl describe node gpu-worker-0 | grep rdma
# rdma/rdma_shared_device_a: 100
# rdma/rdma_shared_device_b: 100
# The "100" is rdmaHcaMax β max concurrent pods sharing this resource
# Not actual hardware count β it's a soft limit
kubectl get node gpu-worker-0 -o json | jq '.status.allocatable' | grep rdma
# "rdma/rdma_shared_device_a": "100"Pod Using Shared RDMA
apiVersion: v1
kind: Pod
metadata:
name: nccl-training
namespace: gpu-workloads
spec:
containers:
- name: trainer
image: nvcr.io/nvidia/pytorch:24.04-py3
resources:
limits:
nvidia.com/gpu: "8"
rdma/rdma_shared_device_a: "1" # Request shared RDMA access
securityContext:
capabilities:
add: ["IPC_LOCK"] # Required for RDMA memory registration
env:
- name: NCCL_IB_HCA
value: "mlx5_0,mlx5_3,mlx5_5,mlx5_6"
- name: NCCL_NET_GDR_LEVEL
value: "5"
volumeMounts:
- name: shm
mountPath: /dev/shm
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: "64Gi"Shared vs Exclusive RDMA
Approach β Plugin β Pods/NIC β Isolation β Use Case
βββββββββββββββββββΌββββββββββββββββββββββββββββΌβββββββββββΌββββββββββββΌββββββββββββββ
Shared RDMA β k8s-rdma-shared-dev-pluginβ Up to 100β None β Training clusters
SR-IOV (exclusive)β sriov-device-plugin β 1 per VF β Full β Multi-tenant
Host device β None (hostNetwork) β All pods β None β Simple/testing
βββββββββββββββββββ΄ββββββββββββββββββββββββββββ΄βββββββββββ΄ββββββββββββ΄ββββββββββββββ
Shared RDMA advantages:
β
No VF limit (100+ pods per NIC)
β
Simpler config (no SR-IOV policy, no VF creation)
β
GPUDirect RDMA works (nvidia-peermem + /dev/infiniband)
β
Lower overhead (no virtual function management)
Shared RDMA limitations:
β No network isolation between pods (shared PF)
β No per-pod bandwidth guarantee
β No separate IP per pod (use overlay + secondary network)
β All pods see all RDMA traffic on the interfaceConfigMap Selectors
{
"configList": [
{
"resourceName": "rdma_shared_device_a",
"rdmaHcaMax": 100,
"selectors": {
"vendors": ["15b3"], // Mellanox/NVIDIA
"deviceIDs": ["101d", "101e"], // ConnectX-7, ConnectX-7 VF
"drivers": ["mlx5_core"], // Driver name
"ifNames": ["ens8f0"], // Interface name (exact match)
"linkTypes": ["IB", "ETH"] // InfiniBand or Ethernet (RoCE)
}
}
]
}Multiple Resource Pools (Fabric Separation)
{
"configList": [
{
"resourceName": "rdma_gpu_fabric",
"rdmaHcaMax": 50,
"selectors": {
"ifNames": ["ens8f0", "ens8f1", "ens9f0", "ens9f1"]
}
},
{
"resourceName": "rdma_storage_fabric",
"rdmaHcaMax": 50,
"selectors": {
"ifNames": ["ens10f0", "ens10f1"]
}
}
]
}# Pod requesting both fabrics
resources:
limits:
nvidia.com/gpu: "4"
rdma/rdma_gpu_fabric: "1" # GPU interconnect NICs
rdma/rdma_storage_fabric: "1" # Storage NICsWhat the Plugin Mounts in Pod
# Inside a pod with rdma/rdma_shared_device_a: 1
ls /dev/infiniband/
# rdma_cm uverbs0 uverbs1 uverbs2 uverbs3
# These are the RDMA character devices:
# rdma_cm β connection manager (for RC/UC connections)
# uverbs0-3 β user-space verbs devices (one per HCA port)
# Verify RDMA devices
ibv_devinfo
# hca_id: mlx5_0
# port: 1
# state: PORT_ACTIVE
# link_layer: Ethernet (RoCE)
# Test bandwidth
ib_write_bw -d mlx5_0 --report_gbitsCombining with Secondary Network (Multus)
# NetworkAttachmentDefinition for RDMA pods
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
name: rdma-net
namespace: gpu-workloads
spec:
config: |
{
"cniVersion": "0.4.0",
"type": "macvlan",
"master": "ens8f0",
"mode": "bridge",
"ipam": {
"type": "nv-ipam",
"poolName": "gpu-fabric"
}
}
---
# Pod with shared RDMA + secondary network IP
apiVersion: v1
kind: Pod
metadata:
name: training-pod
annotations:
k8s.v1.cni.cncf.io/networks: rdma-net
spec:
containers:
- name: trainer
resources:
limits:
nvidia.com/gpu: "8"
rdma/rdma_shared_device_a: "1"Common Issues
βinsufficient rdma/rdma_shared_device_aβ despite rdmaHcaMax=100
- Cause: 100 pods already allocated; or device plugin not running on this node
- Fix: Increase
rdmaHcaMax; verify DaemonSet pod is Running; check kubelet logs
Pod has RDMA devices but GPUDirect RDMA doesnβt work
- Cause: nvidia-peermem not loaded; or missing IPC_LOCK capability
- Fix:
modprobe nvidia-peermem; addcapabilities: add: ["IPC_LOCK"]to pod spec
βPermission deniedβ accessing /dev/infiniband
- Cause: Security context too restrictive; or SELinux blocking
- Fix: Add IPC_LOCK capability; or run with
privileged: truefor testing
Selector matches no devices
- Cause: Wrong vendor ID, device ID, or interface name in config
- Fix: Check with
ibstat,lspci -nn | grep Mellanox,ip linkon the node
Best Practices
- Use shared RDMA for training clusters β simpler than SR-IOV, no VF limit
- Set
rdmaHcaMaxto expected max concurrent pods β acts as admission limit - Separate resource pools per fabric β GPU interconnect vs storage traffic
- Always add
IPC_LOCKcapability β required for RDMA memory registration - Combine with Multus + IPAM β gives pods unique IPs on the RDMA fabric
- Large
/dev/shmβ NCCL uses shared memory for intra-node communication - Use Network Operator for lifecycle β manages plugin DaemonSet + config updates
Key Takeaways
- Shared RDMA plugin: multiple pods share the same physical NICβs RDMA capabilities
- Resource:
rdma/rdma_shared_device_a: 1β requests shared access (not exclusive) rdmaHcaMax: soft limit on concurrent pods (not hardware limit) β set to 50-100- Mounts
/dev/infiniband/*into pod β user-space verbs + connection manager - No isolation between pods β all share PF bandwidth (fine for training clusters)
- Combine with SR-IOV when tenant isolation needed; use shared when not
- Works with GPUDirect RDMA (nvidia-peermem) β GPU memory β shared NIC β wire

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
