Extended Resources & RDMA Shared Device Plugin
Kubernetes extended resources for RDMA devices using the shared device plugin. Advertise and schedule InfiniBand and RoCE NICs without SR-IOV using k8s-rdm.
π‘ Quick Answer: The k8s-rdma-shared-dev-plugin advertises RDMA devices (InfiniBand/RoCE) as Kubernetes extended resources without SR-IOV. Multiple pods share the same physical RDMA NIC via the host network namespace. Pods request
rdma/hca_shared_devices_a: 1in their resource limits. This is the simplest way to enable GPUDirect RDMA for AI training workloads when SR-IOV VFs arenβt needed.
The Problem
AI training jobs need RDMA (Remote Direct Memory Access) for high-bandwidth, low-latency GPU-to-GPU communication across nodes. Kubernetes doesnβt natively understand RDMA devices. You have two options:
- SR-IOV β Each pod gets a dedicated Virtual Function (VF). Full isolation, but complex setup.
- Shared device plugin β Multiple pods share the same physical RDMA NIC. Simpler, and sufficient when pods are trusted (same tenant).
The shared device plugin is the right choice for most AI/HPC clusters where all pods on a node belong to the same training job.
flowchart TB
subgraph NODE["Worker Node"]
NIC["Physical RDMA NIC<br/>mlx5_0 (200Gb/s IB)"]
DP["k8s-rdma-shared-dev-plugin<br/>DaemonSet"]
subgraph PODS["Pods (shared access)"]
P1["GPU Pod 1<br/>rdma/hca_shared_devices_a: 1"]
P2["GPU Pod 2<br/>rdma/hca_shared_devices_a: 1"]
P3["GPU Pod 3<br/>rdma/hca_shared_devices_a: 1"]
end
DP -->|"Advertises<br/>extended resource"| NIC
P1 --> NIC
P2 --> NIC
P3 --> NIC
end
KUBELET["kubelet"] -->|"Allocates<br/>rdma/hca_shared_devices_a"| DPHow Extended Resources Work
Kubernetes extended resources are a mechanism for advertising custom resources beyond CPU and memory. Device plugins register with the kubelet to advertise them:
# After deploying the RDMA shared device plugin, the node advertises:
kubectl describe node gpu-worker-0 | grep -A5 "Allocatable:"
# Allocatable:
# cpu: 64
# memory: 512Gi
# nvidia.com/gpu: 8
# rdma/hca_shared_devices_a: 63 # β RDMA extended resource
# rdma/hca_shared_devices_b: 63 # β Second RDMA NIC (if configured)
# Pods request them like any other resource:
# resources:
# limits:
# nvidia.com/gpu: 1
# rdma/hca_shared_devices_a: 1Extended Resource Properties
| Property | Behavior |
|---|---|
| Integer only | Cannot request 0.5 β always whole numbers |
| No overcommit | Unlike CPU, extended resources are hard limits |
| Scheduler-aware | Pods only schedule on nodes with available capacity |
| Opaque to Kubernetes | K8s doesnβt know what the resource is β just counts |
| Device plugin managed | Plugin handles allocation and health checks |
The Solution
Deploy k8s-rdma-shared-dev-plugin
Step 1: Create the ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: rdma-devices
namespace: kube-system
data:
config.json: |
{
"periodicUpdateInterval": 300,
"configList": [
{
"resourceName": "hca_shared_devices_a",
"rdmaHcaMax": 63,
"selectors": {
"ifNames": ["ens8f0np0"]
}
},
{
"resourceName": "hca_shared_devices_b",
"rdmaHcaMax": 63,
"selectors": {
"ifNames": ["ens8f1np1"]
}
}
]
}Configuration fields:
| Field | Description |
|---|---|
resourceName | Name exposed as rdma/<resourceName> in K8s |
rdmaHcaMax | Max number of pods that can share this device (set to max expected pods per node) |
selectors.ifNames | Network interface names to match |
selectors.vendors | PCI vendor IDs (e.g., ["15b3"] for Mellanox) |
selectors.deviceIDs | PCI device IDs |
selectors.drivers | Kernel driver names (e.g., ["mlx5_core"]) |
periodicUpdateInterval | Seconds between device re-scan (0 = disabled) |
Step 2: Selector strategies
// Option A: Match by interface name (most precise)
{
"resourceName": "hca_shared_devices_a",
"rdmaHcaMax": 63,
"selectors": {
"ifNames": ["ens8f0np0"]
}
}
// Option B: Match by vendor (all Mellanox/NVIDIA NICs)
{
"resourceName": "hca_shared_devices_all",
"rdmaHcaMax": 63,
"selectors": {
"vendors": ["15b3"]
}
}
// Option C: Match by driver
{
"resourceName": "hca_shared_devices_mlx5",
"rdmaHcaMax": 63,
"selectors": {
"drivers": ["mlx5_core"]
}
}
// Option D: Match by PCI device ID (specific NIC model)
{
"resourceName": "hca_shared_devices_cx7",
"rdmaHcaMax": 63,
"selectors": {
"vendors": ["15b3"],
"deviceIDs": ["1021"]
}
}Step 3: Deploy the DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: rdma-shared-dp-ds
namespace: kube-system
spec:
selector:
matchLabels:
name: rdma-shared-dp-ds
template:
metadata:
labels:
name: rdma-shared-dp-ds
spec:
hostNetwork: true
priorityClassName: system-node-critical
nodeSelector:
feature.node.kubernetes.io/rdma.capable: "true" # NFD label
tolerations:
- operator: Exists # Schedule on all matching nodes
containers:
- name: k8s-rdma-shared-dp-ds
image: ghcr.io/mellanox/k8s-rdma-shared-dev-plugin:latest
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
resources:
requests:
cpu: 100m
memory: 50Mi
limits:
cpu: 300m
memory: 150Mi
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
readOnly: false
- name: config
mountPath: /k8s-rdma-shared-dev-plugin
- name: devs
mountPath: /dev
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
- name: config
configMap:
name: rdma-devices
items:
- key: config.json
path: config.json
- name: devs
hostPath:
path: /devStep 4: Verify
# Check DaemonSet is running
kubectl -n kube-system get ds rdma-shared-dp-ds
# Check device plugin registration
kubectl -n kube-system logs -l name=rdma-shared-dp-ds | grep -i "device\|resource"
# "RDMA shared device plugin started"
# "Resource: rdma/hca_shared_devices_a, device count: 63"
# Verify extended resource on nodes
kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, rdma: .status.allocatable | to_entries | map(select(.key | startswith("rdma/")))}'
# "rdma/hca_shared_devices_a": "63"Use RDMA in Pods
apiVersion: v1
kind: Pod
metadata:
name: nccl-test
spec:
hostNetwork: true # Required for shared RDMA (no SR-IOV)
containers:
- name: nccl
image: nvcr.io/nvidia/pytorch:24.07-py3
resources:
limits:
nvidia.com/gpu: 8
rdma/hca_shared_devices_a: 1 # Request RDMA access
env:
- name: NCCL_IB_DISABLE
value: "0" # Enable InfiniBand
- name: NCCL_DEBUG
value: "INFO" # Verify RDMA is used
- name: NCCL_IB_HCA
value: "mlx5" # Target Mellanox HCAs
command: ["sleep", "infinity"]Multi-NIC RDMA for AI Training
For 8-GPU nodes with 8 RDMA NICs (1:1 GPU-to-NIC mapping):
{
"configList": [
{"resourceName": "hca_shared_devices_0", "rdmaHcaMax": 63, "selectors": {"ifNames": ["ens8f0np0"]}},
{"resourceName": "hca_shared_devices_1", "rdmaHcaMax": 63, "selectors": {"ifNames": ["ens9f0np0"]}},
{"resourceName": "hca_shared_devices_2", "rdmaHcaMax": 63, "selectors": {"ifNames": ["ens10f0np0"]}},
{"resourceName": "hca_shared_devices_3", "rdmaHcaMax": 63, "selectors": {"ifNames": ["ens11f0np0"]}},
{"resourceName": "hca_shared_devices_4", "rdmaHcaMax": 63, "selectors": {"ifNames": ["ens12f0np0"]}},
{"resourceName": "hca_shared_devices_5", "rdmaHcaMax": 63, "selectors": {"ifNames": ["ens13f0np0"]}},
{"resourceName": "hca_shared_devices_6", "rdmaHcaMax": 63, "selectors": {"ifNames": ["ens14f0np0"]}},
{"resourceName": "hca_shared_devices_7", "rdmaHcaMax": 63, "selectors": {"ifNames": ["ens15f0np0"]}}
]
}# Pod requesting all 8 RDMA NICs
resources:
limits:
nvidia.com/gpu: 8
rdma/hca_shared_devices_0: 1
rdma/hca_shared_devices_1: 1
rdma/hca_shared_devices_2: 1
rdma/hca_shared_devices_3: 1
rdma/hca_shared_devices_4: 1
rdma/hca_shared_devices_5: 1
rdma/hca_shared_devices_6: 1
rdma/hca_shared_devices_7: 1Shared vs SR-IOV: When to Use Which
| Criteria | Shared Device Plugin | SR-IOV |
|---|---|---|
| Isolation | None β pods share NIC | Full β each pod gets a VF |
| Setup complexity | Low β just a DaemonSet | High β VF config, IOMMU, PF driver |
| Performance | Native (no virtualization) | Near-native (hardware VF) |
| Multi-tenant | β Not safe | β Isolated |
| AI training (single tenant) | β Perfect fit | Overkill |
| Requires hostNetwork | Yes | No (VF injected into pod netns) |
| Max pods per NIC | Configurable (rdmaHcaMax) | Limited by VF count (usually 8-128) |
| GPUDirect RDMA | β Works with hostNetwork | β
Works with deviceType: netdevice |
Common Issues
| Issue | Cause | Fix |
|---|---|---|
rdma/hca_shared_devices_a: 0 on node | Device plugin not finding NIC | Check ifNames in ConfigMap matches ip link output |
| Pod stuck Pending | No RDMA resource available | Check kubectl describe node for allocatable count |
| NCCL falls back to TCP | Missing hostNetwork: true | Shared plugin requires hostNetwork for RDMA verbs |
Permission denied on /dev/infiniband | Missing privileges | Add privileged: true or IPC_LOCK + /dev volume |
| Plugin CrashLoopBackOff | Wrong config.json format | Validate JSON, check plugin logs |
| Multiple NICs not detected | Selector too narrow | Use vendors selector or list all ifNames |
OpenShift Deployment
On OpenShift, use the NVIDIA Network Operator which deploys the shared device plugin automatically:
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
name: nic-cluster-policy
spec:
rdmaSharedDevicePlugin:
image: k8s-rdma-shared-dev-plugin
repository: ghcr.io/mellanox
version: latest
config: |
{
"configList": [
{
"resourceName": "hca_shared_devices_a",
"rdmaHcaMax": 63,
"selectors": {
"vendors": ["15b3"]
}
}
]
}# Verify on OpenShift
oc get nicclusterpolicy
oc get ds -n nvidia-network-operator | grep rdma
oc describe node gpu-worker-0 | grep rdmaBest Practices
- Use shared plugin for single-tenant AI clusters β simpler than SR-IOV, same performance
- Set
rdmaHcaMaxto your max pods per node β 63 is a safe default - Use
ifNamesselector for precision βvendorsis too broad if you have management NICs - Always set
hostNetwork: trueβ shared RDMA requires host network namespace - Deploy with
system-node-criticalpriority β device plugin must run before AI workloads - Use NFD labels for node selection β
feature.node.kubernetes.io/rdma.capable: "true" - One resource per NIC for GPU affinity β map each RDMA NIC to a separate extended resource for NUMA-aware scheduling
- Monitor with
rdma statisticsβ check for dropped packets or congestion
Key Takeaways
- Extended resources let Kubernetes schedule arbitrary hardware (GPUs, RDMA NICs, FPGAs)
- The k8s-rdma-shared-dev-plugin is the simplest way to expose RDMA to pods
- Shared = multiple pods use the same physical NIC (requires hostNetwork)
- SR-IOV = each pod gets an isolated VF (more complex, for multi-tenant)
- For AI training: shared plugin + hostNetwork + GPUDirect RDMA = optimal
- Configure one extended resource per NIC for 1:1 GPU-to-NIC affinity
- OpenShift: use NicClusterPolicy CRD (NVIDIA Network Operator deploys it automatically)

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
