πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
Networking advanced ⏱ 15 minutes K8s 1.28+

Shared RDMA Device Plugin for Kubernetes GPU Pods

Configure the RDMA shared device plugin to allow multiple pods to share RDMA-capable NICs on Kubernetes. K8s-rdma-shared-dev-plugin setup, resource

By Luca Berton β€’ β€’ πŸ“– 6 min read

πŸ’‘ Quick Answer: The RDMA shared device plugin (k8s-rdma-shared-dev-plugin) exposes RDMA-capable NICs as shared Kubernetes resources that multiple pods can use simultaneously. Unlike SR-IOV (exclusive VF per pod), shared RDMA gives all pods access to the same physical NIC’s RDMA capabilities via /dev/infiniband/ devices. Configure with a ConfigMap specifying resource name and device selectors, then pods request rdma/rdma_shared_device_a: 1.

The Problem

  • SR-IOV gives exclusive VF per pod β€” limited by number of VFs (typically 8-128)
  • Many GPU training pods need RDMA but don’t need exclusive NIC access
  • Running out of SR-IOV VFs on large multi-tenant GPU clusters
  • Need GPUDirect RDMA for all pods without dedicating a VF to each
  • Simple shared access to InfiniBand/RoCE NICs for NCCL multi-node training

The Solution

Deploy RDMA Shared Device Plugin

# ConfigMap defining shared RDMA resources
apiVersion: v1
kind: ConfigMap
metadata:
  name: rdma-devices
  namespace: kube-system
data:
  config.json: |
    {
      "periodicUpdateInterval": 300,
      "configList": [
        {
          "resourceName": "rdma_shared_device_a",
          "rdmaHcaMax": 100,
          "selectors": {
            "vendors": ["15b3"],
            "deviceIDs": ["101d", "101e"],
            "ifNames": ["ens8f0", "ens8f1", "ens9f0", "ens9f1"]
          }
        },
        {
          "resourceName": "rdma_shared_device_b",
          "rdmaHcaMax": 100,
          "selectors": {
            "vendors": ["15b3"],
            "ifNames": ["ens10f0", "ens10f1"]
          }
        }
      ]
    }
---
# DaemonSet deploying the plugin
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: rdma-shared-dp
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: rdma-shared-dp
  template:
    metadata:
      labels:
        app: rdma-shared-dp
    spec:
      hostNetwork: true
      nodeSelector:
        node-role.kubernetes.io/gpu-worker: ""
      containers:
        - name: rdma-shared-dp
          image: ghcr.io/mellanox/k8s-rdma-shared-dev-plugin:latest
          imagePullPolicy: IfNotPresent
          securityContext:
            privileged: true
          volumeMounts:
            - name: device-plugin
              mountPath: /var/lib/kubelet/device-plugins
            - name: config
              mountPath: /k8s-rdma-shared-dev-plugin
            - name: devinfiniband
              mountPath: /dev/infiniband
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins
        - name: config
          configMap:
            name: rdma-devices
        - name: devinfiniband
          hostPath:
            path: /dev/infiniband
# NicClusterPolicy with shared RDMA device plugin
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  rdmaSharedDevicePlugin:
    image: k8s-rdma-shared-dev-plugin
    repository: ghcr.io/mellanox
    version: latest
    config: |
      {
        "configList": [
          {
            "resourceName": "rdma_shared_device_a",
            "rdmaHcaMax": 100,
            "selectors": {
              "vendors": ["15b3"],
              "deviceIDs": ["101d"]
            }
          }
        ]
      }

Verify Resources on Nodes

# Check node allocatable resources
kubectl describe node gpu-worker-0 | grep rdma
#   rdma/rdma_shared_device_a:  100
#   rdma/rdma_shared_device_b:  100

# The "100" is rdmaHcaMax β€” max concurrent pods sharing this resource
# Not actual hardware count β€” it's a soft limit

kubectl get node gpu-worker-0 -o json | jq '.status.allocatable' | grep rdma
# "rdma/rdma_shared_device_a": "100"

Pod Using Shared RDMA

apiVersion: v1
kind: Pod
metadata:
  name: nccl-training
  namespace: gpu-workloads
spec:
  containers:
    - name: trainer
      image: nvcr.io/nvidia/pytorch:24.04-py3
      resources:
        limits:
          nvidia.com/gpu: "8"
          rdma/rdma_shared_device_a: "1"    # Request shared RDMA access
      securityContext:
        capabilities:
          add: ["IPC_LOCK"]     # Required for RDMA memory registration
      env:
        - name: NCCL_IB_HCA
          value: "mlx5_0,mlx5_3,mlx5_5,mlx5_6"
        - name: NCCL_NET_GDR_LEVEL
          value: "5"
      volumeMounts:
        - name: shm
          mountPath: /dev/shm
  volumes:
    - name: shm
      emptyDir:
        medium: Memory
        sizeLimit: "64Gi"

Shared vs Exclusive RDMA

Approach          β”‚ Plugin                    β”‚ Pods/NIC β”‚ Isolation β”‚ Use Case
──────────────────┼───────────────────────────┼──────────┼───────────┼──────────────
Shared RDMA       β”‚ k8s-rdma-shared-dev-pluginβ”‚ Up to 100β”‚ None      β”‚ Training clusters
SR-IOV (exclusive)β”‚ sriov-device-plugin       β”‚ 1 per VF β”‚ Full      β”‚ Multi-tenant
Host device       β”‚ None (hostNetwork)        β”‚ All pods β”‚ None      β”‚ Simple/testing
──────────────────┴───────────────────────────┴──────────┴───────────┴──────────────

Shared RDMA advantages:
  βœ… No VF limit (100+ pods per NIC)
  βœ… Simpler config (no SR-IOV policy, no VF creation)
  βœ… GPUDirect RDMA works (nvidia-peermem + /dev/infiniband)
  βœ… Lower overhead (no virtual function management)
  
Shared RDMA limitations:
  ❌ No network isolation between pods (shared PF)
  ❌ No per-pod bandwidth guarantee
  ❌ No separate IP per pod (use overlay + secondary network)
  ❌ All pods see all RDMA traffic on the interface

ConfigMap Selectors

{
  "configList": [
    {
      "resourceName": "rdma_shared_device_a",
      "rdmaHcaMax": 100,
      "selectors": {
        "vendors": ["15b3"],           // Mellanox/NVIDIA
        "deviceIDs": ["101d", "101e"], // ConnectX-7, ConnectX-7 VF
        "drivers": ["mlx5_core"],      // Driver name
        "ifNames": ["ens8f0"],         // Interface name (exact match)
        "linkTypes": ["IB", "ETH"]     // InfiniBand or Ethernet (RoCE)
      }
    }
  ]
}

Multiple Resource Pools (Fabric Separation)

{
  "configList": [
    {
      "resourceName": "rdma_gpu_fabric",
      "rdmaHcaMax": 50,
      "selectors": {
        "ifNames": ["ens8f0", "ens8f1", "ens9f0", "ens9f1"]
      }
    },
    {
      "resourceName": "rdma_storage_fabric",
      "rdmaHcaMax": 50,
      "selectors": {
        "ifNames": ["ens10f0", "ens10f1"]
      }
    }
  ]
}
# Pod requesting both fabrics
resources:
  limits:
    nvidia.com/gpu: "4"
    rdma/rdma_gpu_fabric: "1"         # GPU interconnect NICs
    rdma/rdma_storage_fabric: "1"     # Storage NICs

What the Plugin Mounts in Pod

# Inside a pod with rdma/rdma_shared_device_a: 1
ls /dev/infiniband/
# rdma_cm  uverbs0  uverbs1  uverbs2  uverbs3

# These are the RDMA character devices:
# rdma_cm    β€” connection manager (for RC/UC connections)
# uverbs0-3  β€” user-space verbs devices (one per HCA port)

# Verify RDMA devices
ibv_devinfo
# hca_id: mlx5_0
#   port: 1
#     state: PORT_ACTIVE
#     link_layer: Ethernet  (RoCE)

# Test bandwidth
ib_write_bw -d mlx5_0 --report_gbits

Combining with Secondary Network (Multus)

# NetworkAttachmentDefinition for RDMA pods
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
  name: rdma-net
  namespace: gpu-workloads
spec:
  config: |
    {
      "cniVersion": "0.4.0",
      "type": "macvlan",
      "master": "ens8f0",
      "mode": "bridge",
      "ipam": {
        "type": "nv-ipam",
        "poolName": "gpu-fabric"
      }
    }
---
# Pod with shared RDMA + secondary network IP
apiVersion: v1
kind: Pod
metadata:
  name: training-pod
  annotations:
    k8s.v1.cni.cncf.io/networks: rdma-net
spec:
  containers:
    - name: trainer
      resources:
        limits:
          nvidia.com/gpu: "8"
          rdma/rdma_shared_device_a: "1"

Common Issues

”insufficient rdma/rdma_shared_device_a” despite rdmaHcaMax=100

  • Cause: 100 pods already allocated; or device plugin not running on this node
  • Fix: Increase rdmaHcaMax; verify DaemonSet pod is Running; check kubelet logs

Pod has RDMA devices but GPUDirect RDMA doesn’t work

  • Cause: nvidia-peermem not loaded; or missing IPC_LOCK capability
  • Fix: modprobe nvidia-peermem; add capabilities: add: ["IPC_LOCK"] to pod spec

”Permission denied” accessing /dev/infiniband

  • Cause: Security context too restrictive; or SELinux blocking
  • Fix: Add IPC_LOCK capability; or run with privileged: true for testing

Selector matches no devices

  • Cause: Wrong vendor ID, device ID, or interface name in config
  • Fix: Check with ibstat, lspci -nn | grep Mellanox, ip link on the node

Best Practices

  1. Use shared RDMA for training clusters β€” simpler than SR-IOV, no VF limit
  2. Set rdmaHcaMax to expected max concurrent pods β€” acts as admission limit
  3. Separate resource pools per fabric β€” GPU interconnect vs storage traffic
  4. Always add IPC_LOCK capability β€” required for RDMA memory registration
  5. Combine with Multus + IPAM β€” gives pods unique IPs on the RDMA fabric
  6. Large /dev/shm β€” NCCL uses shared memory for intra-node communication
  7. Use Network Operator for lifecycle β€” manages plugin DaemonSet + config updates

Key Takeaways

  • Shared RDMA plugin: multiple pods share the same physical NIC’s RDMA capabilities
  • Resource: rdma/rdma_shared_device_a: 1 β€” requests shared access (not exclusive)
  • rdmaHcaMax: soft limit on concurrent pods (not hardware limit) β€” set to 50-100
  • Mounts /dev/infiniband/* into pod β€” user-space verbs + connection manager
  • No isolation between pods β€” all share PF bandwidth (fine for training clusters)
  • Combine with SR-IOV when tenant isolation needed; use shared when not
  • Works with GPUDirect RDMA (nvidia-peermem) β€” GPU memory β†’ shared NIC β†’ wire
#rdma #device-plugin #shared #gpu #networking
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens