InfiniBand Subnet Manager OpenSM on Kubernetes
Deploy and manage InfiniBand Subnet Manager (OpenSM) on Kubernetes for GPU cluster fabric management. Covers SM architecture, UFM integration, partition
π‘ Quick Answer: InfiniBand requires a Subnet Manager (SM) to initialize the fabric β without it, no IB communication happens. For small GPU clusters, run OpenSM on a management node. For production, use NVIDIA UFM (Unified Fabric Manager) for centralized IB management, monitoring, and adaptive routing.
The Problem
InfiniBand is not plug-and-play like Ethernet:
- Every IB fabric needs at least one Subnet Manager running
- SM assigns LIDs (Local IDs), configures routing, manages partitions
- Without SM, IB ports stay in βInitializingβ state β no RDMA, no NCCL
- Need to choose: switch-based SM, host-based OpenSM, or NVIDIA UFM
- Partition keys (P_Keys) control which hosts can communicate
The Solution
InfiniBand SM Architecture
Subnet Manager Responsibilities:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1. Discovery β Find all nodes, switches, links in the fabric
2. LID Assignment β Assign Local IDs to each port
3. Routing β Compute forwarding tables for switches
4. Monitoring β Detect topology changes, link failures
5. Partitioning β Enforce P_Key isolation between tenants
6. QoS β Service Level (SL) assignment for traffic classesDeploy OpenSM on Kubernetes
# OpenSM DaemonSet β runs on IB management node
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: opensm
namespace: infiniband-mgmt
spec:
selector:
matchLabels:
app: opensm
template:
metadata:
labels:
app: opensm
spec:
nodeSelector:
infiniband/subnet-manager: "true"
hostNetwork: true
containers:
- name: opensm
image: registry.example.com/opensm:5.18
command: ["/usr/sbin/opensm"]
args:
- "-g" # GUID-based routing
- "0x0000000000000001" # SM GUID (from ibstat)
- "-p" # Priority (higher = preferred SM)
- "15"
- "--log_file"
- "/var/log/opensm/opensm.log"
- "--log_flags"
- "0xFF"
securityContext:
privileged: true
volumeMounts:
- name: infiniband
mountPath: /dev/infiniband
- name: log
mountPath: /var/log/opensm
volumes:
- name: infiniband
hostPath:
path: /dev/infiniband
- name: log
hostPath:
path: /var/log/opensmCheck IB Fabric Health
# Port status (should be Active, not Initializing)
ibstat
# Expected:
# State: Active
# Physical state: LinkUp
# Rate: 200 (HDR) or 400 (NDR)
# List all nodes in the fabric
ibnetdiscover
# Show switch topology
iblinkinfo
# Check SM status
sminfo
# Expected: SM running, priority, GUID
# Show LID assignments
ibnodes | head -20
# Check for errors on all ports
ibdiagnet --ls 10 --lw 4x
# Scans all links for errors, speed mismatches, symbol errors
# Per-port error counters
perfquery -x <lid> <port>Partition Keys (P_Keys) for Multi-Tenant
# /etc/opensm/partitions.conf
# Default partition β all nodes
Default=0x7FFF,ipoib:ALL=full
# GPU training partition β isolated fabric segment
GPUFabric=0x0001,ipoib:
# GPU worker nodes (by GUID)
0x0002c903000001,full;
0x0002c903000002,full;
0x0002c903000003,full;
0x0002c903000004,full;
# Storage partition β NFS/Lustre servers
StorageFabric=0x0002,ipoib:
0x0002c903000010,full;
0x0002c903000011,full;# SR-IOV policy with P_Key for GPU fabric partition
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
name: gpu-rdma-pkey
namespace: openshift-sriov-network-operator
spec:
networkNamespace: ai-training
resourceName: gpu-rdma
capabilities: '{"rdma": true}'
linkState: auto
vlan: 0
# P_Key for GPU fabric partition
# networkDeviceType: "ib" # InfiniBand mode
ipam: |
{
"type": "nv-ipam",
"poolName": "gpu-ib-fabric"
}Switch-Based SM vs Host-Based SM
Option Pros Cons
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Switch SM No host resources needed Limited configuration
(built into IB Survives host failures Basic routing only
switch firmware) Always available No advanced features
Host OpenSM Full control, P_Keys, QoS Needs dedicated host
Custom routing algorithms Host failure = fabric down
Open source, free Must manage HA manually
NVIDIA UFM Enterprise management Licensed, cost
Adaptive routing Requires UFM appliance
Health monitoring dashboard Additional infrastructure
Telemetry, SHARP support Verify NCCL Uses IB After SM Setup
# After SM is running, ports should be Active:
ibstat | grep -E "State|Rate"
# State: Active
# Rate: 200
# NCCL should now show NET/IB:
export NCCL_DEBUG=INFO
# "NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB"
# If still showing "Initializing":
# 1. Check SM is running: sminfo
# 2. Check cable: ibstat (Physical state: LinkUp?)
# 3. Check switch port: iblinkinfo | grep DownCommon Issues
IB ports stuck in βInitializingβ
- Cause: No Subnet Manager running on the fabric
- Fix: Start OpenSM or enable SM on the IB switch
SM failover takes too long
- Cause: Single SM with no standby; failover requires new SM election
- Fix: Run standby SM on second node with lower priority
βNo path to destinationβ RDMA errors
- Cause: SM routing tables not yet computed, or P_Key mismatch
- Fix: Wait for SM sweep (check
opensm.log); verify P_Key membership
Best Practices
- Always run standby SM β two OpenSM instances with different priorities
- Use switch SM for small clusters (<16 nodes) β simpler, no host dependency
- UFM for large clusters (50+ nodes) β adaptive routing, telemetry, health monitoring
- P_Keys for multi-tenant β isolate GPU fabric from storage traffic at IB level
- Monitor with
ibdiagnetβ catches cable issues, speed mismatches, error counters - Log SM events β topology changes, port state changes, rerouting events
Key Takeaways
- InfiniBand requires a Subnet Manager β no SM means no communication
- OpenSM is free and runs as DaemonSet on a management node
- SM assigns LIDs, computes routing, manages P_Keys for isolation
- IB ports show βInitializingβ without SM, βActiveβ with SM
- P_Keys partition the fabric (GPU vs storage vs management)
- NVIDIA UFM for production (adaptive routing, monitoring, SHARP)
- Always run standby SM for high availability
- Verify with
ibstat,sminfo,ibnetdiscoverafter setup

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
