NVIDIA GPU Operator MOFED Driver Configuration
Configure the NVIDIA GPU Operator to deploy Mellanox OFED drivers for high-performance RDMA networking on Kubernetes GPU nodes with InfiniBand and RoCE support.
π‘ Quick Answer: Enable MOFED in the GPU Operator ClusterPolicy with
driver.rdma.enabled=trueanddriver.rdma.useHostMofed=falseto deploy containerized Mellanox OFED drivers, or setuseHostMofed=trueto use pre-installed host drivers.
The Problem
AI and HPC workloads running on Kubernetes need high-bandwidth, low-latency networking between GPU nodes. Standard TCP/IP networking adds unacceptable overhead for distributed training β you need RDMA (Remote Direct Memory Access) via InfiniBand or RoCE.
The Mellanox OFED (MOFED) driver stack enables RDMA on ConnectX NICs, but installing and managing these drivers across a fleet of GPU nodes is complex:
- Driver version alignment β MOFED version must match the kernel and GPU driver
- Node lifecycle β drivers must survive reboots, upgrades, and node replacements
- Containerized vs host drivers β choosing the right deployment model affects maintenance
The NVIDIA GPU Operator automates MOFED driver deployment through the ClusterPolicy CRD.
The Solution
Step 1: Install the GPU Operator with MOFED Enabled
# Add the NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
# Install with MOFED driver support
helm install gpu-operator nvidia/gpu-operator \
-n gpu-operator --create-namespace \
--set driver.rdma.enabled=true \
--set driver.rdma.useHostMofed=falseStep 2: Configure ClusterPolicy for MOFED
For fine-grained control, create or patch the ClusterPolicy:
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: cluster-policy
spec:
operator:
defaultRuntime: containerd
driver:
enabled: true
version: "550.127.08"
rdma:
enabled: true
useHostMofed: false
upgradePolicy:
autoUpgrade: true
maxParallelUpgrades: 1
drain:
enable: true
force: true
timeoutSeconds: 300
mofed:
enabled: true
image: mofed
repository: nvcr.io/nvstaging/mellanox
version: "24.07-0.6.1.0"
startupProbe:
initialDelaySeconds: 10
periodSeconds: 20
livenessProbe:
initialDelaySeconds: 30
periodSeconds: 30
readinessProbe:
initialDelaySeconds: 10
periodSeconds: 30
env:
- name: UNLOAD_STORAGE_MODULES
value: "true"
- name: ENABLE_NFSRDMA
value: "false"
- name: RESTORE_DRIVER_ON_POD_TERMINATION
value: "true"
devicePlugin:
enabled: true
toolkit:
enabled: truekubectl apply -f cluster-policy.yamlStep 3: Verify MOFED Driver Deployment
# Check MOFED driver pods are running on GPU nodes
kubectl get pods -n gpu-operator -l app=mofed-ubuntu -o wide
# Verify the MOFED driver version
kubectl exec -n gpu-operator -it $(kubectl get pod -n gpu-operator \
-l app=mofed-ubuntu -o jsonpath='{.items[0].metadata.name}') \
-- ofed_info -s
# Expected output: MLNX_OFED_LINUX-24.07-0.6.1.0
# Check RDMA devices are visible
kubectl exec -n gpu-operator -it $(kubectl get pod -n gpu-operator \
-l app=mofed-ubuntu -o jsonpath='{.items[0].metadata.name}') \
-- ibstatStep 4: Host MOFED vs Containerized MOFED
Containerized MOFED (default β useHostMofed: false):
- GPU Operator deploys MOFED as a DaemonSet
- Automatic updates via ClusterPolicy
- Easier lifecycle management
Host MOFED (useHostMofed: true):
- Pre-install MOFED on nodes before GPU Operator
- Operator skips MOFED deployment, uses existing drivers
- Required when specific MOFED build or patches are needed
# Use pre-installed host MOFED drivers
spec:
driver:
rdma:
enabled: true
useHostMofed: true
mofed:
enabled: false # Don't deploy containerized MOFEDStep 5: Configure MOFED Environment Variables
Key environment variables for MOFED driver pods:
spec:
mofed:
enabled: true
env:
# Unload storage modules to avoid conflicts
- name: UNLOAD_STORAGE_MODULES
value: "true"
# Enable NFS over RDMA (set true for storage workloads)
- name: ENABLE_NFSRDMA
value: "false"
# Restore driver on pod restart
- name: RESTORE_DRIVER_ON_POD_TERMINATION
value: "true"
# Force specific firmware version (optional)
- name: FORCE_FW_UPDATE
value: "false"flowchart TD
A[GPU Operator Helm Install] --> B[ClusterPolicy CRD]
B --> C{useHostMofed?}
C -->|false| D[Containerized MOFED DaemonSet]
C -->|true| E[Use Pre-installed Host Drivers]
D --> F[MOFED Pod per GPU Node]
E --> F
F --> G[RDMA Devices Available]
G --> H[GPUDirect RDMA Ready]
G --> I[SR-IOV VFs Ready]Common Issues
MOFED Pod Stuck in Init
# Check MOFED pod logs
kubectl logs -n gpu-operator -l app=mofed-ubuntu --tail=50
# Common cause: kernel headers not available
# Fix: ensure kernel-devel packages match the running kernelMOFED and Secure Boot Conflict
MOFED drivers are unsigned by default. On Secure Boot nodes:
spec:
mofed:
enabled: true
env:
- name: CREATE_IFNAMES_UDEV
value: "true"
# Use pre-signed drivers or disable Secure BootMOFED Version Compatibility
| MOFED Version | GPU Driver | Kubernetes | Notes |
|---|---|---|---|
| 24.07-0.6.1.0 | 550.x | 1.27+ | Current recommended |
| 23.10-x | 545.x | 1.25+ | Previous LTS |
| 24.01-x | 550.x | 1.27+ | Intermediate release |
Best Practices
- Pin MOFED versions β donβt use
latest; match with your GPU driver version - Use
autoUpgradecarefully β test MOFED upgrades in staging before production - Enable drain on upgrade β
drain.enable: trueprevents workload disruption - Monitor with
ibstatβ regularly check link state and speed - Use containerized MOFED unless you have specific host-level requirements
- Set
RESTORE_DRIVER_ON_POD_TERMINATION: trueβ ensures drivers persist across pod restarts
Key Takeaways
- The GPU Operator ClusterPolicy manages MOFED driver lifecycle as a DaemonSet
- Choose between containerized MOFED (automated) or host MOFED (pre-installed) based on your needs
- MOFED enables RDMA networking required for GPUDirect and high-performance distributed training
- Always pin MOFED versions and test upgrades in staging before rolling out to production

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
