Open Kernel Modules and DMA-BUF for GPUs
Migrate from proprietary NVIDIA kernel modules and nvidia-peermem to open kernel modules with DMA-BUF for safer GPU upgrades.
π‘ Quick Answer: Enable open kernel modules in GPU Operator ClusterPolicy with
useOpenKernelModules: trueand switch GPUDirect RDMA fromnvidia-peermemto DMA-BUF (kernel β₯ 6.x). This decouples GPU drivers from the kernel, reducing upgrade fragility.
The Problem
Legacy NVIDIA GPU stack uses proprietary .ko kernel modules tightly coupled to specific kernel versions, plus nvidia-peermem for GPUDirect RDMA. Every kernel update risks breaking the GPU driver. Upgrade failures cascade: proprietary module mismatch β GPU unavailable β training jobs killed β teams blocked.
The Solution
Open kernel modules (in-tree compatible, open-source) decouple from kernel internals. DMA-BUF (upstream kernel β₯ 6.x) replaces nvidia-peermem with a standard kernel subsystem for GPU memory sharing, making upgrades predictable and safe.
Before vs After
# β BEFORE (Legacy Stack)
legacy:
kernel_modules: "Proprietary .ko (nvidia.ko, nvidia-modeset.ko)"
gpudirect_rdma: "nvidia-peermem (out-of-tree module)"
coupling: "Tight β kernel update breaks GPU driver"
upgrade_risk: "High β driver rebuild per kernel version"
# β
AFTER (Current Stack)
current:
kernel_modules: "Open kernel modules (in-tree compatible)"
gpudirect_rdma: "DMA-BUF (upstream kernel subsystem, β₯ 6.x)"
coupling: "Loose β kernel and driver independent"
upgrade_risk: "Low β standard kernel interfaces"Enable Open Kernel Modules
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: gpu-cluster-policy
spec:
operator:
defaultRuntime: crio
driver:
enabled: true
# Enable open kernel modules
useOpenKernelModules: true
version: "560.35.03"
repository: nvcr.io/nvidia
image: driver
licensingConfig:
nlsEnabled: false
# Kernel module parameters
kernelModuleConfig:
name: nvidia-module-params
dcgm:
enabled: true
dcgmExporter:
enabled: true
gdrcopy:
enabled: trueVerify Open Modules
# Check if open modules are loaded
kubectl exec -it nvidia-driver-daemonset-xxxx -n gpu-operator -- \
cat /proc/driver/nvidia/version
# Should show: "Open Kernel Module"
# Verify DMA-BUF support
kubectl exec -it gpu-pod -- \
cat /proc/modules | grep -E "nvidia|dma_buf"
# nvidia ... (Open)
# nvidia_modeset ... (Open)
# nvidia_uvm ... (Open)
# dma_buf ... (kernel built-in)
# Check GPUDirect RDMA via DMA-BUF (not nvidia-peermem)
kubectl exec -it gpu-pod -- \
lsmod | grep nvidia_peermem
# Should return empty β DMA-BUF replaces it
# Verify kernel version β₯ 6.x
kubectl exec -it gpu-pod -- uname -r
# 6.x.y required for DMA-BUFMachineConfig for DMA-BUF Prerequisites
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
name: 99-gpu-dma-buf
labels:
machineconfiguration.openshift.io/role: gpu-worker
spec:
config:
ignition:
version: 3.4.0
storage:
files:
- path: /etc/modprobe.d/nvidia-open.conf
mode: 0644
contents:
inline: |
# Use open kernel modules
options nvidia NVreg_OpenRmEnableUnsupportedGpus=1
- path: /etc/modules-load.d/dma-buf.conf
mode: 0644
contents:
inline: |
# Ensure DMA-BUF is available
# Usually built-in on kernel 6.x+
# nvidia-peermem NOT loadedUpgrade Flow Comparison
# Legacy upgrade (proprietary modules):
# 1. New kernel released
# 2. Rebuild proprietary nvidia.ko for new kernel
# 3. Rebuild nvidia-peermem for new kernel
# 4. Test on canary node
# 5. Roll out (high risk of mismatch)
# Risk: 2 out-of-tree modules to rebuild per kernel update
# Open modules + DMA-BUF upgrade:
# 1. New kernel released
# 2. Open modules use stable kernel interfaces (usually compatible)
# 3. DMA-BUF is in-tree (kernel handles it)
# 4. Test on canary node
# 5. Roll out (low risk)
# Risk: Only GPU userspace compatibility to verifygraph TD
A[Legacy Stack] --> B[Proprietary nvidia.ko]
A --> C[nvidia-peermem module]
B --> D[Tight kernel coupling]
C --> D
D --> E[High upgrade risk]
F[Current Stack] --> G[Open kernel modules]
F --> H[DMA-BUF in-tree]
G --> I[Loose kernel coupling]
H --> I
I --> J[Low upgrade risk]
K[Benefit] --> L[Fewer rebuilds per upgrade]
K --> M[Standard kernel interfaces]
K --> N[Upstream maintained]Common Issues
- Open modules not supported on older GPUs β open kernel modules require Turing (T4) or newer architectures; older GPUs (V100) need proprietary modules
- DMA-BUF not available β requires kernel 6.x+; RHEL 8 / older kernels donβt support it
- GPUDirect performance regression β rare; verify DMA-BUF is being used for RDMA with
ibv_devinfoand NCCL debug logs - Module parameter not applied β MachineConfig needs MCO rollout; check
oc get mcp gpu-worker
Best Practices
- Enable open kernel modules for all new GPU deployments on Turing+ hardware
- Verify kernel β₯ 6.x before disabling nvidia-peermem
- Test open modules on canary nodes before cluster-wide rollout
- Store module configuration in Git (MachineConfig) β not manual modprobe
- Monitor
nvidia-smiafter kernel upgrades to verify GPU initialization - Combine with canary upgrade strategy for safe GPU driver transitions
Key Takeaways
- Open kernel modules replace proprietary .ko files with in-tree compatible modules
- DMA-BUF replaces nvidia-peermem for GPUDirect RDMA (kernel β₯ 6.x)
- Decoupling GPU drivers from kernel reduces upgrade fragility
- Both changes are configured via ClusterPolicy and MachineConfig
- Requires Turing+ GPU architecture and kernel 6.x+
- Upgrade failure rate drops significantly β standard kernel interfaces donβt break on updates

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
