πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
Configuration advanced ⏱ 15 minutes K8s 1.28+

Open Kernel Modules and DMA-BUF for GPUs

Migrate from proprietary NVIDIA kernel modules and nvidia-peermem to open kernel modules with DMA-BUF for safer GPU upgrades.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Enable open kernel modules in GPU Operator ClusterPolicy with useOpenKernelModules: true and switch GPUDirect RDMA from nvidia-peermem to DMA-BUF (kernel β‰₯ 6.x). This decouples GPU drivers from the kernel, reducing upgrade fragility.

The Problem

Legacy NVIDIA GPU stack uses proprietary .ko kernel modules tightly coupled to specific kernel versions, plus nvidia-peermem for GPUDirect RDMA. Every kernel update risks breaking the GPU driver. Upgrade failures cascade: proprietary module mismatch β†’ GPU unavailable β†’ training jobs killed β†’ teams blocked.

The Solution

Open kernel modules (in-tree compatible, open-source) decouple from kernel internals. DMA-BUF (upstream kernel β‰₯ 6.x) replaces nvidia-peermem with a standard kernel subsystem for GPU memory sharing, making upgrades predictable and safe.

Before vs After

# ❌ BEFORE (Legacy Stack)
legacy:
  kernel_modules: "Proprietary .ko (nvidia.ko, nvidia-modeset.ko)"
  gpudirect_rdma: "nvidia-peermem (out-of-tree module)"
  coupling: "Tight β€” kernel update breaks GPU driver"
  upgrade_risk: "High β€” driver rebuild per kernel version"

# βœ… AFTER (Current Stack)
current:
  kernel_modules: "Open kernel modules (in-tree compatible)"
  gpudirect_rdma: "DMA-BUF (upstream kernel subsystem, β‰₯ 6.x)"
  coupling: "Loose β€” kernel and driver independent"
  upgrade_risk: "Low β€” standard kernel interfaces"

Enable Open Kernel Modules

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  operator:
    defaultRuntime: crio
  driver:
    enabled: true
    # Enable open kernel modules
    useOpenKernelModules: true
    version: "560.35.03"
    repository: nvcr.io/nvidia
    image: driver
    licensingConfig:
      nlsEnabled: false
    # Kernel module parameters
    kernelModuleConfig:
      name: nvidia-module-params
  dcgm:
    enabled: true
  dcgmExporter:
    enabled: true
  gdrcopy:
    enabled: true

Verify Open Modules

# Check if open modules are loaded
kubectl exec -it nvidia-driver-daemonset-xxxx -n gpu-operator -- \
  cat /proc/driver/nvidia/version
# Should show: "Open Kernel Module"

# Verify DMA-BUF support
kubectl exec -it gpu-pod -- \
  cat /proc/modules | grep -E "nvidia|dma_buf"
# nvidia               ... (Open)
# nvidia_modeset       ... (Open)
# nvidia_uvm           ... (Open)
# dma_buf              ... (kernel built-in)

# Check GPUDirect RDMA via DMA-BUF (not nvidia-peermem)
kubectl exec -it gpu-pod -- \
  lsmod | grep nvidia_peermem
# Should return empty β€” DMA-BUF replaces it

# Verify kernel version β‰₯ 6.x
kubectl exec -it gpu-pod -- uname -r
# 6.x.y required for DMA-BUF

MachineConfig for DMA-BUF Prerequisites

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  name: 99-gpu-dma-buf
  labels:
    machineconfiguration.openshift.io/role: gpu-worker
spec:
  config:
    ignition:
      version: 3.4.0
    storage:
      files:
        - path: /etc/modprobe.d/nvidia-open.conf
          mode: 0644
          contents:
            inline: |
              # Use open kernel modules
              options nvidia NVreg_OpenRmEnableUnsupportedGpus=1
        - path: /etc/modules-load.d/dma-buf.conf
          mode: 0644
          contents:
            inline: |
              # Ensure DMA-BUF is available
              # Usually built-in on kernel 6.x+
              # nvidia-peermem NOT loaded

Upgrade Flow Comparison

# Legacy upgrade (proprietary modules):
# 1. New kernel released
# 2. Rebuild proprietary nvidia.ko for new kernel
# 3. Rebuild nvidia-peermem for new kernel
# 4. Test on canary node
# 5. Roll out (high risk of mismatch)
# Risk: 2 out-of-tree modules to rebuild per kernel update

# Open modules + DMA-BUF upgrade:
# 1. New kernel released
# 2. Open modules use stable kernel interfaces (usually compatible)
# 3. DMA-BUF is in-tree (kernel handles it)
# 4. Test on canary node
# 5. Roll out (low risk)
# Risk: Only GPU userspace compatibility to verify
graph TD
    A[Legacy Stack] --> B[Proprietary nvidia.ko]
    A --> C[nvidia-peermem module]
    B --> D[Tight kernel coupling]
    C --> D
    D --> E[High upgrade risk]
    
    F[Current Stack] --> G[Open kernel modules]
    F --> H[DMA-BUF in-tree]
    G --> I[Loose kernel coupling]
    H --> I
    I --> J[Low upgrade risk]
    
    K[Benefit] --> L[Fewer rebuilds per upgrade]
    K --> M[Standard kernel interfaces]
    K --> N[Upstream maintained]

Common Issues

  • Open modules not supported on older GPUs β€” open kernel modules require Turing (T4) or newer architectures; older GPUs (V100) need proprietary modules
  • DMA-BUF not available β€” requires kernel 6.x+; RHEL 8 / older kernels don’t support it
  • GPUDirect performance regression β€” rare; verify DMA-BUF is being used for RDMA with ibv_devinfo and NCCL debug logs
  • Module parameter not applied β€” MachineConfig needs MCO rollout; check oc get mcp gpu-worker

Best Practices

  • Enable open kernel modules for all new GPU deployments on Turing+ hardware
  • Verify kernel β‰₯ 6.x before disabling nvidia-peermem
  • Test open modules on canary nodes before cluster-wide rollout
  • Store module configuration in Git (MachineConfig) β€” not manual modprobe
  • Monitor nvidia-smi after kernel upgrades to verify GPU initialization
  • Combine with canary upgrade strategy for safe GPU driver transitions

Key Takeaways

  • Open kernel modules replace proprietary .ko files with in-tree compatible modules
  • DMA-BUF replaces nvidia-peermem for GPUDirect RDMA (kernel β‰₯ 6.x)
  • Decoupling GPU drivers from kernel reduces upgrade fragility
  • Both changes are configured via ClusterPolicy and MachineConfig
  • Requires Turing+ GPU architecture and kernel 6.x+
  • Upgrade failure rate drops significantly β€” standard kernel interfaces don’t break on updates
#nvidia #kernel-modules #dma-buf #gpudirect #open-source #upgrades
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens