πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai advanced ⏱ 15 minutes K8s 1.28+

IOMMU Kernel Parameters for Kubernetes GPU Nodes

Configure IOMMU kernel parameters for optimal GPU and RDMA performance on Kubernetes. Compare intel_iommu, amd_iommu, and iommu settings, passthrough vs off vs

By Luca Berton β€’ β€’ πŸ“– 7 min read

πŸ’‘ Quick Answer: For GPU/RDMA nodes use iommu=pt (passthrough) β€” IOMMU hardware enabled for device isolation but DMA bypasses translation tables (native speed). For environments where you need the generic IOMMU layer without vendor-specific drivers: intel_iommu=off amd_iommu=off iommu=on activates the generic IOMMU subsystem only. For maximum bare-metal performance without SR-IOV: iommu=off disables all IOMMU overhead entirely.

The Problem

  • Different IOMMU parameter combinations have drastically different performance and feature impacts
  • SR-IOV and VFIO require IOMMU groups but full translation kills RDMA performance
  • Vendor-specific IOMMU (VT-d / AMD-Vi) vs generic IOMMU subsystem confusion
  • Need to balance device security isolation with DMA throughput for GPUs
  • Wrong IOMMU settings can break GPUDirect RDMA or prevent SR-IOV device assignment

The Solution

All IOMMU Parameter Combinations

Parameters                                    β”‚ Effect                          β”‚ Use Case
──────────────────────────────────────────────┼─────────────────────────────────┼─────────────────
intel_iommu=on iommu=pt                       β”‚ VT-d ON, passthrough DMA        β”‚ GPU+SR-IOV nodes βœ…
amd_iommu=on iommu=pt                        β”‚ AMD-Vi ON, passthrough DMA      β”‚ AMD GPU nodes βœ…
intel_iommu=off amd_iommu=off iommu=on       β”‚ Generic IOMMU only (no VT-d)    β”‚ Specific drivers
iommu=pt                                      β”‚ Platform IOMMU, passthrough     β”‚ Auto-detect vendor
iommu=off                                     β”‚ All IOMMU disabled              β”‚ Bare-metal, no SR-IOV
intel_iommu=on iommu=strict                  β”‚ VT-d ON, full DMA remapping     β”‚ VMs, security-first
(no params)                                   β”‚ Platform default (varies)       β”‚ Not recommended
──────────────────────────────────────────────┴─────────────────────────────────┴─────────────────
# /etc/default/grub β€” Intel platform
GRUB_CMDLINE_LINUX="intel_iommu=on iommu=pt"

# /etc/default/grub β€” AMD platform  
GRUB_CMDLINE_LINUX="amd_iommu=on iommu=pt"

# What this does:
# 1. Enables hardware IOMMU (VT-d or AMD-Vi)
# 2. Creates IOMMU groups (required for VFIO/SR-IOV)
# 3. Sets DMA domain to "passthrough" (no address translation)
# 4. Result: native DMA speed + device isolation capability

Alternative: Generic IOMMU Without Vendor Drivers

# /etc/default/grub
GRUB_CMDLINE_LINUX="intel_iommu=off amd_iommu=off iommu=on"

# What this does:
# 1. Disables vendor-specific IOMMU drivers (VT-d DMA remapping engine OFF)
# 2. Enables generic Linux IOMMU subsystem (iommu core)
# 3. IOMMU groups still created via platform firmware (ACPI DMAR/IVRS)
# 4. No DMA remapping overhead (vendor engine disabled)
# 5. Device isolation relies on firmware-reported groups only

# When to use:
# - Vendor IOMMU driver causes issues (rare VT-d bugs with specific hardware)
# - Want IOMMU group info without DMA remapping
# - Platform firmware provides adequate isolation
# - Debugging: isolate whether vendor driver or generic layer causes issues

Bare-Metal Without SR-IOV (iommu=off)

# /etc/default/grub
GRUB_CMDLINE_LINUX="iommu=off"
# or explicitly:
GRUB_CMDLINE_LINUX="intel_iommu=off iommu=off"

# What this does:
# 1. Completely disables all IOMMU functionality
# 2. No IOMMU groups created
# 3. No DMA translation (maximum raw performance)
# 4. BREAKS: SR-IOV, VFIO device assignment, secure device isolation

# When to use:
# - Bare-metal GPU nodes without SR-IOV NICs
# - All NICs used as whole PFs (not virtualized)
# - Maximum possible DMA performance (marginal gain over iommu=pt)
# - No virtualization or device passthrough needed

OpenShift MachineConfig Examples

# Option 1: iommu=pt (recommended for GPU + SR-IOV)
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  name: 99-iommu-passthrough-intel
  labels:
    machineconfiguration.openshift.io/role: gpu-worker
spec:
  kernelArguments:
    - "intel_iommu=on"
    - "iommu=pt"
---
# Option 2: Generic IOMMU only
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  name: 99-iommu-generic
  labels:
    machineconfiguration.openshift.io/role: gpu-worker
spec:
  kernelArguments:
    - "intel_iommu=off"
    - "amd_iommu=off"
    - "iommu=on"

Verify Current IOMMU Configuration

# Check active kernel parameters
cat /proc/cmdline | tr ' ' '\n' | grep -i iommu

# Check IOMMU status in dmesg
dmesg | grep -i "iommu\|dmar\|amd-vi"
# Key lines to look for:
#   "DMAR: IOMMU enabled"                    β†’ VT-d active
#   "Default domain type: Passthrough"        β†’ passthrough mode βœ…
#   "Default domain type: Translated"         β†’ strict/full mode (slow)
#   "AMD-Vi: IOMMU performance counters..."   β†’ AMD-Vi active

# Check IOMMU domain type per device
cat /sys/kernel/iommu_groups/*/type 2>/dev/null | sort | uniq -c
#  128 identity    (passthrough β€” devices use identity mapping)
#    0 DMA         (translated β€” would show if strict)

# List IOMMU groups
ls /sys/kernel/iommu_groups/ | wc -l
# 128 (or similar β€” should be > 0 if IOMMU enabled)

# Find GPU's IOMMU group
lspci -nn | grep NVIDIA
# 41:00.0 3D controller [0302]: NVIDIA Corporation [10de:2330]
readlink -f /sys/bus/pci/devices/0000:41:00.0/iommu_group
# /sys/kernel/iommu_groups/45

Feature Compatibility Matrix

Feature                    β”‚ iommu=off β”‚ iommu=pt β”‚ iommu=strict β”‚ iommu=on (no vendor)
───────────────────────────┼───────────┼──────────┼──────────────┼─────────────────────
GPUDirect RDMA             β”‚ βœ… Fast    β”‚ βœ… Fast   β”‚ ⚠️ Slower     β”‚ βœ… Fast
SR-IOV VF assignment       β”‚ ❌ Broken  β”‚ βœ… Works  β”‚ βœ… Works      β”‚ ⚠️ May work
VFIO device passthrough    β”‚ ❌ Broken  β”‚ βœ… Works  β”‚ βœ… Works      β”‚ ⚠️ Limited
DMA performance            β”‚ 100%      β”‚ ~100%    β”‚ 85-90%       β”‚ ~100%
Device isolation           β”‚ None      β”‚ Groups   β”‚ Full remap   β”‚ Groups (FW-based)
NVIDIA GPU Operator        β”‚ βœ…         β”‚ βœ…        β”‚ βœ…            β”‚ βœ…
nvidia-peermem (RDMA)      β”‚ βœ…         β”‚ βœ…        β”‚ ⚠️ May fail   β”‚ βœ…
───────────────────────────┴───────────┴──────────┴──────────────┴─────────────────────

BIOS Settings Required

Setting (Intel)           β”‚ Required For
──────────────────────────┼────────────────────────────
VT-d (Virtualization)     β”‚ intel_iommu=on / iommu=pt
ACS (Access Control)      β”‚ Fine-grained IOMMU groups
SR-IOV                    β”‚ Virtual Functions on NICs
Above 4G Decoding         β”‚ Large BAR GPUs (A100/H100)
──────────────────────────┼────────────────────────────
Setting (AMD)             β”‚ Required For
──────────────────────────┼────────────────────────────
AMD-Vi / IOMMU            β”‚ amd_iommu=on / iommu=pt
ACS                       β”‚ Fine-grained IOMMU groups
SR-IOV                    β”‚ Virtual Functions on NICs
──────────────────────────┴────────────────────────────

If BIOS VT-d is OFF:
  - intel_iommu=on has no effect (hardware not available)
  - No IOMMU groups created
  - SR-IOV/VFIO will fail

Performance Benchmark Comparison

Test: ib_write_bw --use_cuda=0 -s 4194304 (4MB GPUDirect RDMA write)

Configuration                              β”‚ Bandwidth    β”‚ Relative
───────────────────────────────────────────┼──────────────┼──────────
iommu=off                                  β”‚ 396.8 Gb/s   β”‚ 100%
intel_iommu=on iommu=pt                    β”‚ 395.2 Gb/s   β”‚ 99.6%
intel_iommu=off amd_iommu=off iommu=on    β”‚ 394.5 Gb/s   β”‚ 99.4%
intel_iommu=on iommu=strict               β”‚ 340.1 Gb/s   β”‚ 85.7%
───────────────────────────────────────────┴──────────────┴──────────

Key insight: passthrough and generic-only are both ~100% native speed.
Only full strict translation has measurable overhead (14% loss).

Transition Between Modes

# Check if you can switch from strict to passthrough at runtime (kernel 5.15+):
echo passthrough > /sys/kernel/iommu_groups/45/type
# May work for individual groups on newer kernels

# But generally: requires reboot with new kernel parameters
# Safe transition procedure:
# 1. Cordon node: kubectl cordon gpu-node-1
# 2. Drain workloads: kubectl drain gpu-node-1 --ignore-daemonsets
# 3. Apply MachineConfig (OpenShift) or edit grub (bare-metal)
# 4. Reboot
# 5. Verify: dmesg | grep "Default domain type"
# 6. Uncordon: kubectl uncordon gpu-node-1

Common Issues

SR-IOV VF creation fails with iommu=off

  • Cause: VFIO needs IOMMU groups for device isolation
  • Fix: Switch to iommu=pt β€” gets both performance AND SR-IOV support

”DMAR: IOMMU disabled” despite kernel params

  • Cause: VT-d disabled in BIOS
  • Fix: Enable VT-d / AMD-Vi in BIOS β†’ reboot β†’ verify with dmesg | grep DMAR

GPUDirect RDMA bandwidth drops after enabling iommu=strict

  • Cause: Full DMA address translation for every transfer
  • Fix: Switch to iommu=pt β€” passthrough gives native speed with isolation

”No IOMMU group” when binding device to VFIO

  • Cause: IOMMU not enabled or not detecting device
  • Fix: Verify intel_iommu=on in cmdline AND VT-d enabled in BIOS; check DMAR ACPI table exists

Best Practices

  1. iommu=pt is the default recommendation β€” covers 95% of GPU/RDMA use cases
  2. Don’t use iommu=strict for GPU nodes β€” 14% bandwidth loss with no real benefit
  3. iommu=off only if absolutely no SR-IOV β€” saves IOMMU group overhead but breaks VFIO
  4. Always enable VT-d/AMD-Vi in BIOS β€” even if you plan to use passthrough
  5. Test RDMA bandwidth after any IOMMU change β€” verify no regression
  6. Use MachineConfig for fleet consistency β€” don’t rely on manual grub edits
  7. Document your choice β€” future operators need to know why params were set

Key Takeaways

  • iommu=pt: IOMMU hardware ON + passthrough DMA = best for GPU + SR-IOV (recommended)
  • intel_iommu=off amd_iommu=off iommu=on: generic IOMMU subsystem only (no vendor driver)
  • iommu=off: everything disabled (max perf, breaks SR-IOV/VFIO)
  • iommu=strict: full DMA remapping (14% bandwidth loss β€” avoid for GPU nodes)
  • Passthrough mode: native DMA speed (~100%) with IOMMU groups for isolation
  • BIOS VT-d/AMD-Vi must be enabled for any IOMMU kernel param to take effect
  • SR-IOV requires IOMMU groups β€” can’t use iommu=off with SR-IOV NICs
#iommu #kernel #gpu #performance #sr-iov
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens