IOMMU BIOS and Kernel Config for NCCL GPU-Direct
Configure IOMMU at BIOS and kernel level to enable NCCL GPU-Direct RDMA on Kubernetes. Covers Intel VT-d, AMD-Vi, kernel parameters, passthrough
π‘ Quick Answer: NCCL GPU-Direct RDMA requires IOMMU enabled in BIOS (VT-d/AMD-Vi) and configured in the kernel with
iommu=pt(passthrough mode) so GPUs and RDMA NICs can perform peer-to-peer DMA without CPU involvement, achieving maximum inter-node communication bandwidth.
The Problem
Multi-GPU distributed training with NCCL needs:
- GPU-to-GPU direct memory access (GPUDirect P2P within a node)
- GPU-to-NIC direct memory access (GPUDirect RDMA across nodes)
- IOMMU must be enabled (required for SR-IOV and device passthrough)
- But IOMMU in strict mode adds DMA translation overhead (kills performance)
- Misconfigured IOMMU = NCCL falls back to CPU-copied transfers (10x slower)
The Solution
BIOS Configuration
Required BIOS Settings for NCCL GPU-Direct:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Setting Intel AMD
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
IOMMU VT-d: Enabled AMD-Vi: Enabled
SR-IOV Enabled Enabled
Above 4G Decoding Enabled Enabled
ACS Override Disabled* Disabled*
PCIe ARI Enabled Enabled
PCIe Relaxed Ordering Enabled Enabled
PCIe Max Payload Size Auto/256B Auto/256B
PCIe Max Read Request Auto/4096B Auto/4096B
NUMA Enabled Enabled
NPS (NUMA Per Socket) β NPS1 or NPS4**
* ACS (Access Control Services) must be disabled or overridden
for GPU-Direct P2P within PCIe switch groups
** NPS4 for best NUMA locality to GPUs; NPS1 for simplicityKernel Parameters
# Required kernel boot parameters for NCCL GPU-Direct RDMA:
# Intel systems:
intel_iommu=on iommu=pt pci=realloc pci=assign-busses
# AMD systems:
amd_iommu=on iommu=pt pci=realloc pci=assign-busses
# Explanation:
# intel_iommu=on / amd_iommu=on β Enable IOMMU hardware
# iommu=pt β Passthrough mode (CRITICAL for performance)
# pci=realloc β Re-allocate PCIe resources (helps MMIO)
# pci=assign-busses β Reassign PCI bus numbers (multi-root systems)Why iommu=pt (Passthrough) is Critical
IOMMU Modes:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Mode DMA Path Performance Security
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Disabled Direct DMA (no protection) Best None
Strict All DMA through IOMMU Worst (-30%) Full
Passthrough Bypass for assigned devices Best Selective
(iommu=pt)
With iommu=pt:
- Devices assigned to VMs/containers go through IOMMU (security β
)
- Devices used directly by host bypass IOMMU (performance β
)
- GPU-to-GPU P2P DMA bypasses translation (GPUDirect P2P β
)
- GPU-to-NIC DMA bypasses translation (GPUDirect RDMA β
)
Without iommu=pt (strict mode):
- Every DMA transaction goes through IOMMU page table walk
- GPU-Direct RDMA latency increases 2-5x
- NCCL bandwidth drops 20-30%
- Training throughput degrades significantlyOpenShift MachineConfig for Kernel Parameters
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
name: 99-gpu-worker-iommu
labels:
machineconfiguration.openshift.io/role: gpu-worker
spec:
kernelArguments:
- intel_iommu=on
- iommu=pt
- pci=realloc
- pci=assign-busses
- rdma_ucm.disable_raw_qp_enforcement=1
- nvidia.NVreg_RegisterForACPIEvents=1
- nvidia.NVreg_EnablePCIeRelaxedOrderingMode=1Talos Linux Kernel Parameters
# Talos machine config patch
machine:
install:
extraKernelArgs:
- intel_iommu=on
- iommu=pt
- pci=realloc
- pci=assign-busses
kernel:
modules:
- name: nvidia
- name: nvidia_uvm
- name: nvidia_peermem # Required for GPUDirect RDMA
- name: ib_core
- name: mlx5_core
- name: mlx5_ibStandard Linux (GRUB)
# Edit GRUB configuration
sudo vim /etc/default/grub
# Add to GRUB_CMDLINE_LINUX:
GRUB_CMDLINE_LINUX="intel_iommu=on iommu=pt pci=realloc pci=assign-busses"
# Regenerate GRUB
sudo grub2-mkconfig -o /boot/grub2/grub.cfg # RHEL/Rocky
sudo update-grub # Ubuntu/Debian
# Reboot
sudo rebootVerify IOMMU Configuration
# Check IOMMU is enabled
dmesg | grep -i iommu
# Expected: "DMAR: IOMMU enabled"
# or: "AMD-Vi: IOMMU performance counters supported"
# Verify passthrough mode
dmesg | grep -i "iommu.*passthrough\|DMA.*passthrough"
# Expected: "iommu: Default domain type: Passthrough"
# Check kernel command line
cat /proc/cmdline | grep -o "iommu=[^ ]*"
# Expected: iommu=pt
# List IOMMU groups (GPUs and NICs should be in same group for P2P)
for d in /sys/kernel/iommu_groups/*/devices/*; do
echo "IOMMU Group $(basename $(dirname $(dirname $d))): $(lspci -nns $(basename $d))"
done | grep -E "NVIDIA|Mellanox"
# Verify GPU-Direct RDMA module loaded
lsmod | grep nvidia_peermem
# If not loaded:
modprobe nvidia_peermem
# Check peermem registration
dmesg | grep -i peermem
# Expected: "nvidia peermem registered"Verify NCCL Can Use GPU-Direct RDMA
# Inside a Pod with GPUs + RDMA VFs:
# Set NCCL debug to see transport selection
export NCCL_DEBUG=INFO
export NCCL_IB_HCA=mlx5
export NCCL_NET_GDR_LEVEL=5 # GPU-Direct RDMA level
export NCCL_IB_GID_INDEX=3 # RoCE v2 GID index
export NCCL_CROSS_NIC=1 # Allow cross-NIC communication
export NCCL_IB_QPS_PER_CONNECTION=4 # QPs per connection
# Run NCCL test
/usr/bin/all_reduce_perf -b 8 -e 256M -f 2 -g 1
# Look for in output:
# "NET/IB : Using [0]mlx5_0:1/GID ..." β IB transport selected
# "GPU Direct RDMA Enabled" β GDR active
# "Channel [0] ... GPU Direct RDMA" β P2P DMA path confirmed
# If you see instead:
# "NET/Socket" β Fallback to TCP (BAD)
# "Could not enable GPU Direct RDMA" β IOMMU or peermem issueNCCL Environment Variables for GPU-Direct
# Pod spec with full NCCL GPU-Direct configuration
apiVersion: v1
kind: Pod
metadata:
name: nccl-benchmark
namespace: ai-training
annotations:
k8s.v1.cni.cncf.io/networks: rdma-network
spec:
containers:
- name: nccl
image: nvcr.io/nvidia/pytorch:24.07-py3
env:
# NCCL Transport
- name: NCCL_DEBUG
value: "INFO"
- name: NCCL_IB_HCA
value: "mlx5"
- name: NCCL_NET_GDR_LEVEL
value: "5"
# GPU-Direct RDMA
- name: NCCL_IB_CUDA_SUPPORT
value: "1"
- name: NCCL_IB_GID_INDEX
value: "3"
# Performance tuning
- name: NCCL_IB_QPS_PER_CONNECTION
value: "4"
- name: NCCL_IB_TIMEOUT
value: "22"
- name: NCCL_IB_RETRY_CNT
value: "7"
- name: NCCL_CROSS_NIC
value: "1"
# Disable SHM for multi-node (use RDMA)
- name: NCCL_SHM_DISABLE
value: "0"
- name: NCCL_P2P_LEVEL
value: "NVL"
resources:
requests:
nvidia.com/gpu: "8"
openshift.io/mellanoxnics: "4"
limits:
nvidia.com/gpu: "8"
openshift.io/mellanoxnics: "4"ACS (Access Control Services) Handling
# ACS can block GPU-Direct P2P between devices behind the same PCIe switch
# Check if ACS is enabled on PCIe bridges
# Find PCIe bridges above GPUs
lspci -tv | grep -A2 "NVIDIA"
# Check ACS status
for bridge in $(lspci -d ::0604 | awk '{print $1}'); do
acs=$(setpci -s $bridge ECAP_ACS+6.w 2>/dev/null)
if [ -n "$acs" ] && [ "$acs" != "0000" ]; then
echo "ACS active on bridge $bridge: $acs"
fi
done
# Disable ACS if blocking P2P (kernel parameter)
# Add to kernel args: pcie_acs_override=downstream,multifunction
# OpenShift MachineConfig:
spec:
kernelArguments:
- pcie_acs_override=downstream,multifunctionNUMA Topology Verification
# GPU-Direct RDMA is fastest when GPU and NIC are on same NUMA node
# Verify topology
# Check GPU NUMA node
nvidia-smi topo -m
# Shows GPU<->NIC affinity matrix
# Check NIC NUMA node
cat /sys/class/net/ens1f0np0/device/numa_node
# Should match GPU NUMA node
# Check PCI device NUMA
lspci -vvv -s 0000:06:00.0 | grep "NUMA node"
# If GPU and NIC on different NUMA nodes:
# Performance penalty ~10-15% due to cross-NUMA memory access
# Solution: Pin workloads to NUMA node with topology-aware schedulingPerformance Validation
# Expected GPU-Direct RDMA bandwidth (ConnectX-7, 400Gb/s)
# Single direction: ~48 GB/s per NIC
# Bidirectional: ~96 GB/s per NIC
# Test with ib_write_bw (raw RDMA bandwidth)
# Server:
ib_write_bw --use_cuda=0 -d mlx5_0
# Client:
ib_write_bw --use_cuda=0 -d mlx5_0 <server-ip>
# NCCL all-reduce benchmark (multi-node)
# Expected: ~380 Gb/s bus bandwidth with 8x GPUs + 4x ConnectX-7
# If bandwidth is significantly lower:
# 1. Check iommu=pt is set (cat /proc/cmdline)
# 2. Check nvidia_peermem is loaded (lsmod | grep peermem)
# 3. Check ACS not blocking P2P
# 4. Check NUMA locality (GPU and NIC same NUMA node)
# 5. Check NCCL_NET_GDR_LEVEL=5Common Issues
NCCL falls back to NET/Socket (TCP)
- Cause: RDMA devices not visible to Pod, or nvidia_peermem not loaded
- Fix: Verify
openshift.io/mellanoxnicsallocated; load nvidia_peermem module
βGPU Direct RDMA disabledβ in NCCL logs
- Cause:
iommu=ptnot set (strict mode blocks GPU-NIC DMA) - Fix: Add
iommu=ptto kernel parameters; cold reboot
Low bandwidth despite GPU-Direct RDMA active
- Cause: GPU and NIC on different NUMA nodes; ACS blocking P2P path
- Fix: Check
nvidia-smi topo -m; addpcie_acs_overrideif needed
nvidia_peermem fails to load
- Cause: nvidia driver version mismatch or ib_core not loaded
- Fix: Load
ib_corefirst; ensure NVIDIA driver matches kernel module version
IOMMU groups too large (all devices in one group)
- Cause: ACS not supported on PCIe bridge; kernel groups all downstream devices
- Fix:
pcie_acs_overridesplits groups; or accept shared group (less isolation)
Best Practices
- Always
iommu=ptβ passthrough mode is mandatory for GPU-Direct performance - Load nvidia_peermem at boot β add to kernel module autoload
- Verify NUMA locality β schedule GPU workloads on nodes where GPUβNIC share NUMA
- Use
NCCL_NET_GDR_LEVEL=5β enables full GPU-Direct RDMA path - Cold reboot after IOMMU changes β warm reboot doesnβt re-enumerate PCIe
- Test with
all_reduce_perfβ validates full NCCL stack end-to-end - Monitor
NCCL_DEBUG=INFOβ confirms transport selection (IB vs Socket) - Match VFs to GPUs 1:1 β one RDMA VF per GPU for optimal topology
Key Takeaways
- IOMMU must be enabled (VT-d/AMD-Vi) for SR-IOV device passthrough
iommu=pt(passthrough) is critical β strict mode adds 20-30% overhead to DMA- nvidia_peermem module bridges NVIDIA GPU memory to RDMA subsystem
- NCCL selects GPU-Direct RDMA automatically when: IOMMU=pt + peermem + VF allocated
- ACS on PCIe bridges can block P2P β override with kernel parameter if needed
- NUMA topology matters: GPU and NIC on same NUMA node = best latency
- Verify with
NCCL_DEBUG=INFOβ look for βGPU Direct RDMA Enabledβ - Full stack: BIOS (VT-d + Above 4G) β Kernel (iommu=pt + peermem) β SR-IOV (VFs) β NCCL (GDR_LEVEL=5)

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
