πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai advanced ⏱ 15 minutes K8s 1.28+

Disable PCIe ACS for GPU-Direct P2P

Disable PCIe Access Control Services (ACS) to enable GPU-Direct peer-to-peer DMA between GPUs and RDMA NICs. Covers BIOS disable, kernel override, and when

By Luca Berton β€’ β€’ πŸ“– 7 min read

πŸ’‘ Quick Answer: For bare-metal GPU clusters running only AI training (no VMs, no multi-tenant isolation), the simplest path is: disable VT-d/AMD-Vi entirely in BIOS. If you need SR-IOV (which requires IOMMU), then use pcie_acs_override=downstream,multifunction in kernel args to allow GPU-Direct P2P across PCIe switches.

The Problem

ACS (Access Control Services) on PCIe bridges blocks GPU-to-GPU and GPU-to-NIC direct DMA:

  • GPUs behind the same PCIe switch can’t do P2P transfers
  • NCCL falls back to CPU-staged copies (10-30x slower)
  • GPU-Direct RDMA path broken between NIC and GPU on same root complex
  • IOMMU groups become too large or too restrictive

The Solution

Decision: Do You Need IOMMU at All?

Question                                          β†’ Action
──────────────────────────────────────────────────────────────────
Running VMs on this node?                         β†’ Keep IOMMU enabled
Running SR-IOV (VFs for Pods)?                    β†’ Keep IOMMU enabled
Multi-tenant with device isolation?               β†’ Keep IOMMU enabled
Bare-metal, single-tenant, GPUs only?             β†’ DISABLE IOMMU entirely
Need SR-IOV + GPU-Direct P2P?                     β†’ IOMMU on + ACS override

Option 1: Disable Virtualization Technology Entirely (Simplest)

If the node is bare-metal, dedicated to GPU training, no SR-IOV needed:

BIOS Settings β€” Disable All Virtualization:
────────────────────────────────────────────────────────────────
Intel:
  β€’ VT-d (Directed I/O):          DISABLED
  β€’ VT-x (Virtualization Tech):   Keep Enabled (for containers)
  β€’ SR-IOV:                        DISABLED (if not using VFs)
  β€’ ACS:                           N/A (no IOMMU = no ACS enforcement)

AMD:
  β€’ AMD-Vi (IOMMU):               DISABLED
  β€’ SVM (Secure Virtual Machine): Keep Enabled (for containers)
  β€’ SR-IOV:                        DISABLED
  β€’ ACS:                           N/A

Result: All DMA is direct, no translation, no ACS enforcement.
GPUDirect P2P and RDMA work at full speed immediately.
# Kernel parameters (no IOMMU at all):
# Simply omit intel_iommu/amd_iommu parameters, or explicitly:
GRUB_CMDLINE_LINUX="intel_iommu=off"
# or just don't set any iommu parameter

# Verify after reboot:
dmesg | grep -i iommu
# Should show: nothing, or "DMAR: IOMMU disabled"

cat /proc/cmdline
# No iommu parameters present

OpenShift MachineConfig (disable IOMMU):

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  name: 99-gpu-worker-no-iommu
  labels:
    machineconfiguration.openshift.io/role: gpu-worker
spec:
  kernelArguments:
    - intel_iommu=off
    - pci=realloc

Talos Linux:

machine:
  install:
    extraKernelArgs:
      - intel_iommu=off
      - pci=realloc

Option 2: Keep IOMMU + Disable ACS (Need SR-IOV + GPU-Direct)

When you need both SR-IOV (for RDMA VFs) and GPU-Direct P2P:

# Kernel parameter to override ACS on all PCIe bridges:
pcie_acs_override=downstream,multifunction

# Full kernel args for SR-IOV + GPU-Direct:
intel_iommu=on iommu=pt pcie_acs_override=downstream,multifunction pci=realloc

OpenShift MachineConfig:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  name: 99-gpu-worker-acs-override
  labels:
    machineconfiguration.openshift.io/role: gpu-worker
spec:
  kernelArguments:
    - intel_iommu=on
    - iommu=pt
    - pcie_acs_override=downstream,multifunction
    - pci=realloc

Talos Linux:

machine:
  install:
    extraKernelArgs:
      - intel_iommu=on
      - iommu=pt
      - pcie_acs_override=downstream,multifunction
      - pci=realloc

Option 3: Disable ACS in BIOS Only (Some Vendors)

Some server BIOS expose ACS as a toggle:

BIOS Location (vendor-specific):
────────────────────────────────────────────────────────────────
Dell:
  System BIOS β†’ Integrated Devices β†’ PCIe ACS: Disabled

HPE:
  System Configuration β†’ BIOS β†’ PCIe β†’ ACS Control: Disabled

Supermicro:
  Advanced β†’ PCIe/PCI/PnP β†’ Access Control Services: Disabled

Lenovo:
  UEFI β†’ Devices and I/O β†’ PCIe ACS: Disabled

Note: Not all BIOS versions expose this. If not available,
use kernel parameter override instead.

Verify ACS Status

# Check if ACS is active on PCIe bridges
for bridge in $(lspci -d ::0604 | awk '{print $1}'); do
  acs=$(setpci -s "$bridge" ECAP_ACS+6.w 2>/dev/null)
  if [ -n "$acs" ] && [ "$acs" != "0000" ]; then
    echo "⚠️  ACS ACTIVE on bridge $bridge: control=$acs"
    lspci -s "$bridge"
  fi
done

# If no output β†’ ACS disabled/overridden βœ…
# If bridges listed β†’ ACS still blocking P2P ❌

Verify GPU-Direct P2P Works

# Check P2P connectivity matrix
nvidia-smi topo -m

# Expected output (P2P enabled):
#         GPU0  GPU1  GPU2  GPU3  mlx5_0
# GPU0     X    NV12  NV12  NV12  SYS
# GPU1    NV12   X    NV12  NV12  SYS
# GPU2    NV12  NV12   X    NV12  NODE
# GPU3    NV12  NV12  NV12   X    NODE

# Legend:
# NV12 = NVLink (best)
# PIX  = PCIe switch (good, means P2P works)
# NODE = Same NUMA node via PCIe (good)
# SYS  = Cross-NUMA (works but slower)
# X    = Same device

# If you see "Connection not supported" β†’ ACS is blocking

# Test P2P bandwidth directly
/usr/local/cuda/samples/bin/p2pBandwidthLatencyTest
# or
cuda-samples p2pBandwidthLatencyTest

# Expected: P2P bandwidth ~25 GB/s per direction (PCIe 4.0 x16)
# If P2P disabled: shows 0 or "P2P not supported"

NCCL Transport Verification After ACS Disable

export NCCL_DEBUG=INFO
export NCCL_P2P_LEVEL=NVL    # Use NVLink for intra-node
export NCCL_NET_GDR_LEVEL=5  # GPU-Direct RDMA for inter-node
export NCCL_IB_HCA=mlx5

# Run all_reduce benchmark
all_reduce_perf -b 8 -e 1G -f 2 -g 8

# Look for:
# "P2P/CUMEM" or "P2P/IPC" in channel info β†’ P2P active βœ…
# "SHM" β†’ Shared memory (fallback, slower) ⚠️
# "NET/Socket" β†’ TCP (worst case, ACS or RDMA broken) ❌

Comparison: Performance Impact

Configuration                              All-Reduce BW    Impact
──────────────────────────────────────────────────────────────────
IOMMU off + no ACS                         ~380 Gb/s        Baseline (best)
IOMMU pt + ACS override                    ~370 Gb/s        -3% (negligible)
IOMMU pt + ACS enabled                     ~180 Gb/s        -53% (P2P blocked)
IOMMU strict + ACS enabled                 ~120 Gb/s        -68% (worst)

(8Γ— A100/H100 + 4Γ— ConnectX-7, all-reduce across 2 nodes)

Quick Decision Flowchart

Do you run VMs or need device isolation?
β”œβ”€β”€ YES β†’ Keep IOMMU on
β”‚         Do you need SR-IOV?
β”‚         β”œβ”€β”€ YES β†’ iommu=pt + pcie_acs_override=downstream,multifunction
β”‚         └── NO  β†’ iommu=pt (ACS won't matter without VFs)
β”‚
└── NO (bare-metal GPU training only)
          β†’ DISABLE VT-d/AMD-Vi in BIOS
            Simplest. Best performance. No ACS issues.
            (You lose: SR-IOV VFs, VM device passthrough)

Common Issues

”P2P not supported” in nvidia-smi topo after ACS override

  • Cause: Kernel compiled without ACS override support (some distros strip it)
  • Fix: Check grep ACS /boot/config-$(uname -r); use BIOS disable instead

SR-IOV fails after disabling IOMMU

  • Cause: SR-IOV VFs require IOMMU for address translation
  • Fix: Can’t use SR-IOV without IOMMU; use Option 2 (IOMMU + ACS override)

ACS override in kernel but setpci still shows active

  • Cause: pcie_acs_override doesn’t change hardware register β€” it tells kernel to ignore ACS
  • Fix: This is expected; IOMMU grouping changes even if setpci shows ACS bits

Node won’t boot after removing IOMMU

  • Cause: Some hyperconverged setups depend on IOMMU for storage
  • Fix: Only disable IOMMU on dedicated GPU compute nodes, not infra nodes

Best Practices

  1. Bare-metal AI clusters: just disable VT-d β€” simplest, fastest, no ACS issues
  2. Mixed clusters: per-MachineConfigPool β€” gpu-worker pool has different kernel args
  3. Document the decision β€” why IOMMU is off (team will forget in 6 months)
  4. Test after every BIOS update β€” updates can reset VT-d to Enabled
  5. Verify with nvidia-smi topo -m β€” the ground truth for P2P connectivity
  6. One config per node role β€” don’t apply GPU kernel args to infra nodes
  7. Cold reboot after BIOS changes β€” PCIe topology enumerated at POST only

Key Takeaways

  • Simplest fix: Disable VT-d/AMD-Vi in BIOS entirely (if no VMs, no SR-IOV needed)
  • If SR-IOV required: keep IOMMU on + iommu=pt + pcie_acs_override=downstream,multifunction
  • ACS blocks GPU-to-GPU and GPU-to-NIC peer-to-peer DMA (53%+ bandwidth loss)
  • pcie_acs_override tells kernel to ignore ACS on bridges (hardware unchanged)
  • Some BIOS have explicit ACS toggle (Dell, HPE, Supermicro) β€” disable there
  • Verify with: nvidia-smi topo -m (P2P matrix) + NCCL_DEBUG=INFO (transport selection)
  • Performance: IOMMU off β‰ˆ IOMMU pt + ACS override >> ACS enabled (-53%)
  • Decision: bare-metal single-tenant β†’ disable VT-d; multi-tenant/SR-IOV β†’ keep + override
#acs #pcie #gpu-direct #nccl #performance
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens