πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai advanced ⏱ 30 minutes K8s 1.28+

NCCL SR-IOV GDS PyTorch Configuration

Configure NCCL with SR-IOV RDMA and GPUDirect Storage on Kubernetes. PyTorch 25.11 container with NCCL 2.28, CUDA 13, MOFED 5.4, GDRCopy 2.

By Luca Berton β€’ β€’ πŸ“– 10 min read

πŸ’‘ Quick Answer: Set NCCL_DEBUG=INFO, NCCL_SOCKET_IFNAME=net1 (SR-IOV secondary interface), enable IOMMU passthrough (iommu=pt) for GDS/GPUDirect RDMA, and run torchrun with the PyTorch 25.11 NGC container (NCCL 2.28.8, CUDA 13.0, MOFED 5.4). Verify NCCL logs show NET/IB transport and GPU Direct RDMA enabled.

The Problem

Multi-node GPU training requires NCCL (NVIDIA Collective Communications Library) to move tensors between GPUs across nodes. For maximum performance:

  • SR-IOV provides dedicated virtual network functions per pod β€” no shared NIC contention
  • GPUDirect RDMA (GDR) transfers data directly between GPU memory and NIC β€” bypasses CPU
  • GPUDirect Storage (GDS) reads training data directly from NVMe/NFS into GPU memory β€” bypasses page cache
  • IOMMU passthrough is required for DMA between GPU and NIC PCI devices

Without proper configuration, NCCL falls back to TCP over the primary CNI interface β€” 10-100Γ— slower than RDMA.

The Solution

Infrastructure Prerequisites

1. IOMMU Configuration

IOMMU must be enabled in passthrough mode for GPUDirect RDMA and GDS:

# OpenShift MachineConfig
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: gpu-worker
  name: 99-iommu-passthrough
spec:
  kernelArguments:
    - intel_iommu=on
    - iommu=pt

For AMD systems:

  kernelArguments:
    - amd_iommu=on
    - iommu=pt

Why iommu=pt (passthrough)?

  • Without iommu=pt: All DMA goes through IOMMU translation β€” adds latency
  • With iommu=pt: Devices assigned to the host kernel bypass IOMMU translation; only devices assigned to VMs/containers use IOMMU isolation
  • Required for GDR: GPU-to-NIC DMA needs direct physical address access

Verify after reboot:

# Check IOMMU is enabled
dmesg | grep -i iommu
# Intel-IOMMU: enabled
# DMAR: IOMMU enabled

# Check passthrough mode
cat /proc/cmdline | grep iommu=pt

2. SR-IOV Network Configuration

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: rdma-sriov-policy
  namespace: openshift-sriov-network-operator
spec:
  nodeSelector:
    node-role.kubernetes.io/gpu-worker: ""
  resourceName: mellanoxrdma
  numVfs: 8
  nicSelector:
    vendor: "15b3"
    deviceID: "101d"    # ConnectX-7
    pfNames: ["ens2f0"]
  deviceType: netdevice   # Required for RDMA verbs (not vfio-pci)
  isRdma: true
  linkType: ETH
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: rdma-network
  namespace: openshift-sriov-network-operator
spec:
  resourceName: mellanoxrdma
  networkNamespace: ai-training
  ipam: |
    {
      "type": "whereabouts",
      "range": "10.0.100.0/24"
    }

⚠️ deviceType: netdevice is mandatory for RDMA. vfio-pci bypasses the kernel network stack β€” no RDMA verbs available.

3. GPU Operator with Open Kernel Modules

GPUDirect RDMA requires open kernel modules (DMA-BUF support):

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  driver:
    useOpenKernelModules: true
  gdrcopy:
    enabled: true
  gds:
    enabled: true

PyTorch 25.11 Container Stack

The NGC PyTorch 25.11 container bundles the complete RDMA/GDS stack:

ComponentVersionPurpose
CUDA13.0.2GPU compute runtime
PyTorch2.10.0a0Training framework
NCCL2.28.8GPU collective communications
MOFED5.4 (rdmacore 56.0)Mellanox RDMA driver userspace
HPC-X2.24.1MPI + RDMA collective acceleration
OpenUCX1.19.0Unified communication framework
GDRCopy2.5.1Low-latency GPU memory copy
cuFile1.15.1.6GPUDirect Storage API
OpenMPI4.1.7Multi-node process management
NVSHMEM3.4.5NVIDIA symmetric memory
AWS OFI NCCL1.17.0EFA/libfabric NCCL plugin
TensorRT10.14.1Inference optimization
Transformer Engine2.9FP8 training support
DALI1.52.0GPU data loading pipeline
DOCA3.1.0DPU/NIC offload SDK

Pod Manifest

apiVersion: v1
kind: Pod
metadata:
  name: pytorch-nccl-training
  namespace: ai-training
  annotations:
    k8s.v1.cni.cncf.io/networks: rdma-network
spec:
  containers:
    - name: training
      image: nvcr.io/nvidia/pytorch:25.11-py3
      command:
        - /bin/bash
        - -c
        - |
          # NCCL configuration
          export NCCL_DEBUG=INFO
          export NCCL_SOCKET_IFNAME=net1
          export NCCL_NET=IB

          # GPUDirect RDMA
          export NCCL_NET_GDR_LEVEL=SYS
          export NCCL_NET_GDR_READ=1

          # Performance tuning
          export NCCL_IB_QPS_PER_CONNECTION=4
          export NCCL_IB_GID_INDEX=3

          # GDS for data loading (if cuFile/GDS enabled)
          export CUFILE_ENV_PATH_JSON=/etc/cufile.json

          # Run training
          torchrun \
            --nproc_per_node=$NUM_GPUS \
            --nnodes=$WORLD_SIZE \
            --node_rank=$RANK \
            --master_addr=$MASTER_ADDR \
            --master_port=29500 \
            multinode.py --batch_size 32 1000 25
      env:
        - name: NUM_GPUS
          value: "8"
        - name: WORLD_SIZE
          value: "2"
        - name: RANK
          valueFrom:
            fieldRef:
              fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
        - name: MASTER_ADDR
          value: "pytorch-nccl-training-0.training-headless"
      resources:
        limits:
          nvidia.com/gpu: "8"
          openshift.io/mellanoxrdma: "1"
      securityContext:
        capabilities:
          add: ["IPC_LOCK"]
      volumeMounts:
        - name: dshm
          mountPath: /dev/shm
        - name: training-data
          mountPath: /data
  volumes:
    - name: dshm
      emptyDir:
        medium: Memory
        sizeLimit: 64Gi
    - name: training-data
      persistentVolumeClaim:
        claimName: training-data-pvc

NCCL Environment Variables Explained

Core Settings

VariableValuePurpose
NCCL_DEBUGINFOShow transport selection and connection details
NCCL_SOCKET_IFNAMEnet1Use SR-IOV secondary interface (not pod eth0)
NCCL_NETIBForce InfiniBand/RoCE transport (not TCP)

GPUDirect RDMA

VariableValuePurpose
NCCL_NET_GDR_LEVELSYSEnable GDR across PCIe switches/NUMA nodes
NCCL_NET_GDR_READ1Enable GPU-initiated RDMA reads (not just writes)

GDR levels (from most restrictive to least):

  • LOC β€” same PCIe switch only
  • PIX β€” same PCIe complex
  • PXB β€” cross PCIe bridge
  • PHB β€” same NUMA node
  • SYS β€” anywhere in the system (most permissive, required for cross-NUMA)

Performance Tuning

VariableValuePurpose
NCCL_IB_QPS_PER_CONNECTION4Queue pairs per connection (more = higher throughput)
NCCL_IB_GID_INDEX3RoCEv2 GID index (IPv4 RoCEv2 = index 3 typically)
NCCL_IB_TC106Traffic class (DSCP 26/AF31 for lossless queue)
NCCL_ALGORing,TreeAlgorithm selection (Ring for bandwidth, Tree for latency)
NCCL_PROTOSimple,LL,LL128Protocol selection (LL128 for small messages)

Multi-Node with LeaderWorkerSet

apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: pytorch-training
  namespace: ai-training
spec:
  replicas: 1
  leaderWorkerTemplate:
    size: 2
    restartPolicy: RecreateGroupOnPodRestart
    leaderTemplate:
      metadata:
        annotations:
          k8s.v1.cni.cncf.io/networks: rdma-network
      spec:
        containers:
          - name: training
            image: nvcr.io/nvidia/pytorch:25.11-py3
            command: ["/bin/bash", "-c"]
            args:
              - |
                export NCCL_DEBUG=INFO
                export NCCL_SOCKET_IFNAME=net1
                export NCCL_NET=IB
                export NCCL_NET_GDR_LEVEL=SYS
                export NCCL_NET_GDR_READ=1
                
                torchrun \
                  --nproc_per_node=8 \
                  --nnodes=2 \
                  --node_rank=0 \
                  --master_addr=$(hostname) \
                  --master_port=29500 \
                  multinode.py --batch_size 32 1000 25
            resources:
              limits:
                nvidia.com/gpu: "8"
                openshift.io/mellanoxrdma: "1"
            securityContext:
              capabilities:
                add: ["IPC_LOCK"]
            volumeMounts:
              - name: dshm
                mountPath: /dev/shm
        volumes:
          - name: dshm
            emptyDir:
              medium: Memory
              sizeLimit: 64Gi
    workerTemplate:
      metadata:
        annotations:
          k8s.v1.cni.cncf.io/networks: rdma-network
      spec:
        containers:
          - name: training
            image: nvcr.io/nvidia/pytorch:25.11-py3
            command: ["/bin/bash", "-c"]
            args:
              - |
                export NCCL_DEBUG=INFO
                export NCCL_SOCKET_IFNAME=net1
                export NCCL_NET=IB
                export NCCL_NET_GDR_LEVEL=SYS
                export NCCL_NET_GDR_READ=1
                
                torchrun \
                  --nproc_per_node=8 \
                  --nnodes=2 \
                  --node_rank=1 \
                  --master_addr=${LWS_LEADER_ADDRESS} \
                  --master_port=29500 \
                  multinode.py --batch_size 32 1000 25
            resources:
              limits:
                nvidia.com/gpu: "8"
                openshift.io/mellanoxrdma: "1"
            securityContext:
              capabilities:
                add: ["IPC_LOCK"]
            volumeMounts:
              - name: dshm
                mountPath: /dev/shm
        volumes:
          - name: dshm
            emptyDir:
              medium: Memory
              sizeLimit: 64Gi

GPUDirect Storage (GDS) Configuration

GDS allows training data to flow directly from NVMe/NFS storage into GPU memory:

{
  "logging": {
    "level": 2
  },
  "profile": {
    "nvtx": false
  },
  "properties": {
    "max_direct_io_size_kb": 16384,
    "max_device_cache_size_kb": 131072,
    "max_device_pinned_mem_size_kb": 33554432,
    "max_batch_io_timeout_msecs": 5,
    "max_batch_io_size": 128
  },
  "fs": {
    "generic": {
      "posix_pool_size": 1024,
      "posix_unaligned_writes": false
    },
    "lustre": {
      "mount_table": []
    },
    "nfs": {
      "mount_table": [
        { "mountpoint": "/data", "servers": ["nfs.example.com"] }
      ]
    }
  }
}

Save as /etc/cufile.json and set:

export CUFILE_ENV_PATH_JSON=/etc/cufile.json

GDS requires:

  • nvidia-fs kernel module loaded (GPU Operator handles this with gds.enabled: true)
  • NFS server with localio support or NVMe storage
  • cuFile API in the application (PyTorch DataLoader doesn’t use GDS by default β€” use DALI with GDS backend)

Verification

NCCL Transport Check

# Look for these lines in NCCL_DEBUG=INFO output:

# βœ… RDMA detected
# NCCL INFO NET/IB : Using [0]mlx5_2:1/RoCE [RO]; OOB net1:10.0.100.10<0>

# βœ… GPUDirect RDMA enabled
# NCCL INFO GPU Direct RDMA Enabled for HCA 0 (PCI 0000:ca:00.0)

# βœ… GDRCopy available
# NCCL INFO GDRCOPY : Enabled gdrcopy 2.5

# ❌ TCP fallback (bad β€” RDMA not working)
# NCCL INFO NET/Socket : Using [0]eth0:10.128.0.15<0>

Bandwidth Verification

# NCCL all-reduce benchmark (inside the container)
cd /opt/nccl-tests
mpirun -np 16 --host node0:8,node1:8 \
  -x NCCL_DEBUG=INFO \
  -x NCCL_SOCKET_IFNAME=net1 \
  -x NCCL_NET=IB \
  -x NCCL_NET_GDR_LEVEL=SYS \
  ./build/all_reduce_perf -b 8 -e 8G -f 2 -g 1

# Expected: ~380 GB/s bus bandwidth for 8Γ—H100 + HDR200 IB
# If seeing <50 GB/s, RDMA is not active

GDS Verification

# Check nvidia-fs module
lsmod | grep nvidia_fs

# GDS stats
cat /proc/driver/nvidia-fs/stats
# Should show non-zero reads/writes after running GDS-enabled workload

# cuFile test
/usr/local/cuda/gds/tools/gdsio -f /data/testfile -d 0 -w 4 -s 1G -x 0 -I 1
graph TD
    subgraph Application Layer
        PT[PyTorch 2.10<br/>torchrun] --> NCCL[NCCL 2.28.8]
        PT --> DALI[DALI 1.52<br/>GDS DataLoader]
    end
    
    subgraph NCCL Transport
        NCCL --> |NCCL_NET=IB| IB[libibverbs<br/>rdmacore 56.0]
        NCCL --> |GDR| GDRCOPY[GDRCopy 2.5<br/>GPU memory access]
    end
    
    subgraph Network Stack
        IB --> UCX[OpenUCX 1.19]
        UCX --> HPCX[HPC-X 2.24<br/>SHARP / collective offload]
        HPCX --> SRIOV[SR-IOV VF<br/>net1 interface]
    end
    
    subgraph Storage Stack
        DALI --> CUFILE[cuFile 1.15<br/>GDS API]
        CUFILE --> NVFS[nvidia-fs<br/>kernel module]
        NVFS --> |DMA| NFS[NFS / NVMe]
    end
    
    subgraph Hardware
        SRIOV --> NIC[ConnectX-7<br/>RDMA]
        NIC --> |RoCEv2 DSCP 26| FABRIC[Network Fabric]
        GPU[H100 GPU] --> |PCIe/NVLink| NIC
        GPU --> |DMA-BUF| NVFS
    end

Common Issues

NCCL falls back to NET/Socket despite SR-IOV

Check NCCL_SOCKET_IFNAME=net1 matches the actual SR-IOV interface name. Verify with:

ip addr show net1
ibv_devinfo  # Should show mlx5_X device in PORT_ACTIVE state

β€œNCCL WARN IB : Unable to open GID index 3”

GID index varies by RoCE version and IP configuration:

show_gids  # List all GID table entries
# Use the index corresponding to RoCEv2 + IPv4

Set NCCL_IB_GID_INDEX to match.

GPUDirect RDMA not available β€” β€œGPU Direct RDMA Disabled”

Three requirements:

  1. Open kernel modules: useOpenKernelModules: true in GPU Operator
  2. IOMMU passthrough: iommu=pt kernel parameter
  3. nvidia-peermem module loaded: lsmod | grep nvidia_peermem

cuFile errors β€” β€œnvidia-fs module not loaded”

GDS requires the nvidia-fs kernel module:

lsmod | grep nvidia_fs
# If missing, check GPU Operator ClusterPolicy gds.enabled: true

OOM during NCCL initialization β€” β€œ/dev/shm too small”

NCCL uses shared memory for intra-node communication. Size /dev/shm appropriately:

volumes:
  - name: dshm
    emptyDir:
      medium: Memory
      sizeLimit: 64Gi  # 8 GPUs Γ— 8GB per GPU

IPC_LOCK capability denied

RDMA memory registration requires IPC_LOCK:

securityContext:
  capabilities:
    add: ["IPC_LOCK"]

On OpenShift, this requires a custom SCC (not restricted-v2).

Best Practices

  • Always verify NCCL transport with NCCL_DEBUG=INFO before running production training β€” one misconfigured variable can silently fall back to TCP
  • Pin NCCL_SOCKET_IFNAME=net1 to the SR-IOV interface β€” NCCL may auto-detect the wrong interface
  • Set NCCL_NET_GDR_LEVEL=SYS for multi-socket systems β€” restrictive levels cause fallback on cross-NUMA GPU-NIC pairs
  • Size /dev/shm at 8GB per GPU minimum β€” NCCL shared memory buffers scale with GPU count
  • Use IPC_LOCK capability, not full privileged β€” minimum privilege for RDMA memory pinning
  • IOMMU passthrough (iommu=pt) is non-negotiable for GDR and GDS
  • Run NCCL all-reduce benchmark first β€” establishes baseline bandwidth before training
  • Pin container image versions (e.g., pytorch:25.11-py3) β€” NCCL behavior can change between releases
  • Match MOFED versions between host driver and container userspace β€” mismatches cause libibverbs errors
  • Use LeaderWorkerSet for multi-node training β€” handles leader election, worker discovery, and group restart

Key Takeaways

  • NCCL 2.28.8 in PyTorch 25.11 supports RDMA, GDR, GDS, and SHARP collective offload out of the box
  • NCCL_SOCKET_IFNAME=net1 + NCCL_NET=IB forces RDMA over the SR-IOV interface
  • GPUDirect RDMA needs three things: open kernel modules, iommu=pt, and nvidia-peermem
  • GPUDirect Storage bypasses CPU and page cache for training data β€” requires nvidia-fs module and cuFile API
  • SR-IOV deviceType: netdevice (not vfio-pci) is mandatory for RDMA verbs
  • IPC_LOCK capability is required for RDMA memory registration β€” use custom SCC on OpenShift
  • Verify with NCCL_DEBUG=INFO: look for NET/IB (good) not NET/Socket (TCP fallback)
  • The complete stack: IOMMU β†’ GPU Operator (open modules + GDS) β†’ SR-IOV β†’ PFC/ECN β†’ NCCL env vars β†’ torchrun
#nccl #sriov #gds #gpudirect #pytorch #rdma #torchrun #cuda
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens