πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
Networking advanced ⏱ 15 minutes K8s 1.28+

Dual-Fabric Mellanox: GPU InfiniBand + Storage Ethernet

Design and configure dual-fabric network architecture with separate Mellanox NICs for GPU communication (InfiniBand) and storage traffic (Ethernet). Covers

By Luca Berton β€’ β€’ πŸ“– 9 min read

πŸ’‘ Quick Answer: GPU clusters use separate physical fabrics: InfiniBand NICs for GPU-to-GPU NCCL traffic (highest bandwidth, lowest latency) and Ethernet NICs for storage (NFS/Ceph), management, and Pod networking. Never mix GPU RDMA and storage on the same fabric β€” congestion on one kills the other.

The Problem

A GPU node typically has multiple Mellanox ConnectX NICs serving different purposes:

  • GPU training needs dedicated low-latency InfiniBand for NCCL all-reduce
  • Storage (NFS, Lustre, GPFS) needs reliable high-throughput Ethernet or separate IB subnet
  • Management/Pod networking needs standard Ethernet
  • Mixing traffic on one fabric causes head-of-line blocking and NCCL timeouts

The Solution

Dual-Fabric Architecture

GPU Compute Node:
──────────────────────────────────────────────────────────────────

  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚                    GPU Node                              β”‚
  β”‚                                                         β”‚
  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
  β”‚  β”‚  GPU 0  β”‚  β”‚  GPU 1  β”‚  β”‚  GPU 2  β”‚  β”‚  GPU 3  β”‚  β”‚
  β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜  β”‚
  β”‚       β”‚NVLink       β”‚            β”‚            β”‚        β”‚
  β”‚  β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”   β”‚
  β”‚  β”‚              NVLink / NVSwitch                    β”‚   β”‚
  β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜   β”‚
  β”‚       β”‚             β”‚            β”‚            β”‚        β”‚
  β”‚  β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”΄β”€β”€β”€β”€β”€β”  β”‚
  β”‚  β”‚ConnectX-7β”‚   β”‚ConnectX-7β”‚ β”‚ConnectX-6β”‚  β”‚ConnectX-6β”‚ β”‚
  β”‚  β”‚ IB HDR  β”‚   β”‚ IB HDR  β”‚ β”‚  25GbE  β”‚  β”‚  25GbE  β”‚  β”‚
  β”‚  β”‚GPU Fabricβ”‚   β”‚GPU Fabricβ”‚ β”‚Stor Fab β”‚  β”‚Mgmt/Pod β”‚  β”‚
  β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜  β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚              β”‚           β”‚            β”‚
          β–Ό              β–Ό           β–Ό            β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ IB Switch    β”‚  β”‚ IB Switchβ”‚  β”‚ Ethernet SW  β”‚
  β”‚ (GPU Fabric) β”‚  β”‚(GPU Fab) β”‚  β”‚(Storage+Mgmt)β”‚
  β”‚ Leaf/Spine   β”‚  β”‚          β”‚  β”‚              β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Physical NIC Assignment

NIC Assignment (typical 4-NIC GPU node):
──────────────────────────────────────────────────────────────────
NIC        Type              Fabric         Purpose
──────────────────────────────────────────────────────────────────
mlx5_0     ConnectX-7 IB    GPU Fabric     NCCL inter-node (GPUs 0-3)
mlx5_1     ConnectX-7 IB    GPU Fabric     NCCL inter-node (GPUs 4-7)
mlx5_2     ConnectX-6 Eth   Storage        NFS/Lustre/Ceph (RoCE or TCP)
mlx5_3     ConnectX-6 Eth   Management     Pod network, API, SSH

Alternative (6-NIC for large clusters):
──────────────────────────────────────────────────────────────────
mlx5_0-3   ConnectX-7 IB    GPU Fabric     4Γ— NCCL (1 per GPU pair)
mlx5_4     ConnectX-6 Eth   Storage        NFS/GPFS
mlx5_5     ConnectX-6 Eth   Management     OVN/Calico Pod network

SR-IOV Policies Per Fabric

# Policy 1: GPU Fabric (InfiniBand) β€” for NCCL RDMA
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: gpu-fabric-ib
  namespace: openshift-sriov-network-operator
spec:
  nodeSelector:
    node-role.kubernetes.io/gpu-worker: ""
  numVfs: 8
  priority: 98
  resourceName: gpu-rdma
  vendor: "15b3"
  deviceType: netdevice
  isRdma: true
  nicSelector:
    vendor: "15b3"
    deviceID: "101e"          # ConnectX-7 IB
    # Or by PF name:
    # pfNames:
    #   - "ibp65s0f0"
    #   - "ibp65s0f1"
---
# Policy 2: Storage Fabric (Ethernet) β€” for NFS/Ceph
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: storage-fabric-eth
  namespace: openshift-sriov-network-operator
spec:
  nodeSelector:
    node-role.kubernetes.io/gpu-worker: ""
  numVfs: 4
  priority: 99
  resourceName: storage-net
  vendor: "15b3"
  deviceType: netdevice
  isRdma: false              # No RDMA needed for NFS over TCP
  nicSelector:
    vendor: "15b3"
    deviceID: "101f"          # ConnectX-6 Eth
    # Or by PF name:
    # pfNames:
    #   - "ens3f0np0"

SriovNetwork Definitions Per Fabric

# GPU RDMA network (InfiniBand)
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: gpu-rdma-network
  namespace: openshift-sriov-network-operator
spec:
  networkNamespace: ai-training
  resourceName: gpu-rdma
  capabilities: '{"rdma": true}'
  ipam: |
    {
      "type": "whereabouts",
      "range": "10.0.100.0/24"
    }
---
# Storage network (Ethernet)
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: storage-network
  namespace: openshift-sriov-network-operator
spec:
  networkNamespace: ai-training
  resourceName: storage-net
  ipam: |
    {
      "type": "whereabouts",
      "range": "10.0.200.0/24"
    }

Pod with Dual-Fabric Attachment

apiVersion: v1
kind: Pod
metadata:
  name: gpu-training
  namespace: ai-training
  annotations:
    k8s.v1.cni.cncf.io/networks: |
      [
        {"name": "gpu-rdma-network", "interface": "rdma0"},
        {"name": "gpu-rdma-network", "interface": "rdma1"},
        {"name": "storage-network", "interface": "stor0"}
      ]
spec:
  containers:
    - name: training
      image: nvcr.io/nvidia/pytorch:24.07-py3
      env:
        # NCCL: use ONLY GPU fabric NICs
        - name: NCCL_IB_HCA
          value: "mlx5_0,mlx5_1"    # GPU fabric only
        - name: NCCL_NET_GDR_LEVEL
          value: "5"
        - name: NCCL_CROSS_NIC
          value: "1"
        # Explicitly exclude storage NIC from NCCL
        - name: NCCL_IB_DISABLE
          value: "mlx5_2"           # Don't use storage NIC for NCCL
        # Socket interface for NCCL bootstrap (uses management network)
        - name: NCCL_SOCKET_IFNAME
          value: "eth0"             # Pod default interface
      volumeMounts:
        - name: training-data
          mountPath: /data
        - name: checkpoints
          mountPath: /checkpoints
      resources:
        requests:
          nvidia.com/gpu: "8"
          openshift.io/gpu-rdma: "2"      # 2 IB VFs for NCCL
          openshift.io/storage-net: "1"    # 1 Eth VF for storage
  volumes:
    - name: training-data
      nfs:                                 # NFS goes over storage fabric
        server: nfs.storage.example.com
        path: /datasets
    - name: checkpoints
      persistentVolumeClaim:
        claimName: checkpoint-pvc          # Also on storage fabric

NCCL NIC Binding (Prevent Fabric Crosstalk)

# Critical: tell NCCL exactly which NICs to use
# Otherwise NCCL may pick storage NICs and congest that fabric

# Option 1: Whitelist GPU fabric NICs
export NCCL_IB_HCA="mlx5_0,mlx5_1"

# Option 2: Blacklist storage/management NICs
export NCCL_IB_DISABLE="mlx5_2,mlx5_3"

# Option 3: By PCI bus ID (most precise)
export NCCL_IB_HCA="mlx5_0000:65:00"    # PCI prefix match

# For InfiniBand specifically:
export NCCL_IB_HCA="mlx5_0:1,mlx5_1:1"  # device:port

# Bootstrap socket (TCP control plane) β€” use management network
export NCCL_SOCKET_IFNAME="eth0"          # NOT the IB interface

InfiniBand vs Ethernet: When to Use Each

Traffic Type       Protocol          Fabric          Why
──────────────────────────────────────────────────────────────────
GPU NCCL           IB Verbs/RDMA     InfiniBand      Lowest latency, highest BW
                                                      No TCP overhead
                                                      GPU-Direct RDMA capable

Storage (NFS)      TCP or NFS/RDMA   Ethernet        Commodity switches
                                     or IB            TCP works fine for sequential I/O
                                                      RoCE if ultra-low latency needed

Storage (Lustre)   LNET over IB      InfiniBand      Native IB support
                   or TCP/Ethernet   or Ethernet     Depends on cluster size

Storage (Ceph)     TCP/msgr2         Ethernet        Ceph doesn't need RDMA
                                                      Standard 25GbE sufficient

Management         TCP               Ethernet        API server, SSH, monitoring
Pod Network        OVN/Calico        Ethernet        Standard container networking

NCCL Bootstrap     TCP               Ethernet        Initial rank discovery only
                                                      Low bandwidth, use mgmt net

Storage over RoCE (Ethernet RDMA)

# If storage needs RDMA (NFS over RDMA, NVMe-oF):
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: storage-roce
  namespace: openshift-sriov-network-operator
spec:
  nodeSelector:
    node-role.kubernetes.io/gpu-worker: ""
  numVfs: 4
  priority: 99
  resourceName: storage-roce
  vendor: "15b3"
  deviceType: netdevice
  isRdma: true               # Enable RDMA for RoCE
  nicSelector:
    pfNames:
      - "ens3f0np0"          # Storage Ethernet NIC
# RoCE requires proper flow control (PFC) on the Ethernet switch
# Without PFC, RoCE performance degrades under congestion

# Verify RoCE is working:
ibv_devinfo | grep -A10 mlx5_2
# Look for: transport: InfiniBand (for IB) or Ethernet (for RoCE)
# link_layer: Ethernet confirms RoCE mode

# Test RoCE bandwidth:
ib_write_bw -d mlx5_2 --rdma_cm    # Server
ib_write_bw -d mlx5_2 --rdma_cm <server-ip>  # Client

Network Separation at Switch Level

Physical Switch Topology:
──────────────────────────────────────────────────────────────────

GPU Fabric (InfiniBand):
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  IB Leaf Switch 1        IB Leaf Switch 2β”‚
  β”‚  (HDR 200Gb/s)           (HDR 200Gb/s)  β”‚
  β”‚     β”‚ β”‚ β”‚ β”‚                β”‚ β”‚ β”‚ β”‚      β”‚
  β”‚     β–Ό β–Ό β–Ό β–Ό                β–Ό β–Ό β–Ό β–Ό      β”‚
  β”‚  Node1  Node2           Node3  Node4    β”‚
  β”‚  GPU NICs               GPU NICs        β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                β”‚    IB Spine       β”‚
                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Storage Fabric (Ethernet):
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  Eth Switch 1 (25/100GbE)              β”‚
  β”‚     β”‚ β”‚ β”‚ β”‚                             β”‚
  β”‚     β–Ό β–Ό β–Ό β–Ό                             β”‚
  β”‚  Node1  Node2  Node3  Node4            β”‚
  β”‚  Stor NICs                              β”‚
  β”‚         β”‚                               β”‚
  β”‚         β–Ό                               β”‚
  β”‚  NFS Server / Ceph OSD / Lustre MDS    β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Management (Ethernet):
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  Mgmt Switch (10/25GbE)                β”‚
  β”‚     β”‚ β”‚ β”‚ β”‚                             β”‚
  β”‚     β–Ό β–Ό β–Ό β–Ό                             β”‚
  β”‚  All Nodes (BMC + OS mgmt)             β”‚
  β”‚  API Server, Monitoring, SSH            β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Rules:
  β€’ GPU fabric: ONLY NCCL/MPI traffic. No storage. No management.
  β€’ Storage fabric: ONLY storage I/O. No GPU training traffic.
  β€’ Management: Everything else (API, SSH, monitoring, Pod network)
  β€’ NEVER cross-connect fabrics at switch level

Verifying Fabric Separation

# Confirm which NIC handles which traffic:

# GPU fabric β€” should see RDMA counters during training
rdma stat show link mlx5_0
# rx_write_requests, tx_write_requests should be high during training

# Storage fabric β€” should see TCP/NFS traffic during data load
ethtool -S ens3f0np0 | grep -E "rx_bytes|tx_bytes"

# Check no NCCL traffic on storage NIC (should be zero IB counters)
rdma stat show link mlx5_2
# rx_write_requests should be 0 if NCCL correctly uses mlx5_0/1 only

# Monitor during training:
watch -n1 "rdma stat show link mlx5_0 | grep write; echo '---'; \
           rdma stat show link mlx5_2 | grep write"

Common Issues

NCCL uses storage NIC, congests NFS

  • Cause: NCCL_IB_HCA not set; NCCL auto-discovers all Mellanox NICs
  • Fix: Explicitly set NCCL_IB_HCA=mlx5_0,mlx5_1 (GPU fabric only)

NFS timeouts during training

  • Cause: NCCL traffic leaking to storage fabric, or storage NIC saturated
  • Fix: Verify fabric separation; add dedicated NFS NIC; check switch PFC config

InfiniBand port down on GPU fabric

  • Cause: Cable issue, switch port config, or subnet manager not running
  • Fix: ibstat to check port state; verify OpenSM or UFM is managing the IB fabric

RoCE storage drops under GPU training load

  • Cause: ECN/PFC not configured on Ethernet switch; RoCE needs lossless Ethernet
  • Fix: Configure PFC (Priority Flow Control) on storage switch ports

Best Practices

  1. Physical separation β€” different switches for GPU and storage fabrics
  2. Explicit NCCL NIC binding β€” always set NCCL_IB_HCA to GPU fabric NICs
  3. InfiniBand for GPU, Ethernet for storage β€” unless storage is Lustre (native IB)
  4. Separate SR-IOV policies per fabric β€” different resourceNames
  5. PFC for RoCE β€” if storage uses Ethernet RDMA, configure lossless
  6. Monitor per-NIC β€” alert if RDMA traffic appears on storage NICs
  7. Document the cable map β€” which port on which switch for each NIC

Key Takeaways

  • GPU clusters need physically separate fabrics: IB for NCCL, Ethernet for storage
  • Never let NCCL auto-discover NICs β€” explicitly bind with NCCL_IB_HCA
  • InfiniBand = lowest latency + GPU-Direct RDMA for training traffic
  • Ethernet = commodity, cost-effective, sufficient for NFS/Ceph sequential I/O
  • SR-IOV policies should be per-fabric (separate resourceNames)
  • RoCE (Ethernet RDMA) needs PFC β€” without it, performance collapses under congestion
  • Physical switch separation prevents one fabric’s congestion from affecting the other
  • NCCL_SOCKET_IFNAME=eth0 β€” bootstrap over management, not GPU fabric
#infiniband #ethernet #mellanox #dual-fabric #storage
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens