Dual-Fabric Mellanox: GPU InfiniBand + Storage Ethernet
Design and configure dual-fabric network architecture with separate Mellanox NICs for GPU communication (InfiniBand) and storage traffic (Ethernet). Covers
π‘ Quick Answer: GPU clusters use separate physical fabrics: InfiniBand NICs for GPU-to-GPU NCCL traffic (highest bandwidth, lowest latency) and Ethernet NICs for storage (NFS/Ceph), management, and Pod networking. Never mix GPU RDMA and storage on the same fabric β congestion on one kills the other.
The Problem
A GPU node typically has multiple Mellanox ConnectX NICs serving different purposes:
- GPU training needs dedicated low-latency InfiniBand for NCCL all-reduce
- Storage (NFS, Lustre, GPFS) needs reliable high-throughput Ethernet or separate IB subnet
- Management/Pod networking needs standard Ethernet
- Mixing traffic on one fabric causes head-of-line blocking and NCCL timeouts
The Solution
Dual-Fabric Architecture
GPU Compute Node:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GPU Node β
β β
β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β
β β GPU 0 β β GPU 1 β β GPU 2 β β GPU 3 β β
β ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ β
β βNVLink β β β β
β ββββββ΄ββββββββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββ β
β β NVLink / NVSwitch β β
β ββββββ¬ββββββββββββββ¬βββββββββββββ¬βββββββββββββ¬βββββ β
β β β β β β
β ββββββ΄βββββ ββββββ΄βββββ ββββββ΄βββββ βββββ΄ββββββ β
β βConnectX-7β βConnectX-7β βConnectX-6β βConnectX-6β β
β β IB HDR β β IB HDR β β 25GbE β β 25GbE β β
β βGPU Fabricβ βGPU Fabricβ βStor Fab β βMgmt/Pod β β
β ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ β
βββββββββΌβββββββββββββββΌββββββββββββΌβββββββββββββΌβββββββββ
β β β β
βΌ βΌ βΌ βΌ
ββββββββββββββββ ββββββββββββ ββββββββββββββββ
β IB Switch β β IB Switchβ β Ethernet SW β
β (GPU Fabric) β β(GPU Fab) β β(Storage+Mgmt)β
β Leaf/Spine β β β β β
ββββββββββββββββ ββββββββββββ ββββββββββββββββPhysical NIC Assignment
NIC Assignment (typical 4-NIC GPU node):
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
NIC Type Fabric Purpose
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
mlx5_0 ConnectX-7 IB GPU Fabric NCCL inter-node (GPUs 0-3)
mlx5_1 ConnectX-7 IB GPU Fabric NCCL inter-node (GPUs 4-7)
mlx5_2 ConnectX-6 Eth Storage NFS/Lustre/Ceph (RoCE or TCP)
mlx5_3 ConnectX-6 Eth Management Pod network, API, SSH
Alternative (6-NIC for large clusters):
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
mlx5_0-3 ConnectX-7 IB GPU Fabric 4Γ NCCL (1 per GPU pair)
mlx5_4 ConnectX-6 Eth Storage NFS/GPFS
mlx5_5 ConnectX-6 Eth Management OVN/Calico Pod networkSR-IOV Policies Per Fabric
# Policy 1: GPU Fabric (InfiniBand) β for NCCL RDMA
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: gpu-fabric-ib
namespace: openshift-sriov-network-operator
spec:
nodeSelector:
node-role.kubernetes.io/gpu-worker: ""
numVfs: 8
priority: 98
resourceName: gpu-rdma
vendor: "15b3"
deviceType: netdevice
isRdma: true
nicSelector:
vendor: "15b3"
deviceID: "101e" # ConnectX-7 IB
# Or by PF name:
# pfNames:
# - "ibp65s0f0"
# - "ibp65s0f1"
---
# Policy 2: Storage Fabric (Ethernet) β for NFS/Ceph
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: storage-fabric-eth
namespace: openshift-sriov-network-operator
spec:
nodeSelector:
node-role.kubernetes.io/gpu-worker: ""
numVfs: 4
priority: 99
resourceName: storage-net
vendor: "15b3"
deviceType: netdevice
isRdma: false # No RDMA needed for NFS over TCP
nicSelector:
vendor: "15b3"
deviceID: "101f" # ConnectX-6 Eth
# Or by PF name:
# pfNames:
# - "ens3f0np0"SriovNetwork Definitions Per Fabric
# GPU RDMA network (InfiniBand)
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
name: gpu-rdma-network
namespace: openshift-sriov-network-operator
spec:
networkNamespace: ai-training
resourceName: gpu-rdma
capabilities: '{"rdma": true}'
ipam: |
{
"type": "whereabouts",
"range": "10.0.100.0/24"
}
---
# Storage network (Ethernet)
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
name: storage-network
namespace: openshift-sriov-network-operator
spec:
networkNamespace: ai-training
resourceName: storage-net
ipam: |
{
"type": "whereabouts",
"range": "10.0.200.0/24"
}Pod with Dual-Fabric Attachment
apiVersion: v1
kind: Pod
metadata:
name: gpu-training
namespace: ai-training
annotations:
k8s.v1.cni.cncf.io/networks: |
[
{"name": "gpu-rdma-network", "interface": "rdma0"},
{"name": "gpu-rdma-network", "interface": "rdma1"},
{"name": "storage-network", "interface": "stor0"}
]
spec:
containers:
- name: training
image: nvcr.io/nvidia/pytorch:24.07-py3
env:
# NCCL: use ONLY GPU fabric NICs
- name: NCCL_IB_HCA
value: "mlx5_0,mlx5_1" # GPU fabric only
- name: NCCL_NET_GDR_LEVEL
value: "5"
- name: NCCL_CROSS_NIC
value: "1"
# Explicitly exclude storage NIC from NCCL
- name: NCCL_IB_DISABLE
value: "mlx5_2" # Don't use storage NIC for NCCL
# Socket interface for NCCL bootstrap (uses management network)
- name: NCCL_SOCKET_IFNAME
value: "eth0" # Pod default interface
volumeMounts:
- name: training-data
mountPath: /data
- name: checkpoints
mountPath: /checkpoints
resources:
requests:
nvidia.com/gpu: "8"
openshift.io/gpu-rdma: "2" # 2 IB VFs for NCCL
openshift.io/storage-net: "1" # 1 Eth VF for storage
volumes:
- name: training-data
nfs: # NFS goes over storage fabric
server: nfs.storage.example.com
path: /datasets
- name: checkpoints
persistentVolumeClaim:
claimName: checkpoint-pvc # Also on storage fabricNCCL NIC Binding (Prevent Fabric Crosstalk)
# Critical: tell NCCL exactly which NICs to use
# Otherwise NCCL may pick storage NICs and congest that fabric
# Option 1: Whitelist GPU fabric NICs
export NCCL_IB_HCA="mlx5_0,mlx5_1"
# Option 2: Blacklist storage/management NICs
export NCCL_IB_DISABLE="mlx5_2,mlx5_3"
# Option 3: By PCI bus ID (most precise)
export NCCL_IB_HCA="mlx5_0000:65:00" # PCI prefix match
# For InfiniBand specifically:
export NCCL_IB_HCA="mlx5_0:1,mlx5_1:1" # device:port
# Bootstrap socket (TCP control plane) β use management network
export NCCL_SOCKET_IFNAME="eth0" # NOT the IB interfaceInfiniBand vs Ethernet: When to Use Each
Traffic Type Protocol Fabric Why
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
GPU NCCL IB Verbs/RDMA InfiniBand Lowest latency, highest BW
No TCP overhead
GPU-Direct RDMA capable
Storage (NFS) TCP or NFS/RDMA Ethernet Commodity switches
or IB TCP works fine for sequential I/O
RoCE if ultra-low latency needed
Storage (Lustre) LNET over IB InfiniBand Native IB support
or TCP/Ethernet or Ethernet Depends on cluster size
Storage (Ceph) TCP/msgr2 Ethernet Ceph doesn't need RDMA
Standard 25GbE sufficient
Management TCP Ethernet API server, SSH, monitoring
Pod Network OVN/Calico Ethernet Standard container networking
NCCL Bootstrap TCP Ethernet Initial rank discovery only
Low bandwidth, use mgmt netStorage over RoCE (Ethernet RDMA)
# If storage needs RDMA (NFS over RDMA, NVMe-oF):
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: storage-roce
namespace: openshift-sriov-network-operator
spec:
nodeSelector:
node-role.kubernetes.io/gpu-worker: ""
numVfs: 4
priority: 99
resourceName: storage-roce
vendor: "15b3"
deviceType: netdevice
isRdma: true # Enable RDMA for RoCE
nicSelector:
pfNames:
- "ens3f0np0" # Storage Ethernet NIC# RoCE requires proper flow control (PFC) on the Ethernet switch
# Without PFC, RoCE performance degrades under congestion
# Verify RoCE is working:
ibv_devinfo | grep -A10 mlx5_2
# Look for: transport: InfiniBand (for IB) or Ethernet (for RoCE)
# link_layer: Ethernet confirms RoCE mode
# Test RoCE bandwidth:
ib_write_bw -d mlx5_2 --rdma_cm # Server
ib_write_bw -d mlx5_2 --rdma_cm <server-ip> # ClientNetwork Separation at Switch Level
Physical Switch Topology:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
GPU Fabric (InfiniBand):
βββββββββββββββββββββββββββββββββββββββββββ
β IB Leaf Switch 1 IB Leaf Switch 2β
β (HDR 200Gb/s) (HDR 200Gb/s) β
β β β β β β β β β β
β βΌ βΌ βΌ βΌ βΌ βΌ βΌ βΌ β
β Node1 Node2 Node3 Node4 β
β GPU NICs GPU NICs β
βββββββββββββββ¬ββββββββββββββββββββ¬ββββββββ
β IB Spine β
βββββββββββββββββββββ
Storage Fabric (Ethernet):
βββββββββββββββββββββββββββββββββββββββββββ
β Eth Switch 1 (25/100GbE) β
β β β β β β
β βΌ βΌ βΌ βΌ β
β Node1 Node2 Node3 Node4 β
β Stor NICs β
β β β
β βΌ β
β NFS Server / Ceph OSD / Lustre MDS β
βββββββββββββββββββββββββββββββββββββββββββ
Management (Ethernet):
βββββββββββββββββββββββββββββββββββββββββββ
β Mgmt Switch (10/25GbE) β
β β β β β β
β βΌ βΌ βΌ βΌ β
β All Nodes (BMC + OS mgmt) β
β API Server, Monitoring, SSH β
βββββββββββββββββββββββββββββββββββββββββββ
Rules:
β’ GPU fabric: ONLY NCCL/MPI traffic. No storage. No management.
β’ Storage fabric: ONLY storage I/O. No GPU training traffic.
β’ Management: Everything else (API, SSH, monitoring, Pod network)
β’ NEVER cross-connect fabrics at switch levelVerifying Fabric Separation
# Confirm which NIC handles which traffic:
# GPU fabric β should see RDMA counters during training
rdma stat show link mlx5_0
# rx_write_requests, tx_write_requests should be high during training
# Storage fabric β should see TCP/NFS traffic during data load
ethtool -S ens3f0np0 | grep -E "rx_bytes|tx_bytes"
# Check no NCCL traffic on storage NIC (should be zero IB counters)
rdma stat show link mlx5_2
# rx_write_requests should be 0 if NCCL correctly uses mlx5_0/1 only
# Monitor during training:
watch -n1 "rdma stat show link mlx5_0 | grep write; echo '---'; \
rdma stat show link mlx5_2 | grep write"Common Issues
NCCL uses storage NIC, congests NFS
- Cause:
NCCL_IB_HCAnot set; NCCL auto-discovers all Mellanox NICs - Fix: Explicitly set
NCCL_IB_HCA=mlx5_0,mlx5_1(GPU fabric only)
NFS timeouts during training
- Cause: NCCL traffic leaking to storage fabric, or storage NIC saturated
- Fix: Verify fabric separation; add dedicated NFS NIC; check switch PFC config
InfiniBand port down on GPU fabric
- Cause: Cable issue, switch port config, or subnet manager not running
- Fix:
ibstatto check port state; verify OpenSM or UFM is managing the IB fabric
RoCE storage drops under GPU training load
- Cause: ECN/PFC not configured on Ethernet switch; RoCE needs lossless Ethernet
- Fix: Configure PFC (Priority Flow Control) on storage switch ports
Best Practices
- Physical separation β different switches for GPU and storage fabrics
- Explicit NCCL NIC binding β always set
NCCL_IB_HCAto GPU fabric NICs - InfiniBand for GPU, Ethernet for storage β unless storage is Lustre (native IB)
- Separate SR-IOV policies per fabric β different resourceNames
- PFC for RoCE β if storage uses Ethernet RDMA, configure lossless
- Monitor per-NIC β alert if RDMA traffic appears on storage NICs
- Document the cable map β which port on which switch for each NIC
Key Takeaways
- GPU clusters need physically separate fabrics: IB for NCCL, Ethernet for storage
- Never let NCCL auto-discover NICs β explicitly bind with
NCCL_IB_HCA - InfiniBand = lowest latency + GPU-Direct RDMA for training traffic
- Ethernet = commodity, cost-effective, sufficient for NFS/Ceph sequential I/O
- SR-IOV policies should be per-fabric (separate resourceNames)
- RoCE (Ethernet RDMA) needs PFC β without it, performance collapses under congestion
- Physical switch separation prevents one fabricβs congestion from affecting the other
NCCL_SOCKET_IFNAME=eth0β bootstrap over management, not GPU fabric

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
