πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
Networking advanced ⏱ 15 minutes K8s 1.28+

InfiniBand vs Ethernet for AI on Kubernetes

Compare InfiniBand and Ethernet networking for GPU AI workloads on Kubernetes, including RDMA, RoCE, latency, and throughput considerations.

By Luca Berton β€’ β€’ πŸ“– 6 min read

πŸ’‘ Quick Answer: InfiniBand delivers 1-2ΞΌs latency and 400+ Gb/s RDMA with zero CPU overhead β€” ideal for multi-node distributed training. RoCE (RDMA over Converged Ethernet) provides RDMA over Ethernet infrastructure but requires PFC/ECN configuration. Pure Ethernet (TCP/IP) adds 10-30% training overhead vs RDMA.

The Problem

Multi-node GPU training requires all-reduce operations across nodes, where gradient synchronization speed directly impacts training throughput. Network choice β€” InfiniBand, RoCE Ethernet, or standard TCP Ethernet β€” can mean a 30% performance difference. Choosing wrong wastes GPU hours and money.

The Solution

Understand the tradeoffs between InfiniBand, RoCE, and TCP Ethernet for NCCL communication in AI workloads. Each has different requirements for Kubernetes configuration, switch infrastructure, and driver setup.

Network Technology Comparison

InfiniBand_NDR:
  speed: "400 Gb/s per port"
  latency: "0.5-1 ΞΌs"
  rdma: "Native RDMA (zero-copy, kernel bypass)"
  cpu_overhead: "Near zero"
  requires:
    - "InfiniBand switches (Quantum-2)"
    - "ConnectX-7 HCAs (or newer)"
    - "MOFED/DOCA drivers"
    - "Subnet Manager (OpenSM or UFM)"
  cost: "High (dedicated IB fabric)"
  best_for: "Large-scale training (64+ GPUs)"
  nccl_env:
    NCCL_IB_DISABLE: "0"
    NCCL_IB_HCA: "mlx5"
    NCCL_NET_GDR_LEVEL: "5"    # GPU Direct RDMA

RoCE_v2:
  speed: "100-400 Gb/s per port"
  latency: "2-5 ΞΌs"
  rdma: "RDMA over Ethernet (needs PFC/ECN)"
  cpu_overhead: "Near zero (with proper config)"
  requires:
    - "Ethernet switches with PFC and ECN support"
    - "ConnectX-6/7 NICs"
    - "MOFED/DOCA drivers"
    - "DCB (Data Center Bridging) configuration"
    - "Lossless Ethernet fabric"
  cost: "Medium (enterprise Ethernet switches)"
  best_for: "Converged infrastructure, mid-scale training"
  nccl_env:
    NCCL_IB_DISABLE: "0"
    NCCL_IB_HCA: "mlx5"
    NCCL_IB_GID_INDEX: "3"
    NCCL_NET_GDR_LEVEL: "5"

TCP_Ethernet:
  speed: "25-100 Gb/s per port"
  latency: "10-50 ΞΌs"
  rdma: "None (kernel TCP/IP stack)"
  cpu_overhead: "Significant (kernel copies)"
  requires:
    - "Standard Ethernet switches"
    - "Any NIC"
    - "No special drivers"
  cost: "Low (standard infrastructure)"
  best_for: "Inference, small-scale training (<16 GPUs)"
  nccl_env:
    NCCL_IB_DISABLE: "1"
    NCCL_SOCKET_IFNAME: "eth0"

NCCL Configuration for InfiniBand

apiVersion: apps/v1
kind: Deployment
metadata:
  name: training-infiniband
  namespace: ai-training
spec:
  template:
    metadata:
      annotations:
        k8s.v1.cni.cncf.io/networks: ib-sriov-net
    spec:
      containers:
        - name: trainer
          image: nvcr.io/nvidia/pytorch:24.03-py3
          env:
            # InfiniBand NCCL settings
            - name: NCCL_IB_DISABLE
              value: "0"
            - name: NCCL_IB_HCA
              value: "mlx5_0,mlx5_1,mlx5_2,mlx5_3"
            - name: NCCL_IB_GID_INDEX
              value: "3"
            - name: NCCL_NET_GDR_LEVEL
              value: "5"          # GPU Direct RDMA
            - name: NCCL_IB_QPS_PER_CONNECTION
              value: "4"
            - name: NCCL_IB_TC
              value: "136"        # Traffic class for DSCP
            - name: NCCL_ALGO
              value: "Ring,Tree"
            - name: NCCL_DEBUG
              value: "INFO"
          resources:
            limits:
              nvidia.com/gpu: 8
              rdma/rdma_shared_device_a: 1

NCCL Configuration for RoCE

apiVersion: apps/v1
kind: Deployment
metadata:
  name: training-roce
  namespace: ai-training
spec:
  template:
    metadata:
      annotations:
        k8s.v1.cni.cncf.io/networks: roce-sriov-net
    spec:
      containers:
        - name: trainer
          image: nvcr.io/nvidia/pytorch:24.03-py3
          env:
            # RoCE NCCL settings
            - name: NCCL_IB_DISABLE
              value: "0"
            - name: NCCL_IB_HCA
              value: "mlx5_0"
            - name: NCCL_IB_GID_INDEX
              value: "3"          # RoCEv2 GID
            - name: NCCL_NET_GDR_LEVEL
              value: "5"
            - name: NCCL_IB_ROCE_VERSION_NUM
              value: "2"
            # RoCE-specific tuning
            - name: NCCL_IB_SL
              value: "0"
            - name: NCCL_IB_TC
              value: "136"
            - name: NCCL_IB_TIMEOUT
              value: "22"
            - name: NCCL_IB_RETRY_CNT
              value: "13"
            # PFC must be configured on switches
          resources:
            limits:
              nvidia.com/gpu: 8
              rdma/rdma_shared_device_a: 1

NCCL Configuration for TCP Ethernet

apiVersion: apps/v1
kind: Deployment
metadata:
  name: training-tcp
  namespace: ai-training
spec:
  template:
    spec:
      containers:
        - name: trainer
          image: nvcr.io/nvidia/pytorch:24.03-py3
          env:
            # TCP fallback (no RDMA)
            - name: NCCL_IB_DISABLE
              value: "1"
            - name: NCCL_SOCKET_IFNAME
              value: "eth0"
            - name: NCCL_SOCKET_NTHREADS
              value: "4"
            - name: NCCL_NSOCKS_PERTHREAD
              value: "8"
            # TCP tuning
            - name: NCCL_BUFFSIZE
              value: "8388608"    # 8MB buffer
            - name: NCCL_P2P_LEVEL
              value: "NVL"        # NVLink within node
          resources:
            limits:
              nvidia.com/gpu: 8

Switch Configuration Reference

# InfiniBand switch (Quantum-2):
InfiniBand:
  subnet_manager: "UFM or OpenSM"
  partitioning: "Optional P_Key isolation"
  adaptive_routing: "Enabled for large fabrics"
  port_config: "Auto-negotiate NDR (400 Gb/s)"

# Ethernet switch for RoCE:
RoCE_Switch:
  pfc:
    enabled: true
    priority: 3              # Priority for RoCE traffic
    # Without PFC, RoCE drops packets and stalls training
  ecn:
    enabled: true
    threshold: 150000        # ECN marking threshold (bytes)
  dcbx:
    mode: "ieee"
    willing: false
  mtu: 9216                  # Jumbo frames
  # Example (Cumulus/SONIC):
  # nv set qos roce mode lossless
  # nv set interface swp1-48 link mtu 9216

# Standard Ethernet (TCP):
TCP_Switch:
  mtu: 9000                  # Jumbo frames recommended
  # No PFC/ECN needed
  # Standard L2/L3 switching

Performance Benchmarking

# NCCL all-reduce benchmark
kubectl exec -it training-pod -- \
  /usr/local/bin/all_reduce_perf \
  -b 8 -e 2G -f 2 -g 8

# Expected results (8x A100/H100, message size 1GB):
# InfiniBand NDR:  ~380 Gb/s busbw, ~1.0 ms latency
# RoCE 100G:       ~90 Gb/s busbw,  ~2.5 ms latency
# TCP 100G:        ~60 Gb/s busbw,  ~15 ms latency

# ib_write_bw for raw RDMA bandwidth
kubectl exec -it training-pod -- \
  ib_write_bw --size=65536 --duration=10

# Check RDMA devices
kubectl exec -it training-pod -- ibstat
kubectl exec -it training-pod -- rdma link show

# Verify GPU Direct RDMA
kubectl exec -it training-pod -- \
  nvidia-smi topo -m
# Look for "NV#" (NVLink) and "SYS" connections

Decision Matrix

graph TD
    A[Choose Network for AI] --> B{Scale?}
    
    B -->|64+ GPUs, multi-node training| C[InfiniBand NDR]
    B -->|16-64 GPUs| D{Existing infrastructure?}
    B -->|Less than 16 GPUs or inference only| E[TCP Ethernet is fine]
    
    D -->|Ethernet switches with PFC and ECN| F[RoCE v2]
    D -->|No RDMA-capable switches| G[Upgrade switches or use IB]
    D -->|Budget for new fabric| C
    
    C --> H[Best: lowest latency, highest BW]
    F --> I[Good: RDMA over existing Ethernet]
    E --> J[OK: 10-30% overhead vs RDMA]
    
    K[Key Factors] --> L[Latency: IB 1us vs RoCE 3us vs TCP 20us]
    K --> M[Bandwidth: IB 400G vs RoCE 100-400G vs TCP 25-100G]
    K --> N[Complexity: IB needs SM, RoCE needs PFC and ECN]
    K --> O[Cost: IB highest, TCP lowest]

Common Issues

  • RoCE PFC storms β€” misconfigured PFC causes network-wide pause frames; verify ECN is enabled alongside PFC
  • NCCL falls back to TCP despite IB available β€” check NCCL_IB_DISABLE=0; verify ibstat shows Active ports; check RDMA device plugin
  • GPU Direct RDMA not working β€” requires NCCL_NET_GDR_LEVEL=5 and nvidia-peermem kernel module loaded
  • InfiniBand subnet manager missing β€” IB fabric needs exactly one SM; deploy OpenSM or use NVIDIA UFM
  • RoCE performance worse than expected β€” verify PFC is not dropping frames (ethtool -S | grep pause); check ECN marks

Best Practices

  • InfiniBand for large-scale training (64+ GPUs) β€” lowest latency, highest bandwidth
  • RoCE v2 for converged infrastructure β€” RDMA over existing Ethernet with proper PFC/ECN
  • TCP Ethernet is acceptable for inference and small training jobs (<16 GPUs)
  • Always enable GPU Direct RDMA (NCCL_NET_GDR_LEVEL=5) with IB or RoCE
  • Use jumbo frames (MTU 9000+) for all AI network interfaces
  • Run all_reduce_perf benchmarks before production training to validate network
  • Monitor NCCL debug logs during initial runs to verify transport selection
  • Use multiple HCAs (NCCL_IB_HCA=mlx5_0,mlx5_1,...) for multi-rail bandwidth

Key Takeaways

  • InfiniBand: 400 Gb/s, <1ΞΌs latency, native RDMA β€” best for large-scale training
  • RoCE v2: RDMA over Ethernet, needs PFC/ECN on switches β€” good for converged networks
  • TCP Ethernet: 10-30% slower than RDMA β€” acceptable for inference and small training
  • GPU Direct RDMA bypasses CPU for GPU-to-GPU transfers across nodes
  • NCCL auto-selects transport but needs correct environment variables
  • Network choice has diminishing returns for inference (compute-bound) vs training (communication-bound)
  • Cost-performance sweet spot depends on scale: TCP for <16 GPUs, RDMA for 16+
#infiniband #ethernet #rdma #roce #nccl #gpu #networking
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens