πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai intermediate ⏱ 20 minutes K8s 1.28+

NCCL_IB_DISABLE Environment Variable

NCCL_IB_DISABLE environment variable explained. Set NCCL_IB_DISABLE=1 for Ethernet-only clusters, debug InfiniBand errors, and tune GPU communication.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Set NCCL_IB_DISABLE=1 when your cluster has InfiniBand hardware detected but you want NCCL to use TCP/Ethernet instead. Common scenarios: IB drivers installed but no IB fabric, mixed IB/Ethernet nodes, or IB errors during training. Without this flag, NCCL auto-detects IB and attempts to use it β€” failing silently or hanging if the fabric isn’t properly configured.

The Problem

NCCL (NVIDIA Collective Communications Library) auto-detects network interfaces. When InfiniBand hardware or drivers are present β€” even if not connected to an IB fabric β€” NCCL selects IB as the transport. This causes hangs, timeouts, or unhandled system error during distributed training. The NCCL_IB_DISABLE variable forces NCCL to skip IB and use TCP sockets instead.

flowchart TB
    NCCL["NCCL Init"] -->|"Auto-detect"| CHECK{"IB hardware<br/>present?"}
    CHECK -->|"Yes (default)"| IB["Use InfiniBand<br/>⚠️ May fail if<br/>no IB fabric"]
    CHECK -->|"No"| TCP["Use TCP/Socket"]
    
    NCCL2["NCCL Init<br/>NCCL_IB_DISABLE=1"] --> FORCE_TCP["Force TCP/Socket<br/>βœ… Skip IB detection"]

The Solution

Set in Kubernetes Pod

apiVersion: v1
kind: Pod
metadata:
  name: training-job
spec:
  containers:
    - name: trainer
      image: nvcr.io/nvidia/pytorch:24.07-py3
      env:
        # Disable InfiniBand β€” use TCP/Ethernet
        - name: NCCL_IB_DISABLE
          value: "1"
        # Specify which Ethernet interface to use
        - name: NCCL_SOCKET_IFNAME
          value: "eth0"
        # Optional: improve TCP performance
        - name: NCCL_SOCKET_NTHREADS
          value: "4"
        - name: NCCL_NSOCKS_PERTHREAD
          value: "4"

Set in PyTorchJob

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: distributed-training
spec:
  pytorchReplicaSpecs:
    Master:
      template:
        spec:
          containers:
            - name: pytorch
              env:
                - name: NCCL_IB_DISABLE
                  value: "1"
                - name: NCCL_SOCKET_IFNAME
                  value: "eth0"
    Worker:
      replicas: 3
      template:
        spec:
          containers:
            - name: pytorch
              env:
                - name: NCCL_IB_DISABLE
                  value: "1"
                - name: NCCL_SOCKET_IFNAME
                  value: "eth0"

When to Use NCCL_IB_DISABLE=1

ScenarioSet NCCL_IB_DISABLE?Why
Cloud VMs (AWS, GCP, Azure) without IBYesIB drivers often pre-installed but no fabric
On-prem with InfiniBand fabricNoIB provides 200-400 Gb/s vs ~25 Gb/s TCP
Mixed IB + Ethernet nodesYes on Ethernet nodesPrevents IB selection on non-IB nodes
IB errors during NCCL initYes (temporary)Unblock training while debugging IB
Single-node multi-GPUNoNCCL uses NVLink/PCIe, not network
EFA on AWS (Elastic Fabric Adapter)YesUse FI_PROVIDER=efa instead of IB

Diagnose IB Issues

# Check if IB devices are detected
ibstat 2>/dev/null || echo "No IB tools installed"

# Check NCCL's transport selection
NCCL_DEBUG=INFO python -c "
import torch
import torch.distributed as dist
dist.init_process_group('nccl')
" 2>&1 | grep -i "ib\|socket\|net\|transport"

# Expected output with IB disabled:
# NCCL INFO NET/Socket : Using [0]eth0:10.0.0.5<0>
# (no IB/RDMA lines)

# Expected output with IB enabled:
# NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB

Performance Comparison

# Run NCCL all-reduce benchmark with IB
NCCL_IB_DISABLE=0 \
  /opt/nccl-tests/build/all_reduce_perf -b 1M -e 1G -f 2 -g 1

# Run with TCP/Socket
NCCL_IB_DISABLE=1 NCCL_SOCKET_IFNAME=eth0 \
  /opt/nccl-tests/build/all_reduce_perf -b 1M -e 1G -f 2 -g 1

# Typical results (8 GPUs across 2 nodes):
# IB HDR (200Gb/s):    ~180 GB/s bus bandwidth
# TCP 25GbE:           ~20 GB/s bus bandwidth
# TCP 100GbE:          ~80 GB/s bus bandwidth
VariablePurposeDefault
NCCL_IB_DISABLE=1Skip InfiniBand, use TCP0 (IB auto-detected)
NCCL_SOCKET_IFNAME=eth0Which interface for TCPAuto-detect
NCCL_IB_HCA=mlx5_0Which IB device to useAll detected
NCCL_IB_GID_INDEX=3RoCE GID index0
NCCL_NET_GDR_LEVEL=5GPUDirect RDMA levelAuto
NCCL_DEBUG=INFOEnable debug loggingWARN
NCCL_P2P_DISABLE=1Disable GPU peer-to-peer0

Common Issues

IssueCauseFix
NCCL hang on initIB detected but no fabricSet NCCL_IB_DISABLE=1
unhandled system errorIB driver version mismatchUpdate MLNX_OFED or disable IB
Slow training with IB disabledTCP much slower than IBFix IB fabric or upgrade to 100GbE
ib_register_peer_memory_client errornvidia-peermem module issuemodprobe nvidia-peermem or disable GPUDirect
Wrong interface selectedMultiple NICs presentSet NCCL_SOCKET_IFNAME explicitly

Best Practices

  • Don’t disable IB if you have working InfiniBand β€” 10Γ— performance difference
  • Always set NCCL_SOCKET_IFNAME with IB disabled β€” prevents NCCL picking loopback
  • Use NCCL_DEBUG=INFO to diagnose β€” shows exactly which transport NCCL selects
  • Set in ConfigMap for consistency β€” all training pods get the same NCCL config
  • Test with nccl-tests before training β€” verify bandwidth before wasting GPU hours
  • Consider EFA on AWS β€” better than TCP, different from IB (NCCL_IB_DISABLE=1 + FI_PROVIDER=efa)

Key Takeaways

  • NCCL_IB_DISABLE=1 forces NCCL to use TCP instead of InfiniBand
  • Needed when IB hardware/drivers exist but no IB fabric is connected
  • Common in cloud VMs where MLNX_OFED is pre-installed
  • Always pair with NCCL_SOCKET_IFNAME to select the right interface
  • IB is 5-10Γ— faster than TCP β€” only disable when IB isn’t available
  • Use NCCL_DEBUG=INFO to verify which transport NCCL actually uses
#nccl #infiniband #rdma #gpu-networking #distributed-training
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens