πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai advanced ⏱ 15 minutes K8s 1.28+

NCCL Environment Variables Complete Reference

Complete reference for NCCL environment variables on Kubernetes. Configure network transport, InfiniBand, GPUDirect RDMA, socket

By Luca Berton β€’ β€’ πŸ“– 11 min read

πŸ’‘ Quick Answer: NCCL environment variables control network transport selection, InfiniBand configuration, GPUDirect RDMA, TCP socket tuning, algorithm selection, and debugging output. Set them in your Pod spec env section. Key variables: NCCL_SOCKET_IFNAME (network interface), NCCL_IB_HCA (IB devices), NCCL_NET_GDR_LEVEL (GPUDirect RDMA), NCCL_DEBUG (logging), NCCL_IB_DISABLE (disable IB).

The Problem

  • NCCL has 50+ environment variables with no single reference page
  • Wrong network configuration silently degrades performance by 10-100x
  • Debugging distributed training failures requires knowing which debug variables to set
  • Kubernetes pods need explicit env vars β€” NCCL can’t auto-detect across containers
  • InfiniBand, RoCE, and TCP each need different tuning variables

The Solution

Network Interface Selection

env:
  # NCCL_SOCKET_IFNAME β€” Select network interface for TCP/socket communication
  # Prefix with = for exact match, ^ for exclusion
  - name: NCCL_SOCKET_IFNAME
    value: "=eth0"           # Use exactly eth0
    # value: "eth"           # Any interface starting with "eth"
    # value: "^docker0,lo"   # Exclude docker0 and loopback
    # value: "=ib0"          # Use InfiniBand interface

  # NCCL_NET β€” Force network transport type
  - name: NCCL_NET
    value: "IB"              # Force InfiniBand (IB | Socket)
    # value: "Socket"        # Force TCP sockets (disable IB/RDMA)

InfiniBand Configuration

env:
  # NCCL_IB_DISABLE β€” Completely disable InfiniBand
  - name: NCCL_IB_DISABLE
    value: "0"               # 0=enable (default), 1=disable IB entirely
    # Set to "1" to force TCP even when IB is available

  # NCCL_IB_HCA β€” Select specific InfiniBand HCA devices
  - name: NCCL_IB_HCA
    value: "=mlx5_0,mlx5_1,mlx5_2,mlx5_3"
    # = prefix: exact device names
    # ^ prefix: exclude devices
    # No prefix: match prefix (mlx5 matches all mlx5_*)
    # value: "^mlx5_bond0"   # Exclude bonded device

  # NCCL_IB_GID_INDEX β€” GID index for RoCE v2
  - name: NCCL_IB_GID_INDEX
    value: "3"               # Typically 3 for RoCE v2 (IPv4)
    # 0 = IB default (InfiniBand native)
    # 1 = RoCE v1
    # 2 = RoCE v2 (link-local IPv6)
    # 3 = RoCE v2 (IPv4) ← most common

  # NCCL_IB_TIMEOUT β€” IB transport timeout
  - name: NCCL_IB_TIMEOUT
    value: "23"              # Timeout = 4.096Β΅s Γ— 2^value
    # 14 = ~67ms (default)
    # 22 = ~17s
    # 23 = ~34s (recommended for large clusters)

  # NCCL_IB_RETRY_CNT β€” IB retry count
  - name: NCCL_IB_RETRY_CNT
    value: "7"               # Max retries (default: 7, max: 7)

  # NCCL_IB_SL β€” InfiniBand Service Level (QoS)
  - name: NCCL_IB_SL
    value: "0"               # Service Level 0-15 (maps to VL)

  # NCCL_IB_TC β€” Traffic Class (for DSCP/ECN marking)
  - name: NCCL_IB_TC
    value: "106"             # Traffic class value
    # 106 = DSCP 26 (AF31) β€” common for GPU traffic with PFC

  # NCCL_IB_QPS_PER_CONNECTION β€” Queue Pairs per connection
  - name: NCCL_IB_QPS_PER_CONNECTION
    value: "4"               # Default: 1. Higher = more IB bandwidth per peer

  # NCCL_IB_ADAPTIVE_ROUTING β€” Enable IB adaptive routing
  - name: NCCL_IB_ADAPTIVE_ROUTING
    value: "1"               # 0=disable, 1=enable (requires switch support)

  # NCCL_IB_AR_THRESHOLD β€” Adaptive routing message size threshold
  - name: NCCL_IB_AR_THRESHOLD
    value: "8192"            # Only use AR for messages > this size (bytes)

GPUDirect RDMA

env:
  # NCCL_NET_GDR_LEVEL β€” GPUDirect RDMA topology level
  - name: NCCL_NET_GDR_LEVEL
    value: "5"
    # Controls max PCIe distance for GPUDirect RDMA:
    # 0 = disabled (no GDR)
    # 1 = same GPU (PHB β€” same PCIe hub)
    # 2 = same PCIe switch (PIX)
    # 3 = same PCIe root complex (PXB)
    # 4 = same NUMA node (NODE)
    # 5 = any distance (SYS) ← allows cross-NUMA GDR

  # NCCL_NET_GDR_READ β€” Enable GPUDirect RDMA for read operations
  - name: NCCL_NET_GDR_READ
    value: "1"               # 0=disable, 1=enable
    # Allows NIC to read directly from GPU memory
    # Requires NVIDIA peer memory module (nvidia_peermem)

  # NCCL_P2P_DISABLE β€” Disable PCIe peer-to-peer
  - name: NCCL_P2P_DISABLE
    value: "0"               # 0=enable P2P (default), 1=disable
    # Disable if seeing GPU errors on some PCIe topologies

  # NCCL_P2P_LEVEL β€” PCIe P2P topology level
  - name: NCCL_P2P_LEVEL
    value: "5"               # Same scale as GDR_LEVEL
    # Controls intra-node GPU-to-GPU PCIe P2P

  # NCCL_SHM_DISABLE β€” Disable shared memory transport
  - name: NCCL_SHM_DISABLE
    value: "0"               # 0=enable (default), 1=disable
    # SHM used for intra-node when P2P not available

TCP/Socket Tuning

env:
  # NCCL_SOCKET_NTHREADS β€” Number of threads per socket connection
  - name: NCCL_SOCKET_NTHREADS
    value: "4"               # Default: 1. Range: 1-16
    # More threads = higher TCP bandwidth (at CPU cost)

  # NCCL_NSOCKS_PERTHREAD β€” Sockets per thread
  - name: NCCL_NSOCKS_PERTHREAD
    value: "4"               # Default: 1. Range: 1-16
    # Total sockets = NTHREADS Γ— NSOCKS_PERTHREAD (max 64)
    # 4 Γ— 4 = 16 sockets per peer connection

  # NCCL_BUFFSIZE β€” Communication buffer size
  - name: NCCL_BUFFSIZE
    value: "8388608"         # 8MB (default: 4MB)
    # Larger = better bandwidth for large messages
    # Uses GPU memory, so don't set too high

  # NCCL_SOCKET_FAMILY β€” IP version for socket connections
  - name: NCCL_SOCKET_FAMILY
    value: "AF_INET"         # AF_INET (IPv4) or AF_INET6 (IPv6)

Algorithm and Protocol Selection

env:
  # NCCL_ALGO β€” Collective algorithm (usually let NCCL auto-select)
  - name: NCCL_ALGO
    value: "Ring,Tree"       # Comma-separated allowed algorithms
    # Ring β€” good for large messages, predictable bandwidth
    # Tree β€” good for small messages, lower latency
    # CollnetDirect β€” InfiniBand SHARP (requires switch support)
    # CollnetChain β€” InfiniBand SHARP chained
    # NVLS β€” NVLink SHARP (H100+ NVSwitch)
    # ⚠️ Usually best to NOT set this (auto-select is optimal)

  # NCCL_PROTO β€” Wire protocol
  - name: NCCL_PROTO
    value: "Simple,LL,LL128" # Comma-separated allowed protocols
    # LL β€” Low Latency (8-byte packets, good for <256KB)
    # LL128 β€” Low Latency 128-byte (good for <1MB)
    # Simple β€” High bandwidth (good for >1MB)
    # ⚠️ Usually best to NOT set this

  # NCCL_MIN_NCHANNELS β€” Minimum communication channels
  - name: NCCL_MIN_NCHANNELS
    value: "4"               # Default varies by GPU
    # More channels = more parallelism = more GPU memory

  # NCCL_MAX_NCHANNELS β€” Maximum communication channels
  - name: NCCL_MAX_NCHANNELS
    value: "32"              # Default varies by GPU
    # H100: default max 32
    # A100: default max 16

  # NCCL_NTHREADS β€” GPU threads per channel
  - name: NCCL_NTHREADS
    value: "512"             # Default: 512. Range: 64-1024
    # Higher = more GPU resources for communication

  # NCCL_CROSS_NIC β€” Allow cross-NIC (non-rail) communication
  - name: NCCL_CROSS_NIC
    value: "0"               # 0=same rail only, 1=cross-NIC allowed, 2=auto
    # Rail-optimized networks: set to 0
    # Full-mesh networks: set to 1 or 2

Topology and Tuning

env:
  # NCCL_TOPO_FILE β€” Custom topology XML file
  - name: NCCL_TOPO_FILE
    value: "/etc/nccl/topo.xml"
    # Override auto-detected topology
    # Useful when running in containers with limited /sys access

  # NCCL_TOPO_DUMP_FILE β€” Dump detected topology to file
  - name: NCCL_TOPO_DUMP_FILE
    value: "/tmp/nccl-topo.xml"
    # Saves detected topology on first run
    # Use as NCCL_TOPO_FILE for subsequent runs (skips detection)

  # NCCL_GRAPH_FILE β€” Communication graph file
  - name: NCCL_GRAPH_FILE
    value: "/etc/nccl/graph.xml"
    # Custom channel/ring configuration

  # NCCL_GRAPH_DUMP_FILE β€” Dump communication graph
  - name: NCCL_GRAPH_DUMP_FILE
    value: "/tmp/nccl-graph.xml"

  # NCCL_COLLNET_ENABLE β€” Enable collective network offload (SHARP)
  - name: NCCL_COLLNET_ENABLE
    value: "0"               # 0=disable (default), 1=enable
    # Requires InfiniBand SHARP support on switches

  # NCCL_LAUNCH_MODE β€” Process launch mode
  - name: NCCL_LAUNCH_MODE
    value: "GROUP"           # PARALLEL | GROUP
    # GROUP: all GPUs init together (better for containers)

Debugging and Logging

env:
  # NCCL_DEBUG β€” Debug output verbosity
  - name: NCCL_DEBUG
    value: "INFO"
    # WARN β€” warnings only (production)
    # INFO β€” initialization + transport selection
    # TRACE β€” all operations (very verbose, impacts performance)
    # VERSION β€” just print NCCL version

  # NCCL_DEBUG_SUBSYS β€” Filter debug by subsystem
  - name: NCCL_DEBUG_SUBSYS
    value: "INIT,NET,GRAPH"
    # INIT β€” initialization
    # NET β€” network operations
    # GRAPH β€” topology graph
    # COLL β€” collectives
    # P2P β€” peer-to-peer
    # SHM β€” shared memory
    # NVLS β€” NVLink SHARP
    # ALL β€” everything

  # NCCL_DEBUG_FILE β€” Redirect debug to file (per-rank)
  - name: NCCL_DEBUG_FILE
    value: "/tmp/nccl-debug-%h-%p.log"
    # %h = hostname, %p = PID
    # Useful for multi-GPU debugging without interleaved output

Complete Pod Example

apiVersion: v1
kind: Pod
metadata:
  name: distributed-training
  namespace: ml-workloads
spec:
  hostNetwork: true
  dnsPolicy: ClusterFirstWithHostNet
  containers:
    - name: trainer
      image: nvcr.io/nvidia/pytorch:24.05-py3
      env:
        # === Network Selection ===
        - name: NCCL_SOCKET_IFNAME
          value: "=eth0"
        - name: NCCL_NET
          value: "IB"
        - name: NCCL_IB_DISABLE
          value: "0"

        # === InfiniBand ===
        - name: NCCL_IB_HCA
          value: "=mlx5_0,mlx5_1,mlx5_2,mlx5_3"
        - name: NCCL_IB_GID_INDEX
          value: "3"
        - name: NCCL_IB_TIMEOUT
          value: "23"
        - name: NCCL_IB_RETRY_CNT
          value: "7"
        - name: NCCL_IB_QPS_PER_CONNECTION
          value: "4"

        # === GPUDirect RDMA ===
        - name: NCCL_NET_GDR_LEVEL
          value: "5"
        - name: NCCL_NET_GDR_READ
          value: "1"

        # === Topology ===
        - name: NCCL_TOPO_FILE
          value: "/etc/nccl/topo.xml"
        - name: NCCL_CROSS_NIC
          value: "0"

        # === Performance ===
        - name: NCCL_BUFFSIZE
          value: "8388608"
        - name: NCCL_MIN_NCHANNELS
          value: "4"

        # === Debugging (remove in production) ===
        - name: NCCL_DEBUG
          value: "INFO"
        - name: NCCL_DEBUG_SUBSYS
          value: "INIT,NET"

      resources:
        limits:
          nvidia.com/gpu: "8"
          rdma/rdma_shared_device_a: "1"
      volumeMounts:
        - name: shm
          mountPath: /dev/shm
        - name: nccl-topo
          mountPath: /etc/nccl
  volumes:
    - name: shm
      emptyDir:
        medium: Memory
        sizeLimit: 64Gi
    - name: nccl-topo
      configMap:
        name: nccl-topology

Quick Reference Table

Variable                     β”‚ Default  β”‚ Values           β”‚ Purpose
─────────────────────────────┼──────────┼──────────────────┼────────────────────────
NCCL_SOCKET_IFNAME           β”‚ auto     β”‚ =eth0, ^lo       β”‚ Network interface
NCCL_NET                     β”‚ auto     β”‚ IB, Socket       β”‚ Force transport
NCCL_IB_DISABLE              β”‚ 0        β”‚ 0, 1             β”‚ Disable InfiniBand
NCCL_IB_HCA                  β”‚ auto     β”‚ =mlx5_0,...      β”‚ Select IB devices
NCCL_IB_GID_INDEX            β”‚ 0        β”‚ 0-3              β”‚ RoCE GID index
NCCL_IB_TIMEOUT              β”‚ 14       β”‚ 1-31             β”‚ IB timeout exponent
NCCL_IB_RETRY_CNT            β”‚ 7        β”‚ 0-7              β”‚ IB retries
NCCL_IB_SL                   β”‚ 0        β”‚ 0-15             β”‚ Service level
NCCL_IB_TC                   β”‚ 0        β”‚ 0-255            β”‚ Traffic class
NCCL_IB_QPS_PER_CONNECTION   β”‚ 1        β”‚ 1-128            β”‚ QPs per conn
NCCL_IB_ADAPTIVE_ROUTING     β”‚ 0        β”‚ 0, 1             β”‚ Adaptive routing
NCCL_NET_GDR_LEVEL           β”‚ auto     β”‚ 0-5              β”‚ GPUDirect RDMA distance
NCCL_NET_GDR_READ            β”‚ 0        β”‚ 0, 1             β”‚ GDR read enable
NCCL_P2P_DISABLE             β”‚ 0        β”‚ 0, 1             β”‚ Disable PCIe P2P
NCCL_P2P_LEVEL               β”‚ auto     β”‚ 0-5              β”‚ P2P topology level
NCCL_SHM_DISABLE             β”‚ 0        β”‚ 0, 1             β”‚ Disable shared mem
NCCL_SOCKET_NTHREADS         β”‚ 1        β”‚ 1-16             β”‚ TCP threads
NCCL_NSOCKS_PERTHREAD        β”‚ 1        β”‚ 1-16             β”‚ Sockets per thread
NCCL_BUFFSIZE                β”‚ 4194304  β”‚ bytes            β”‚ Buffer size
NCCL_ALGO                    β”‚ auto     β”‚ Ring,Tree,...     β”‚ Algorithm
NCCL_PROTO                   β”‚ auto     β”‚ LL,LL128,Simple  β”‚ Protocol
NCCL_MIN_NCHANNELS           β”‚ varies   β”‚ 1-32             β”‚ Min channels
NCCL_MAX_NCHANNELS           β”‚ varies   β”‚ 1-32             β”‚ Max channels
NCCL_NTHREADS                β”‚ 512      β”‚ 64-1024          β”‚ GPU threads/channel
NCCL_CROSS_NIC               β”‚ 2        β”‚ 0, 1, 2          β”‚ Cross-NIC policy
NCCL_TOPO_FILE               β”‚ none     β”‚ path             β”‚ Topology XML
NCCL_TOPO_DUMP_FILE          β”‚ none     β”‚ path             β”‚ Dump topology
NCCL_COLLNET_ENABLE          β”‚ 0        β”‚ 0, 1             β”‚ SHARP offload
NCCL_DEBUG                   β”‚ WARN     β”‚ WARN,INFO,TRACE  β”‚ Log level
NCCL_DEBUG_SUBSYS            β”‚ ALL      β”‚ INIT,NET,...     β”‚ Log filter
NCCL_DEBUG_FILE              β”‚ stderr   β”‚ path (%h,%p)     β”‚ Log file
─────────────────────────────┴──────────┴──────────────────┴────────────────────────

Common Issues

NCCL_IB_DISABLE=1 but performance is bad

  • Cause: Forcing TCP when IB hardware is available
  • Fix: Only disable IB if hardware is broken; set NCCL_SOCKET_NTHREADS=8 and NCCL_NSOCKS_PERTHREAD=4 for TCP

”Invalid argument” on modprobe nvidia_peermem

  • Cause: Driver version mismatch between nvidia.ko and nvidia_peermem.ko
  • Fix: Ensure GPU Operator installs matching driver + peermem versions; check dmesg for details

NCCL_NET_GDR_LEVEL set but GDR not active

  • Cause: nvidia_peermem module not loaded, or NIC not RDMA-capable
  • Fix: Verify lsmod | grep nvidia_peermem; check ibv_devinfo shows active port

NCCL_SOCKET_IFNAME wrong interface selected

  • Cause: Multiple interfaces match prefix pattern
  • Fix: Use = prefix for exact match: NCCL_SOCKET_IFNAME==eth0

High latency despite IB being enabled

  • Cause: NCCL_IB_GID_INDEX wrong for RoCE setup (using IB native on RoCE fabric)
  • Fix: Set GID index to 3 for RoCE v2; verify with ibv_devinfo -d mlx5_0 -v | grep GID

Best Practices

  1. Don’t set NCCL_ALGO/NCCL_PROTO β€” auto-selection is correct 95% of the time
  2. Always set NCCL_SOCKET_IFNAME β€” Kubernetes pods may have multiple interfaces
  3. Use NCCL_TOPO_FILE in containers β€” avoids 10-30s topology detection on every start
  4. Set NCCL_DEBUG=INFO for initial runs β€” verify transport selection, then reduce to WARN
  5. NCCL_IB_TIMEOUT=23 for large clusters β€” prevents spurious timeout failures
  6. NCCL_CROSS_NIC=0 for rail-optimized networks β€” avoids suboptimal cross-switch paths
  7. Match NCCL_IB_HCA to GPU affinity β€” ensure each GPU uses its nearest NIC
  8. NCCL_BUFFSIZE=8388608 for large models β€” improves bandwidth for multi-GB transfers
  9. Use NCCL_DEBUG_FILE in multi-GPU jobs β€” prevents interleaved log output
  10. Test changes with nccl-tests β€” measure all_reduce_perf before and after tuning

Key Takeaways

  • NCCL environment variables control all aspects of GPU collective communication
  • NCCL_IB_DISABLE=1 forces TCP β€” 5-10x slower than IB/RDMA (use only for debugging)
  • NCCL_NET_GDR_LEVEL=5 + NCCL_NET_GDR_READ=1 enables GPUDirect RDMA at any PCIe distance
  • NCCL_IB_GID_INDEX=3 is required for RoCE v2 (IPv4) β€” wrong index = connection failure
  • TCP tuning: NCCL_SOCKET_NTHREADS Γ— NCCL_NSOCKS_PERTHREAD = total sockets (max 64)
  • NCCL_TOPO_FILE eliminates topology detection overhead in containers
  • NCCL_DEBUG=INFO + NCCL_DEBUG_SUBSYS=INIT,NET shows transport selection without noise
  • Don’t manually set algorithms/protocols unless benchmarking proves improvement
  • All variables set via Pod env section β€” no config files needed
#nccl #gpu #rdma #infiniband #distributed-training #environment-variables
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens