NCCL Environment Variables Complete Reference
Complete reference for NCCL environment variables on Kubernetes. Configure network transport, InfiniBand, GPUDirect RDMA, socket
π‘ Quick Answer: NCCL environment variables control network transport selection, InfiniBand configuration, GPUDirect RDMA, TCP socket tuning, algorithm selection, and debugging output. Set them in your Pod spec
envsection. Key variables:NCCL_SOCKET_IFNAME(network interface),NCCL_IB_HCA(IB devices),NCCL_NET_GDR_LEVEL(GPUDirect RDMA),NCCL_DEBUG(logging),NCCL_IB_DISABLE(disable IB).
The Problem
- NCCL has 50+ environment variables with no single reference page
- Wrong network configuration silently degrades performance by 10-100x
- Debugging distributed training failures requires knowing which debug variables to set
- Kubernetes pods need explicit env vars β NCCL canβt auto-detect across containers
- InfiniBand, RoCE, and TCP each need different tuning variables
The Solution
Network Interface Selection
env:
# NCCL_SOCKET_IFNAME β Select network interface for TCP/socket communication
# Prefix with = for exact match, ^ for exclusion
- name: NCCL_SOCKET_IFNAME
value: "=eth0" # Use exactly eth0
# value: "eth" # Any interface starting with "eth"
# value: "^docker0,lo" # Exclude docker0 and loopback
# value: "=ib0" # Use InfiniBand interface
# NCCL_NET β Force network transport type
- name: NCCL_NET
value: "IB" # Force InfiniBand (IB | Socket)
# value: "Socket" # Force TCP sockets (disable IB/RDMA)InfiniBand Configuration
env:
# NCCL_IB_DISABLE β Completely disable InfiniBand
- name: NCCL_IB_DISABLE
value: "0" # 0=enable (default), 1=disable IB entirely
# Set to "1" to force TCP even when IB is available
# NCCL_IB_HCA β Select specific InfiniBand HCA devices
- name: NCCL_IB_HCA
value: "=mlx5_0,mlx5_1,mlx5_2,mlx5_3"
# = prefix: exact device names
# ^ prefix: exclude devices
# No prefix: match prefix (mlx5 matches all mlx5_*)
# value: "^mlx5_bond0" # Exclude bonded device
# NCCL_IB_GID_INDEX β GID index for RoCE v2
- name: NCCL_IB_GID_INDEX
value: "3" # Typically 3 for RoCE v2 (IPv4)
# 0 = IB default (InfiniBand native)
# 1 = RoCE v1
# 2 = RoCE v2 (link-local IPv6)
# 3 = RoCE v2 (IPv4) β most common
# NCCL_IB_TIMEOUT β IB transport timeout
- name: NCCL_IB_TIMEOUT
value: "23" # Timeout = 4.096Β΅s Γ 2^value
# 14 = ~67ms (default)
# 22 = ~17s
# 23 = ~34s (recommended for large clusters)
# NCCL_IB_RETRY_CNT β IB retry count
- name: NCCL_IB_RETRY_CNT
value: "7" # Max retries (default: 7, max: 7)
# NCCL_IB_SL β InfiniBand Service Level (QoS)
- name: NCCL_IB_SL
value: "0" # Service Level 0-15 (maps to VL)
# NCCL_IB_TC β Traffic Class (for DSCP/ECN marking)
- name: NCCL_IB_TC
value: "106" # Traffic class value
# 106 = DSCP 26 (AF31) β common for GPU traffic with PFC
# NCCL_IB_QPS_PER_CONNECTION β Queue Pairs per connection
- name: NCCL_IB_QPS_PER_CONNECTION
value: "4" # Default: 1. Higher = more IB bandwidth per peer
# NCCL_IB_ADAPTIVE_ROUTING β Enable IB adaptive routing
- name: NCCL_IB_ADAPTIVE_ROUTING
value: "1" # 0=disable, 1=enable (requires switch support)
# NCCL_IB_AR_THRESHOLD β Adaptive routing message size threshold
- name: NCCL_IB_AR_THRESHOLD
value: "8192" # Only use AR for messages > this size (bytes)GPUDirect RDMA
env:
# NCCL_NET_GDR_LEVEL β GPUDirect RDMA topology level
- name: NCCL_NET_GDR_LEVEL
value: "5"
# Controls max PCIe distance for GPUDirect RDMA:
# 0 = disabled (no GDR)
# 1 = same GPU (PHB β same PCIe hub)
# 2 = same PCIe switch (PIX)
# 3 = same PCIe root complex (PXB)
# 4 = same NUMA node (NODE)
# 5 = any distance (SYS) β allows cross-NUMA GDR
# NCCL_NET_GDR_READ β Enable GPUDirect RDMA for read operations
- name: NCCL_NET_GDR_READ
value: "1" # 0=disable, 1=enable
# Allows NIC to read directly from GPU memory
# Requires NVIDIA peer memory module (nvidia_peermem)
# NCCL_P2P_DISABLE β Disable PCIe peer-to-peer
- name: NCCL_P2P_DISABLE
value: "0" # 0=enable P2P (default), 1=disable
# Disable if seeing GPU errors on some PCIe topologies
# NCCL_P2P_LEVEL β PCIe P2P topology level
- name: NCCL_P2P_LEVEL
value: "5" # Same scale as GDR_LEVEL
# Controls intra-node GPU-to-GPU PCIe P2P
# NCCL_SHM_DISABLE β Disable shared memory transport
- name: NCCL_SHM_DISABLE
value: "0" # 0=enable (default), 1=disable
# SHM used for intra-node when P2P not availableTCP/Socket Tuning
env:
# NCCL_SOCKET_NTHREADS β Number of threads per socket connection
- name: NCCL_SOCKET_NTHREADS
value: "4" # Default: 1. Range: 1-16
# More threads = higher TCP bandwidth (at CPU cost)
# NCCL_NSOCKS_PERTHREAD β Sockets per thread
- name: NCCL_NSOCKS_PERTHREAD
value: "4" # Default: 1. Range: 1-16
# Total sockets = NTHREADS Γ NSOCKS_PERTHREAD (max 64)
# 4 Γ 4 = 16 sockets per peer connection
# NCCL_BUFFSIZE β Communication buffer size
- name: NCCL_BUFFSIZE
value: "8388608" # 8MB (default: 4MB)
# Larger = better bandwidth for large messages
# Uses GPU memory, so don't set too high
# NCCL_SOCKET_FAMILY β IP version for socket connections
- name: NCCL_SOCKET_FAMILY
value: "AF_INET" # AF_INET (IPv4) or AF_INET6 (IPv6)Algorithm and Protocol Selection
env:
# NCCL_ALGO β Collective algorithm (usually let NCCL auto-select)
- name: NCCL_ALGO
value: "Ring,Tree" # Comma-separated allowed algorithms
# Ring β good for large messages, predictable bandwidth
# Tree β good for small messages, lower latency
# CollnetDirect β InfiniBand SHARP (requires switch support)
# CollnetChain β InfiniBand SHARP chained
# NVLS β NVLink SHARP (H100+ NVSwitch)
# β οΈ Usually best to NOT set this (auto-select is optimal)
# NCCL_PROTO β Wire protocol
- name: NCCL_PROTO
value: "Simple,LL,LL128" # Comma-separated allowed protocols
# LL β Low Latency (8-byte packets, good for <256KB)
# LL128 β Low Latency 128-byte (good for <1MB)
# Simple β High bandwidth (good for >1MB)
# β οΈ Usually best to NOT set this
# NCCL_MIN_NCHANNELS β Minimum communication channels
- name: NCCL_MIN_NCHANNELS
value: "4" # Default varies by GPU
# More channels = more parallelism = more GPU memory
# NCCL_MAX_NCHANNELS β Maximum communication channels
- name: NCCL_MAX_NCHANNELS
value: "32" # Default varies by GPU
# H100: default max 32
# A100: default max 16
# NCCL_NTHREADS β GPU threads per channel
- name: NCCL_NTHREADS
value: "512" # Default: 512. Range: 64-1024
# Higher = more GPU resources for communication
# NCCL_CROSS_NIC β Allow cross-NIC (non-rail) communication
- name: NCCL_CROSS_NIC
value: "0" # 0=same rail only, 1=cross-NIC allowed, 2=auto
# Rail-optimized networks: set to 0
# Full-mesh networks: set to 1 or 2Topology and Tuning
env:
# NCCL_TOPO_FILE β Custom topology XML file
- name: NCCL_TOPO_FILE
value: "/etc/nccl/topo.xml"
# Override auto-detected topology
# Useful when running in containers with limited /sys access
# NCCL_TOPO_DUMP_FILE β Dump detected topology to file
- name: NCCL_TOPO_DUMP_FILE
value: "/tmp/nccl-topo.xml"
# Saves detected topology on first run
# Use as NCCL_TOPO_FILE for subsequent runs (skips detection)
# NCCL_GRAPH_FILE β Communication graph file
- name: NCCL_GRAPH_FILE
value: "/etc/nccl/graph.xml"
# Custom channel/ring configuration
# NCCL_GRAPH_DUMP_FILE β Dump communication graph
- name: NCCL_GRAPH_DUMP_FILE
value: "/tmp/nccl-graph.xml"
# NCCL_COLLNET_ENABLE β Enable collective network offload (SHARP)
- name: NCCL_COLLNET_ENABLE
value: "0" # 0=disable (default), 1=enable
# Requires InfiniBand SHARP support on switches
# NCCL_LAUNCH_MODE β Process launch mode
- name: NCCL_LAUNCH_MODE
value: "GROUP" # PARALLEL | GROUP
# GROUP: all GPUs init together (better for containers)Debugging and Logging
env:
# NCCL_DEBUG β Debug output verbosity
- name: NCCL_DEBUG
value: "INFO"
# WARN β warnings only (production)
# INFO β initialization + transport selection
# TRACE β all operations (very verbose, impacts performance)
# VERSION β just print NCCL version
# NCCL_DEBUG_SUBSYS β Filter debug by subsystem
- name: NCCL_DEBUG_SUBSYS
value: "INIT,NET,GRAPH"
# INIT β initialization
# NET β network operations
# GRAPH β topology graph
# COLL β collectives
# P2P β peer-to-peer
# SHM β shared memory
# NVLS β NVLink SHARP
# ALL β everything
# NCCL_DEBUG_FILE β Redirect debug to file (per-rank)
- name: NCCL_DEBUG_FILE
value: "/tmp/nccl-debug-%h-%p.log"
# %h = hostname, %p = PID
# Useful for multi-GPU debugging without interleaved outputComplete Pod Example
apiVersion: v1
kind: Pod
metadata:
name: distributed-training
namespace: ml-workloads
spec:
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet
containers:
- name: trainer
image: nvcr.io/nvidia/pytorch:24.05-py3
env:
# === Network Selection ===
- name: NCCL_SOCKET_IFNAME
value: "=eth0"
- name: NCCL_NET
value: "IB"
- name: NCCL_IB_DISABLE
value: "0"
# === InfiniBand ===
- name: NCCL_IB_HCA
value: "=mlx5_0,mlx5_1,mlx5_2,mlx5_3"
- name: NCCL_IB_GID_INDEX
value: "3"
- name: NCCL_IB_TIMEOUT
value: "23"
- name: NCCL_IB_RETRY_CNT
value: "7"
- name: NCCL_IB_QPS_PER_CONNECTION
value: "4"
# === GPUDirect RDMA ===
- name: NCCL_NET_GDR_LEVEL
value: "5"
- name: NCCL_NET_GDR_READ
value: "1"
# === Topology ===
- name: NCCL_TOPO_FILE
value: "/etc/nccl/topo.xml"
- name: NCCL_CROSS_NIC
value: "0"
# === Performance ===
- name: NCCL_BUFFSIZE
value: "8388608"
- name: NCCL_MIN_NCHANNELS
value: "4"
# === Debugging (remove in production) ===
- name: NCCL_DEBUG
value: "INFO"
- name: NCCL_DEBUG_SUBSYS
value: "INIT,NET"
resources:
limits:
nvidia.com/gpu: "8"
rdma/rdma_shared_device_a: "1"
volumeMounts:
- name: shm
mountPath: /dev/shm
- name: nccl-topo
mountPath: /etc/nccl
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: 64Gi
- name: nccl-topo
configMap:
name: nccl-topologyQuick Reference Table
Variable β Default β Values β Purpose
ββββββββββββββββββββββββββββββΌβββββββββββΌβββββββββββββββββββΌββββββββββββββββββββββββ
NCCL_SOCKET_IFNAME β auto β =eth0, ^lo β Network interface
NCCL_NET β auto β IB, Socket β Force transport
NCCL_IB_DISABLE β 0 β 0, 1 β Disable InfiniBand
NCCL_IB_HCA β auto β =mlx5_0,... β Select IB devices
NCCL_IB_GID_INDEX β 0 β 0-3 β RoCE GID index
NCCL_IB_TIMEOUT β 14 β 1-31 β IB timeout exponent
NCCL_IB_RETRY_CNT β 7 β 0-7 β IB retries
NCCL_IB_SL β 0 β 0-15 β Service level
NCCL_IB_TC β 0 β 0-255 β Traffic class
NCCL_IB_QPS_PER_CONNECTION β 1 β 1-128 β QPs per conn
NCCL_IB_ADAPTIVE_ROUTING β 0 β 0, 1 β Adaptive routing
NCCL_NET_GDR_LEVEL β auto β 0-5 β GPUDirect RDMA distance
NCCL_NET_GDR_READ β 0 β 0, 1 β GDR read enable
NCCL_P2P_DISABLE β 0 β 0, 1 β Disable PCIe P2P
NCCL_P2P_LEVEL β auto β 0-5 β P2P topology level
NCCL_SHM_DISABLE β 0 β 0, 1 β Disable shared mem
NCCL_SOCKET_NTHREADS β 1 β 1-16 β TCP threads
NCCL_NSOCKS_PERTHREAD β 1 β 1-16 β Sockets per thread
NCCL_BUFFSIZE β 4194304 β bytes β Buffer size
NCCL_ALGO β auto β Ring,Tree,... β Algorithm
NCCL_PROTO β auto β LL,LL128,Simple β Protocol
NCCL_MIN_NCHANNELS β varies β 1-32 β Min channels
NCCL_MAX_NCHANNELS β varies β 1-32 β Max channels
NCCL_NTHREADS β 512 β 64-1024 β GPU threads/channel
NCCL_CROSS_NIC β 2 β 0, 1, 2 β Cross-NIC policy
NCCL_TOPO_FILE β none β path β Topology XML
NCCL_TOPO_DUMP_FILE β none β path β Dump topology
NCCL_COLLNET_ENABLE β 0 β 0, 1 β SHARP offload
NCCL_DEBUG β WARN β WARN,INFO,TRACE β Log level
NCCL_DEBUG_SUBSYS β ALL β INIT,NET,... β Log filter
NCCL_DEBUG_FILE β stderr β path (%h,%p) β Log file
ββββββββββββββββββββββββββββββ΄βββββββββββ΄βββββββββββββββββββ΄ββββββββββββββββββββββββCommon Issues
NCCL_IB_DISABLE=1 but performance is bad
- Cause: Forcing TCP when IB hardware is available
- Fix: Only disable IB if hardware is broken; set
NCCL_SOCKET_NTHREADS=8andNCCL_NSOCKS_PERTHREAD=4for TCP
βInvalid argumentβ on modprobe nvidia_peermem
- Cause: Driver version mismatch between nvidia.ko and nvidia_peermem.ko
- Fix: Ensure GPU Operator installs matching driver + peermem versions; check
dmesgfor details
NCCL_NET_GDR_LEVEL set but GDR not active
- Cause:
nvidia_peermemmodule not loaded, or NIC not RDMA-capable - Fix: Verify
lsmod | grep nvidia_peermem; checkibv_devinfoshows active port
NCCL_SOCKET_IFNAME wrong interface selected
- Cause: Multiple interfaces match prefix pattern
- Fix: Use
=prefix for exact match:NCCL_SOCKET_IFNAME==eth0
High latency despite IB being enabled
- Cause:
NCCL_IB_GID_INDEXwrong for RoCE setup (using IB native on RoCE fabric) - Fix: Set GID index to 3 for RoCE v2; verify with
ibv_devinfo -d mlx5_0 -v | grep GID
Best Practices
- Donβt set NCCL_ALGO/NCCL_PROTO β auto-selection is correct 95% of the time
- Always set NCCL_SOCKET_IFNAME β Kubernetes pods may have multiple interfaces
- Use NCCL_TOPO_FILE in containers β avoids 10-30s topology detection on every start
- Set NCCL_DEBUG=INFO for initial runs β verify transport selection, then reduce to WARN
- NCCL_IB_TIMEOUT=23 for large clusters β prevents spurious timeout failures
- NCCL_CROSS_NIC=0 for rail-optimized networks β avoids suboptimal cross-switch paths
- Match NCCL_IB_HCA to GPU affinity β ensure each GPU uses its nearest NIC
- NCCL_BUFFSIZE=8388608 for large models β improves bandwidth for multi-GB transfers
- Use NCCL_DEBUG_FILE in multi-GPU jobs β prevents interleaved log output
- Test changes with nccl-tests β measure
all_reduce_perfbefore and after tuning
Key Takeaways
- NCCL environment variables control all aspects of GPU collective communication
NCCL_IB_DISABLE=1forces TCP β 5-10x slower than IB/RDMA (use only for debugging)NCCL_NET_GDR_LEVEL=5+NCCL_NET_GDR_READ=1enables GPUDirect RDMA at any PCIe distanceNCCL_IB_GID_INDEX=3is required for RoCE v2 (IPv4) β wrong index = connection failure- TCP tuning:
NCCL_SOCKET_NTHREADS Γ NCCL_NSOCKS_PERTHREAD= total sockets (max 64) NCCL_TOPO_FILEeliminates topology detection overhead in containersNCCL_DEBUG=INFO+NCCL_DEBUG_SUBSYS=INIT,NETshows transport selection without noise- Donβt manually set algorithms/protocols unless benchmarking proves improvement
- All variables set via Pod
envsection β no config files needed

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
