NCCL Debug Subsystems for GPU Network Troubleshooting
Configure NCCL_DEBUG and NCCL_DEBUG_SUBSYS for targeted logging during multi-node GPU training. Covers INIT, NET, GRAPH subsystems, log
π‘ Quick Answer: Set
NCCL_DEBUG=INFOwithNCCL_DEBUG_SUBSYS=INIT,NET,GRAPHfor targeted debugging without flooding logs. INIT shows device discovery and topology, NET shows network transport selection and RDMA setup, GRAPH shows channel/ring topology decisions. For production, useNCCL_DEBUG=WARNto minimize log volume.
The Problem
- NCCL full debug (
NCCL_DEBUG=TRACE) produces gigabytes of logs - Need targeted subsystem logging to diagnose specific issues
- Must understand what each subsystem reveals for efficient troubleshooting
- Production workloads need minimal logging overhead
- Multi-rank jobs produce interleaved logs that are hard to parse
The Solution
Debug Levels
# NCCL_DEBUG controls overall verbosity:
export NCCL_DEBUG=WARN # Production: only warnings and errors
export NCCL_DEBUG=INFO # Debugging: initialization and key events
export NCCL_DEBUG=TRACE # Deep debug: every operation (very verbose)Debug Subsystems
# NCCL_DEBUG_SUBSYS filters which components log at the selected level:
export NCCL_DEBUG_SUBSYS="INIT,NET,GRAPH"
# Available subsystems:
# INIT β Device discovery, topology detection, version info
# NET β Network transport selection (IB, socket, RDMA)
# GRAPH β Channel topology, ring/tree algorithm decisions
# COLL β Collective operation scheduling
# P2P β Peer-to-peer GPU communication
# SHM β Shared memory transport
# NVLS β NVLink SHARP (NVSwitch multicast)
# REG β Memory registration (DMA-BUF, peermem)
# PROXY β Network proxy thread operations
# ALL β Everything (same as TRACE without subsys filter)Recommended Configurations
# === Validation/Benchmarking ===
# See device selection, network setup, and topology
env:
- name: NCCL_DEBUG
value: "INFO"
- name: NCCL_DEBUG_SUBSYS
value: "INIT,NET,GRAPH"
# === Network Debugging ===
# Focus on RDMA connections and transport
env:
- name: NCCL_DEBUG
value: "INFO"
- name: NCCL_DEBUG_SUBSYS
value: "NET,PROXY"
# === Topology Issues ===
# Channel algorithm and ring formation
env:
- name: NCCL_DEBUG
value: "INFO"
- name: NCCL_DEBUG_SUBSYS
value: "GRAPH,INIT"
# === Production ===
# Minimal logging, only errors
env:
- name: NCCL_DEBUG
value: "WARN"
# No NCCL_DEBUG_SUBSYS needed at WARN level
# === Full Trace (last resort) ===
# WARNING: Produces GB of logs, slows execution
env:
- name: NCCL_DEBUG
value: "TRACE"
- name: NCCL_DEBUG_SUBSYS
value: "ALL"Interpreting Key Log Lines
# === INIT subsystem ===
# Version and library info:
NCCL INFO nccl-tests version 2.18.3 nccl-headers=22808 nccl-library=22808
# Confirms nccl-tests and NCCL library versions match
# Device discovery:
NCCL INFO Using devices: Rank 0 Group 0 Pid 95 on worker-0 device 0 [0000:42:00] NVIDIA H200 NVL
# Shows: rank assignment, node, PCIe bus ID, GPU model
# CUDA driver:
NCCL INFO cudaDriverVersion 13000
# CUDA driver version (13000 = CUDA 13.0)
# === NET subsystem ===
# Transport selection:
NCCL INFO Channel 0/0 : 0[0] -> 2[0] [send] via NET/IB/0/GDRDMA
# Channel/subchannel : src_rank[gpu] -> dst_rank[gpu] direction transport
# NET/IB/0/GDRDMA = InfiniBand device 0 with GPUDirect RDMA β
# NET/IB/0 = InfiniBand device 0 without GDRDMA (CPU bounce)
# NET/Socket/0 = TCP socket fallback (bad performance)
# GPUDirect RDMA enabled:
NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 9 <= 9), read 1 mode Default
# distance 9 <= 9 means SYS level allows this pair
# IB device setup:
NCCL INFO NET/IB: Dev 0 IBDev 0 Port 1 qpn 364 mtu 5 GID 3
# Dev=network device, IBDev=IB device, qpn=queue pair, mtu 5=4096B, GID 3=RoCEv2 IPv4
# Socket interface:
NCCL INFO NCCL_SOCKET_IFNAME set to net1
# Confirms NCCL is using the correct SR-IOV interface
# Plugin search:
NCCL INFO ENV/Plugin: Could not find: libnccl-env.so
# Informational only β no env plugin loaded (normal)
# === GRAPH subsystem ===
# Topology search:
NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
# Shows tree algorithm channels and connections
# Ring formation:
NCCL INFO Ring 0: 0->1->2->3->0
# Shows rank ordering in the allreduce ring
# Channel count:
NCCL INFO Connected all trees
NCCL INFO Connected all rings
# Confirms all channels established successfullyPer-Rank Log Filtering
# NCCL logs are prefixed with worker and rank:
# nccl-roce-validation-worker-0:95:95 [0] NCCL INFO ...
# ^node ^pid ^rank
# Filter logs for specific rank:
kubectl logs nccl-validation-launcher | grep "\[0\] NCCL" > rank0.log
kubectl logs nccl-validation-launcher | grep "\[1\] NCCL" > rank1.log
# Filter for specific subsystem in logs:
grep "NET/IB" rank0.log # Network device setup
grep "GPU Direct" rank0.log # GDRDMA status
grep "Channel" rank0.log # Transport selection per channel
grep "Connected" rank0.log # Ring/tree establishmentLog Volume Estimates
Configuration β Log Size (4 ranks, 1G-16G test)
ββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββ
NCCL_DEBUG=WARN β ~1 KB (errors only)
NCCL_DEBUG=INFO, SUBSYS=INIT β ~50 KB
NCCL_DEBUG=INFO, SUBSYS=NET β ~200 KB
NCCL_DEBUG=INFO, SUBSYS=INIT,NET,GRAPH β ~500 KB
NCCL_DEBUG=INFO (no subsys filter) β ~2 MB
NCCL_DEBUG=TRACE, SUBSYS=ALL β ~500 MB - 2 GB
ββββββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββCommon Issues
Logs show βvia NET/Socketβ instead of βvia NET/IBβ
- Cause: IB disabled or plugin not loaded
- Fix: Check
NCCL_IB_DISABLE=0, removeNCCL_NET_PLUGIN=none
No βGPU Direct RDMA Enabledβ message
- Cause: GDR level too restrictive or DMABUF not enabled
- Fix: Set
NCCL_NET_GDR_LEVEL=SYSandNCCL_DMABUF_ENABLE=1
βConnected all trees/ringsβ never appears
- Cause: NCCL hanging on connection setup β likely DNS or network issue
- Fix: Check MPI hostfile DNS resolution and inter-pod connectivity
Logs truncated (pod OOM)
- Cause: TRACE logging consuming too much memory for log buffers
- Fix: Use targeted SUBSYS instead of ALL; increase pod memory
Best Practices
- Use
INFO+ subsystem filter for targeted debugging β never blind TRACE INIT,NET,GRAPHcovers 90% of debugging scenarios- Save logs before pod cleanup β
cleanPodPolicy: Nonehelps - Filter by rank for multi-node debugging β each rankβs perspective differs
- Switch to
WARNin production β INFO adds latency at scale - Redirect to file for large tests:
NCCL_DEBUG_FILE=/tmp/nccl_%h_%p.log
Key Takeaways
NCCL_DEBUG_SUBSYS=INIT,NET,GRAPHis the optimal validation configuration- INIT: device discovery + topology | NET: transport + RDMA | GRAPH: channels + rings
- Look for βvia NET/IB/0/GDRDMAβ (good) vs βvia NET/Socketβ (bad)
- βCould not find: libnccl-env.soβ is informational β not an error
- Per-rank log filtering essential for multi-node debugging
- Production:
NCCL_DEBUG=WARNonly; debugging:INFOwith subsystem filter

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
