NCCL Channel Routing and Transport Path Analysis
Interpret NCCL channel logs to understand GPU communication paths on Kubernetes. Decode P2P/CUMEM, SHM/direct, NET/IB/GDRDMA transport
π‘ Quick Answer: NCCL INFO channel logs show exactly how each GPU communicates with its peers.
P2P/CUMEM= NVLink direct (fastest, intra-node),SHM/direct/direct= shared host memory (cross-NVLink-group on same node),NET/IB/X(Y)/GDRDMA= network RDMA with GPUDirect (inter-node). Each channel maps source rank β destination rank with the transport used. Verify all inter-node channels show/GDRDMAsuffix for optimal performance.
The Problem
- Need to verify NCCL is using the optimal transport path for each GPU pair
- Canβt tell if GPUDirect RDMA is actually engaged or falling back to host staging
- Distributed training is slow β need to identify which GPU-to-GPU links are bottlenecked
- Unknown which NICs are handling inter-node traffic for each GPU rank
- Proxy threads may be misaligned with GPU NUMA nodes
The Solution
Channel Log Format
NCCL INFO Channel <chan_id>/<subchan> : <src_rank>[<local_dev>] -> <dst_rank>[<local_dev>] [<direction>] via <transport>Intra-Node Transports
# P2P/CUMEM β NVLink peer-to-peer via CUDA Unified Memory
Channel 00/0 : 2[2] -> 1[1] via P2P/CUMEM
Channel 01/0 : 2[2] -> 1[1] via P2P/CUMEM
...
Channel 09/0 : 2[2] -> 1[1] via P2P/CUMEM
βββ Rank 2 (GPU 2) β Rank 1 (GPU 1) on same node
βββ P2P/CUMEM = NVLink direct GPU memory access
βββ 10 channels (00-09) = parallel communication paths
βββ This is the FASTEST intra-node transport
# SHM/direct/direct β Shared host memory (NVLink not available between these GPUs)
Channel 07 : 4[4] -> 2[2] via SHM/direct/direct
Channel 08 : 4[4] -> 2[2] via SHM/direct/direct
βββ GPU 4 β GPU 2 = cross-NVLink-group (different NVL4 domains)
βββ SHM = traffic goes GPU β host memory β GPU (PCIe path)
βββ Slower than P2P/CUMEM but still intra-node
βββ "direct/direct" means both sides use direct GPU memory access to SHMInter-Node Transports
# NET/IB/X(Y)/GDRDMA β Network with GPUDirect RDMA
Channel 00/0 : 0[0] -> 8[0] [send] via NET/IB/2(3)/GDRDMA
Channel 01/0 : 0[0] -> 8[0] [send] via NET/IB/2(3)/GDRDMA
Channel 02/0 : 0[0] -> 8[0] [send] via NET/IB/0/GDRDMA
Channel 03/0 : 0[0] -> 8[0] [send] via NET/IB/0/GDRDMA
Channel 04/0 : 0[0] -> 8[0] [send] via NET/IB/0/GDRDMA
Channel 05/0 : 0[0] -> 8[0] [send] via NET/IB/2(3)/GDRDMA
Channel 06/0 : 0[0] -> 8[0] [send] via NET/IB/2(3)/GDRDMA
Channel 07/0 : 0[0] -> 8[0] [send] via NET/IB/0/GDRDMA
Channel 08/0 : 0[0] -> 8[0] [send] via NET/IB/0/GDRDMA
Channel 09/0 : 0[0] -> 8[0] [send] via NET/IB/0/GDRDMA
Breakdown:
0[0] β 8[0]: Rank 0 (local GPU 0) β Rank 8 (remote GPU 0 on node 2)
[send]: This is the send direction
NET/IB: Using InfiniBand/RoCE network transport
/2(3): Using NIC index 2, port 3
/0: Using NIC index 0 (port default)
/GDRDMA: GPUDirect RDMA ACTIVE β GPU memory β NIC β wire directly
(no CPU staging = optimal)Transport WITHOUT GDRDMA (Degraded)
# If you see this β GDRDMA is NOT working:
Channel 00/0 : 0[0] -> 8[0] [send] via NET/IB/0
βββ No /GDRDMA suffix = data goes GPU β CPU memory β NIC (extra copy!)
βββ Expect 30-50% bandwidth loss vs GDRDMA
Causes:
- nvidia-peermem module not loaded
- DMA-BUF not available for this GPU
- NCCL_NET_GDR_LEVEL=0 or unset
- NIC not PIX-local to GPU (falls back if topology too distant)NIC Distribution Across Channels
From the logs, NCCL distributes channels across available NICs:
Channels 00,01,05,06 β NET/IB/2(3) (NIC index 2, port 3)
Channels 02,03,04,07,08,09 β NET/IB/0 (NIC index 0)
Ideal: even distribution across all NICs for maximum aggregate bandwidth.
If one NIC handles too many channels β bottleneck on that NIC.
Fix: ensure NCCL_IB_HCA lists all available NICs:
NCCL_IB_HCA=mlx5_0,mlx5_3,mlx5_5,mlx5_6Proxy Progress Thread
NCCL INFO [Proxy Progress] Device 3 CPU core 289
βββ Network proxy thread for GPU 3 is pinned to CPU core 289
βββ Should be on the same NUMA node as GPU 3
βββ If on wrong NUMA: increased latency for network operations
Verify NUMA alignment:
GPU 3 on NUMA 0 β core 289 should be on NUMA 0
Check: cat /sys/devices/system/cpu/cpu289/topology/physical_package_idNetwork Plugin Assignment
NCCL INFO Assigned NET plugin IB to comm
βββ IB (InfiniBand) network plugin handles data transfer
NCCL INFO Assigned GIN plugin GIN_IB_GDAKT to comm
βββ GIN = GPU-Initiated Networking
βββ GIN_IB_GDAKT = GPU initiates DMA transfer directly via IB
NCCL INFO Assigned RMA plugin GIN_IB_PROXY to comm
βββ RMA = Remote Memory Access
βββ GIN_IB_PROXY = GPU-initiated with proxy assistance for complex operations
NCCL INFO Using network IB
βββ Confirms IB/RoCE is the active network stackFull Channel Map Visualization
2-Node, 8 GPUs/node, 16 total ranks:
Node 1 (Ranks 0-7): Node 2 (Ranks 8-15):
βββββββββββββββββββββββ βββββββββββββββββββββββ
β GPU0(R0) ββNVLββ GPU1(R1)β β GPU0(R8) ββNVLββ GPU1(R9) β
β GPU2(R2) ββNVLββ GPU3(R3)β β GPU2(R10)ββNVLββ GPU3(R11)β
β β NVL4 Group 0 β β β NVL4 Group 0 β
β GPU4(R4) ββNVLββ GPU5(R5)β β GPU4(R12)ββNVLββ GPU5(R13)β
β GPU6(R6) ββNVLββ GPU7(R7)β β GPU6(R14)ββNVLββ GPU7(R15)β
β β NVL4 Group 1 β β β NVL4 Group 1 β
βββββββββββ¬ββββββββββββ βββββββββββ¬ββββββββββββ
β β
NIC0,NIC1,NIC2,NIC3 NIC0,NIC1,NIC2,NIC3
β RoCE/IB RDMA β
βββββββββββββββββββββββββββββββ
Intra-node paths (P2P/CUMEM):
R0βR1, R0βR2, R0βR3 (same NVL4)
R4βR5, R4βR6, R4βR7 (same NVL4)
Cross-NVL4 paths (SHM/direct):
R0βR4, R1βR5, R2βR6, R3βR7 (different NVL4 groups, same node)
Inter-node paths (NET/IB/GDRDMA):
R0βR8, R1βR9, ... (all cross-node pairs)Troubleshooting Channel Output
# Enable verbose channel info
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=INIT,NET,GRAPH
# In logs, look for:
# 1. All inter-node channels should show /GDRDMA
grep "via NET" nccl.log | grep -v GDRDMA
# If any results β those channels lack GPUDirect RDMA
# 2. NIC distribution should be balanced
grep "via NET" nccl.log | grep -oP 'NET/IB/\d+' | sort | uniq -c
# 5 NET/IB/0
# 5 NET/IB/2
# Balanced = good. All on one NIC = bottleneck.
# 3. Verify all GPUs use P2P for intra-node
grep "via P2P" nccl.log | wc -l
# Should be (ranks_per_node - 1) Γ channels Γ 2 (send+recv)Common Issues
Some channels show NET/IB without /GDRDMA
- Cause: nvidia-peermem not loaded; or NIC too far from GPU (SYS topology)
- Fix:
modprobe nvidia-peermem; verifycat /sys/module/nvidia_peermem/version; use PIX-local NICs
All channels use same NIC (unbalanced)
- Cause:
NCCL_IB_HCAnot set or lists only one NIC - Fix: Set
NCCL_IB_HCA=mlx5_0,mlx5_3,mlx5_5,mlx5_6(all fabric NICs)
SHM/direct paths where P2P/CUMEM expected
- Cause: GPUs not in same NVLink group; or CUDA P2P access not enabled
- Fix: Check
nvidia-smi topo -mβ SHM is correct for cross-NVL4 GPUs. P2P/CUMEM only between NVLink-connected GPUs
Proxy thread on wrong NUMA node
- Cause: Default proxy thread placement doesnβt consider GPU locality
- Fix: Set
NCCL_PROXY_AFFINITY=1(NCCL 2.21+); or pin manually with taskset
Best Practices
- Every inter-node channel should show
/GDRDMAβ if not, fix nvidia-peermem - Balance NIC usage across channels β set
NCCL_IB_HCAwith all fabric NICs - P2P/CUMEM within NVL group, SHM across groups β this is correct behavior
- Pin proxy threads to GPU-local NUMA β reduces network operation latency
- Use
NCCL_DEBUG=INFOto capture channel map at initialization - More channels = more parallelism β increase
NCCL_MAX_NCHANNELSif NICs underutilized - Monitor per-NIC bandwidth β ensure no single NIC is saturated
Key Takeaways
- NCCL channel logs reveal exact transport path between every GPU pair
- P2P/CUMEM: NVLink direct β fastest intra-node (same NVL group)
- SHM/direct/direct: host memory relay β cross-NVL4 groups on same node
- NET/IB/X(Y)/GDRDMA: network RDMA with GPUDirect β optimal inter-node
- Missing
/GDRDMAsuffix = degraded path (30-50% bandwidth loss) - NIC index in
NET/IB/2(3)maps to physical mlx5 devices β verify balance - Proxy Progress shows CPU core for network proxy β should be NUMA-aligned with GPU
- GIN (GPU-Initiated Networking) + RMA plugins = latest NCCL optimization stack

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
