Verify NCCL RDMA Traffic with Debug Logging
Prove NCCL uses RDMA for GPU communication on Kubernetes. Use NCCL_DEBUG and NCCL_DEBUG_SUBSYS=ALL to verify InfiniBand, RoCE.
π‘ Quick Answer: Set
NCCL_DEBUG=INFOandNCCL_DEBUG_SUBSYS=ALLto see exactly which transport NCCL selects. If RDMA is active, logs showNET/IBwithmlx5_Xdevice names. If it falls back to TCP, youβll seeNET/Socket. For GPUDirect RDMA proof, look forGPU Direct RDMA EnabledandNET/IB/0/GDRDMA. Combine withperfquery,rdma_stats, andibstatfor wire-level verification.
The Problem
You configured InfiniBand or RoCE for your GPU cluster, but how do you prove that NCCL is actually using RDMA β not silently falling back to TCP sockets? Without verification, you might be running distributed training over TCP at 1/10th the bandwidth and never know it. NCCL auto-detects transports and doesnβt warn loudly when it falls back.
flowchart TB
NCCL["NCCL Init"] --> DETECT{"Auto-detect<br/>transports"}
DETECT -->|"IB devices found"| IB_CHECK{"IB fabric<br/>healthy?"}
IB_CHECK -->|"Yes"| RDMA["β
NET/IB (RDMA)<br/>200-400 Gb/s"]
IB_CHECK -->|"No (silent fail)"| TCP["β οΈ NET/Socket (TCP)<br/>10-25 Gb/s"]
DETECT -->|"No IB"| TCP
RDMA --> GDR_CHECK{"GPUDirect<br/>RDMA?"}
GDR_CHECK -->|"nvidia-peermem loaded"| GDRDMA["β
GDRDMA<br/>GPUβNIC direct"]
GDR_CHECK -->|"Not available"| STAGED["β οΈ Staged via CPU<br/>GPUβCPUβNIC"]The Solution
Step 1: Enable Full NCCL Debug Logging
apiVersion: v1
kind: Pod
metadata:
name: nccl-debug-test
spec:
containers:
- name: test
image: nvcr.io/nvidia/pytorch:24.07-py3
env:
# Full debug output β shows every transport decision
- name: NCCL_DEBUG
value: "INFO" # Options: WARN, INFO, TRACE
- name: NCCL_DEBUG_SUBSYS
value: "ALL" # Log ALL subsystems
# Other useful subsystem filters:
# "NET" β network transport only
# "INIT,NET" β initialization + network
# "ALL" β everything (verbose, use for debugging)
# Don't force disable IB β let NCCL auto-detect
# - name: NCCL_IB_DISABLE
# value: "0"
command: ["/bin/bash", "-c"]
args:
- |
# Simple all-reduce test that triggers NCCL init
python -c "
import torch
import torch.distributed as dist
import os
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '29500'
dist.init_process_group('nccl', rank=0, world_size=1)
t = torch.ones(1024, device='cuda')
dist.all_reduce(t)
print(f'SUCCESS: all-reduce result = {t[0].item()}')
dist.destroy_process_group()
" 2>&1 | tee /tmp/nccl-debug.log
resources:
limits:
nvidia.com/gpu: 1Step 2: Read the NCCL Logs β What to Look For
β RDMA is active (InfiniBand):
NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB
NCCL INFO NET/IB : GPU Direct RDMA Enabled for HCA 0 : mlx5_0
NCCL INFO NET/IB : GPU Direct RDMA Enabled for HCA 1 : mlx5_1
NCCL INFO Channel 00 : 0[0] -> 1[1] via NET/IB/0/GDRDMA
NCCL INFO Channel 01 : 0[0] -> 1[1] via NET/IB/1/GDRDMAβ RDMA is active (RoCE):
NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE
NCCL INFO NET/IB : GPU Direct RDMA Enabled for HCA 0 : mlx5_0
NCCL INFO Channel 00 : 0[0] -> 1[1] via NET/IB/0/GDRDMAβ οΈ RDMA without GPUDirect (staged through CPU):
NCCL INFO NET/IB : Using [0]mlx5_0:1/IB
NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 0 : mlx5_0
NCCL INFO Channel 00 : 0[0] -> 1[1] via NET/IB/0(Still RDMA, but data goes GPU β CPU β NIC instead of GPU β NIC directly)
β Fell back to TCP sockets:
NCCL INFO NET/Socket : Using [0]eth0:10.0.0.5<0>
NCCL INFO Channel 00 : 0[0] -> 1[1] via NET/Socket/0β No network at all (single node NVLink/PCIe only):
NCCL INFO Channel 00 : 0[0] -> 1[1] via P2P/IPC/read
NCCL INFO Channel 01 : 0[0] -> 1[1] via SHMStep 3: Decode the Log Lines
# Key lines to grep from NCCL logs:
# 1. Which transport was selected?
grep "NET/" nccl-debug.log
# NET/IB = InfiniBand/RoCE RDMA β
# NET/Socket = TCP fallback β
# 2. Is GPUDirect RDMA active?
grep "GPU Direct RDMA" nccl-debug.log
# "Enabled" = GPUβNIC direct transfer β
# "Disabled" = Staged through CPU memory β οΈ
# 3. Which HCA (network device) is used?
grep "Using \[" nccl-debug.log
# Shows mlx5_X devices and port type (IB vs RoCE)
# 4. Channel routing
grep "via NET" nccl-debug.log
# NET/IB/0/GDRDMA = RDMA with GPUDirect β
β
# NET/IB/0 = RDMA without GPUDirect β
# NET/Socket/0 = TCP β
# 5. Bandwidth achieved
grep "Bandwidth" nccl-debug.log
# Or run nccl-tests for measured bandwidthNCCL_DEBUG_SUBSYS Options
| Subsystem | Shows | When to Use |
|---|---|---|
INIT | Initialization, topology detection | Verify GPU/NIC discovery |
NET | Network transport selection, connections | Prove RDMA vs TCP |
COLL | Collective operations (all-reduce, etc.) | Debug hangs during training |
P2P | Peer-to-peer GPU transfers (NVLink/PCIe) | Verify intra-node P2P |
SHM | Shared memory transport | Debug single-node issues |
GRAPH | Channel/ring topology | Optimize multi-rail configs |
TUNING | Algorithm selection (ring, tree, etc.) | Performance tuning |
ALL | Everything | Full debugging (verbose!) |
# Targeted debugging examples:
# Just network transport (most useful for RDMA verification)
NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=NET
# Network + initialization
NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,NET
# Maximum verbosity (generates lots of output)
NCCL_DEBUG=TRACE NCCL_DEBUG_SUBSYS=ALL
# Save to file (TRACE is extremely verbose)
NCCL_DEBUG=TRACE NCCL_DEBUG_SUBSYS=ALL NCCL_DEBUG_FILE=/tmp/nccl-trace-%h-%p.logStep 4: Wire-Level RDMA Verification
NCCL logs prove the software chose RDMA. These tools prove packets actually flow over RDMA on the wire:
# 1. Check IB port counters BEFORE and AFTER a training step
# Run on worker node:
perfquery -x # Extended counters
# Key counters:
# PortXmitData / PortRcvData β bytes sent/received
# PortXmitPkts / PortRcvPkts β packets sent/received
# If these increase during training β RDMA traffic confirmed
# 2. Snapshot counters before training
ibstat mlx5_0 | grep -E "Rate|State"
# Rate: 200 (HDR)
# State: Active
perfquery -x mlx5_0 1 | grep -E "XmitData|RcvData" > /tmp/before.txt
# ... run training step ...
perfquery -x mlx5_0 1 | grep -E "XmitData|RcvData" > /tmp/after.txt
diff /tmp/before.txt /tmp/after.txt
# If counters increased β RDMA traffic on the wire β
# 3. Real-time RDMA traffic monitoring
watch -n 1 'perfquery -x mlx5_0 1 | grep -E "XmitData|RcvData"'
# 4. Check RDMA device statistics
rdma statistic show link mlx5_0/1
# 5. Verify no TCP traffic on the Ethernet interface (if using IB)
# During NCCL all-reduce, Ethernet counters should NOT increase
watch -n 1 'cat /sys/class/net/eth0/statistics/tx_bytes'
# Stable = traffic is going over IB, not Ethernet β
# Increasing = something is using TCP βStep 5: NCCL-Tests with Full Debug
# Run nccl-tests with RDMA debug β the definitive proof
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: nccl-rdma-verify
spec:
slotsPerWorker: 8
runPolicy:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
containers:
- name: launcher
image: nvcr.io/nvidia/pytorch:24.07-py3
command: ["/bin/bash", "-c"]
args:
- |
mpirun \
-np 16 --npernode 8 \
-x NCCL_DEBUG=INFO \
-x NCCL_DEBUG_SUBSYS=ALL \
-x NCCL_IB_DISABLE=0 \
-x NCCL_NET_GDR_LEVEL=5 \
/opt/nccl-tests/build/all_reduce_perf \
-b 1M -e 1G -f 2 -g 1 2>&1 | tee /results/nccl-rdma-test.log
echo "=== TRANSPORT SUMMARY ==="
grep "NET/" /results/nccl-rdma-test.log | sort -u
echo "=== GPUDirect RDMA ==="
grep "GPU Direct" /results/nccl-rdma-test.log | sort -u
echo "=== BANDWIDTH ==="
grep -E "^\s+[0-9]" /results/nccl-rdma-test.log | tail -5
Worker:
replicas: 2
template:
spec:
containers:
- name: worker
image: nvcr.io/nvidia/pytorch:24.07-py3
resources:
limits:
nvidia.com/gpu: 8
rdma/rdma_shared_device_a: 1
volumeMounts:
- name: shm
mountPath: /dev/shm
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: 16GiExpected output proving RDMA:
=== TRANSPORT SUMMARY ===
NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB
=== GPUDirect RDMA ===
NCCL INFO NET/IB : GPU Direct RDMA Enabled for HCA 0 : mlx5_0
NCCL INFO NET/IB : GPU Direct RDMA Enabled for HCA 1 : mlx5_1
NCCL INFO NET/IB : GPU Direct RDMA Enabled for HCA 2 : mlx5_2
NCCL INFO NET/IB : GPU Direct RDMA Enabled for HCA 3 : mlx5_3
=== BANDWIDTH ===
# size count type redop root time algbw busbw
67108864 16777216 float sum -1 1.23 54.56 102.30
134217728 33554432 float sum -1 2.41 55.69 104.42
268435456 67108864 float sum -1 4.78 56.16 105.30
536870912 134217728 float sum -1 9.52 56.39 105.73
1073741824 268435456 float sum -1 18.94 56.69 106.30Bus bandwidth reference:
- 100+ GB/s β 4Γ HDR IB (200Gb/s each) with RDMA β
- 40-80 GB/s β 2Γ HDR IB or 4Γ 100Gb/s RoCE β
- 5-15 GB/s β Fell back to TCP β
Step 6: GPUDirect RDMA Verification
# Verify nvidia-peermem module is loaded (required for GPUDirect RDMA)
lsmod | grep nvidia_peermem
# nvidia_peermem 16384 0
# If empty β GPUDirect RDMA is NOT active
# Load it
sudo modprobe nvidia-peermem
# Verify GPU-NIC affinity (GPU and NIC should be on same PCIe root)
nvidia-smi topo -m
# Expected: GPU0 β mlx5_0 = PIX or PHB (same PCIe switch)
# Bad: GPU0 β mlx5_0 = SYS (crosses CPU socket β slower)
# Check GDR (GPUDirect RDMA) capability
cat /sys/kernel/mm/memory_peers/nv_mem/version
# 2.0 = nvidia-peermem v2 β
# Verify with NCCL env
NCCL_NET_GDR_LEVEL=5 NCCL_DEBUG=INFO python -c "
import torch.distributed as dist
# ... init and run all-reduce
" 2>&1 | grep "GPU Direct RDMA"
# "Enabled" = GPUDirect active β
# "Disabled" = Falling back to staged (CPU bounce) β οΈComplete Verification Checklist
#!/bin/bash
# rdma-verify.sh β Run on each GPU node
echo "=== 1. IB Hardware ==="
ibstat 2>/dev/null || echo "β No IB tools (install rdma-core)"
echo -e "\n=== 2. IB Device State ==="
ibstatus 2>/dev/null | grep -E "state:|rate:" || echo "β No active IB ports"
echo -e "\n=== 3. RDMA Devices ==="
rdma link show 2>/dev/null || echo "β No RDMA devices"
echo -e "\n=== 4. nvidia-peermem (GPUDirect RDMA) ==="
if lsmod | grep -q nvidia_peermem; then
echo "β
nvidia-peermem loaded"
else
echo "β nvidia-peermem NOT loaded β no GPUDirect RDMA"
echo " Fix: modprobe nvidia-peermem"
fi
echo -e "\n=== 5. GPU-NIC Topology ==="
nvidia-smi topo -m 2>/dev/null | head -20 || echo "β nvidia-smi not available"
echo -e "\n=== 6. IB Port Counters ==="
for dev in $(ls /sys/class/infiniband/ 2>/dev/null); do
echo "Device: $dev"
perfquery -x $dev 1 2>/dev/null | grep -E "XmitData|RcvData|XmitPkts|RcvPkts" || echo " (no counters)"
done
echo -e "\n=== 7. NCCL Transport Test ==="
echo "Run with: NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL <your-training-command>"
echo "Look for: NET/IB (RDMA) vs NET/Socket (TCP)"Common Issues
| Issue | Cause | Fix |
|---|---|---|
NET/Socket instead of NET/IB | IB not detected or unhealthy | Check ibstat, verify port is Active |
GPU Direct RDMA Disabled | nvidia-peermem not loaded | modprobe nvidia-peermem |
ib_register_peer_memory_client error | nvidia-peermem version mismatch | Reinstall GPU driver with --peermem |
| Low bandwidth despite RDMA | Wrong GID index for RoCE | Set NCCL_IB_GID_INDEX=3 |
NCCL WARN IB : ib_cmd error | IB port in Init state, not Active | Check subnet manager (opensm) |
| Counters not increasing on IB | Traffic using different HCA | Check NCCL_IB_HCA setting |
GDRDMA not appearing | GPU and NIC on different NUMA nodes | Verify with nvidia-smi topo -m |
Best Practices
- Always verify after infrastructure changes β driver update, firmware update, or node reimage can break RDMA
- Use
NCCL_DEBUG_SUBSYS=ALLfor initial verification, then reduce toNETin production βALLis very verbose - Save debug logs to file with
NCCL_DEBUG_FILEβ easier to analyze than mixed stdout - Compare IB counter deltas β the ground truth for wire-level RDMA traffic
- Check
nvidia-smi topo -mβ GPUDirect RDMA works best when GPU and NIC share a PCIe root complex - Run rdma-verify.sh on every node β one misconfigured node can silently degrade the entire job
- Remove
NCCL_DEBUG=INFOin production β debug logging adds latency (~5% throughput hit)
Key Takeaways
NCCL_DEBUG=INFO+NCCL_DEBUG_SUBSYS=ALLis the definitive way to verify RDMANET/IB= RDMA active,NET/Socket= TCP fallbackGDRDMA= GPUDirect RDMA (GPUβNIC direct, bypasses CPU)perfquerycounters prove packets flow on the wire β not just software selection- nvidia-peermem module is required for GPUDirect RDMA
- Always verify after cluster changes β RDMA can silently fall back to TCP
- Expected bandwidth: 100+ GB/s bus bandwidth with 4Γ HDR IB, vs 5-15 GB/s on TCP

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
