NCCL All-Reduce Benchmarking on Multi-Node GPUs
Run and interpret NCCL all_reduce_perf benchmarks on multi-node Kubernetes GPU clusters. Understand bus bandwidth results, expected throughput for H200 NVL
π‘ Quick Answer: Run
all_reduce_perffrom nccl-tests across nodes to measure collective communication bandwidth. On 2-node H200 NVL (8 GPUs/node, 4x ConnectX-7 RoCE), expect ~35 GB/s peak bus bandwidth for large messages (β₯1GB) and ~13-18 GB/s average across all sizes. Results below 30 GB/s at large sizes indicate network misconfiguration, missing GDRDMA, or NIC-GPU topology mismatch.
The Problem
- Need to validate GPU cluster interconnect performance before running production training
- Donβt know if NCCL is achieving theoretical bandwidth on your fabric
- Canβt tell if GPUDirect RDMA is actually working vs falling back to host memory staging
- Need baseline numbers to compare against after config changes
- Multi-node all-reduce is the critical path for data-parallel training throughput
The Solution
Run all_reduce_perf on Kubernetes
apiVersion: batch/v1
kind: Job
metadata:
name: nccl-allreduce-bench
namespace: gpu-workloads
spec:
parallelism: 2 # 2 nodes
completions: 2
template:
spec:
hostNetwork: true
containers:
- name: nccl-test
image: nvcr.io/nvidia/pytorch:24.04-py3
command:
- bash
- -c
- |
/build/all_reduce_perf \
-b 8 \
-e 8G \
-f 2 \
-g 8 \
-n 20 \
-w 5
env:
- name: NCCL_DEBUG
value: "INFO"
- name: NCCL_IB_HCA
value: "mlx5_0,mlx5_3,mlx5_5,mlx5_6"
- name: NCCL_NET_GDR_LEVEL
value: "5"
- name: MASTER_ADDR
value: "10.10.13.10"
- name: MASTER_PORT
value: "29500"
- name: NCCL_NVLS_ENABLE
value: "1"
resources:
limits:
nvidia.com/gpu: "8"
rdma/rdma_shared_device_a: "1"Using MPI Launcher
# From a launcher pod with SSH access to both nodes
mpirun --np 16 --npernode 8 \
--host node1:8,node2:8 \
--mca btl_tcp_if_include eth0 \
-x NCCL_DEBUG=INFO \
-x NCCL_IB_HCA=mlx5_0,mlx5_3,mlx5_5,mlx5_6 \
-x NCCL_NET_GDR_LEVEL=5 \
-x LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu \
/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1 -n 20 -w 5Interpreting Results
# all_reduce_perf output format:
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 27.80 0.00 0.00 0 27.83 0.00 0.00 0
128 32 float sum -1 28.24 0.00 0.01 0 27.96 0.00 0.01 0
1024 256 float sum -1 30.31 0.03 0.06 0 29.97 0.03 0.06 0
8192 2048 float sum -1 38.24 0.21 0.40 0 37.78 0.22 0.41 0
65536 16384 float sum -1 63.41 1.03 1.94 0 62.52 1.05 1.97 0
524288 131072 float sum -1 256.85 2.04 3.83 0 249.74 2.10 3.94 0
4194304 1048576 float sum -1 437.59 9.59 17.97 0 436.09 9.62 18.03 0
33554432 8388608 float sum -1 531.96 15.77 29.57 0 530.99 15.80 29.62 0
268435456 67108864 float sum -1 14314.7 18.75 35.16 0 14972.1 17.93 33.62 0
1073741824 2.68e+8 float sum -1 59530.2 18.04 33.82 0 57634.7 18.63 34.93 0
8589934592 2.15e+9 float sum -1 458955 18.72 35.09 0 459021 18.71 35.09 0
# Avg bus bandwidth: 13.49 GB/s
Key columns:
size = message size in bytes
algbw = algorithm bandwidth (size/time)
busbw = bus bandwidth (corrected for algorithm β the meaningful metric)
#wrong = data verification errors (should be 0)Bus Bandwidth Expectations
Configuration β Expected Peak busbw β Notes
ββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββΌββββββββββββββββββ
2-node, 4x ConnectX-7 400G RoCE β 35-50 GB/s β 4 NICs Γ ~12.5 GB/s
2-node, 4x ConnectX-7 400G IB β 40-55 GB/s β IB slightly better
2-node, 8x ConnectX-7 400G IB (DGX) β 80-100 GB/s β Full NIC count
1-node only (NVLink H200) β 400-450 GB/s β NVLink-only
ββββββββββββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββ΄ββββββββββββββββββ
Formula for expected bus bandwidth (all-reduce ring):
busbw_max = N_NICs Γ link_rate Γ 2(N-1)/N Γ efficiency
= 4 Γ 50 GB/s Γ (2Γ15/16) Γ 0.90
β 4 Γ 50 Γ 1.875 Γ 0.9 β 337.5 (theoretical, single-node)
Cross-node limited by:
= 4 NICs Γ 50 GB/s (400Gbps each) Γ efficiency(0.85)
β 170 GB/s link bandwidth, but all-reduce correction: busbw = algbw Γ 2(N-1)/NUnderstanding NCCL Transport Selection
From NCCL INFO logs:
Intra-node (GPU-to-GPU on same server):
Channel X : 2[2] -> 1[1] via P2P/CUMEM
βββ P2P/CUMEM = NVLink direct memory access (fastest)
Channel X : 4[4] -> 2[2] via SHM/direct/direct
βββ SHM = shared memory (cross-NVLink-group, still intra-node)
Inter-node (GPU-to-GPU across network):
Channel X : 0[0] -> 8[0] [send] via NET/IB/2(3)/GDRDMA
βββ NET/IB = InfiniBand/RoCE network
βββ /2(3) = NIC index 2, port 3
βββ /GDRDMA = GPUDirect RDMA enabled (GPU memory β NIC β wire directly)
Proxy Progress:
[Proxy Progress] Device 3 CPU core 289
βββ CPU core handling network proxy for GPU 3
βββ Should be on same NUMA node as GPU for best latencyNCCL Initialization Decoded
NCCL INFO NCCL version 2.29.3+cuda13.1
NCCL INFO NCCL git version stable dcf2a2fbe
NCCL INFO cudaDriverVersion 13010
NCCL INFO Bootstrap: Using enol7195np0:10.10.13.10<0>
NCCL INFO 10 coll channels, 10 collnet channels, 0 nvls channels, 16 p2p channels, 2 p2p channels per peer
βββ coll channels: rings/trees for collective ops
βββ p2p channels: point-to-point (send/recv)
βββ nvls channels: 0 means NVLS (NVLink SHARP) not used for this config
NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
βββ Thread thresholds for different protocol sizes
NCCL INFO CC Off, workFifoBytes 1048576
βββ CC (Compute Capability) features; work FIFO size 1MB
NCCL TUNER/Plugin: Could not find: libnccl-tuner.so
βββ No external tuner plugin β using built-in algorithms (fine)nccl1CommInitRankConfig Decoded
nccl1CommInitRankConfig comm 0x5ac47f7d1bd0 rank 7 nranks 16 cudaDev 7 nvmlDev 7 busId 1c8000 commId 0x61ff503aed8cflcc - Init COMPLETE
βββ rank 7 of 16 total ranks
βββ cudaDev 7 = CUDA device index 7
βββ nvmlDev 7 = NVML device index 7
βββ busId 1c8000 = PCIe bus ID (domain:bus:device)
Init timings - total 16.08 (kernels 0.15, alloc 15.58, bootstrap 0.21, allgathers 0.00, topo 0.05, graphs ...)
βββ total 16.08s initialization
βββ alloc 15.58s = memory allocation (dominant β pre-allocating NCCL buffers)
βββ bootstrap 0.21s = establishing connections between ranks
βββ topo 0.05s = topology detectionNetwork Plugin Stack
NCCL INFO Assigned NET plugin IB to comm β InfiniBand/RoCE network
NCCL INFO Assigned GIN plugin GIN_IB_GDAKT to comm β GPU-Initiated Network (DMA)
NCCL INFO Assigned RMA plugin GIN_IB_PROXY to comm β Remote Memory Access via proxy
NCCL INFO Using network IB β Confirmed: IB transport active
NCCL INFO NET/IB: Using [0]mlx5_0:1/RoCE [1]mlx5_3:1/RoCE [2]mlx5_5:1/RoCE [3]mlx5_6:1/RoCE [4]mlx5...
βββ 4 Mellanox NICs active for NCCL traffic
βββ :1/RoCE = port 1, RoCE mode (not native IB)
NCCL INFO DMA-BUF is available on GPU device 7
βββ DMA-BUF = kernel interface for GPUDirect RDMA
βββ Must show for each GPU β confirms GDRDMA worksTuning for Better Results
env:
# Enable NVLink SHARP (if supported)
- name: NCCL_NVLS_ENABLE
value: "1"
# NVLink-centric scheduling (H200/H100)
- name: NCCL_NVLINK_CENTRIC_SCHED
value: "1"
# Use all available NICs
- name: NCCL_IB_HCA
value: "mlx5_0,mlx5_3,mlx5_5,mlx5_6"
# Max channels (more parallelism)
- name: NCCL_MAX_NCHANNELS
value: "16"
# Min channels (avoid idle NICs)
- name: NCCL_MIN_NCHANNELS
value: "8"
# GPUDirect RDMA threshold (use for messages > 0 bytes)
- name: NCCL_NET_GDR_LEVEL
value: "5"
# Protocol selection
- name: NCCL_PROTO
value: "Simple,LL,LL128"
# Buffer sizes
- name: NCCL_BUFFSIZE
value: "8388608"Common Issues
Peak busbw much lower than expected (< 20 GB/s with 4x 400G NICs)
- Cause: GDRDMA not active (missing DMA-BUF); or only 1-2 NICs used instead of 4
- Fix: Check for
DMA-BUF is availableper GPU; verifyNCCL_IB_HCAlists all NICs; checkNET/IB: Usingline
βCould not find: libnccl-tuner.soβ warning
- Cause: Optional tuner plugin not installed β NOT an error
- Fix: Ignore. NCCL uses built-in algorithms. Install tuner plugin only for specific fabric optimizations
High latency for small messages (> 50Β΅s)
- Cause: Normal for cross-node (network RTT); or proxy thread on wrong NUMA
- Fix: Small message latency dominated by network RTT (~5-15Β΅s per hop). For optimization: pin proxy cores to GPU-local NUMA
#wrong > 0 (data corruption)
- Cause: Hardware error (NIC, cable, switch); or software bug (rare)
- Fix: Critical β indicates data corruption. Check switch error counters, cable CRC errors, replace hardware
Bandwidth plateaus below theoretical at large sizes
- Cause: Switch congestion; PFC pauses; or insufficient channels
- Fix: Check switch port counters for pause frames; increase
NCCL_MAX_NCHANNELS; verify ECN/PFC config
Best Practices
- Always run all_reduce_perf before production training β establishes baseline
- Test increasing message sizes (-b 8 -e 8G -f 2) β reveals bandwidth vs latency behavior
- Verify GDRDMA per GPU β every GPU should show
DMA-BUF is available - Check all NICs are active β
NET/IB: Using [0]... [1]... [2]... [3]... - Compare in-place vs out-of-place β should be similar (if not, memory contention)
- Run with
NCCL_DEBUG=INFOβ confirms transport paths and plugin assignments - Record baselines β compare after any network/driver/firmware changes
Key Takeaways
all_reduce_perfis the standard NCCL collective benchmark β measures actual cross-node bandwidth- busbw is the meaningful metric (corrected for algorithm overhead)
- 2-node H200 NVL with 4x 400G RoCE: expect ~35 GB/s peak busbw, ~13-18 GB/s average
- P2P/CUMEM = NVLink intra-node; NET/IB/GDRDMA = RDMA inter-node (both are optimal paths)
- DMA-BUF required per GPU for GPUDirect RDMA β verify in NCCL INFO output
nvlinkCentricSched=1enables NVLink-aware communication scheduling (H100/H200)- #wrong must always be 0 β non-zero means hardware-level data corruption

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
