NCCL and RCCL Networking Performance on Kubernetes
Optimize NCCL (NVIDIA) and RCCL (AMD) collective communication performance on Kubernetes GPU clusters. Network transport selection, bandwidth tuning, latency
π‘ Quick Answer: NCCL (NVIDIA) and RCCL (AMD) are the collective communication libraries for distributed GPU workloads. Peak networking performance requires: GPUDirect RDMA for zero-copy GPU-to-GPU transfers over InfiniBand/RoCE, correct NIC-to-GPU affinity (same PCIe/NUMA), tuned socket threads for TCP fallback, and rail-optimized topology matching. Benchmark with
all_reduce_perfβ target >90% of theoretical link bandwidth.
The Problem
- Distributed training/inference spends 30-60% of time in communication (all-reduce, all-gather)
- Default NCCL/RCCL settings leave significant bandwidth on the table
- Mismatched NIC-GPU affinity routes traffic through CPU, halving throughput
- TCP fallback (no RDMA) can be 5-10x slower than InfiniBand/RoCE
- AMD GPU clusters need RCCL-specific tuning different from NVIDIAβs NCCL
- Kubernetes pod networking adds latency unless bypassed with host networking or SR-IOV
The Solution
NCCL vs RCCL Overview
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β NCCL (NVIDIA Collective Communications Library) β
β β’ For NVIDIA GPUs (CUDA) β
β β’ Transports: NVLink, PCIe P2P, InfiniBand (Verbs), RoCE, TCP β
β β’ GPUDirect RDMA: GPU memory β NIC without CPU copy β
β β’ GPUDirect P2P: GPU β GPU via NVLink/NVSwitch β
β β’ Version: 2.30.x (latest) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β RCCL (ROCm Collective Communications Library) β
β β’ For AMD GPUs (ROCm/HIP) β
β β’ Fork of NCCL, similar API and env vars (NCCL_* β RCCL_*) β
β β’ Transports: XGMI (Infinity Fabric), PCIe P2P, RoCE, TCP β
β β’ GPU RDMA via ROCm SMI + Mellanox/Broadcom NICs β
β β’ Version: 2.20.x (ROCm 6.x) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Transport hierarchy (fastest β slowest):
NVIDIA: NVSwitch > NVLink > PCIe P2P > GPUDirect RDMA > IB Verbs > RoCE > TCP
AMD: XGMI > PCIe P2P > GPU RDMA > RoCE > TCPNetwork Transport Selection
NCCL automatically selects the best available transport:
Intra-node (same server):
βββ NVLink/NVSwitch: 900 GB/s (H100), 600 GB/s (A100)
βββ PCIe P2P: ~32 GB/s (Gen4 x16), ~64 GB/s (Gen5 x16)
βββ Shared Memory (SHM): ~20 GB/s (CPU-mediated)
Inter-node (across servers):
βββ InfiniBand HDR: 200 Gbps (~24 GB/s) per port
βββ InfiniBand NDR: 400 Gbps (~48 GB/s) per port
βββ RoCE v2: 100-400 Gbps (depends on NIC)
βββ TCP/IP: 10-100 Gbps (depends on NIC, high CPU overhead)
Multi-rail (multiple NICs per node):
βββ 4x NDR 400G = 1.6 Tbps aggregate (~192 GB/s)
βββ 8x HDR 200G = 1.6 Tbps aggregate (~192 GB/s)
βββ Rail-optimized: each NIC connects to dedicated switchNCCL Performance Tuning for Kubernetes
apiVersion: batch/v1
kind: Job
metadata:
name: nccl-benchmark
namespace: gpu-workloads
spec:
parallelism: 2
completions: 2
template:
spec:
hostNetwork: true # Bypass pod network overhead
dnsPolicy: ClusterFirstWithHostNet
containers:
- name: nccl-test
image: nvcr.io/nvidia/pytorch:24.05-py3
command:
- bash
- -c
- |
# Run all_reduce benchmark
/usr/local/bin/all_reduce_perf \
-b 8 -e 4G -f 2 -g 8 \
-n 100 -w 50
env:
# === Transport Selection ===
- name: NCCL_NET
value: "IB" # Force InfiniBand (IB|Socket)
# === InfiniBand / RDMA ===
- name: NCCL_IB_HCA
value: "=mlx5_0,mlx5_1,mlx5_2,mlx5_3" # Select specific HCAs
- name: NCCL_IB_GID_INDEX
value: "3" # RoCE v2 GID index
- name: NCCL_IB_TIMEOUT
value: "23" # IB timeout (2^23 * 4.096Β΅s β 34s)
- name: NCCL_IB_RETRY_CNT
value: "7" # Max IB retries
# === GPUDirect RDMA ===
- name: NCCL_NET_GDR_LEVEL
value: "5" # 5 = allow GDR across any PCIe distance
- name: NCCL_NET_GDR_READ
value: "1" # Enable GDR for read operations
# === Multi-NIC / Rail ===
- name: NCCL_CROSS_NIC
value: "0" # 0 = same rail only (rail-optimized)
- name: NCCL_IB_QPS_PER_CONNECTION
value: "4" # QPs per IB connection
# === TCP Fallback (if no RDMA) ===
- name: NCCL_SOCKET_IFNAME
value: "=eth0"
- name: NCCL_SOCKET_NTHREADS
value: "4" # CPU threads per socket connection
- name: NCCL_NSOCKS_PERTHREAD
value: "4" # Sockets per thread (max 64 total)
- name: NCCL_BUFFSIZE
value: "8388608" # 8MB send/recv buffer
# === Algorithm / Protocol ===
# DO NOT set in production β let NCCL auto-select
# - name: NCCL_ALGO
# value: "Ring" # Ring|Tree|CollnetDirect|CollnetChain
# - name: NCCL_PROTO
# value: "Simple" # LL|LL128|Simple
# === Debugging (remove in production) ===
- name: NCCL_DEBUG
value: "INFO"
- name: NCCL_DEBUG_SUBSYS
value: "INIT,NET,GRAPH"
resources:
limits:
nvidia.com/gpu: "8"
rdma/rdma_shared_device_a: "1"
volumeMounts:
- name: shm
mountPath: /dev/shm
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: 64Gi
restartPolicy: NeverRCCL Performance Tuning for AMD GPUs
apiVersion: batch/v1
kind: Job
metadata:
name: rccl-benchmark
namespace: gpu-workloads
spec:
parallelism: 2
completions: 2
template:
spec:
hostNetwork: true
containers:
- name: rccl-test
image: rocm/pytorch:rocm6.2-ubuntu22.04-py3.10
command:
- bash
- -c
- |
/opt/rocm/bin/all_reduce_perf \
-b 8 -e 4G -f 2 -g 8 \
-n 100 -w 50
env:
# RCCL uses same env var names as NCCL (mostly)
# Some are prefixed RCCL_ instead of NCCL_
# === Network Selection ===
- name: NCCL_SOCKET_IFNAME
value: "=eth0"
- name: NCCL_IB_HCA
value: "=mlx5_0,mlx5_1,mlx5_2,mlx5_3"
# === RCCL-Specific ===
- name: RCCL_MSCCL_ENABLE
value: "1" # Enable MSCCL algorithms
- name: HSA_FORCE_FINE_GRAIN_PCIE
value: "1" # Fine-grain PCIe for P2P
- name: NCCL_MIN_NCHANNELS
value: "32" # Min channels (MI300X: 32)
- name: NCCL_MAX_NCHANNELS
value: "32" # Max channels
# === AMD Infinity Fabric (XGMI) ===
# Automatic for MI250X/MI300X intra-node
# No env var needed β detected via topology
# === RoCE v2 (inter-node) ===
- name: NCCL_IB_GID_INDEX
value: "3"
- name: NCCL_NET_GDR_LEVEL
value: "3" # AMD GDR level (check ROCm docs)
# === Debugging ===
- name: NCCL_DEBUG
value: "INFO"
- name: RCCL_KERNEL_DEBUG
value: "0"
resources:
limits:
amd.com/gpu: "8"
volumeMounts:
- name: shm
mountPath: /dev/shm
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: 64Gi
restartPolicy: NeverBenchmarking with nccl-tests
# Build nccl-tests (if not in container image)
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests
make MPI=1 CUDA_HOME=/usr/local/cuda NCCL_HOME=/usr/lib/x86_64-linux-gnu
# Single-node: 8 GPUs, message sizes 8B to 4GB
./build/all_reduce_perf -b 8 -e 4G -f 2 -g 8
# Multi-node via MPI (2 nodes Γ 8 GPUs)
mpirun -np 16 --hostfile hosts \
-x NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3 \
-x NCCL_DEBUG=INFO \
./build/all_reduce_perf -b 8 -e 4G -f 2 -g 8# Expected output interpretation:
#
# size count type redop root time algbw busbw
# (B) (elements) (us) (GB/s) (GB/s)
#
# 8 2 float sum - 25.3 0.00 0.00
# 256 64 float sum - 26.1 0.01 0.02
# 4096 1024 float sum - 28.4 0.14 0.27
# 65536 16384 float sum - 32.1 2.04 3.83
# 1048576 262144 float sum - 48.2 21.76 40.80
# 16777216 4194304 float sum - 215.3 77.93 146.12
# 268435456 67108864 float sum - 2891 92.84 174.08
#4294967296 1073741824 float sum - 44521 96.47 180.88
#
# Key metrics:
# algbw = algorithm bandwidth (data_size / time)
# busbw = bus bandwidth (accounts for collective factor)
# = algbw Γ 2(n-1)/n for all_reduce with n GPUs
#
# Targets (inter-node, 4x NDR 400G):
# busbw β 170-190 GB/s (>90% of 4Γ48 GB/s theoretical)
#
# Targets (intra-node, NVSwitch H100):
# busbw β 800-850 GB/s (>90% of 900 GB/s theoretical)Performance Comparison: Transport Impact
Transport | Latency (Β΅s) | Bandwidth (GB/s) | CPU Overhead
ββββββββββββββββββββββββΌβββββββββββββββΌβββββββββββββββββββΌβββββββββββββ
NVLink/NVSwitch (H100) | 1-3 | 800-900 | None
XGMI (MI300X) | 2-5 | 400-500 | None
GPUDirect RDMA (IB) | 3-8 | 45-48 | Minimal
IB Verbs (host copy) | 10-20 | 20-24 | Moderate
RoCE v2 + GDR | 5-12 | 40-45 | Minimal
RoCE v2 (host copy) | 15-30 | 15-20 | Moderate
TCP (tuned) | 50-200 | 8-12 | High
TCP (default) | 100-500 | 2-5 | Very High
ββββββββββββββββββββββββ΄βββββββββββββββ΄βββββββββββββββββββ΄βββββββββββββ
GPUDirect RDMA eliminates CPU from the data path:
Without GDR: GPU β CPU memory β NIC β wire β NIC β CPU memory β GPU
With GDR: GPU β NIC β wire β NIC β GPU (zero CPU copies)GPUDirect RDMA Verification
# Verify GDR is enabled (inside GPU pod)
# Check for nvidia_peermem module
lsmod | grep nvidia_peermem
# nvidia_peermem 16384 0
# Or check NCCL debug output for "GPU Direct RDMA"
export NCCL_DEBUG=INFO
python3 -c "
import torch.distributed as dist
import os
os.environ.update({'MASTER_ADDR':'localhost','MASTER_PORT':'29500','RANK':'0','WORLD_SIZE':'1'})
dist.init_process_group('nccl')
t = torch.zeros(1024*1024).cuda()
dist.all_reduce(t)
" 2>&1 | grep -i "gdr\|gpu direct\|NET/"
# Expected: "NET/IB : Using [0]mlx5_0:1/GDR ; ..."
# If no GDR: "NET/IB : Using [0]mlx5_0:1/ ; ..." (missing /GDR)
# Verify peer memory registered
cat /sys/kernel/mm/memory_peers/nv_mem/version
# 2.0
# Check IB device GDR capability
ibv_devinfo -d mlx5_0 | grep -i "fw_ver\|phys_port"NIC-GPU Affinity (Critical for Performance)
Optimal: NIC and GPU on same PCIe root complex / NUMA node
GPU0 βPCIeβ NIC0 (same NUMA 0) β 48 GB/s with GDR β
GPU4 βPCIeβ NIC2 (same NUMA 1) β 48 GB/s with GDR β
Suboptimal: NIC and GPU on different NUMA nodes
GPU0 (NUMA 0) β QPI/UPI β NIC2 (NUMA 1) β ~30 GB/s (30-40% loss)# Check GPU-NIC affinity
nvidia-smi topo -m
# GPU0 GPU1 GPU2 GPU3 mlx5_0 mlx5_1 CPU Affinity NUMA
# GPU0 X NV18 NV18 NV18 PXB SYS 0-63 0
# GPU1 NV18 X NV18 NV18 SYS PXB 0-63 0
# mlx5_0 PXB SYS SYS SYS X SYS 0-63 0
# mlx5_1 SYS PXB SYS SYS SYS X 0-63 0
#
# PXB = same PCIe bridge (best for GDR)
# SYS = cross-socket (suboptimal)
# NV = NVLink
# For AMD GPUs
rocm-smi --showtopoKubernetes Network Configurations
# Option 1: Host Network (best performance, least isolation)
apiVersion: v1
kind: Pod
spec:
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet
containers:
- name: trainer
env:
- name: NCCL_SOCKET_IFNAME
value: "=ib0" # Use host IB interface directly
---
# Option 2: SR-IOV VF (near-host performance + isolation)
apiVersion: v1
kind: Pod
spec:
containers:
- name: trainer
resources:
limits:
nvidia.com/gpu: "8"
openshift.io/mlx5-rdma: "1" # SR-IOV VF with RDMA
env:
- name: NCCL_IB_HCA
value: "=mlx5_2" # VF device name in pod
---
# Option 3: Macvlan/IPVLAN (decent performance, simpler setup)
apiVersion: v1
kind: Pod
metadata:
annotations:
k8s.v1.cni.cncf.io/networks: rdma-net
spec:
containers:
- name: trainer
env:
- name: NCCL_SOCKET_IFNAME
value: "=net1" # Secondary network interface
---
# Option 4: Pod network only (worst for NCCL, fine for small scale)
apiVersion: v1
kind: Pod
spec:
containers:
- name: trainer
env:
- name: NCCL_SOCKET_IFNAME
value: "=eth0"
- name: NCCL_SOCKET_NTHREADS
value: "8" # More threads to compensate
- name: NCCL_NSOCKS_PERTHREAD
value: "4"RoCE v2 Tuning (Ethernet-Based RDMA)
# RoCE requires careful network configuration
env:
# RoCE GID index (typically 3 for RoCEv2)
- name: NCCL_IB_GID_INDEX
value: "3"
# Disable adaptive routing if causing issues
- name: NCCL_IB_ADAPTIVE_ROUTING
value: "0"
# Traffic class for DSCP marking
- name: NCCL_IB_TC
value: "106" # Maps to DSCP 26 (AF31) for PFC
# Increase timeout for lossy networks
- name: NCCL_IB_TIMEOUT
value: "22" # Higher = more tolerant of drops
# Service level (priority)
- name: NCCL_IB_SL
value: "0"RoCE v2 switch requirements:
βββ PFC (Priority Flow Control) enabled on GPU traffic class
βββ ECN (Explicit Congestion Notification) enabled
βββ Large buffers for bursty all-reduce traffic
βββ Jumbo frames (MTU 9000) recommended
βββ DCQCN or DCTCP congestion controlPerformance Optimization Checklist
Category β Check β Impact
βββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββΌββββββββ
Transport β GPUDirect RDMA enabled (nvidia_peermem) β 2-3x
β NVLink/NVSwitch for intra-node β 10-30x
β InfiniBand > RoCE > TCP β 5-10x
βββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββΌββββββββ
Topology β NIC-GPU same NUMA/PCIe root β 30-40%
β NCCL_CROSS_NIC=0 (rail-optimized) β 10-20%
β Correct NCCL_IB_HCA selection β 20-50%
βββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββΌββββββββ
Kubernetes β hostNetwork or SR-IOV (bypass CNI) β 2-5x
β /dev/shm large enough (β₯ model size) β avoid OOM
β NUMA-aware scheduling β 10-20%
βββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββΌββββββββ
TCP fallback β SOCKET_NTHREADS Γ NSOCKS_PERTHREAD β€ 64 β 2-4x
β BUFFSIZE=8388608 (8MB) β 10-30%
β Jumbo frames (MTU 9000) β 10-15%
βββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββΌββββββββ
Protocol β Let NCCL auto-select (don't force) β varies
β LL128 for small messages (<256KB) β latency
β Simple for large messages (>1MB) β bandwidth
βββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββ΄ββββββββMonitoring NCCL/RCCL Performance in Production
# DCGM metrics for NCCL monitoring
apiVersion: v1
kind: ConfigMap
metadata:
name: dcgm-metrics
namespace: gpu-operator
data:
custom-metrics.csv: |
# NVLink bandwidth
DCGM_FI_PROF_NVLINK_TX_BYTES, gauge, NVLink TX bytes
DCGM_FI_PROF_NVLINK_RX_BYTES, gauge, NVLink RX bytes
# PCIe bandwidth (for non-NVLink transfers)
DCGM_FI_PROF_PCIE_TX_BYTES, gauge, PCIe TX bytes
DCGM_FI_PROF_PCIE_RX_BYTES, gauge, PCIe RX bytes# Real-time NVLink utilization
nvidia-smi nvlink -gt d -i 0
# GPU 0: NVLink throughput: TX: 45 GB/s, RX: 45 GB/s
# IB port counters
perfquery -x mlx5_0 1
# PortXmitData:..............1234567890
# PortRcvData:...............1234567890
# Check for RDMA errors / retransmits
rdma stat show link mlx5_0/1Common Issues
busbw much lower than expected (50% or less of theoretical)
- Cause: NIC-GPU affinity mismatch (cross-NUMA traffic)
- Fix: Verify with
nvidia-smi topo -m; select NICs on same PCIe root as GPUs
βNET/IB : No device foundβ in NCCL debug
- Cause: RDMA device not exposed to container
- Fix: Add
rdma/rdma_shared_device_a: "1"to resource limits; verify device plugin
High latency for small messages (>100Β΅s)
- Cause: Using TCP instead of RDMA; or NCCL falling back to shared memory
- Fix: Verify IB/RoCE is active in
NCCL_DEBUG=INFOoutput; checkNCCL_NET=IB
RCCL hangs on multi-node all_reduce
- Cause: XGMI detected but no inter-node RDMA configured
- Fix: Set
NCCL_IB_HCAexplicitly; verify RoCE GID index withibv_devinfo
Performance degrades at scale (>32 GPUs)
- Cause: Tree algorithm hitting network bottlenecks; or congestion without PFC/ECN
- Fix: Verify switch PFC configuration; check for packet drops with
ethtool -S
βConnection refusedβ between nodes
- Cause: Firewall blocking NCCL ports (random high ports) or IB subnet manager down
- Fix: Use
hostNetwork: true; or open port range 40000-50000; verify SM withibstat
Best Practices
- Always use GPUDirect RDMA β eliminates 2 CPU memory copies per transfer
- Match NIC-GPU NUMA affinity β verify with
nvidia-smi topo -mbefore deploying - Use hostNetwork or SR-IOV β pod CNI adds 10-50Β΅s latency per transfer
- Benchmark before production β run
all_reduce_perfon every new cluster - Donβt force NCCL_ALGO/NCCL_PROTO β auto-selection is optimal 95% of the time
- Size /dev/shm adequately β at least 1GB per GPU for NCCL shared memory
- Enable PFC/ECN for RoCE β without flow control, RoCE drops packets under load
- Use NCCL_TOPO_DUMP_FILE β cache topology to avoid 10-30s detection per container start
- Monitor NVLink/PCIe counters β DCGM exposes per-GPU link utilization
- RCCL on AMD: set NCCL_MIN/MAX_NCHANNELS β MI300X benefits from 32 channels
Key Takeaways
- NCCL (NVIDIA) and RCCL (AMD) handle all GPU collective communication β optimizing them is critical for distributed workloads
- GPUDirect RDMA gives 2-3x bandwidth improvement over CPU-mediated transfers
- NIC-GPU PCIe/NUMA affinity is the #1 source of unexpected performance loss
- InfiniBand > RoCE v2 > TCP β each step down is 2-5x slower
- Kubernetes networking (CNI) adds overhead β use hostNetwork or SR-IOV for GPU traffic
- RCCL is API-compatible with NCCL but needs AMD-specific tuning (MSCCL, channel count, XGMI)
- Benchmark target: >90% of theoretical link bandwidth on large messages (β₯256MB)
- Auto-selection beats manual algorithm/protocol forcing in almost all cases
- Production monitoring: DCGM NVLink/PCIe counters + IB port perfquery + error rates

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
