Benchmark NCCL AllReduce Performance on Kubernetes
Measure NCCL AllReduce bandwidth and latency on Kubernetes to validate distributed training network performance across multi-GPU clusters.
π‘ Quick Answer: Run
all_reduce_perf -b 8 -e 2G -f 2 -g 1in GPU pods and trackalgbw/busbwover message sizes to validate real cluster throughput.
All-reduce is the key communication primitive for data-parallel training. This test gives a fast signal on cluster readiness.
Baseline Command
all_reduce_perf -b 8 -e 2G -f 2 -g 1Recommended Matrix
- Single node, 2 GPUs
- Single node, all local GPUs
- Two nodes, 1 GPU per node
- Two nodes, multiple GPUs per node
What to Capture
algbwandbusbw- GPU model and driver version
- Node pair tested
- CNI/network path used
Pass Criteria
- Stable bandwidth at medium and large message sizes
- No repeated NCCL transport warnings
- Inter-node results align with link capabilities

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
