Benchmark NCCL AllReduce Performance

Measure NCCL AllReduce bandwidth and latency on Kubernetes to validate distributed training network performance across multi-GPU clusters.

By Luca Berton • February 17, 2026 • 📖 5 min read

💡 Quick Answer: Run all_reduce_perf -b 8 -e 2G -f 2 -g 1 in GPU pods and track algbw/busbw over message sizes to validate real cluster throughput.

All-reduce is the key communication primitive for data-parallel training. This test gives a fast signal on cluster readiness.

Baseline Command

all_reduce_perf -b 8 -e 2G -f 2 -g 1

Recommended Matrix

Single node, 2 GPUs
Single node, all local GPUs
Two nodes, 1 GPU per node
Two nodes, multiple GPUs per node

What to Capture

algbw and busbw
GPU model and driver version
Node pair tested
CNI/network path used

Pass Criteria

Stable bandwidth at medium and large message sizes
No repeated NCCL transport warnings
Inter-node results align with link capabilities

#nccl #allreduce #gpu #benchmark #kubernetes

Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

🌐 💼 💻

🎓 Deepen Your Skills — Hands-on Courses

🧪

MLflow on Kubernetes — MLOps

Master ML lifecycle management with MLflow on Kubernetes — tracking, registry, and deployment.

Start Learning →

Courses by CopyPasteLearn.com — Learn IT by Doing

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Get the Book → ← More ai Recipes

← Back to All Recipes

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens