🎀Speaking at Red Hat Summit 2026GPUs take flight: Safety-first multi-tenant Platform Engineering with NVIDIA and OpenShift AILearn More
ai intermediate ⏱ 20 minutes K8s 1.28+

Benchmark NCCL AllReduce Performance on Kubernetes

Measure NCCL AllReduce bandwidth and latency on Kubernetes to validate distributed training network performance across multi-GPU clusters.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Run all_reduce_perf -b 8 -e 2G -f 2 -g 1 in GPU pods and track algbw/busbw over message sizes to validate real cluster throughput.

All-reduce is the key communication primitive for data-parallel training. This test gives a fast signal on cluster readiness.

Baseline Command

all_reduce_perf -b 8 -e 2G -f 2 -g 1
  • Single node, 2 GPUs
  • Single node, all local GPUs
  • Two nodes, 1 GPU per node
  • Two nodes, multiple GPUs per node

What to Capture

  • algbw and busbw
  • GPU model and driver version
  • Node pair tested
  • CNI/network path used

Pass Criteria

  • Stable bandwidth at medium and large message sizes
  • No repeated NCCL transport warnings
  • Inter-node results align with link capabilities
#nccl #allreduce #gpu #benchmark #kubernetes
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens