πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai intermediate ⏱ 20 minutes K8s 1.28+

Benchmark NCCL AllReduce Performance on Kubernetes

Measure NCCL AllReduce bandwidth and latency on Kubernetes to validate distributed training network performance across multi-GPU clusters.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Run all_reduce_perf -b 8 -e 2G -f 2 -g 1 in GPU pods and track algbw/busbw over message sizes to validate real cluster throughput.

All-reduce is the key communication primitive for data-parallel training. This test gives a fast signal on cluster readiness.

Baseline Command

all_reduce_perf -b 8 -e 2G -f 2 -g 1
  • Single node, 2 GPUs
  • Single node, all local GPUs
  • Two nodes, 1 GPU per node
  • Two nodes, multiple GPUs per node

What to Capture

  • algbw and busbw
  • GPU model and driver version
  • Node pair tested
  • CNI/network path used

Pass Criteria

  • Stable bandwidth at medium and large message sizes
  • No repeated NCCL transport warnings
  • Inter-node results align with link capabilities
#nccl #allreduce #gpu #benchmark #kubernetes
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens