πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
Deployments advanced ⏱ 35 minutes K8s 1.28+

Run NCCL Tests with MPIJob on Kubernetes

Launch multi-pod NCCL benchmarks using MPIJob on Kubernetes for repeatable, automated distributed GPU communication testing across nodes.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Use an MPIJob with one launcher and N workers, then execute all_reduce_perf through mpirun to test real multi-pod communication paths.

MPIJob provides a repeatable way to run multi-process NCCL tests across pods and nodes.

Minimal Flow

  1. Create an MPIJob with launcher and worker replicas.
  2. Request one GPU per worker pod.
  3. Run mpirun ... all_reduce_perf from launcher.
  4. Collect logs from launcher and workers.

Suggested Command

mpirun -np 4 -N 1 all_reduce_perf -b 8 -e 1G -f 2 -g 1

Validation

  • All workers join the run successfully.
  • No transport or rendezvous failures.
  • Bandwidth trends are consistent across repeated runs.

When to Use

  • Before enabling distributed training in production
  • After network changes on GPU nodes
  • As a periodic cluster health check
#nccl #mpijob #kubeflow #distributed #gpu
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens