Run NCCL Tests with MPIJob on Kubernetes
Launch multi-pod NCCL benchmarks using MPIJob on Kubernetes for repeatable, automated distributed GPU communication testing across nodes.
π‘ Quick Answer: Use an
MPIJobwith one launcher and N workers, then executeall_reduce_perfthroughmpirunto test real multi-pod communication paths.
MPIJob provides a repeatable way to run multi-process NCCL tests across pods and nodes.
Minimal Flow
- Create an MPIJob with launcher and worker replicas.
- Request one GPU per worker pod.
- Run
mpirun ... all_reduce_perffrom launcher. - Collect logs from launcher and workers.
Suggested Command
mpirun -np 4 -N 1 all_reduce_perf -b 8 -e 1G -f 2 -g 1Validation
- All workers join the run successfully.
- No transport or rendezvous failures.
- Bandwidth trends are consistent across repeated runs.
When to Use
- Before enabling distributed training in production
- After network changes on GPU nodes
- As a periodic cluster health check

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
