Validate GPU and NIC Topology Before NCCL Benchmarks
Inspect node-level GPU, NIC, and PCI topology on Kubernetes workers to predict and explain NCCL benchmark performance before running tests.
π‘ Quick Answer: Run
nvidia-smi topo -mandlspcimapping checks first; poor physical topology often explains low NCCL bandwidth without any software bug.
Topology awareness prevents false conclusions during NCCL troubleshooting.
Commands to Run
nvidia-smi topo -m
lspci | grep -Ei 'NVIDIA|Mellanox|Ethernet|Infiniband'What to Confirm
- GPUs used by your pod are local to the expected PCI root complex.
- High-speed NICs are attached to suitable CPU/PCI paths.
- Node hardware is homogeneous across benchmark participants.
Practical Outcome
Use topology results to define realistic performance targets for intra-node and inter-node NCCL tests.

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
