🎀Speaking at Red Hat Summit 2026GPUs take flight: Safety-first multi-tenant Platform Engineering with NVIDIA and OpenShift AILearn More
Troubleshooting intermediate ⏱ 15 minutes K8s 1.28+

Validate GPU and NIC Topology Before NCCL Benchmarks

Inspect node-level GPU, NIC, and PCI topology on Kubernetes workers to predict and explain NCCL benchmark performance before running tests.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Run nvidia-smi topo -m and lspci mapping checks first; poor physical topology often explains low NCCL bandwidth without any software bug.

Topology awareness prevents false conclusions during NCCL troubleshooting.

Commands to Run

nvidia-smi topo -m
lspci | grep -Ei 'NVIDIA|Mellanox|Ethernet|Infiniband'

What to Confirm

  • GPUs used by your pod are local to the expected PCI root complex.
  • High-speed NICs are attached to suitable CPU/PCI paths.
  • Node hardware is homogeneous across benchmark participants.

Practical Outcome

Use topology results to define realistic performance targets for intra-node and inter-node NCCL tests.

#nccl #topology #pci #gpu #troubleshooting
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens