πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
Troubleshooting intermediate ⏱ 15 minutes K8s 1.28+

Validate GPU and NIC Topology Before NCCL Benchmarks

Inspect node-level GPU, NIC, and PCI topology on Kubernetes workers to predict and explain NCCL benchmark performance before running tests.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Run nvidia-smi topo -m and lspci mapping checks first; poor physical topology often explains low NCCL bandwidth without any software bug.

Topology awareness prevents false conclusions during NCCL troubleshooting.

Commands to Run

nvidia-smi topo -m
lspci | grep -Ei 'NVIDIA|Mellanox|Ethernet|Infiniband'

What to Confirm

  • GPUs used by your pod are local to the expected PCI root complex.
  • High-speed NICs are attached to suitable CPU/PCI paths.
  • Node hardware is homogeneous across benchmark participants.

Practical Outcome

Use topology results to define realistic performance targets for intra-node and inter-node NCCL tests.

#nccl #topology #pci #gpu #troubleshooting
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens