Diagnose GPU Peer-to-Peer Latency with NCCL Tests
Use NCCL point-to-point and collective tests to isolate GPU peer-to-peer latency issues between GPU pairs in multi-node Kubernetes clusters.
π‘ Quick Answer: Compare latency with small-message runs such as
all_reduce_perf -b 8 -e 8M -f 2 -g 1across different GPU pairs and nodes to identify outliers.
High latency usually points to topology or transport path issues.
Fast Latency Test
all_reduce_perf -b 8 -e 8M -f 2 -g 1Isolation Strategy
- Test within one node first.
- Test cross-node with same pod specs.
- Repeat with pinned nodes and interfaces.
Correlate With Topology
Inside each pod:
nvidia-smi topo -mUse topology distance to explain expected latency differences.
Common Root Causes
- Wrong data interface selected
- RDMA disabled or unavailable
- Mixed firmware/driver versions across nodes

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
