🎀Speaking at Red Hat Summit 2026GPUs take flight: Safety-first multi-tenant Platform Engineering with NVIDIA and OpenShift AILearn More
Troubleshooting advanced ⏱ 25 minutes K8s 1.28+

Diagnose GPU Peer-to-Peer Latency with NCCL Tests

Use NCCL point-to-point and collective tests to isolate GPU peer-to-peer latency issues between GPU pairs in multi-node Kubernetes clusters.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Compare latency with small-message runs such as all_reduce_perf -b 8 -e 8M -f 2 -g 1 across different GPU pairs and nodes to identify outliers.

High latency usually points to topology or transport path issues.

Fast Latency Test

all_reduce_perf -b 8 -e 8M -f 2 -g 1

Isolation Strategy

  1. Test within one node first.
  2. Test cross-node with same pod specs.
  3. Repeat with pinned nodes and interfaces.

Correlate With Topology

Inside each pod:

nvidia-smi topo -m

Use topology distance to explain expected latency differences.

Common Root Causes

  • Wrong data interface selected
  • RDMA disabled or unavailable
  • Mixed firmware/driver versions across nodes
#nccl #latency #p2p #gpu #troubleshooting
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens