πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
Troubleshooting advanced ⏱ 25 minutes K8s 1.28+

Diagnose GPU Peer-to-Peer Latency with NCCL Tests

Use NCCL point-to-point and collective tests to isolate GPU peer-to-peer latency issues between GPU pairs in multi-node Kubernetes clusters.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Compare latency with small-message runs such as all_reduce_perf -b 8 -e 8M -f 2 -g 1 across different GPU pairs and nodes to identify outliers.

High latency usually points to topology or transport path issues.

Fast Latency Test

all_reduce_perf -b 8 -e 8M -f 2 -g 1

Isolation Strategy

  1. Test within one node first.
  2. Test cross-node with same pod specs.
  3. Repeat with pinned nodes and interfaces.

Correlate With Topology

Inside each pod:

nvidia-smi topo -m

Use topology distance to explain expected latency differences.

Common Root Causes

  • Wrong data interface selected
  • RDMA disabled or unavailable
  • Mixed firmware/driver versions across nodes
#nccl #latency #p2p #gpu #troubleshooting
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens