πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
Configuration advanced ⏱ 20 minutes K8s 1.28+

Tune NCCL Environment Variables for RDMA and Ethernet

Apply safe NCCL environment variable profiles for RDMA-capable and Ethernet-only GPU clusters to maximize collective communication throughput.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Start with NCCL_DEBUG=INFO, set NCCL_SOCKET_IFNAME to the correct data interface, and enable or disable InfiniBand explicitly using NCCL_IB_DISABLE.

Use explicit NCCL environment configuration to reduce transport ambiguity and improve repeatability.

RDMA-Oriented Profile

NCCL_DEBUG=INFO
NCCL_IB_DISABLE=0
NCCL_SOCKET_IFNAME=eth0

Ethernet-Only Profile

NCCL_DEBUG=INFO
NCCL_IB_DISABLE=1
NCCL_SOCKET_IFNAME=eth0

Validation Loop

  1. Apply one profile.
  2. Run all_reduce_perf and keep logs.
  3. Compare bandwidth and error rates.

Best Practices

  • Change one variable at a time when troubleshooting.
  • Keep per-cluster baseline profiles under version control.
  • Re-test after CNI, firmware, or driver upgrades.
#nccl #rdma #ethernet #tuning #configuration
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens