Tune NCCL Environment Variables for RDMA and Ethernet
Apply safe NCCL environment variable profiles for RDMA-capable and Ethernet-only GPU clusters to maximize collective communication throughput.
π‘ Quick Answer: Start with
NCCL_DEBUG=INFO, setNCCL_SOCKET_IFNAMEto the correct data interface, and enable or disable InfiniBand explicitly usingNCCL_IB_DISABLE.
Use explicit NCCL environment configuration to reduce transport ambiguity and improve repeatability.
RDMA-Oriented Profile
NCCL_DEBUG=INFO
NCCL_IB_DISABLE=0
NCCL_SOCKET_IFNAME=eth0Ethernet-Only Profile
NCCL_DEBUG=INFO
NCCL_IB_DISABLE=1
NCCL_SOCKET_IFNAME=eth0Validation Loop
- Apply one profile.
- Run
all_reduce_perfand keep logs. - Compare bandwidth and error rates.
Best Practices
- Change one variable at a time when troubleshooting.
- Keep per-cluster baseline profiles under version control.
- Re-test after CNI, firmware, or driver upgrades.

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
