NCCL IB HCA Selection and QPS Tuning for RoCE
Configure NCCL_IB_HCA, NCCL_IB_GID_INDEX, NCCL_IB_QPS_PER_CONNECTION, and NCCL_IB_SPLIT_DATA_ON_QPS for optimal RoCE performance on Kubernetes GPU clusters.
π‘ Quick Answer: Set
NCCL_IB_HCA=mlx5to use all Mellanox ConnectX devices,NCCL_IB_GID_INDEX=3for RoCEv2 over IPv4,NCCL_IB_QPS_PER_CONNECTION=1for single QP per peer, andNCCL_IB_SPLIT_DATA_ON_QPS=1to distribute data across QPs for better link utilization. These defaults work for most SR-IOV RoCE deployments on OpenShift.
The Problem
- Multiple mlx5 devices visible in pod (SR-IOV VFs) β NCCL must pick the right ones
- Wrong GID index causes connection failures on RoCE (vs InfiniBand)
- Single queue pair can bottleneck at high message sizes
- Need to balance parallelism vs. overhead for QP-based transfers
The Solution
NCCL_IB_HCA β Device Selection
# Wildcard: use all mlx5 devices (most common for SR-IOV)
export NCCL_IB_HCA="mlx5"
# Matches: mlx5_0, mlx5_1, mlx5_2, ... mlx5_25
# Specific devices (pin to known-good NICs):
export NCCL_IB_HCA="mlx5_0,mlx5_3"
# Only uses these two HCAs
# Exclude specific devices (prefix with ^):
export NCCL_IB_HCA="^mlx5_1"
# Uses all mlx5 except mlx5_1
# Per-NIC port selection:
export NCCL_IB_HCA="mlx5_0:1"
# Only port 1 of mlx5_0Scenario β Recommended NCCL_IB_HCA
ββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββ
SR-IOV (1 VF per pod) β mlx5 (wildcard, auto-select)
SR-IOV (multiple VFs per pod) β mlx5 (let NCCL pick by topology)
Shared RDMA (26 VFs visible) β mlx5 (NCCL filters by distance)
Dedicated NIC (bare metal) β mlx5_0,mlx5_3 (explicit)
InfiniBand (not RoCE) β mlx5 (same wildcard works)
ββββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββNCCL_IB_GID_INDEX β Address Selection
# GID table for RoCE:
# Index 0: IB default (not for Ethernet/RoCE)
# Index 1: RoCEv1 link-local (same L2 domain only)
# Index 2: RoCEv2 IPv6 (if configured)
# Index 3: RoCEv2 IPv4 (standard for Kubernetes)
export NCCL_IB_GID_INDEX=3 # RoCEv2 over IPv4 β use this for K8s# Verify GID table contents:
for i in $(seq 0 7); do
gid=$(cat /sys/class/infiniband/mlx5_0/ports/1/gids/$i 2>/dev/null)
gid_type=$(cat /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/types/$i 2>/dev/null)
echo "GID[$i]: ${gid} (${gid_type})"
done
# Expected output:
# GID[0]: fe80:0000:0000:... (IB/RoCE v1)
# GID[1]: fe80:0000:0000:... (RoCE v2)
# GID[2]: 0000:0000:0000:... (RoCE v2 - IPv6)
# GID[3]: 0000:0000:0000:0000:0000:ffff:c0a8:0101 (RoCE v2 - IPv4 192.168.1.1)NCCL_IB_QPS_PER_CONNECTION β Queue Pair Scaling
# Default: 1 QP per connection
export NCCL_IB_QPS_PER_CONNECTION=1
# Higher values: multiple QPs per peer connection
# Benefit: more hardware parallelism for large messages
# Cost: more memory, more CQ processing overhead
export NCCL_IB_QPS_PER_CONNECTION=4 # Use for high-bandwidth NICs (400G)QPs/Connection β Best For β Trade-off
ββββββββββββββββΌββββββββββββββββββββββββββββββΌββββββββββββββββββββββ
1 β Most deployments β Simple, low overhead
2 β 200G NICs with large msgs β Moderate improvement
4 β 400G NICs, 8+ GPUs/node β Maximum NIC utilization
8+ β Rarely needed β Diminishing returns
ββββββββββββββββ΄ββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββNCCL_IB_SPLIT_DATA_ON_QPS β Data Distribution
# When QPS_PER_CONNECTION > 1:
export NCCL_IB_SPLIT_DATA_ON_QPS=1 # Split large messages across QPs
# 0 = send entire message on one QP (round-robin between messages)
# 1 = split each message across all QPs (better latency for large msgs)
# With SPLIT=1 and QPS=4:
# A 4GB message becomes 4Γ 1GB transfers in parallel across 4 QPs
# Result: ~4Γ bandwidth improvement if NIC has the capacityComplete Configuration Block
env:
# Device selection β wildcard for all mlx5 SR-IOV VFs
- name: NCCL_IB_HCA
value: "mlx5"
# RoCEv2 over IPv4 addressing
- name: NCCL_IB_GID_INDEX
value: "3"
# Enable IB transport (0 = enabled, 1 = disabled)
- name: NCCL_IB_DISABLE
value: "0"
# Queue pair tuning
- name: NCCL_IB_QPS_PER_CONNECTION
value: "1"
# Split data across QPs for large messages
- name: NCCL_IB_SPLIT_DATA_ON_QPS
value: "1"Verifying HCA Selection in Logs
# NCCL_DEBUG=INFO shows which devices are selected:
NCCL INFO NET/IB: Dev 0 IBDev 0 Port 1 qpn 364 mtu 5 GID 3 \
(0/B9D4E80AFFFF0000) fifoRkey=0x41200 fifoLkey=0x41200
NCCL INFO NET/IB: Dev 0 IBDev 0 Port 1 qpn 236 mtu 5 GID 3 \
(0/B5D4E80AFFEF0000) fifoRkey=0x21300 fifoLkey=0x21300
# Decode:
# Dev 0 = first network device
# IBDev 0 = first IB device (mlx5_0 or first VF)
# Port 1 = physical port
# qpn 364 = queue pair number
# mtu 5 = 4096 bytes (IB MTU encoding: 1=256, 2=512, 3=1024, 4=2048, 5=4096)
# GID 3 = using GID index 3 (RoCEv2 IPv4) βCommon Issues
βTransport retry count exceededβ
- Cause: Wrong GID index β packets routed incorrectly or dropped
- Fix: Verify
NCCL_IB_GID_INDEX=3matches actual IPv4 GID in device
All ranks use same IBDev (bandwidth halved)
- Cause: Only one VF allocated, or topology makes NCCL pick same device
- Fix: Request more
openshift.io/mellanoxnicsor use explicit HCA list
βNo IB device foundβ despite /dev/infiniband existing
- Cause:
NCCL_IB_DISABLE=1orNCCL_NET_PLUGIN=none - Fix: Set
NCCL_IB_DISABLE=0and removeNCCL_NET_PLUGINentirely
QPS_PER_CONNECTION > 1 causes OOM
- Cause: Each QP allocates send/receive buffers (typically 64KB-1MB each)
- Fix: Reduce QPS or increase pod memory limit
Best Practices
- Start with
mlx5wildcard β let NCCL auto-select by PCIe topology - Always use
GID_INDEX=3for RoCE on Kubernetes (IPv4) - Keep
QPS_PER_CONNECTION=1unless youβve verified higher helps - Enable
SPLIT_DATA_ON_QPS=1β low cost, potential benefit for large messages - Check logs for qpn and GID β confirms correct device and addressing
- Never set
NCCL_IB_DISABLE=1in production RDMA workloads
Key Takeaways
NCCL_IB_HCA=mlx5wildcard is sufficient for most SR-IOV deployments- GID index 3 = RoCEv2 IPv4 β the standard for Kubernetes GPU clusters
- QPS tuning provides marginal gains; topology and GDR level matter more
- SPLIT_DATA_ON_QPS=1 is safe to enable by default (splits large messages)
- Verify in NCCL logs: check IBDev, Port, GID index, and qpn values
- Multiple visible mlx5 devices (e.g., 26) is normal with shared RDMA plugin

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
