NCCL Network Validation Script for OpenShift GPU Clusters
Build a comprehensive NCCL network validation script for OpenShift GPU clusters with SR-IOV. Configure NCCL_IB_GID_INDEX, NCCL_NET_GDR_LEVEL=SYS, per-rank HCA
π‘ Quick Answer: Create a
validate_network.shscript that sets NCCL defaults for SR-IOV environments: donβt exportNCCL_IB_HCAglobally (let each MPI rank auto-detect its own HCA vianet1), setNCCL_IB_GID_INDEX=3for RoCEv2,NCCL_NET_GDR_LEVEL=SYS(because SR-IOV VF-to-GPU locality is non-deterministic), andNCCL_SOCKET_IFNAME=net1. If bandwidth is lower than expected, check: SR-IOV VF allocation,/dev/infinibandvisibility, RoCE GID index, MTU, PFC/ECN, GPUDirect RDMA, PCIe/NUMA locality, and per-rank HCA selection.
The Problem
- Multi-node NCCL tests need correct environment variables for SR-IOV RoCE fabrics
- Setting
NCCL_IB_HCAglobally breaks MPI mode (each rank may have different VFs) - RoCE GID index must match network configuration (wrong index = connection failures)
- GPUDirect RDMA level must account for non-deterministic SR-IOV VF placement
- Need a repeatable validation script with built-in troubleshooting guidance
The Solution
validate_network.sh β Complete Script
#!/bin/bash
# validate_network.sh β NCCL network validation for SR-IOV GPU clusters
#
# Usage: source validate_network.sh
# (then run all_reduce_perf via MPI)
# ============================================================
# NCCL Defaults
# ============================================================
#
# IMPORTANT:
# Do not set NCCL_IB_HCA globally in MPI mode.
# Each MPI rank will detect the HCA backing its own NCCL_SOCKET_IFNAME/net1.
#
# You may still override NCCL_IB_HCA manually for single-node debugging, but
# mpi-job mode intentionally does NOT export NCCL_IB_HCA through mpirun.
export NCCL_IB_HCA="${NCCL_IB_HCA:-}"
export NCCL_IB_GID_INDEX="${NCCL_IB_GID_INDEX:-3}"
export NCCL_IB_DISABLE="${NCCL_IB_DISABLE:-0}"
# For your SR-IOV Multus interface, this should usually be net1.
export NCCL_SOCKET_IFNAME="${NCCL_SOCKET_IFNAME:-net1}"
# GPUDirect RDMA level.
# SYS is useful when GPU/HCA locality is not deterministic because the generic
# SR-IOV resource may attach different HCAs to different pods.
export NCCL_NET_GDR_LEVEL="${NCCL_NET_GDR_LEVEL:-SYS}"
export NCCL_DEBUG="${NCCL_DEBUG:-INFO}"
export NCCL_DEBUG_SUBSYS="${NCCL_DEBUG_SUBSYS:-INIT,NET,GRAPH}"
# Optional NCCL tuning.
# For initial validation, keep QP usage simple.
export NCCL_IB_OPS_PER_CONNECTION="${NCCL_IB_OPS_PER_CONNECTION:-1}"
export NCCL_IB_SPLIT_DATA_ON_QPS="${NCCL_IB_SPLIT_DATA_ON_QPS:-0}"
echo "=== NCCL Environment ==="
env | grep ^NCCL_ | sort
echo "========================"Why NOT Set NCCL_IB_HCA Globally
Problem: In SR-IOV mode, each pod gets a different Virtual Function (VF).
Node 1, Pod A: gets mlx5_2 (VF from PF mlx5_0)
Node 1, Pod B: gets mlx5_5 (VF from PF mlx5_1)
Node 2, Pod C: gets mlx5_3 (VF from different PF)
If you set NCCL_IB_HCA=mlx5_0 globally:
β Rank in Pod B tries to use mlx5_0 (doesn't have access!) β FAIL
Solution: Leave NCCL_IB_HCA empty.
NCCL auto-detects which HCA backs the NCCL_SOCKET_IFNAME (net1) interface.
Each rank independently finds its own VF.
For single-node debugging ONLY, you may temporarily set:
export NCCL_IB_HCA=mlx5_0,mlx5_3
(when you know all GPUs on that node share those HCAs)NCCL_IB_GID_INDEX Explained
GID Index β Type β Use Case
βββββββββββΌββββββββββββββΌββββββββββββββββββββββββββββββββββ
0 β RoCEv1 GID β Legacy, link-local only
1 β RoCEv2 IPv6 β IPv6 link-local
2 β RoCEv2 IPv4 β If IPv4 mapped to GID index 2
3 β RoCEv2 IPv4 β Standard routable IPv4 GID β
βββββββββββ΄ββββββββββββββ΄ββββββββββββββββββββββββββββββββββ
To check your GID table:
show_gids (from ibverbs-utils)
DEV PORT INDEX GID IPv4 VER DEV
mlx5_0 1 0 fe80:0000:... --- v1 net1
mlx5_0 1 1 fe80:0000:... --- v2 net1
mlx5_0 1 2 0000:0000:... 10.10.0.5 v1 net1
mlx5_0 1 3 0000:0000:... 10.10.0.5 v2 net1 β Use this
Default: NCCL_IB_GID_INDEX=3 (RoCEv2 with routable IPv4)
If your fabric uses different indexing, check show_gids and adjust.NCCL_NET_GDR_LEVEL=SYS for SR-IOV
Why SYS instead of PIX or PHB?
With SR-IOV, the device plugin assigns VFs from a pool.
GPU 0 might get VF from mlx5_0 (same PCIe switch = PIX)
GPU 0 might get VF from mlx5_3 (different socket = SYS)
The assignment is non-deterministic β depends on which VFs are available.
If NCCL_NET_GDR_LEVEL=PIX:
β NCCL only uses GDRDMA when GPU and HCA share PCIe switch
β Some ranks fall back to host staging (inconsistent performance)
If NCCL_NET_GDR_LEVEL=SYS:
β NCCL uses GDRDMA even when GPU and HCA are on different sockets
β Consistent behavior regardless of VF assignment
β Small bandwidth penalty for cross-socket GDRDMA, but still better than no GDRDMA
Recommendation for SR-IOV: NCCL_NET_GDR_LEVEL=SYS (or 4/5)
Recommendation for dedicated NICs: NCCL_NET_GDR_LEVEL=PIX (optimal)MPIJob YAML (nccl_prod.yaml)
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: nccl-roce-validation
namespace: gpu-workloads
spec:
launcherCreationPolicy: AtStartup
mpiImplementation: OpenMPI
mpiReplicaSpecs:
Launcher:
replicas: 1
restartPolicy: Never
template:
metadata:
labels:
app: nccl-roce-validation
spec:
containers:
- name: mpi-job
image: nvcr.io/nvidia/pytorch:24.04-py3
command: ["/bin/bash", "-c"]
args:
- |
source /workspace/validate_network.sh
mpirun --allow-run-as-root \
-np ${MPI_NP:-4} \
--bind-to none \
-x NCCL_IB_GID_INDEX \
-x NCCL_IB_DISABLE \
-x NCCL_SOCKET_IFNAME \
-x NCCL_NET_GDR_LEVEL \
-x NCCL_DEBUG \
-x NCCL_DEBUG_SUBSYS \
-x NCCL_IB_OPS_PER_CONNECTION \
-x NCCL_IB_SPLIT_DATA_ON_QPS \
-x NCCL_DMABUF_ENABLE=1 \
/opt/nccl-tests/build/all_reduce_perf \
-b 8 -e 8G -f 2 -g 1
env:
- name: MPI_NP
value: "4"
- name: GPUS_PER_MPI_PROCESS
value: "1"
- name: MPI_HOSTFILE
value: /etc/mpi/hostfile
- name: MPI_DNS_WAIT_SECONDS
value: "120"
- name: MPI_DNS_WAIT_INTERVAL
value: "3"
Worker:
replicas: 2
template:
metadata:
annotations:
k8s.v1.cni.cncf.io/networks: sriov-rdma-net
spec:
containers:
- name: worker
image: nvcr.io/nvidia/pytorch:24.04-py3
resources:
limits:
nvidia.com/gpu: "2"
rdma/rdma_shared_device_a: "1"
securityContext:
capabilities:
add: ["IPC_LOCK"]
volumeMounts:
- name: shm
mountPath: /dev/shm
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: "32Gi"Headless Service for MPI DNS
# nccl-roce-validation-headless-svc.yaml
apiVersion: v1
kind: Service
metadata:
name: nccl-roce-validation-worker
namespace: gpu-workloads
spec:
clusterIP: None
selector:
app: nccl-roce-validation
training.kubeflow.org/replica-type: worker
ports:
- port: 22
targetPort: 22Troubleshooting Checklist
If observed bandwidth is much lower than expected, investigate:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CHECKLIST β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β‘ SR-IOV VF allocation β
β β oc get sriovnetworknodestates -o yaml β
β β Verify VFs created and available β
β β
β β‘ /dev/infiniband visibility β
β β oc exec worker-0 -- ls /dev/infiniband/ β
β β Must show uverbs0, rdma_cm β
β β
β β‘ RoCE GID index β
β β oc exec worker-0 -- show_gids β
β β Verify NCCL_IB_GID_INDEX matches routable IP GID β
β β
β β‘ MTU β
β β ip link show net1 | grep mtu β
β β Should be 9000 (jumbo) for optimal RDMA throughput β
β β
β β‘ PFC / ECN β
β β ethtool -S mlx5_0 | grep pause β
β β PFC must be enabled on RDMA priority (typically TC3) β
β β
β β‘ GPUDirect RDMA β
β β Check NCCL logs for /GDRDMA suffix β
β β lsmod | grep nvidia_peermem on host β
β β
β β‘ NCCL_SOCKET_IFNAME β
β β Must point to SR-IOV secondary network interface (net1) β
β β NOT eth0 (pod network) or lo β
β β
β β‘ PCIe / NUMA locality β
β β nvidia-smi topo -m β
β β Check if GPU and assigned HCA are on same NUMA node β
β β
β β‘ Whether each MPI rank selected the HCA backing its own net1 β
β β NCCL INFO logs show "Using network IB" + device name β
β β Each rank should use the VF attached to its pod β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββVerify Pod Placement
# Check where pods landed
oc get pods -o wide -n gpu-workloads
# NAME READY STATUS AGE IP NODE
# nccl-roce-validation-launcher-9br76 0/1 Completed 2m 10.128.2.255 worker-w01
# nccl-roce-validation-worker-0 1/1 Running 2m 10.131.0.149 worker-w02
# nccl-roce-validation-worker-1 1/1 Running 2m 10.128.2.254 worker-w01
# Workers on different nodes = tests cross-node network β
# Workers on same node = only tests NVLink/SHM (not useful for network validation)
# Force different nodes with anti-affinity:
# spec.template.spec.affinity.podAntiAffinity...NCCL_IB_OPS_PER_CONNECTION and NCCL_IB_SPLIT_DATA_ON_QPS
NCCL_IB_OPS_PER_CONNECTION (default: 1)
Number of outstanding RDMA operations per QP connection.
Higher = more pipelining = better bandwidth (but more QP memory).
For validation: keep at 1 (simple, predictable).
For production: try 4-8 for better throughput.
NCCL_IB_SPLIT_DATA_ON_QPS (default: 0)
0 = send all data on one QP per connection
1 = split data across multiple QPs (requires multiple QPs per connection)
For validation: keep at 0.
For production with multiple rails: set to 1 with NCCL_IB_QPS_PER_CONNECTION>1.Project File Structure
ocp_validate_nccl/
βββ validate_network.sh # NCCL environment setup script
βββ validate_network_v4.sh # Version 4 with latest tuning
βββ mpijob.yaml # Generic MPIJob template
βββ nccl_prod.yaml # Production validation config
βββ nccl-roce-validation.yaml # RoCE multi-node test
βββ nccl-roce-validation-headless-svc.yaml # DNS service for MPI
βββ shell-nccl-roce-validation.yaml # Interactive debug shell
βββ single-nccl-roce-validation.yaml # Single-node variant
βββ single.log # Single-node results
βββ nv5.log # NVLink 5-way test
βββ sys_v5.log # SYS-level GDR test
βββ pix.log # PIX-level GDR test
βββ phb_v5.log # PHB-level test
βββ phb_v5_1805.log # PHB test variant
βββ 4q_phb.log # 4-QP PHB test
βββ Dockerfile # Custom NCCL test image
βββ .dockerignoreCommon Issues
Wrong GID index β βConnection refusedβ or timeout
- Cause:
NCCL_IB_GID_INDEXdoesnβt match routable GID in switch fabric - Fix: Run
show_gidsin worker pod; find RoCEv2 index with routable IP; set accordingly
Rank uses wrong HCA (not backing net1)
- Cause: NCCL picks first available HCA instead of one behind net1
- Fix: Ensure
NCCL_SOCKET_IFNAME=net1; NCCL resolves which HCA owns that interface
Inconsistent bandwidth across runs
- Cause: SR-IOV VF assignment varies; some VFs closer to GPU than others
- Fix: Use
NCCL_NET_GDR_LEVEL=SYSto ensure GDRDMA regardless of topology; or pin VFs with topology-aware scheduling
Workers placed on same node (no network test)
- Cause: Scheduler placed both workers on same node
- Fix: Add pod anti-affinity on hostname; or use topology spread constraints
Best Practices
- Never set
NCCL_IB_HCAglobally in MPI mode β let each rank auto-detect - Use
NCCL_IB_GID_INDEX=3for standard RoCEv2 with IPv4 - Use
NCCL_NET_GDR_LEVEL=SYSwith SR-IOV (non-deterministic VF placement) - Source validate_network.sh β consistent environment across all test variants
- Export NCCL vars through mpirun
-xβ ensures workers inherit settings - Keep OPS_PER_CONNECTION=1 for validation β increase for production tuning
- Save all log variants β compare PIX vs PHB vs SYS to quantify topology impact
- Check pod placement β workers must be on different nodes for network tests
Key Takeaways
validate_network.sh: standardized NCCL environment for SR-IOV GPU clusters- Donβt set NCCL_IB_HCA globally β each MPI rank auto-detects its own VF
NCCL_IB_GID_INDEX=3: RoCEv2 routable IPv4 GID (verify withshow_gids)NCCL_NET_GDR_LEVEL=SYS: use GDRDMA even when VF is on different socket than GPUNCCL_SOCKET_IFNAME=net1: point NCCL to SR-IOV Multus secondary interface- Bandwidth troubleshooting: systematic checklist from VF allocation β HCA selection
- File multiple test variants (single/pix/phb/sys) to characterize cluster topology impact
- MPI launcher doesnβt need GPUs or RDMA β only worker pods need resources

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
