NCCL RoCE Validation MPIJob Complete Reference
Complete nccl-roce-validation.yaml MPIJob reference for OpenShift GPU clusters. Full launcher environment variables, OpenMPI control plane settings, NCCL
π‘ Quick Answer: The complete
nccl-roce-validation.yamlMPIJob configures: NCCL env vars (GDR level, DMA-BUF, socket interface, debug), OpenMPI control plane (OMPI_MCA_btl_tcp_if_include=eth0, SSH with no host key checking, 60s abort timeout), test bounds (NCCL_TEST_MIN_BYTES=1G,NCCL_TEST_MAX_BYTES=16G),NCCL_NET_GDR_READ=0,CUDA_VISIBLE_DEVICES=2, and launcher resources (2 CPU/4Gi). The IB device tree shows multiple QP numbers with ECE negotiation (query_ece,set_ece) confirming RoCEv2 enhanced connection establishment.
The Problem
- Need a complete, production-ready MPIJob YAML for NCCL RoCE validation
- OpenMPI control plane must use pod network (eth0) while NCCL data uses SR-IOV (net1)
- DNS resolution for MPI worker hostnames can fail or timeout
- Need to understand IB device tree connection logs (QPN, ECE, MTU, GID)
- Workers stuck in Terminating state after job completes
The Solution
Complete nccl-roce-validation.yaml
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: nccl-roce-validation
namespace: gpu-workloads
spec:
launcherCreationPolicy: AtStartup
mpiImplementation: OpenMPI
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
containers:
- name: launcher
image: registry.example.com/nccl-tests:latest
env:
# === NCCL Network Settings ===
- name: NCCL_IB_DISABLE
value: "0"
- name: NCCL_COLLNET_ENABLE
value: "0"
# GDR Level: PIX (commented out) or SYS (from validate_network.sh)
# - name: NCCL_NET_GDR_LEVEL
# - value: "PIX"
- name: NCCL_NET_GDR_LEVEL
value: "PIX"
- name: NCCL_DMABUF_ENABLE
value: "1"
- name: NCCL_NET_PLUGIN
value: "none" # Socket fallback (remove for IB plugin)
- name: NCCL_DEBUG
value: "INFO"
- name: NCCL_SHM_DISABLE
value: "0"
- name: NCCL_DEBUG_SUBSYS
value: "INIT,NET,GRAPH"
# === Test Bounds ===
- name: NCCL_TEST_MIN_BYTES
value: "1G"
- name: NCCL_TEST_MAX_BYTES
value: "16G"
# === GPUDirect Read ===
- name: NCCL_NET_GDR_READ
value: "0" # Disable GDR for reads (use for debugging)
# === OpenMPI Control Plane ===
# MPI control traffic on eth0 (pod network), NOT on net1 (RDMA)
- name: OMPI_MCA_btl_tcp_if_include
value: "eth0"
- name: OMPI_MCA_plm_rsh_agent
value: "ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null"
- name: OMPI_MCA_orte_abort_timeout
value: "60"
- name: OMPI_MCA_coll_ucc_enable
value: "0"
- name: OMPI_MCA_coll_hcoll_enable
value: "0"
# === GPU Visibility ===
# - name: CUDA_VISIBLE_DEVICES
# - value: "2" # Uncomment to limit GPU selection
resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "2"
memory: "4Gi"
Worker:
replicas: 2
template:
metadata:
annotations:
k8s.v1.cni.cncf.io/networks: sriov-rdma-net
spec:
containers:
- name: worker
image: registry.example.com/nccl-tests:latest
env:
- name: START_SSHD
value: "true"
- name: NCCL_SOCKET_IFNAME
value: net1
- name: NCCL_NET_GDR_LEVEL
value: SYS
- name: NCCL_DMABUF_ENABLE
value: "1"
- name: NCCL_SHM_DISABLE
value: "0"
resources:
limits:
nvidia.com/gpu: "2"
rdma/rdma_shared_device_a: "1"
securityContext:
capabilities:
add: ["IPC_LOCK"]
volumeMounts:
- name: shm
mountPath: /dev/shm
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: "64Gi"OpenMPI Control Plane Settings Explained
Variable β Value β Purpose
βββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββ
OMPI_MCA_btl_tcp_if_include β eth0 β MPI control on pod network
β β (NOT net1/RDMA interface)
βββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββ
OMPI_MCA_plm_rsh_agent β ssh -o StrictHostKeyChecking=no β SSH without host key prompts
β -o UserKnownHostsFile=/dev/null β (pods are ephemeral)
βββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββ
OMPI_MCA_orte_abort_timeout β 60 β Wait 60s before killing ranks
β β on error (allows log flush)
βββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββ
OMPI_MCA_coll_ucc_enable β 0 β Disable UCC collectives
β β (use NCCL instead)
βββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββ
OMPI_MCA_coll_hcoll_enable β 0 β Disable HPC-X HCOLL
β β (let NCCL handle collectives)
βββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββ
Key insight: MPI uses eth0 for process management (launch, signal, barrier).
NCCL uses net1 (SR-IOV) for actual GPU data transfer.
These are separate planes β don't mix them!MPI DNS Resolution and Hostfile
=== MPI hostfile ===
nccl-roce-validation-worker-0.nccl-roce-validation.runai-benchmark.svc slots=2
nccl-roce-validation-worker-1.nccl-roce-validation.runai-benchmark.svc slots=2
Waiting for MPI worker DNS records to resolve...
Hostfile: /etc/mpi/hostfile
Timeout: 120s
DNS WAIT: nccl-roce-validation-worker-0.nccl-roce-validation.runai-benchmark.svc not resolvable yet
DNS WAIT: nccl-roce-validation-worker-1.nccl-roce-validation.runai-benchmark.svc not resolvable yet
...
(retries until workers are Running and headless Service has endpoints)
DNS format: <pod-name>.<headless-svc>.<namespace>.svc
slots=2 β 2 GPUs per worker (matches CUDA_VISIBLE_DEVICES count)IB Device Tree Connection Logs
# These logs show RoCE connection establishment:
NCCL INFO NET/IB: NCCL Dev 0 IBDev 0 Port 1 qpn 364 mtu 5 GID 3
(0/B9D4E80AFFFF0000) fifoRKey=0x41200 fifoLKey=0x41200
NCCL INFO NET/IB: IBDev 0 Port 1 qpn 364 query_ece={supported=1,
vendor_id=0x15b3, options=0x30000002, comp_mask=0x0}
NCCL INFO NET/IB: IBDev 0 Port 1 qpn 236 set_ece={supported=1,
vendor_id=0x15b3, options=0x30000002, comp_mask=0x0}
Decoded:
Dev 0 IBDev 0 Port 1 β First RDMA device, port 1
qpn 364/236/367/241 β Queue Pair Numbers (multiple QPs for parallel transfer)
mtu 5 β MTU index 5 = 4096 bytes (RoCE maximum)
GID 3 β GID index 3 (RoCEv2 IPv4 routable) β
0/B9D4E80AFFFF0000 β GID value (IPv6 mapped)
fifoRKey/fifoLKey β Remote/Local memory registration keys for RDMA FIFO
vendor_id=0x15b3 β Mellanox/NVIDIA NIC
options=0x30000002 β ECE options (Enhanced Connection Establishment)
supported=1 β ECE negotiation successful
# Multiple QPN lines = NCCL opening multiple QPs per connection:
qpn 364 β QP 1
qpn 236 β QP 2
qpn 367 β QP 3
qpn 241 β QP 4
β 4 QPs for this connection (parallel RDMA operations)
# "Connected all trees" confirms ring/tree topology fully established:
NCCL INFO Connected all treesNCCL_NET_GDR_READ Setting
NCCL_NET_GDR_READ=0 (disabled β your current setting)
β NIC does NOT read directly from GPU memory for SEND operations
β Data path: GPU β host buffer β NIC β wire
β Lower CPU load but one extra copy
NCCL_NET_GDR_READ=1 (enabled)
β NIC reads directly from GPU memory (GPUDirect RDMA read)
β Data path: GPU β NIC β wire (zero-copy)
β Better bandwidth but requires close GPU-NIC topology
β Can cause issues if GPU and NIC are on different NUMA nodes
For SR-IOV with non-deterministic placement:
NCCL_NET_GDR_READ=0 is safer (avoids cross-socket read penalties)
NCCL_NET_GDR_READ=1 with NCCL_NET_GDR_LEVEL=SYS works but may be slowerValidation Workflow
# 1. Apply the MPIJob
oc apply -f nccl-roce-validation.yaml
# 2. Watch pods come up
oc get pods -w
# nccl-roce-validation-launcher-xxx 0/1 Pending β Running
# nccl-roce-validation-worker-0 1/1 Running
# nccl-roce-validation-worker-1 1/1 Running
# 3. Follow launcher logs (real-time)
oc logs -f nccl-roce-validation-launcher-xxx
# 4. Wait for "Validation complete." message
# Look for: "Read the busbw column."
# 5. Check results
# Closing env plugin ncclEnvDefault β cleanup
# Look for final bandwidth line
# 6. Cleanup
oc delete -f nccl-roce-validation.yaml mpijob.kubeflow.org "nccl-roce-validation" deleted
oc get pods # Workers will be Terminating for a few minutes (grace period)Workers Stuck Terminating
From screenshots:
nccl-roce-validation-worker-0 1/1 Terminating 0 3m55s
nccl-roce-validation-worker-1 1/1 Terminating 0 3m55s
... still Terminating at 4m2s, 4m5s, 4m12s
This is NORMAL for:
- SR-IOV VF cleanup (VF must be released back to pool)
- GPU resource deallocation
- Shared memory cleanup (64Gi tmpfs)
- RDMA connection teardown
If stuck > 5 minutes:
oc delete pod nccl-roce-validation-worker-0 --force --grace-period=0Key Results from PIX Test (with GDRDMA enabled for close pairs)
# With NCCL_NET_GDR_LEVEL=PIX, GDRDMA active (distance 9 <= 9 in SYS test):
# GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 9 <= 9), read 1 mode Default
# IB connection: MTU 5, GID 3, ECE supported, 4 QPs per connection
# "Connected all trees" β full topology established
# Result at 1073741824 bytes (1 GB):
1073741824 268435456 float sum -1 50047.0 21.45 [busbw] 0 50156.6 21.41 32.11 0
# ~32 GB/s busbw at 1GB message size with GDRDMA
# Compare to: ~13 GB/s without RDMA (NCCL_NET_PLUGIN=none)
# Compare to: ~68 GB/s NVLink intra-nodeCommon Issues
βDNS WAIT: β¦ not resolvable yetβ (loops indefinitely)
- Cause: Headless Service not created; or worker pods not yet Running
- Fix: Ensure
nccl-roce-validation-headless-svc.yamlis applied; increaseMPI_DNS_WAIT_SECONDS
βClosing env plugin ncclEnvDefaultβ but no results shown
- Cause: Test completed but terminal scrolled past results
- Fix: Check full logs:
oc logs nccl-roce-validation-launcher-xxx --tail=100
Workers Terminating for > 5 minutes
- Cause: SR-IOV VF finalizer stuck; or GPU resource not released
- Fix: Force delete:
oc delete pod <name> --force --grace-period=0
βOMPI_MCA_btl_tcp_if_include: eth0 not foundβ
- Cause: Pod network interface named differently (e.g.,
eth0@ifXXX) - Fix: Check
ip linkin pod; use actual interface name
Best Practices
- Separate control and data planes β MPI on eth0, NCCL on net1
- Disable UCC and HCOLL β let NCCL handle GPU collectives exclusively
- Set abort timeout = 60 β gives time to flush logs before cleanup
- Use SSH without host key checking β pods are ephemeral, keys change
- NCCL_TEST_MIN_BYTES=1G β skip small messages for production validation (only large matters)
- NCCL_NET_GDR_READ=0 for SR-IOV β safer with non-deterministic VF placement
- Save all log files β compare pix/phb/sys variants to quantify topology impact
- Delete MPIJob (not pods) β proper cleanup of all resources
Key Takeaways
- Complete MPIJob: launcher (control) + workers (GPU + RDMA) with separate network planes
- MPI control on
eth0(pod network), NCCL data onnet1(SR-IOV RDMA) - OpenMPI settings: no host key check, 60s abort timeout, UCC/HCOLL disabled
- IB device tree: QPN allocation, GID 3 (RoCEv2), ECE negotiation, MTU 5 (4096)
NCCL_NET_GDR_READ=0: disable GPU-direct read (safer for SR-IOV)NCCL_TEST_MIN_BYTES/MAX_BYTES: bound test range (1G-16G for production)- DNS resolution: headless Service β pod FQDN β MPI SSH connection
- Workers terminate slowly (VF cleanup, GPU release) β normal up to 5 minutes
- βValidation complete. Read the busbw column.β = success

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
