NCCL RoCE Validation with Kubeflow MPIJob on Kubernetes
Run NCCL all_reduce_perf validation tests using Kubeflow MPIJob on GPU clusters. Configure MPI launcher and workers, NCCL environment variables, test
π‘ Quick Answer: Use Kubeflowβs MPIJob (v2beta1) to run NCCL
all_reduce_perfvalidation across GPU nodes. The MPIJob creates a launcher pod and worker pods, orchestrates MPI rank placement, and runs collective tests. Single-node 8Γ H200 NVL achieves ~68 GB/s busbw (pure NVLink). Multi-node 2Γ2 GPU over RoCE withNCCL_NET_PLUGIN=none(socket fallback) gets ~13-35 GB/s. For full RDMA performance, ensure pods have/dev/infinibandaccess via the shared RDMA device plugin.
The Problem
- Need to validate GPU interconnect performance before running production training
- Must test both intra-node (NVLink) and inter-node (RoCE/IB) paths independently
- NCCL multi-node tests require MPI coordination across pods
- RDMA devices may be missing in pods if device plugin not configured
- Need standardized, repeatable benchmark jobs for cluster acceptance
The Solution
MPIJob for Single-Node 8-GPU Validation (NVLink)
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: nccl-single-node-validation
namespace: gpu-workloads
spec:
launcherCreationPolicy: AtStartup
mpiImplementation: OpenMPI
mpiReplicaSpecs:
Launcher:
replicas: 1
restartPolicy: Never
template:
metadata:
labels:
app: nccl-single-node-validation
spec:
containers:
- name: mpi-job
image: nvcr.io/nvidia/pytorch:24.04-py3
env:
- name: REWRITE_MPI_HOSTFILE_FQDN
value: "false"
- name: MPI_DNS_WAIT_SECONDS
value: "120"
- name: MPI_DNS_WAIT_INTERVAL
value: "3"
command:
- mpirun
args:
- --allow-run-as-root
- -np
- "8"
- --bind-to
- none
- -x
- NCCL_DEBUG=INFO
- /opt/nccl-tests/build/all_reduce_perf
- -b
- "32G"
- -e
- "32G"
- -f
- "2"
- -g
- "1"
- -w
- "1"
- -n
- "20"
Worker:
replicas: 1
template:
spec:
containers:
- name: worker
image: nvcr.io/nvidia/pytorch:24.04-py3
resources:
limits:
nvidia.com/gpu: "8"
volumeMounts:
- name: shm
mountPath: /dev/shm
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: "64Gi"Expected Results: Single-Node 8Γ H200 NVL
# nccl-tests version 2.17.6 nccl-headers=22808 nccl-library=22808
# Collective test starting: all_reduce_perf
# nThread 1 nGpus 8 minBytes 34359738368 maxBytes 34359738368 step: 2(factor)
#
# Using devices
# Rank 0 Group 0 Pid 52 on nccl-single-node-validation device 0 [0000:18:00] NVIDIA H200 NVL
# Rank 1 Group 0 Pid 52 on nccl-single-node-validation device 1 [0000:67:00] NVIDIA H200 NVL
# Rank 2 Group 0 Pid 52 on nccl-single-node-validation device 2 [0000:b2:00] NVIDIA H200 NVL
# Rank 3 Group 0 Pid 52 on nccl-single-node-validation device 3 [0000:d8:00] NVIDIA H200 NVL
# Rank 4 Group 0 Pid 52 on nccl-single-node-validation device 4 [0001:18:00] NVIDIA H200 NVL
# Rank 5 Group 0 Pid 52 on nccl-single-node-validation device 5 [0001:69:00] NVIDIA H200 NVL
# Rank 6 Group 0 Pid 52 on nccl-single-node-validation device 6 [0001:8f:00] NVIDIA H200 NVL
# Rank 7 Group 0 Pid 52 on nccl-single-node-validation device 7 [0001:b3:00] NVIDIA H200 NVL
#
# size count type redop root time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s)
34359738368 8589934592 float sum -1 875713 39.24 68.66 0
# Avg bus bandwidth : 68.6248
# Collective test concluded: all_reduce_perf
# β
68.66 GB/s busbw = excellent (near H200 NVL theoretical max)
# This confirms NVLink fabric is healthy across all 8 GPUsMPIJob for Multi-Node 2Γ2 GPU RoCE Validation
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: nccl-roce-validation
namespace: gpu-workloads
spec:
launcherCreationPolicy: AtStartup
mpiImplementation: OpenMPI
mpiReplicaSpecs:
Launcher:
replicas: 1
restartPolicy: Never
template:
metadata:
labels:
app: nccl-roce-validation
spec:
containers:
- name: mpi-job
env:
- name: REWRITE_MPI_HOSTFILE_FQDN
value: "false"
- name: MPI_DNS_WAIT_SECONDS
value: "120"
- name: MPI_DNS_WAIT_INTERVAL
value: "3"
- name: MPI_NP
value: "4"
- name: GPUS_PER_MPI_PROCESS
value: "1"
- name: MPI_HOSTFILE
value: /etc/mpi/hostfile
# NCCL configuration
- name: NCCL_SOCKET_IFNAME
value: net1 # Secondary network interface (Multus)
- name: NCCL_DMABUF_ENABLE
value: "1" # Enable DMA-BUF for GPUDirect
- name: NCCL_NET_PLUGIN
value: none # Disable IB plugin (use TCP sockets)
- name: NCCL_SHM_DISABLE
value: "1" # Force network path (no SHM shortcut)
image: nvcr.io/nvidia/pytorch:24.04-py3
command:
- /opt/nccl-tests/build/all_reduce_perf
args:
- -b
- "8"
- -e
- "8G"
- -f
- "2"
- -g
- "1"
Worker:
replicas: 2
template:
metadata:
annotations:
k8s.v1.cni.cncf.io/networks: rdma-net
spec:
containers:
- name: worker
image: nvcr.io/nvidia/pytorch:24.04-py3
resources:
limits:
nvidia.com/gpu: "2"
rdma/rdma_shared_device_a: "1"
securityContext:
capabilities:
add: ["IPC_LOCK"]
volumeMounts:
- name: shm
mountPath: /dev/shm
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: "32Gi"Results: Multi-Node Without RDMA (Socket Fallback)
# When /dev/infiniband is missing and NCCL_NET_PLUGIN=none:
# NCCL falls back to TCP sockets over the secondary network (net1)
=============== System diagnostics ===============
Hostname: nccl-roce-validation-launcher
RDMA devices:
0 HCAs found:
WARNING: /dev/infiniband is missing. RDMA will not work.
=================================================
# Results with socket fallback (2 nodes Γ 2 GPUs = 4 ranks, 16 ranks in full test):
# size count type redop root time algbw busbw #wrong
8589934592 2147483648 float sum -1 458955 18.72 35.09 0
# Avg bus bandwidth : 13.4939
#
# Peak: ~35 GB/s busbw (large messages)
# Average: ~13.5 GB/s (across all sizes)
#
# β οΈ This is WITHOUT RDMA β TCP socket over RoCE NIC
# With proper RDMA (/dev/infiniband + IB plugin): expect 2-3Γ betterNCCL Environment Variables Explained
Variable β Value β Purpose
βββββββββββββββββββββββββββΌβββββββββΌβββββββββββββββββββββββββββββββββββββ
NCCL_SOCKET_IFNAME β net1 β Use secondary network (Multus) for NCCL
NCCL_DMABUF_ENABLE β 1 β Allow DMA-BUF for GPUDirect RDMA
NCCL_NET_PLUGIN β none β Disable IB verbs plugin (force sockets)
NCCL_SHM_DISABLE β 1 β Disable shared memory (force network path)
MPI_NP β 4 β Total MPI processes (ranks)
GPUS_PER_MPI_PROCESS β 1 β Each rank gets 1 GPU
MPI_DNS_WAIT_SECONDS β 120 β Wait for worker DNS resolution
MPI_DNS_WAIT_INTERVAL β 3 β DNS retry interval (seconds)
REWRITE_MPI_HOSTFILE_FQDN β false β Don't rewrite hostfile with FQDNs
βββββββββββββββββββββββββββ΄βββββββββ΄βββββββββββββββββββββββββββββββββββββ
To enable RDMA instead of sockets:
NCCL_NET_PLUGIN: "" (or remove β use default IB plugin)
NCCL_IB_HCA: mlx5_0,mlx5_3 (specify HCAs)
NCCL_NET_GDR_LEVEL: 5 (enable GPUDirect RDMA)Fix: Enable RDMA in Multi-Node Test
# The 2x2gpu test showed "0 HCAs found" because:
# 1. Pods don't request rdma/rdma_shared_device_a
# 2. NCCL_NET_PLUGIN=none explicitly disables IB
# Fixed version with RDMA:
env:
- name: NCCL_SOCKET_IFNAME
value: net1
- name: NCCL_DMABUF_ENABLE
value: "1"
# Remove NCCL_NET_PLUGIN=none (let NCCL use IB plugin)
- name: NCCL_IB_HCA
value: "mlx5_0,mlx5_3,mlx5_5,mlx5_6"
- name: NCCL_NET_GDR_LEVEL
value: "5"
# Remove NCCL_SHM_DISABLE (allow SHM for intra-node)
# Worker must request RDMA device:
resources:
limits:
nvidia.com/gpu: "2"
rdma/rdma_shared_device_a: "1" # β This gives /dev/infiniband
securityContext:
capabilities:
add: ["IPC_LOCK"] # β Required for RDMATest Matrix: Recommended Validations
Test Name β Config β Validates β Expected busbw
ββββββββββββββββββββΌββββββββββββββββΌβββββββββββββββββββββββββββββββΌβββββββββββββββ
nccl-prod-1x4 β 1 node, 4 GPU β NVLink within NVL4 group β ~68 GB/s
nccl-prod-1x8 β 1 node, 8 GPU β Full NVLink fabric (2ΓNVL4) β ~68 GB/s
nccl-prod-2x2gpu β 2 nodes, 2/nodeβ Cross-node network path β ~35 GB/s (socket)
β β β ~50 GB/s (RDMA)
nccl-prod-2x8gpu β 2 nodes, 8/nodeβ Full multi-node scale β ~35 GB/s (socket)
β β β ~48 GB/s (RDMA+GDR)
ββββββββββββββββββββ΄ββββββββββββββββ΄βββββββββββββββββββββββββββββββ΄βββββββββββββββ
Naming convention: nccl-prod-{nodes}x{gpus_per_node}
Files generated:
- nccl-prod-1x8.log (benchmark output)
- nccl-prod-1x8.describe.txt (kubectl describe of MPIJob)Diagnostic Output Interpretation
=============== System diagnostics ===============
Hostname: nccl-roce-validation-launcher
Date: Wed May 28 12:38:32 UTC 2026
User: uid=0(root) gid=0(root) groups=0(root)
Interfaces:
lo UNKNOWN 127.0.0.1/8 ::1/128
eth0@if257 UP 10.233.8.27/23 fe80::858:aff:fee9:81b/64
WARNING: nvidia-smi not found. β Launcher pod has no GPUs (expected)
Workers have GPUs, not the launcher
RDMA devices:
0 HCAs found: β No RDMA in launcher (expected if launcher-only)
WARNING: /dev/infiniband is missing. RDMA will not work.
β If workers also show this = problem!
=================================================
================ NCCL / MPI environment ================
CUDA_ARCH_LIST=7.5 8.0 8.6 9.0 10.0 12.0
CUDA_DRIVER_VERSION=560.95.05
CUDA_VERSION=13.0.2.006
GPUS_PER_MPI_PROCESS=1
MPI_DNS_WAIT_INTERVAL=3
...Run:ai Integration
# When running under Run:ai, the MPIJob gets Run:ai annotations:
metadata:
annotations:
runai-calculated-status: Running
runai-current-allocated-gpus: "4"
runai-current-allocated-gpus-memory: "301509"
runai-current-requested-gpus: "4"
runai-running-pods: "2"
runai-total-requested-gpus: "4"
runai-used-nodes: gpu-node-0, gpu-node-1
namespace: project-001 # Run:ai project namespace
# Run:ai scheduler:
# - Places workers on nodes with available GPUs
# - Tracks GPU memory allocation (301509 MB = ~294 GB for 4Γ H200)
# - Reports used nodes for visibilityFull Validation Script
#!/bin/bash
# run-nccl-validation.sh β Run all NCCL test variants
NAMESPACE="gpu-workloads"
IMAGE="nvcr.io/nvidia/pytorch:24.04-py3"
# Test 1: Single-node 8 GPU (NVLink validation)
echo "Starting 1x8 NVLink test..."
kubectl apply -f nccl-single-node-1x8.yaml -n $NAMESPACE
kubectl wait --for=condition=succeeded mpijob/nccl-single-node-validation \
-n $NAMESPACE --timeout=600s
kubectl logs -n $NAMESPACE -l app=nccl-single-node-validation \
--tail=50 > nccl-prod-1x8.log
kubectl describe mpijob nccl-single-node-validation \
-n $NAMESPACE > nccl-prod-1x8.describe.txt
# Test 2: Multi-node 2x2 GPU (network validation)
echo "Starting 2x2 RoCE test..."
kubectl apply -f nccl-roce-2x2gpu.yaml -n $NAMESPACE
kubectl wait --for=condition=succeeded mpijob/nccl-roce-validation \
-n $NAMESPACE --timeout=600s
kubectl logs -n $NAMESPACE -l app=nccl-roce-validation \
--tail=100 > nccl-prod-2x2gpu.log
kubectl describe mpijob nccl-roce-validation \
-n $NAMESPACE > nccl-prod-2x2gpu.describe.txt
# Parse results
echo "=== Results ==="
grep "Avg bus bandwidth" nccl-prod-*.logCommon Issues
βWARNING: /dev/infiniband is missing. RDMA will not work.β
- Cause: Pod doesnβt request
rdma/rdma_shared_device_a; or RDMA device plugin not deployed - Fix: Add RDMA resource request to worker pods; deploy shared RDMA device plugin
βWARNING: nvidia-smi not foundβ in launcher
- Cause: Launcher pod doesnβt need GPUs β it only coordinates MPI
- Fix: This is expected. Only workers need GPU resources. Ignore this warning in launcher logs.
Low busbw on multi-node (13 GB/s instead of 50 GB/s)
- Cause:
NCCL_NET_PLUGIN=noneforces TCP sockets; no RDMA - Fix: Remove
NCCL_NET_PLUGIN=none; add RDMA device to workers; setNCCL_IB_HCA
MPI launcher times out waiting for workers
- Cause: DNS not resolving worker hostnames; or workers not ready
- Fix: Increase
MPI_DNS_WAIT_SECONDS; verify worker pods are Running; check headless Service
βNCCL WARN Connect to β¦ failedβ
- Cause: Network policy blocking inter-pod traffic; or wrong
NCCL_SOCKET_IFNAME - Fix: Allow all traffic between NCCL pods; set
NCCL_SOCKET_IFNAMEto correct interface (net1 for Multus secondary)
Best Practices
- Test NVLink first (1x8) β validate intra-node before adding network complexity
- Then test network (2x2) β isolates network performance from NVLink
- Save logs and describe output β create test evidence for cluster acceptance
- Compare socket vs RDMA β run with and without
NCCL_NET_PLUGIN=noneto measure RDMA gain - Use large messages for peak bandwidth β 32GB messages show true fabric capacity
- Run regularly β detect hardware degradation early
- Pin NCCL test image version β reproducible results across test runs
Key Takeaways
- MPIJob (kubeflow.org/v2beta1): standard way to run multi-node NCCL tests on Kubernetes
- 1x8 H200 NVL: ~68 GB/s busbw = healthy NVLink (near theoretical max)
- 2x2 socket fallback: ~13-35 GB/s = works but suboptimal (no RDMA)
- 2x2 with RDMA: ~48-50 GB/s expected with GDRDMA + IB plugin
NCCL_NET_PLUGIN=nonedeliberately disables RDMA β useful for socket baseline testing- Launcher pod has no GPUs and no RDMA (expected) β only workers need resources
- Run:ai tracks GPU allocation and node placement via annotations
- Missing
/dev/infiniband= needrdma/rdma_shared_device_aresource in pod spec

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
