OpenMPI Control Plane Separation for NCCL RDMA
Configure OpenMPI to use eth0 for MPI control traffic while NCCL uses net1 SR-IOV for data. Covers btl_tcp_if_include, pml, routed direct, plm_rsh_agent SSH
π‘ Quick Answer: In Kubeflow MPIJob with SR-IOV RDMA, you must separate control and data planes: OpenMPI uses
eth0(pod network) for SSH, process management, and barrier synchronization via--mca btl_tcp_if_include eth0, while NCCL usesnet1(SR-IOV VF) for GPU collective operations viaNCCL_SOCKET_IFNAME=net1. This prevents MPI from attempting to route control traffic over the RDMA interface which lacks pod DNS resolution.
The Problem
- MPI control traffic (SSH, process spawn, barriers) needs pod-to-pod DNS resolution
- SR-IOV
net1interface only provides L2/L3 connectivity for RDMA β no Kubernetes DNS - If OpenMPI tries to use
net1for control traffic, SSH connections and DNS lookups fail - MPI collective libraries (UCC, HCOLL) can conflict with NCCLβs own collective implementation
- Need clean separation: MPI manages processes, NCCL manages GPU data
The Solution
Complete MPI Environment Configuration
env:
# === OpenMPI Control Plane (process management) ===
# Force TCP byte transfer layer on eth0 only
- name: OMPI_MCA_btl
value: "self,tcp" # self (loopback) + tcp (no openib)
- name: OMPI_MCA_btl_tcp_if_include
value: "eth0" # Pod network interface
# Point-to-point messaging layer
- name: OMPI_MCA_pml
value: "ob1" # Use ob1 (not ucx) for simplicity
# SSH agent for launching remote processes
- name: OMPI_MCA_plm_rsh_agent
value: >-
ssh -o StrictHostKeyChecking=no
-o UserKnownHostsFile=/dev/null
-o GlobalKnownHostsFile=/dev/null
# Direct routing (no tree/ring for process management)
# Set via mpirun --mca routed direct
# Abort timeout for hung processes
- name: OMPI_MCA_orte_abort_timeout
value: "60"
# === Disable MPI Collectives (use NCCL instead) ===
# Disable UCX collective component
- name: OMPI_MCA_coll_ucc_enable
value: "0"
# Disable Mellanox HCOLL (hardware collectives)
- name: OMPI_MCA_coll_hcoll_enable
value: "0"
# === NCCL Data Plane (GPU collectives) ===
# NCCL bootstrap and socket operations on SR-IOV interface
- name: NCCL_SOCKET_IFNAME
value: "net1"
# Allow running as root in containers
- name: OMPI_ALLOW_RUN_AS_ROOT
value: "1"
- name: OMPI_ALLOW_RUN_AS_ROOT_CONFIRM
value: "1"The mpirun Command
mpirun \
-np 4 \
--hostfile /etc/mpi/hostfile \
--bind-to none \ # Don't bind to cores (GPU workload)
--map-by slot \ # One rank per slot (GPU)
--mca routed direct \ # Direct process routing
--mca btl "self,tcp" \ # TCP only (no openib BTL)
--mca pml ob1 \ # ob1 PML (not UCX)
--mca btl_tcp_if_include eth0 \ # Control on pod network
-x NCCL_IB_HCA \ # Forward NCCL vars to workers
-x NCCL_IB_GID_INDEX \
-x NCCL_IB_DISABLE \
-x NCCL_SOCKET_IFNAME \
-x NCCL_NET_GDR_LEVEL \
-x NCCL_DMABUF_ENABLE \
-x NCCL_COLLNET_ENABLE \
-x NCCL_DEBUG \
-x NCCL_DEBUG_SUBSYS \
-x NCCL_IB_QPS_PER_CONNECTION \
-x NCCL_IB_SPLIT_DATA_ON_QPS \
-x NCCL_SHM_DISABLE \
-x NCCL_NET_PLUGIN \
-x OMPI_MCA_coll_ucc_enable \
-x OMPI_MCA_coll_hcoll_enable \
-x LD_LIBRARY_PATH \
-x PATH \
/opt/nccl-tests/build/all_reduce_perf \
-b 1G -e 16G -f 2 -g 1Traffic Flow Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Kubernetes Cluster β
β β
β ββββββββββββββββ eth0 (pod network) ββββββββββββ
β β Launcher ββββββββ SSH + MPI control βββββββββΊβ Worker-0 β
β β ββββββββ SSH + MPI control βββββββββΊβ Worker-1 β
β ββββββββββββββββ (DNS resolvable) ββββββββββββ
β β
β ββββββββββββββββ net1 (SR-IOV VF) ββββββββββββ
β β Worker-0 ββββββββ NCCL RDMA data ββββββββββββΊβ Worker-1 β
β β GPU 0,1 β (L2/L3 only) β GPU 2,3 β
β ββββββββββββββββ (no DNS needed) ββββββββββββ
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
eth0: 10.128.x.x (Kubernetes pod CIDR, full DNS, Service discovery)
net1: 192.168.x.x (SR-IOV subnet, RDMA-capable, no K8s services)Why Disable UCC and HCOLL
Component β What it does β Why disable
ββββββββββββββΌββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββ
coll_ucc β UCX-based MPI collectives β Conflicts with NCCL allreduce
coll_hcoll β Mellanox HW collectives β Conflicts with NCCL allreduce
ββββββββββββββ΄ββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββ
NCCL handles all GPU collective operations (allreduce, allgather, etc.)
MPI is only used for process management (spawn, barrier, finalize)
Enabling MPI collectives creates confusion about which library handles
the actual GPU data movement β always let NCCL own the data path.DNS Resolution for MPI Hostfile
# Kubeflow MPI Operator creates a headless Service and generates hostfile:
# /etc/mpi/hostfile contains:
# nccl-roce-validation-worker-0.nccl-roce-validation.gpu-benchmark.svc slots=2
# nccl-roce-validation-worker-1.nccl-roce-validation.gpu-benchmark.svc slots=2
# These DNS names resolve via eth0 (pod network)
# If using net1, getent hosts would fail β MPI cannot SSH to workers
# The validate_network.sh script waits for DNS:
wait_for_mpi_dns() {
while read -r host rest; do
if getent hosts "${host}" >/dev/null 2>&1; then
echo "DNS OK: ${host}"
else
echo "DNS WAIT: ${host} not resolvable yet"
fi
done < "${MPI_HOSTFILE}"
}FQDN Rewriting for Stubborn DNS
# Some clusters need .svc.cluster.local suffix for resolution
# Enable with: REWRITE_MPI_HOSTFILE_FQDN=true
# Converts:
# worker-0.svc slots=2
# To:
# worker-0.svc.cluster.local slots=2
sed 's/\.svc /.svc.cluster.local /g' /etc/mpi/hostfile > /tmp/mpi-hostfile
export MPI_HOSTFILE="/tmp/mpi-hostfile"Common Issues
βNo route to hostβ on mpirun
- Cause: OpenMPI trying to use net1 for SSH
- Fix: Ensure
OMPI_MCA_btl_tcp_if_include=eth0is set on launcher AND forwarded to workers
NCCL hangs after βConnected to proxyβ
- Cause: NCCL trying to bootstrap on eth0 instead of net1
- Fix: Set
NCCL_SOCKET_IFNAME=net1β this tells NCCL where to establish connections
MPI barrier timeout
- Cause: Firewall or NetworkPolicy blocking eth0 TCP between pods
- Fix: Ensure no NetworkPolicy restricts inter-pod traffic on port ranges used by MPI
Workers not reachable via SSH
- Cause: SSHD not running on workers, or hostfile DNS not resolved
- Fix: Workers must run in
shellmode withSTART_SSHD=true
Best Practices
- Always set
btl_tcp_if_include=eth0β never let MPI auto-detect interfaces - Use
pml=ob1not UCX β simpler, no interference with NCCLβs UCX usage - Disable ALL MPI collectives β NCCL owns GPU data movement exclusively
- Forward all NCCL vars via
-xβ workers inherit from launcher environment - Set
routed directβ flat topology, no MPI routing overhead - Use
--bind-to noneβ GPU workloads manage their own affinity - SSH with no host checking β pods are ephemeral, strict checking always fails
Key Takeaways
- Two separate networks: eth0 (MPI control) and net1 (NCCL data)
- OpenMPI only manages processes β SSH, spawn, barriers, finalize
- NCCL exclusively handles GPU collective operations over RDMA
- Disable UCC + HCOLL to prevent MPI from touching GPU data
NCCL_SOCKET_IFNAME=net1is mandatory for NCCL to find SR-IOV interface- DNS resolution only works on eth0 β MPI hostfile relies on pod network
- Forward all NCCL environment variables from launcher to workers via mpirun
-x

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
