MPI DNS Resolution and Hostfile for Kubernetes GPU Jobs
Troubleshoot MPI hostfile DNS resolution in Kubeflow MPIJob on Kubernetes. Covers headless Service creation, subdomain configuration, DNS wait loops, FQDN
π‘ Quick Answer: MPI launcher pods must resolve worker hostnames before starting training. Kubeflow MPI Operator creates a headless Service matching the
subdomainfield in worker pods. DNS propagation takes 5-30 seconds after pods reach Running state. Implement a DNS wait loop with configurable timeout (default 120s) and interval (3s) to handle this race condition gracefully.
The Problem
- MPI launcher starts before worker pod DNS records propagate
- Headless Service creation has delay β CoreDNS caches stale NXDOMAIN
- Some clusters need
.svc.cluster.localFQDN suffix for resolution - Hostfile entries generated by MPI Operator may not match DNS names
- Failed DNS β SSH failure β mpirun crash β entire job fails
The Solution
How MPI Operator Creates DNS
MPIJob "nccl-validation" in namespace "gpu-benchmark"
βββ Creates headless Service: "nccl-validation"
βββ Worker pods with subdomain: "nccl-validation"
βββ Hostfile entries:
nccl-validation-worker-0.nccl-validation.gpu-benchmark.svc slots=2
nccl-validation-worker-1.nccl-validation.gpu-benchmark.svc slots=2
DNS resolution:
nccl-validation-worker-0.nccl-validation.gpu-benchmark.svc
βββ headless Service: nccl-validation (namespace: gpu-benchmark)
βββ endpoint: pod IP of worker-0DNS Wait Loop Implementation
#!/bin/bash
MPI_DNS_WAIT_SECONDS="${MPI_DNS_WAIT_SECONDS:-120}"
MPI_DNS_WAIT_INTERVAL="${MPI_DNS_WAIT_INTERVAL:-3}"
MPI_HOSTFILE="${MPI_HOSTFILE:-/etc/mpi/hostfile}"
wait_for_mpi_dns() {
local hostfile="$1"
local elapsed=0
echo "Waiting for MPI worker DNS records to resolve..."
echo "Hostfile: ${hostfile}"
echo "Timeout: ${MPI_DNS_WAIT_SECONDS}s"
while true; do
local failed=0
while read -r host rest; do
[[ -z "${host}" || "${host}" =~ ^# ]] && continue
if getent hosts "${host}" >/dev/null 2>&1; then
local ip
ip=$(getent hosts "${host}" | awk '{print $1}' | head -1)
echo "DNS OK: ${host} -> ${ip}"
else
echo "DNS WAIT: ${host} not resolvable yet"
failed=1
fi
done < "${hostfile}"
if [[ "${failed}" -eq 0 ]]; then
echo "All MPI worker DNS records resolved."
return 0
fi
if [[ "${elapsed}" -ge "${MPI_DNS_WAIT_SECONDS}" ]]; then
echo "ERROR: Timed out waiting for MPI worker DNS."
echo ""
echo "Debug info:"
echo "--- Hostfile ---"
cat "${hostfile}"
echo "--- resolv.conf ---"
cat /etc/resolv.conf
echo "--- /etc/hosts ---"
cat /etc/hosts
exit 1
fi
sleep "${MPI_DNS_WAIT_INTERVAL}"
elapsed=$((elapsed + MPI_DNS_WAIT_INTERVAL))
done
}
wait_for_mpi_dns "${MPI_HOSTFILE}"FQDN Rewriting
# Some clusters only resolve with .svc.cluster.local suffix
# MPI Operator generates: worker-0.svc (short form)
# Cluster DNS expects: worker-0.svc.cluster.local
REWRITE_MPI_HOSTFILE_FQDN="${REWRITE_MPI_HOSTFILE_FQDN:-false}"
if [[ "${REWRITE_MPI_HOSTFILE_FQDN}" == "true" ]]; then
cp "${MPI_HOSTFILE}" /tmp/mpi-hostfile.original
sed 's/\.svc /.svc.cluster.local /g; s/\.svc$/.svc.cluster.local/' \
/tmp/mpi-hostfile.original > /tmp/mpi-hostfile
export MPI_HOSTFILE="/tmp/mpi-hostfile"
echo "Rewritten hostfile:"
cat "${MPI_HOSTFILE}"
fiWorker Pod Spec Requirements
# subdomain MUST match the headless Service name
Worker:
template:
spec:
subdomain: nccl-validation # β This creates DNS records
# The headless Service is auto-created by MPI Operator with:
# metadata:
# name: nccl-validation # Matches subdomain
# spec:
# clusterIP: None # Headless
# selector:
# app: nccl-validation
# mpi-role: workerEnvironment Variables
env:
# DNS wait before mpirun (gives CoreDNS time to propagate)
- name: MPI_DNS_WAIT_SECONDS
value: "120" # Max wait time
- name: MPI_DNS_WAIT_INTERVAL
value: "3" # Check every 3 seconds
# Enable FQDN rewriting if short names don't resolve
- name: REWRITE_MPI_HOSTFILE_FQDN
value: "false" # Set "true" if needed
# Hostfile location (MPI Operator default)
- name: MPI_HOSTFILE
value: "/etc/mpi/hostfile"Debugging DNS Failures
# Inside launcher pod:
# Check hostfile content
cat /etc/mpi/hostfile
# Manual DNS resolution
getent hosts nccl-validation-worker-0.nccl-validation.gpu-benchmark.svc
# Check resolver config
cat /etc/resolv.conf
# Should show:
# nameserver 172.30.0.10
# search gpu-benchmark.svc.cluster.local svc.cluster.local cluster.local
# Verify headless Service exists
kubectl get svc -n gpu-benchmark | grep nccl-validation
# NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
# nccl-validation ClusterIP None <none> <none> 30s
# Check endpoints (workers registered?)
kubectl get endpoints nccl-validation -n gpu-benchmark
# NAME ENDPOINTS AGE
# nccl-validation 10.128.4.15:22,10.128.5.23:22 30s
# CoreDNS logs for NXDOMAIN debugging
kubectl logs -n openshift-dns -l dns.operator.openshift.io/daemonset-dnsCommon Issues
DNS WAIT loops forever (timeout after 120s)
- Cause: Workers not in Running state yet, or headless Service not created
- Fix: Check worker pod status; verify MPI Operator created the Service
βNXDOMAINβ even after workers are Running
- Cause: CoreDNS caches negative responses for 30s (default TTL)
- Fix: Wait longer (increase
MPI_DNS_WAIT_SECONDS) or restart CoreDNS pods
Resolves on retry but first attempt always fails
- Cause: Normal race condition β DNS propagation takes 5-30s
- Fix: The wait loop handles this. Set interval to 3-5s for responsiveness.
Short names resolve but FQDN doesnβt (or vice versa)
- Cause: resolv.conf search domains not matching hostfile format
- Fix: Enable
REWRITE_MPI_HOSTFILE_FQDN=trueto add.cluster.local
Workers resolve to wrong IP
- Cause: Stale endpoint from previous job run with same name
- Fix: Delete old MPIJob fully before redeploying; use unique names
Best Practices
- Always implement DNS wait loop β never assume instant resolution
- 120s timeout is generous β most resolve within 30s
- 3s interval balances responsiveness with DNS server load
- Print debug info on timeout β hostfile, resolv.conf, /etc/hosts
- Use
cleanPodPolicy: Noneβ prevents premature Service deletion - Unique MPIJob names per run β avoids stale DNS cache hits
- Check endpoints, not just Service β Service exists β endpoints populated
Key Takeaways
- MPI Operator creates headless Service matching worker pod
subdomain - DNS propagation takes 5-30s after pods reach Running state
- Implement wait loop with timeout (120s) and debug output on failure
- FQDN rewriting handles clusters that need
.svc.cluster.localsuffix getent hostsis more reliable thannslookupfor testing resolution- CoreDNS negative caching (30s) causes delays after failed first lookup
- Worker pods must have
subdomainfield set for DNS to work at all

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
