MPI DNS Resolution and Hostfile for Kubernetes GPU Jobs

💡 Quick Answer: MPI launcher pods must resolve worker hostnames before starting training. Kubeflow MPI Operator creates a headless Service matching the subdomain field in worker pods. DNS propagation takes 5-30 seconds after pods reach Running state. Implement a DNS wait loop with configurable timeout (default 120s) and interval (3s) to handle this race condition gracefully.

The Problem

MPI launcher starts before worker pod DNS records propagate
Headless Service creation has delay — CoreDNS caches stale NXDOMAIN
Some clusters need .svc.cluster.local FQDN suffix for resolution
Hostfile entries generated by MPI Operator may not match DNS names
Failed DNS → SSH failure → mpirun crash → entire job fails

The Solution

How MPI Operator Creates DNS

MPIJob "nccl-validation" in namespace "gpu-benchmark"
  └── Creates headless Service: "nccl-validation"
  └── Worker pods with subdomain: "nccl-validation"
  └── Hostfile entries:
      nccl-validation-worker-0.nccl-validation.gpu-benchmark.svc slots=2
      nccl-validation-worker-1.nccl-validation.gpu-benchmark.svc slots=2

DNS resolution:
  nccl-validation-worker-0.nccl-validation.gpu-benchmark.svc
  └── headless Service: nccl-validation (namespace: gpu-benchmark)
  └── endpoint: pod IP of worker-0

DNS Wait Loop Implementation

#!/bin/bash
MPI_DNS_WAIT_SECONDS="${MPI_DNS_WAIT_SECONDS:-120}"
MPI_DNS_WAIT_INTERVAL="${MPI_DNS_WAIT_INTERVAL:-3}"
MPI_HOSTFILE="${MPI_HOSTFILE:-/etc/mpi/hostfile}"

wait_for_mpi_dns() {
  local hostfile="$1"
  local elapsed=0

  echo "Waiting for MPI worker DNS records to resolve..."
  echo "Hostfile: ${hostfile}"
  echo "Timeout: ${MPI_DNS_WAIT_SECONDS}s"

  while true; do
    local failed=0

    while read -r host rest; do
      [[ -z "${host}" || "${host}" =~ ^# ]] && continue

      if getent hosts "${host}" >/dev/null 2>&1; then
        local ip
        ip=$(getent hosts "${host}" | awk '{print $1}' | head -1)
        echo "DNS OK: ${host} -> ${ip}"
      else
        echo "DNS WAIT: ${host} not resolvable yet"
        failed=1
      fi
    done < "${hostfile}"

    if [[ "${failed}" -eq 0 ]]; then
      echo "All MPI worker DNS records resolved."
      return 0
    fi

    if [[ "${elapsed}" -ge "${MPI_DNS_WAIT_SECONDS}" ]]; then
      echo "ERROR: Timed out waiting for MPI worker DNS."
      echo ""
      echo "Debug info:"
      echo "--- Hostfile ---"
      cat "${hostfile}"
      echo "--- resolv.conf ---"
      cat /etc/resolv.conf
      echo "--- /etc/hosts ---"
      cat /etc/hosts
      exit 1
    fi

    sleep "${MPI_DNS_WAIT_INTERVAL}"
    elapsed=$((elapsed + MPI_DNS_WAIT_INTERVAL))
  done
}

wait_for_mpi_dns "${MPI_HOSTFILE}"

FQDN Rewriting

# Some clusters only resolve with .svc.cluster.local suffix
# MPI Operator generates: worker-0.svc (short form)
# Cluster DNS expects:    worker-0.svc.cluster.local

REWRITE_MPI_HOSTFILE_FQDN="${REWRITE_MPI_HOSTFILE_FQDN:-false}"

if [[ "${REWRITE_MPI_HOSTFILE_FQDN}" == "true" ]]; then
  cp "${MPI_HOSTFILE}" /tmp/mpi-hostfile.original
  sed 's/\.svc /.svc.cluster.local /g; s/\.svc$/.svc.cluster.local/' \
    /tmp/mpi-hostfile.original > /tmp/mpi-hostfile
  export MPI_HOSTFILE="/tmp/mpi-hostfile"
  echo "Rewritten hostfile:"
  cat "${MPI_HOSTFILE}"
fi

Worker Pod Spec Requirements

# subdomain MUST match the headless Service name
Worker:
  template:
    spec:
      subdomain: nccl-validation    # ← This creates DNS records

      # The headless Service is auto-created by MPI Operator with:
      # metadata:
      #   name: nccl-validation     # Matches subdomain
      # spec:
      #   clusterIP: None           # Headless
      #   selector:
      #     app: nccl-validation
      #     mpi-role: worker

Environment Variables

env:
  # DNS wait before mpirun (gives CoreDNS time to propagate)
  - name: MPI_DNS_WAIT_SECONDS
    value: "120"          # Max wait time

  - name: MPI_DNS_WAIT_INTERVAL
    value: "3"            # Check every 3 seconds

  # Enable FQDN rewriting if short names don't resolve
  - name: REWRITE_MPI_HOSTFILE_FQDN
    value: "false"        # Set "true" if needed

  # Hostfile location (MPI Operator default)
  - name: MPI_HOSTFILE
    value: "/etc/mpi/hostfile"

Debugging DNS Failures

# Inside launcher pod:

# Check hostfile content
cat /etc/mpi/hostfile

# Manual DNS resolution
getent hosts nccl-validation-worker-0.nccl-validation.gpu-benchmark.svc

# Check resolver config
cat /etc/resolv.conf
# Should show:
#   nameserver 172.30.0.10
#   search gpu-benchmark.svc.cluster.local svc.cluster.local cluster.local

# Verify headless Service exists
kubectl get svc -n gpu-benchmark | grep nccl-validation
# NAME              TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
# nccl-validation   ClusterIP   None         <none>        <none>    30s

# Check endpoints (workers registered?)
kubectl get endpoints nccl-validation -n gpu-benchmark
# NAME              ENDPOINTS                           AGE
# nccl-validation   10.128.4.15:22,10.128.5.23:22      30s

# CoreDNS logs for NXDOMAIN debugging
kubectl logs -n openshift-dns -l dns.operator.openshift.io/daemonset-dns

Common Issues

DNS WAIT loops forever (timeout after 120s)

Cause: Workers not in Running state yet, or headless Service not created
Fix: Check worker pod status; verify MPI Operator created the Service

”NXDOMAIN” even after workers are Running

Cause: CoreDNS caches negative responses for 30s (default TTL)
Fix: Wait longer (increase MPI_DNS_WAIT_SECONDS) or restart CoreDNS pods

Resolves on retry but first attempt always fails

Cause: Normal race condition — DNS propagation takes 5-30s
Fix: The wait loop handles this. Set interval to 3-5s for responsiveness.

Short names resolve but FQDN doesn’t (or vice versa)

Cause: resolv.conf search domains not matching hostfile format
Fix: Enable REWRITE_MPI_HOSTFILE_FQDN=true to add .cluster.local

Workers resolve to wrong IP

Cause: Stale endpoint from previous job run with same name
Fix: Delete old MPIJob fully before redeploying; use unique names

Best Practices

Always implement DNS wait loop — never assume instant resolution
120s timeout is generous — most resolve within 30s
3s interval balances responsiveness with DNS server load
Print debug info on timeout — hostfile, resolv.conf, /etc/hosts
Use cleanPodPolicy: None — prevents premature Service deletion
Unique MPIJob names per run — avoids stale DNS cache hits
Check endpoints, not just Service — Service exists ≠ endpoints populated

Key Takeaways

MPI Operator creates headless Service matching worker pod subdomain
DNS propagation takes 5-30s after pods reach Running state
Implement wait loop with timeout (120s) and debug output on failure
FQDN rewriting handles clusters that need .svc.cluster.local suffix
getent hosts is more reliable than nslookup for testing resolution
CoreDNS negative caching (30s) causes delays after failed first lookup
Worker pods must have subdomain field set for DNS to work at all

The Problem

The Solution

How MPI Operator Creates DNS

DNS Wait Loop Implementation

FQDN Rewriting

Worker Pod Spec Requirements

Environment Variables

Debugging DNS Failures

Common Issues

DNS WAIT loops forever (timeout after 120s)

”NXDOMAIN” even after workers are Running

Resolves on retry but first attempt always fails

Short names resolve but FQDN doesn’t (or vice versa)

Workers resolve to wrong IP

Best Practices

Key Takeaways

Want More Kubernetes Recipes?