πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai intermediate ⏱ 15 minutes K8s 1.28+

MPI DNS Resolution and Hostfile for Kubernetes GPU Jobs

Troubleshoot MPI hostfile DNS resolution in Kubeflow MPIJob on Kubernetes. Covers headless Service creation, subdomain configuration, DNS wait loops, FQDN

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: MPI launcher pods must resolve worker hostnames before starting training. Kubeflow MPI Operator creates a headless Service matching the subdomain field in worker pods. DNS propagation takes 5-30 seconds after pods reach Running state. Implement a DNS wait loop with configurable timeout (default 120s) and interval (3s) to handle this race condition gracefully.

The Problem

  • MPI launcher starts before worker pod DNS records propagate
  • Headless Service creation has delay β€” CoreDNS caches stale NXDOMAIN
  • Some clusters need .svc.cluster.local FQDN suffix for resolution
  • Hostfile entries generated by MPI Operator may not match DNS names
  • Failed DNS β†’ SSH failure β†’ mpirun crash β†’ entire job fails

The Solution

How MPI Operator Creates DNS

MPIJob "nccl-validation" in namespace "gpu-benchmark"
  └── Creates headless Service: "nccl-validation"
  └── Worker pods with subdomain: "nccl-validation"
  └── Hostfile entries:
      nccl-validation-worker-0.nccl-validation.gpu-benchmark.svc slots=2
      nccl-validation-worker-1.nccl-validation.gpu-benchmark.svc slots=2

DNS resolution:
  nccl-validation-worker-0.nccl-validation.gpu-benchmark.svc
  └── headless Service: nccl-validation (namespace: gpu-benchmark)
  └── endpoint: pod IP of worker-0

DNS Wait Loop Implementation

#!/bin/bash
MPI_DNS_WAIT_SECONDS="${MPI_DNS_WAIT_SECONDS:-120}"
MPI_DNS_WAIT_INTERVAL="${MPI_DNS_WAIT_INTERVAL:-3}"
MPI_HOSTFILE="${MPI_HOSTFILE:-/etc/mpi/hostfile}"

wait_for_mpi_dns() {
  local hostfile="$1"
  local elapsed=0

  echo "Waiting for MPI worker DNS records to resolve..."
  echo "Hostfile: ${hostfile}"
  echo "Timeout: ${MPI_DNS_WAIT_SECONDS}s"

  while true; do
    local failed=0

    while read -r host rest; do
      [[ -z "${host}" || "${host}" =~ ^# ]] && continue

      if getent hosts "${host}" >/dev/null 2>&1; then
        local ip
        ip=$(getent hosts "${host}" | awk '{print $1}' | head -1)
        echo "DNS OK: ${host} -> ${ip}"
      else
        echo "DNS WAIT: ${host} not resolvable yet"
        failed=1
      fi
    done < "${hostfile}"

    if [[ "${failed}" -eq 0 ]]; then
      echo "All MPI worker DNS records resolved."
      return 0
    fi

    if [[ "${elapsed}" -ge "${MPI_DNS_WAIT_SECONDS}" ]]; then
      echo "ERROR: Timed out waiting for MPI worker DNS."
      echo ""
      echo "Debug info:"
      echo "--- Hostfile ---"
      cat "${hostfile}"
      echo "--- resolv.conf ---"
      cat /etc/resolv.conf
      echo "--- /etc/hosts ---"
      cat /etc/hosts
      exit 1
    fi

    sleep "${MPI_DNS_WAIT_INTERVAL}"
    elapsed=$((elapsed + MPI_DNS_WAIT_INTERVAL))
  done
}

wait_for_mpi_dns "${MPI_HOSTFILE}"

FQDN Rewriting

# Some clusters only resolve with .svc.cluster.local suffix
# MPI Operator generates: worker-0.svc (short form)
# Cluster DNS expects:    worker-0.svc.cluster.local

REWRITE_MPI_HOSTFILE_FQDN="${REWRITE_MPI_HOSTFILE_FQDN:-false}"

if [[ "${REWRITE_MPI_HOSTFILE_FQDN}" == "true" ]]; then
  cp "${MPI_HOSTFILE}" /tmp/mpi-hostfile.original
  sed 's/\.svc /.svc.cluster.local /g; s/\.svc$/.svc.cluster.local/' \
    /tmp/mpi-hostfile.original > /tmp/mpi-hostfile
  export MPI_HOSTFILE="/tmp/mpi-hostfile"
  echo "Rewritten hostfile:"
  cat "${MPI_HOSTFILE}"
fi

Worker Pod Spec Requirements

# subdomain MUST match the headless Service name
Worker:
  template:
    spec:
      subdomain: nccl-validation    # ← This creates DNS records

      # The headless Service is auto-created by MPI Operator with:
      # metadata:
      #   name: nccl-validation     # Matches subdomain
      # spec:
      #   clusterIP: None           # Headless
      #   selector:
      #     app: nccl-validation
      #     mpi-role: worker

Environment Variables

env:
  # DNS wait before mpirun (gives CoreDNS time to propagate)
  - name: MPI_DNS_WAIT_SECONDS
    value: "120"          # Max wait time

  - name: MPI_DNS_WAIT_INTERVAL
    value: "3"            # Check every 3 seconds

  # Enable FQDN rewriting if short names don't resolve
  - name: REWRITE_MPI_HOSTFILE_FQDN
    value: "false"        # Set "true" if needed

  # Hostfile location (MPI Operator default)
  - name: MPI_HOSTFILE
    value: "/etc/mpi/hostfile"

Debugging DNS Failures

# Inside launcher pod:

# Check hostfile content
cat /etc/mpi/hostfile

# Manual DNS resolution
getent hosts nccl-validation-worker-0.nccl-validation.gpu-benchmark.svc

# Check resolver config
cat /etc/resolv.conf
# Should show:
#   nameserver 172.30.0.10
#   search gpu-benchmark.svc.cluster.local svc.cluster.local cluster.local

# Verify headless Service exists
kubectl get svc -n gpu-benchmark | grep nccl-validation
# NAME              TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
# nccl-validation   ClusterIP   None         <none>        <none>    30s

# Check endpoints (workers registered?)
kubectl get endpoints nccl-validation -n gpu-benchmark
# NAME              ENDPOINTS                           AGE
# nccl-validation   10.128.4.15:22,10.128.5.23:22      30s

# CoreDNS logs for NXDOMAIN debugging
kubectl logs -n openshift-dns -l dns.operator.openshift.io/daemonset-dns

Common Issues

DNS WAIT loops forever (timeout after 120s)

  • Cause: Workers not in Running state yet, or headless Service not created
  • Fix: Check worker pod status; verify MPI Operator created the Service

”NXDOMAIN” even after workers are Running

  • Cause: CoreDNS caches negative responses for 30s (default TTL)
  • Fix: Wait longer (increase MPI_DNS_WAIT_SECONDS) or restart CoreDNS pods

Resolves on retry but first attempt always fails

  • Cause: Normal race condition β€” DNS propagation takes 5-30s
  • Fix: The wait loop handles this. Set interval to 3-5s for responsiveness.

Short names resolve but FQDN doesn’t (or vice versa)

  • Cause: resolv.conf search domains not matching hostfile format
  • Fix: Enable REWRITE_MPI_HOSTFILE_FQDN=true to add .cluster.local

Workers resolve to wrong IP

  • Cause: Stale endpoint from previous job run with same name
  • Fix: Delete old MPIJob fully before redeploying; use unique names

Best Practices

  1. Always implement DNS wait loop β€” never assume instant resolution
  2. 120s timeout is generous β€” most resolve within 30s
  3. 3s interval balances responsiveness with DNS server load
  4. Print debug info on timeout β€” hostfile, resolv.conf, /etc/hosts
  5. Use cleanPodPolicy: None β€” prevents premature Service deletion
  6. Unique MPIJob names per run β€” avoids stale DNS cache hits
  7. Check endpoints, not just Service β€” Service exists β‰  endpoints populated

Key Takeaways

  • MPI Operator creates headless Service matching worker pod subdomain
  • DNS propagation takes 5-30s after pods reach Running state
  • Implement wait loop with timeout (120s) and debug output on failure
  • FQDN rewriting handles clusters that need .svc.cluster.local suffix
  • getent hosts is more reliable than nslookup for testing resolution
  • CoreDNS negative caching (30s) causes delays after failed first lookup
  • Worker pods must have subdomain field set for DNS to work at all
#mpi #dns #networking #troubleshooting #openshift
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens