Multi-Node Distributed Training on Kubernetes
Run distributed deep learning training across multiple GPU nodes on Kubernetes. Covers PyTorch DDP, DeepSpeed, Horovod, and MPI jobs with NCCL optimization.
π‘ Quick Answer: Multi-node training on Kubernetes uses PyTorch DDP/FSDP or DeepSpeed with
torchrun/MPI, scheduled via gang scheduling. Each node runs a worker with 8 GPUs communicating via NCCL over NVLink (intra-node) and RDMA/InfiniBand (inter-node).
The Problem
Training large models (7B+ parameters) on a single node is too slow:
- Time: Fine-tuning a 70B model on 8 GPUs takes weeks vs days on 32 GPUs
- Memory: Full model + optimizer states + gradients exceed single-node capacity
- Throughput: More GPUs = larger effective batch size = faster convergence
- Cost: 4 nodes Γ 2 days is cheaper than 1 node Γ 8 days (if using spot/preemptible)
The Solution
PyTorch DDP with torchrun (Recommended)
apiVersion: batch/v1
kind: Job
metadata:
name: llm-finetune
namespace: training
labels:
scheduling.k8s.io/pod-group: finetune-group
spec:
completions: 4
parallelism: 4
completionMode: Indexed
template:
metadata:
labels:
app: llm-finetune
scheduling.k8s.io/pod-group: finetune-group
spec:
subdomain: finetune-workers
setHostnameAsFQDN: true
containers:
- name: trainer
image: registry.example.com/training:v2.0
command:
- torchrun
- --nnodes=4
- --nproc_per_node=8
- --node_rank=$(JOB_COMPLETION_INDEX)
- --rdzv_backend=c10d
- --rdzv_endpoint=llm-finetune-0.finetune-workers:29400
- train.py
- --model_name=meta-llama/Llama-3.1-70B
- --batch_size=4
- --gradient_accumulation_steps=8
- --learning_rate=2e-5
- --num_epochs=3
- --output_dir=/checkpoints/run-001
env:
- name: JOB_COMPLETION_INDEX
valueFrom:
fieldRef:
fieldPath: metadata.labels['batch.kubernetes.io/job-completion-index']
- name: NCCL_DEBUG
value: "INFO"
- name: NCCL_SOCKET_IFNAME
value: "eth0"
- name: NCCL_IB_DISABLE
value: "0"
- name: MASTER_ADDR
value: "llm-finetune-0.finetune-workers"
- name: MASTER_PORT
value: "29400"
resources:
limits:
nvidia.com/gpu: 8
memory: 512Gi
rdma/rdma_shared_device_a: 1
requests:
nvidia.com/gpu: 8
memory: 256Gi
volumeMounts:
- name: shm
mountPath: /dev/shm
- name: checkpoints
mountPath: /checkpoints
- name: dataset
mountPath: /data
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: 64Gi
- name: checkpoints
persistentVolumeClaim:
claimName: training-checkpoints
- name: dataset
persistentVolumeClaim:
claimName: training-dataset
restartPolicy: Never
---
# Headless Service for DNS resolution between workers
apiVersion: v1
kind: Service
metadata:
name: finetune-workers
namespace: training
spec:
clusterIP: None
selector:
app: llm-finetune
ports:
- port: 29400
name: rdzvDeepSpeed ZeRO-3 Multi-Node
apiVersion: batch/v1
kind: Job
metadata:
name: deepspeed-training
namespace: training
spec:
completions: 4
parallelism: 4
completionMode: Indexed
template:
spec:
subdomain: ds-workers
containers:
- name: trainer
image: registry.example.com/deepspeed-training:v3.0
command:
- torchrun
- --nnodes=4
- --nproc_per_node=8
- --node_rank=$(JOB_COMPLETION_INDEX)
- --rdzv_backend=c10d
- --rdzv_endpoint=deepspeed-training-0.ds-workers:29400
- train.py
- --deepspeed
- --deepspeed_config=ds_config.json
resources:
limits:
nvidia.com/gpu: 8
volumeMounts:
- name: shm
mountPath: /dev/shm
- name: config
mountPath: /app/ds_config.json
subPath: ds_config.json
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: 64Gi
- name: config
configMap:
name: deepspeed-config
restartPolicy: Never
---
apiVersion: v1
kind: ConfigMap
metadata:
name: deepspeed-config
namespace: training
data:
ds_config.json: |
{
"train_batch_size": 128,
"gradient_accumulation_steps": 4,
"fp16": {"enabled": true},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {"device": "none"},
"offload_param": {"device": "none"},
"overlap_comm": true,
"contiguous_gradients": true,
"reduce_bucket_size": 5e8,
"stage3_prefetch_bucket_size": 5e8,
"stage3_param_persistence_threshold": 1e6
},
"communication_data_type": "fp16",
"gradient_clipping": 1.0
}Kubeflow MPIJob (Horovod/NCCL)
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: nccl-training
namespace: training
spec:
slotsPerWorker: 8
runPolicy:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
containers:
- name: launcher
image: registry.example.com/horovod-training:v2.0
command:
- mpirun
- --allow-run-as-root
- -np 32
- -x NCCL_DEBUG=INFO
- -x NCCL_SOCKET_IFNAME=eth0
- -x LD_LIBRARY_PATH
- --mca btl_tcp_if_include eth0
- python train.py
--epochs 10
--batch-size 64
Worker:
replicas: 4
template:
spec:
containers:
- name: worker
image: registry.example.com/horovod-training:v2.0
resources:
limits:
nvidia.com/gpu: 8
memory: 512Gi
volumeMounts:
- name: shm
mountPath: /dev/shm
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: 64GiNCCL Environment Optimization
env:
# Network interface selection
- name: NCCL_SOCKET_IFNAME
value: "eth0" # Or specific RDMA interface
# InfiniBand/RDMA
- name: NCCL_IB_DISABLE
value: "0" # 0 = enable IB
- name: NCCL_IB_HCA
value: "mlx5" # Mellanox HCA device
- name: NCCL_IB_GID_INDEX
value: "3" # RoCE v2 GID index
# Performance tuning
- name: NCCL_BUFFSIZE
value: "8388608" # 8MB buffer
- name: NCCL_NTHREADS
value: "512"
- name: NCCL_ALGO
value: "Ring,Tree" # Algorithm selection
# Debugging
- name: NCCL_DEBUG
value: "WARN" # INFO for troubleshooting
- name: NCCL_DEBUG_SUBSYS
value: "ALL"Checkpoint Management
# Shared storage for checkpoints (all workers write)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: training-checkpoints
namespace: training
spec:
accessModes:
- ReadWriteMany # Must be RWX for multi-node access
storageClassName: nfs-csi
resources:
requests:
storage: 2TiMonitoring Training Progress
# Watch GPU utilization across training nodes
kubectl get pods -n training -l app=llm-finetune -o wide
# Check NCCL initialization
kubectl logs -n training llm-finetune-0 | grep "NCCL"
# Expected: "NCCL INFO Connected all ... ranks"
# Monitor GPU memory and compute
kubectl exec -n training llm-finetune-0 -- nvidia-smi
# Check training throughput (tokens/sec or samples/sec from logs)
kubectl logs -n training llm-finetune-0 --tail=20Scaling Formula
Effective batch size = micro_batch Γ gradient_accumulation Γ num_gpus Γ num_nodes
Example:
micro_batch = 4
gradient_accumulation = 8
num_gpus = 8
num_nodes = 4
Effective batch = 4 Γ 8 Γ 8 Γ 4 = 1024
Linear scaling rule: lr_new = lr_base Γ (effective_batch / base_batch)Common Issues
NCCL timeout between nodes
- Cause: Firewall blocking ports, wrong interface, or network too slow
- Fix: Open ports 29400-29500; set
NCCL_SOCKET_IFNAME; verify connectivity between all Pod IPs
Workers not finding each other (rendezvous failure)
- Cause: DNS not resolving headless service names
- Fix: Ensure
subdomainmatches Service name; usesetHostnameAsFQDN: true
Training slower with more nodes (negative scaling)
- Cause: Communication overhead exceeds compute benefit (model too small)
- Fix: Increase batch size proportionally; use gradient accumulation; ensure RDMA is active
OOM during backward pass
- Cause: Activation memory peak exceeds GPU RAM
- Fix: Enable gradient checkpointing; use DeepSpeed ZeRO-3; reduce micro batch size
Checkpoint corruption with multi-node writes
- Cause: Multiple ranks writing simultaneously without coordination
- Fix: Only rank 0 saves full checkpoint; use
dist.barrier()before/after
Best Practices
- Use gang scheduling β all workers must start together or not at all
- Size
/dev/shmat 32-64Gi β NCCL uses shared memory extensively - Use RWX storage for checkpoints β NFS or parallel filesystem (Lustre, GPFS)
- Enable RDMA/InfiniBand β 200+ Gbps vs 25 Gbps Ethernet
- Gradient checkpointing β trades compute for memory (essential for large models)
- Monitor NCCL bandwidth β should see near line-rate for well-configured clusters
- Use Indexed Job completion mode β each Pod gets a unique index for rank assignment
Key Takeaways
- Multi-node training uses PyTorch DDP/FSDP (torchrun) or DeepSpeed ZeRO or Horovod (MPI)
torchrunwithcompletionMode: Indexedis the simplest Kubernetes-native approach- Headless Service +
subdomainenables DNS-based worker discovery - NCCL performance depends on network: RDMA/InfiniBand >> Ethernet
- Gang scheduling prevents resource waste from partial scheduling
- Checkpoint to RWX shared storage for fault tolerance

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
