Kubernetes 1.36 RestartAllContainers for ML
Use the RestartAllContainers policy in Kubernetes 1.36 to restart all Pod containers in-place when a worker fails, avoiding costly ML training rescheduling.
π‘ Quick Answer: Kubernetes 1.36 introduces RestartAllContainers (Alpha). When one container in a multi-container Pod fails, all containers restart in-place instead of the Pod being rescheduled β saving hours of ML training checkpoint recovery time.
The Problem
In distributed ML training, a multi-container Pod might have:
- A training worker container
- A communication sidecar (NCCL, MPI)
- A metrics/logging sidecar
When one container crashes, Kubernetes only restarts that container. But the other containers hold stale state (old NCCL communicator handles, dead rank connections). The training job hangs or produces corrupted results.
The only reliable fix was deleting the entire Pod, which means:
- Waiting for GPU re-scheduling (minutes to hours)
- Reloading model checkpoints (minutes)
- Re-establishing distributed communication (NCCL ring setup)
- Potential loss of uncheckpointed progress
The Solution
RestartAllContainers restarts every container in the Pod when any single container fails, keeping the Pod on the same node with the same GPU allocation.
Enable the Feature Gate (Alpha)
# Add to kube-apiserver and kubelet flags
--feature-gates=RestartAllContainers=trueConfigure a Training Pod
apiVersion: v1
kind: Pod
metadata:
name: training-worker-0
spec:
restartPolicy: Always
containerRestartPolicy: RestartAllContainers # NEW in 1.36
containers:
- name: trainer
image: registry.example.com/training:v2.0
command: ["torchrun", "--nproc_per_node=8", "train.py"]
resources:
limits:
nvidia.com/gpu: 8
env:
- name: NCCL_SOCKET_IFNAME
value: "eth0"
- name: MASTER_ADDR
value: "training-worker-0.training.default.svc"
- name: nccl-healthcheck
image: registry.example.com/nccl-monitor:v1.0
command: ["nccl-watchdog", "--timeout=300"]
resources:
limits:
nvidia.com/gpu: 0PyTorchJob with RestartAllContainers
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: llm-finetune
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
template:
spec:
containerRestartPolicy: RestartAllContainers
containers:
- name: pytorch
image: registry.example.com/training:v2.0
command:
- torchrun
- --nnodes=4
- --nproc_per_node=8
- --rdzv_backend=c10d
- train.py
resources:
limits:
nvidia.com/gpu: 8
Worker:
replicas: 3
template:
spec:
containerRestartPolicy: RestartAllContainers
containers:
- name: pytorch
image: registry.example.com/training:v2.0
resources:
limits:
nvidia.com/gpu: 8Comparison: Without vs With RestartAllContainers
# WITHOUT RestartAllContainers:
# 1. Container "nccl-healthcheck" crashes
# 2. Only nccl-healthcheck restarts
# 3. trainer container holds stale NCCL state
# 4. Training hangs β timeout β Pod deleted β rescheduled
# 5. Total recovery: 10-30 minutes (GPU re-allocation + checkpoint reload)
# WITH RestartAllContainers:
# 1. Container "nccl-healthcheck" crashes
# 2. ALL containers restart together
# 3. Fresh NCCL state, fresh training from last checkpoint
# 4. Pod stays on same node, same GPUs
# 5. Total recovery: 30-60 seconds (checkpoint reload only)Common Issues
Feature gate not recognized
- Cause: Running Kubernetes < 1.36
- Fix: Upgrade to 1.36+ and enable
RestartAllContainersfeature gate
Containers restart too aggressively
- Cause: One flaky container causes constant full-Pod restarts
- Fix: Fix the flaky container; consider using
restartPolicy: OnFailurefor non-critical sidecars
GPU not released during restart
- Cause: Expected behavior β in-place restart keeps GPU allocation
- Fix: This is the desired behavior. GPUs stay allocated to the Pod.
Best Practices
- Use for ML training Pods β the primary use case for this feature
- Implement checkpointing β restart is only useful if training can resume
- Monitor restart counts β frequent full restarts indicate a deeper issue
- Combine with liveness probes β detect hangs early, trigger clean restarts
- Pin to specific nodes β use nodeSelector/affinity to keep GPU locality
Key Takeaways
RestartAllContainersis Alpha in Kubernetes 1.36 β requires feature gate- All containers restart together when any single container fails
- Pod stays on the same node with the same resource allocation (GPUs)
- Reduces ML training recovery from 30 minutes to 30 seconds
- Essential for distributed training with NCCL sidecars and multi-container Pods

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
