Kubernetes 1.36 Gang Scheduling
Use gang scheduling in Kubernetes 1.36 to schedule Pod groups atomically. Essential for distributed ML training, MPI jobs, and Spark workloads.
π‘ Quick Answer: Kubernetes 1.36 advances Gang Scheduling (KEP-4671) with Workload Scheduling Cycles and Delayed Preemption. Pod groups are scheduled atomically β all-or-nothing β preventing partial scheduling deadlocks in distributed ML training.
The Problem
Distributed training needs N Pods running simultaneously. Without gang scheduling:
- Partial scheduling: 7 of 8 GPU Pods schedule, the 8th canβt find resources. The 7 idle Pods waste expensive GPUs while waiting.
- Deadlock: Two 4-Pod jobs each get 2 Pods scheduled. Neither can complete, both hold GPUs hostage.
- Resource waste: Partially scheduled jobs block cluster capacity for minutes or hours.
- Training failures: Workers that start before all peers are ready crash or timeout on NCCL initialization.
The Solution
Gang scheduling ensures all Pods in a group schedule together or none do. Kubernetes 1.36 introduces native PodGroup and Workload APIs.
Define a PodGroup
apiVersion: scheduling.k8s.io/v1alpha2
kind: PodGroup
metadata:
name: llm-training-group
namespace: ml-training
spec:
minMember: 4
scheduleTimeoutSeconds: 300Training Job with Gang Scheduling
apiVersion: batch/v1
kind: Job
metadata:
name: llm-finetune
namespace: ml-training
labels:
scheduling.k8s.io/pod-group: llm-training-group
spec:
completions: 4
parallelism: 4
template:
metadata:
labels:
scheduling.k8s.io/pod-group: llm-training-group
spec:
schedulerName: default-scheduler
containers:
- name: trainer
image: registry.example.com/training:v2.0
command:
- torchrun
- --nnodes=4
- --nproc_per_node=8
- --rdzv_backend=c10d
- --rdzv_endpoint=llm-finetune-0:29400
- train.py
resources:
limits:
nvidia.com/gpu: 8
memory: 256Gi
requests:
nvidia.com/gpu: 8
memory: 256Gi
restartPolicy: NeverWorkload API (1.36 Enhancement)
apiVersion: scheduling.k8s.io/v1alpha2
kind: Workload
metadata:
name: distributed-training
namespace: ml-training
spec:
podGroups:
- name: workers
minMember: 4
maxMember: 8
template:
spec:
containers:
- name: trainer
image: registry.example.com/training:v2.0
resources:
limits:
nvidia.com/gpu: 8
schedulingPolicy:
preemptionPolicy: DelayedPreemption
schedulingCycle: AtomicDelayed Preemption (New in 1.36)
Instead of immediately preempting lower-priority Pods, the scheduler waits to see if resources free up naturally:
apiVersion: scheduling.k8s.io/v1alpha2
kind: Workload
metadata:
name: large-training
spec:
podGroups:
- name: workers
minMember: 8
schedulingPolicy:
preemptionPolicy: DelayedPreemption
delayedPreemptionTimeout: 120s # Wait 2 min before preemptingIntegration with Kueue
apiVersion: kueue.x-k8s.io/v1beta1
kind: Workload
metadata:
name: training-workload
spec:
podSets:
- name: workers
count: 4
minCount: 4 # Gang: require all 4
template:
spec:
containers:
- name: trainer
image: registry.example.com/training:v2.0
resources:
requests:
nvidia.com/gpu: 8
queueName: gpu-queueMPI Job with Gang Scheduling
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: nccl-benchmark
labels:
scheduling.k8s.io/pod-group: nccl-bench-group
spec:
slotsPerWorker: 8
runPolicy:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
metadata:
labels:
scheduling.k8s.io/pod-group: nccl-bench-group
spec:
containers:
- name: launcher
image: registry.example.com/nccl-tests:v2.0
command:
- mpirun
- --allow-run-as-root
- -np 32
- -x NCCL_DEBUG=INFO
- /opt/nccl-tests/build/all_reduce_perf
- -b 1G -e 8G -f 2
Worker:
replicas: 4
template:
metadata:
labels:
scheduling.k8s.io/pod-group: nccl-bench-group
spec:
containers:
- name: worker
image: registry.example.com/nccl-tests:v2.0
resources:
limits:
nvidia.com/gpu: 8Verify Gang Scheduling
# Check PodGroup status
kubectl get podgroup llm-training-group -o yaml
# Status shows: scheduled: true, members: 4/4
# Check scheduling events
kubectl get events --field-selector reason=GangScheduled
# Output: PodGroup llm-training-group successfully gang-scheduled (4/4 members)
# Check for deadlocks
kubectl get events --field-selector reason=GangSchedulingTimeoutCommon Issues
PodGroup stuck in Pending
- Cause: Not enough resources for all members simultaneously
- Fix: Reduce
minMember, add nodes, or configure preemption
Scheduling deadlock between groups
- Cause: Multiple PodGroups competing for same resource pool
- Fix: Use priority classes and Delayed Preemption to break ties
Timeout before all members scheduled
- Cause:
scheduleTimeoutSecondstoo short for cluster size - Fix: Increase timeout or reduce group size
Best Practices
- Set
minMembercarefully β allow elastic scaling when possible (e.g., 4 min, 8 max) - Use priority classes β ensure training jobs can preempt lower-priority workloads
- Combine with Kueue β queue management prevents resource contention
- Set reasonable timeouts β 5-10 minutes for large GPU clusters
- Monitor scheduling latency β gang scheduling adds overhead vs individual Pod scheduling
Key Takeaways
- Gang scheduling is Alpha v2 in Kubernetes 1.36 with Workload API and Delayed Preemption
- All Pods in a group schedule atomically β prevents partial scheduling waste
- Delayed Preemption avoids unnecessary evictions
- Essential for distributed ML training (PyTorch DDP, Horovod, MPI)
- Integrates with Kueue for enterprise batch scheduling

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
