πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai advanced ⏱ 15 minutes K8s 1.28+

Kubernetes 1.36 Gang Scheduling

Use gang scheduling in Kubernetes 1.36 to schedule Pod groups atomically. Essential for distributed ML training, MPI jobs, and Spark workloads.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Kubernetes 1.36 advances Gang Scheduling (KEP-4671) with Workload Scheduling Cycles and Delayed Preemption. Pod groups are scheduled atomically β€” all-or-nothing β€” preventing partial scheduling deadlocks in distributed ML training.

The Problem

Distributed training needs N Pods running simultaneously. Without gang scheduling:

  • Partial scheduling: 7 of 8 GPU Pods schedule, the 8th can’t find resources. The 7 idle Pods waste expensive GPUs while waiting.
  • Deadlock: Two 4-Pod jobs each get 2 Pods scheduled. Neither can complete, both hold GPUs hostage.
  • Resource waste: Partially scheduled jobs block cluster capacity for minutes or hours.
  • Training failures: Workers that start before all peers are ready crash or timeout on NCCL initialization.

The Solution

Gang scheduling ensures all Pods in a group schedule together or none do. Kubernetes 1.36 introduces native PodGroup and Workload APIs.

Define a PodGroup

apiVersion: scheduling.k8s.io/v1alpha2
kind: PodGroup
metadata:
  name: llm-training-group
  namespace: ml-training
spec:
  minMember: 4
  scheduleTimeoutSeconds: 300

Training Job with Gang Scheduling

apiVersion: batch/v1
kind: Job
metadata:
  name: llm-finetune
  namespace: ml-training
  labels:
    scheduling.k8s.io/pod-group: llm-training-group
spec:
  completions: 4
  parallelism: 4
  template:
    metadata:
      labels:
        scheduling.k8s.io/pod-group: llm-training-group
    spec:
      schedulerName: default-scheduler
      containers:
        - name: trainer
          image: registry.example.com/training:v2.0
          command:
            - torchrun
            - --nnodes=4
            - --nproc_per_node=8
            - --rdzv_backend=c10d
            - --rdzv_endpoint=llm-finetune-0:29400
            - train.py
          resources:
            limits:
              nvidia.com/gpu: 8
              memory: 256Gi
            requests:
              nvidia.com/gpu: 8
              memory: 256Gi
      restartPolicy: Never

Workload API (1.36 Enhancement)

apiVersion: scheduling.k8s.io/v1alpha2
kind: Workload
metadata:
  name: distributed-training
  namespace: ml-training
spec:
  podGroups:
    - name: workers
      minMember: 4
      maxMember: 8
      template:
        spec:
          containers:
            - name: trainer
              image: registry.example.com/training:v2.0
              resources:
                limits:
                  nvidia.com/gpu: 8
  schedulingPolicy:
    preemptionPolicy: DelayedPreemption
    schedulingCycle: Atomic

Delayed Preemption (New in 1.36)

Instead of immediately preempting lower-priority Pods, the scheduler waits to see if resources free up naturally:

apiVersion: scheduling.k8s.io/v1alpha2
kind: Workload
metadata:
  name: large-training
spec:
  podGroups:
    - name: workers
      minMember: 8
  schedulingPolicy:
    preemptionPolicy: DelayedPreemption
    delayedPreemptionTimeout: 120s    # Wait 2 min before preempting

Integration with Kueue

apiVersion: kueue.x-k8s.io/v1beta1
kind: Workload
metadata:
  name: training-workload
spec:
  podSets:
    - name: workers
      count: 4
      minCount: 4    # Gang: require all 4
      template:
        spec:
          containers:
            - name: trainer
              image: registry.example.com/training:v2.0
              resources:
                requests:
                  nvidia.com/gpu: 8
  queueName: gpu-queue

MPI Job with Gang Scheduling

apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: nccl-benchmark
  labels:
    scheduling.k8s.io/pod-group: nccl-bench-group
spec:
  slotsPerWorker: 8
  runPolicy:
    cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        metadata:
          labels:
            scheduling.k8s.io/pod-group: nccl-bench-group
        spec:
          containers:
            - name: launcher
              image: registry.example.com/nccl-tests:v2.0
              command:
                - mpirun
                - --allow-run-as-root
                - -np 32
                - -x NCCL_DEBUG=INFO
                - /opt/nccl-tests/build/all_reduce_perf
                - -b 1G -e 8G -f 2
    Worker:
      replicas: 4
      template:
        metadata:
          labels:
            scheduling.k8s.io/pod-group: nccl-bench-group
        spec:
          containers:
            - name: worker
              image: registry.example.com/nccl-tests:v2.0
              resources:
                limits:
                  nvidia.com/gpu: 8

Verify Gang Scheduling

# Check PodGroup status
kubectl get podgroup llm-training-group -o yaml
# Status shows: scheduled: true, members: 4/4

# Check scheduling events
kubectl get events --field-selector reason=GangScheduled
# Output: PodGroup llm-training-group successfully gang-scheduled (4/4 members)

# Check for deadlocks
kubectl get events --field-selector reason=GangSchedulingTimeout

Common Issues

PodGroup stuck in Pending

  • Cause: Not enough resources for all members simultaneously
  • Fix: Reduce minMember, add nodes, or configure preemption

Scheduling deadlock between groups

  • Cause: Multiple PodGroups competing for same resource pool
  • Fix: Use priority classes and Delayed Preemption to break ties

Timeout before all members scheduled

  • Cause: scheduleTimeoutSeconds too short for cluster size
  • Fix: Increase timeout or reduce group size

Best Practices

  1. Set minMember carefully β€” allow elastic scaling when possible (e.g., 4 min, 8 max)
  2. Use priority classes β€” ensure training jobs can preempt lower-priority workloads
  3. Combine with Kueue β€” queue management prevents resource contention
  4. Set reasonable timeouts β€” 5-10 minutes for large GPU clusters
  5. Monitor scheduling latency β€” gang scheduling adds overhead vs individual Pod scheduling

Key Takeaways

  • Gang scheduling is Alpha v2 in Kubernetes 1.36 with Workload API and Delayed Preemption
  • All Pods in a group schedule atomically β€” prevents partial scheduling waste
  • Delayed Preemption avoids unnecessary evictions
  • Essential for distributed ML training (PyTorch DDP, Horovod, MPI)
  • Integrates with Kueue for enterprise batch scheduling
#kubernetes-1.36 #scheduling #gang-scheduling #machine-learning #distributed-training
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens