πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai intermediate ⏱ 20 minutes K8s 1.28+

AI Resource Allocation Optimization

Optimize GPU and memory allocation for AI workloads on Kubernetes. Right-size GPU requests, bin-packing strategies, gang scheduling.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Use gang scheduling (Volcano/Coscheduling) for distributed training β€” all workers start together or none do. Enable topology-aware scheduling to co-locate GPU pods on the same switch for NCCL performance. Implement priority-based preemption: inference > training > notebooks.

The Problem

AI/ML workloads have unique scheduling requirements: distributed training needs all workers to start simultaneously (gang scheduling), GPU communication requires network proximity (topology awareness), and mixed workloads (training + inference + notebooks) compete for limited GPU resources.

The Solution

Gang Scheduling with Volcano

apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  name: training-job
spec:
  minMember: 4
  queue: gpu-queue
  priorityClassName: training
---
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: distributed-training
spec:
  schedulerName: volcano
  minAvailable: 4
  policies:
    - event: PodEvicted
      action: RestartJob
  tasks:
    - replicas: 4
      name: worker
      template:
        spec:
          schedulerName: volcano
          containers:
            - name: pytorch
              image: registry.example.com/training:1.0
              resources:
                limits:
                  nvidia.com/gpu: 8

All 4 workers (32 GPUs) must be schedulable simultaneously. If only 24 GPUs are available, the job waits rather than partially starting.

GPU Bin-Packing

# Scheduler configuration for GPU bin-packing
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
  - schedulerName: default-scheduler
    pluginConfig:
      - name: NodeResourcesFit
        args:
          scoringStrategy:
            type: MostAllocated
            resources:
              - name: nvidia.com/gpu
                weight: 10
              - name: cpu
                weight: 1
              - name: memory
                weight: 1

MostAllocated packs GPU workloads onto fewer nodes β€” frees up entire nodes for large multi-GPU jobs.

Priority Hierarchy for AI Workloads

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: inference-critical
value: 100000
description: "Production inference β€” preempts everything"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: training-standard
value: 10000
preemptionPolicy: Never
description: "Training β€” queues without preempting"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: notebook-dev
value: 1000
preemptionPolicy: Never
description: "Interactive notebooks β€” lowest priority"
graph TD
    subgraph GPU Cluster - 4 Nodes Γ— 8 GPUs
        N1[Node 1<br/>8/8 GPU used<br/>Training Job A]
        N2[Node 2<br/>8/8 GPU used<br/>Training Job A]
        N3[Node 3<br/>4/8 GPU used<br/>Inference pods]
        N4[Node 4<br/>2/8 GPU used<br/>Notebooks]
    end
    
    GANG[Gang Scheduler<br/>All-or-nothing] -->|16 GPUs| N1 & N2
    BINPACK[Bin-Packing<br/>MostAllocated] -->|Pack inference| N3
    PREEMPT[Priority<br/>Inference > Training] -->|Can preempt| N4

Common Issues

Gang-scheduled job stuck in Pending

Not enough GPUs available simultaneously. Check: kubectl describe podgroup training-job. Consider preempting lower-priority workloads or adding nodes.

Training pods scattered across racks β€” slow NCCL

Enable topology-aware scheduling. Label nodes with topology.kubernetes.io/rack and use topology spread constraints to co-locate training pods.

Best Practices

  • Gang scheduling for distributed training β€” partial starts waste GPU time
  • Bin-pack GPUs with MostAllocated scoring β€” frees entire nodes for large jobs
  • Priority: inference > training > notebooks β€” production SLA always wins
  • Topology-aware placement β€” co-locate training pods on same switch for NCCL performance
  • preemptionPolicy: Never for training β€” queue instead of disrupting other jobs

Key Takeaways

  • Gang scheduling ensures all workers start together β€” prevents deadlocks and wasted GPUs
  • GPU bin-packing consolidates workloads onto fewer nodes
  • Priority-based preemption: inference always gets GPUs, training queues
  • Topology-aware scheduling reduces NCCL communication latency by 2-5x
  • Combine gang scheduling + topology awareness + priority for optimal GPU cluster utilization
#gpu #resource-optimization #bin-packing #gang-scheduling #topology
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens