🎤Speaking at KubeCon EU 2026Lessons Learned Orchestrating Multi-Tenant GPUs on OpenShift AIView Session
ai advanced ⏱ 35 minutes K8s 1.28+

GPU Sharing and Bin Packing with KAI Scheduler

Maximize GPU utilization using KAI Scheduler's GPU sharing, fractional GPUs, and intelligent bin packing strategies

By Luca Berton

The Problem

GPUs are expensive resources that often sit underutilized. Small inference workloads or interactive notebooks don’t need a full GPU, yet standard scheduling allocates entire GPUs. This leads to poor utilization and wasted resources.

The Solution

Use KAI Scheduler’s GPU sharing capabilities with MIG (Multi-Instance GPU), time-slicing, or fractional GPU allocation. Combined with bin packing strategies, you can dramatically improve GPU utilization.

GPU Sharing Architecture

flowchart TB
    subgraph cluster["☸️ KUBERNETES CLUSTER"]
        subgraph kai["🎯 KAI SCHEDULER"]
            BP["Bin Packing<br/>Algorithm"]
            GS["GPU Sharing<br/>Controller"]
        end
        
        subgraph gpu["🎮 NVIDIA A100 80GB"]
            direction TB
            subgraph mig["MIG PARTITIONS"]
                M1["1g.10gb"]
                M2["1g.10gb"]
                M3["2g.20gb"]
                M4["3g.40gb"]
            end
            subgraph ts["TIME-SLICING"]
                TS1["Slot 1"]
                TS2["Slot 2"]
                TS3["Slot 3"]
                TS4["Slot 4"]
            end
        end
        
        subgraph workloads["🚀 WORKLOADS"]
            W1["Notebook<br/>0.25 GPU"]
            W2["Small Inference<br/>0.5 GPU"]
            W3["Medium Training<br/>2g MIG"]
            W4["Large Training<br/>Full GPU"]
        end
    end
    
    workloads --> kai
    kai --> gpu
    W1 -.-> TS1
    W2 -.-> M1
    W3 -.-> M3

Step 1: Enable GPU Sharing in KAI Scheduler

# Install KAI with GPU sharing enabled
helm upgrade -i kai-scheduler \
  oci://ghcr.io/nvidia/kai-scheduler/kai-scheduler \
  -n kai-scheduler \
  --version v0.12.10 \
  --set gpuSharing.enabled=true \
  --set gpuSharing.strategy=binpack  # or 'spread'

Step 2: Configure GPU Operator for Time-Slicing

# gpu-time-slicing-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |
    version: v1
    sharing:
      timeSlicing:
        renameByDefault: true
        failRequestsGreaterThanOne: false
        resources:
        - name: nvidia.com/gpu
          replicas: 4  # 4 time-slices per GPU
---
# Apply to GPU Operator
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: cluster-policy
spec:
  devicePlugin:
    config:
      name: time-slicing-config
      default: any
kubectl apply -f gpu-time-slicing-config.yaml
kubectl rollout restart daemonset nvidia-device-plugin-daemonset -n gpu-operator

Step 3: Request Fractional GPUs

# fractional-gpu-workloads.yaml
# Small workload requesting 1/4 GPU (time-sliced)
apiVersion: v1
kind: Pod
metadata:
  name: notebook-small
  namespace: ml-training
  labels:
    runai/queue: development
spec:
  schedulerName: kai-scheduler
  containers:
  - name: jupyter
    image: jupyter/pytorch-notebook:latest
    resources:
      limits:
        nvidia.com/gpu: 1  # Gets 1 time-slice (1/4 GPU)
    ports:
    - containerPort: 8888
---
# Medium workload requesting 1/2 GPU
apiVersion: v1
kind: Pod
metadata:
  name: inference-medium
  namespace: ml-training
  labels:
    runai/queue: inference
spec:
  schedulerName: kai-scheduler
  containers:
  - name: model-server
    image: nvcr.io/nvidia/tritonserver:24.01-py3
    resources:
      limits:
        nvidia.com/gpu: 2  # Gets 2 time-slices (1/2 GPU)

Step 4: MIG-Based GPU Sharing

# mig-sharing-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: mig-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    mig-configs:
      all-1g.10gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            "1g.10gb": 7
      
      mixed-workload:
        - devices: all
          mig-enabled: true
          mig-devices:
            "1g.10gb": 2
            "2g.20gb": 1
            "3g.40gb": 1
      
      inference-optimized:
        - devices: all
          mig-enabled: true
          mig-devices:
            "1g.10gb": 4
            "2g.20gb": 1
kubectl apply -f mig-sharing-config.yaml

# Label nodes with desired MIG config
kubectl label node gpu-node-1 nvidia.com/mig.config=mixed-workload

Step 5: Request Specific MIG Profiles

# mig-workloads.yaml
# Small inference using 1g.10gb MIG slice
apiVersion: v1
kind: Pod
metadata:
  name: small-inference
  namespace: ml-training
spec:
  schedulerName: kai-scheduler
  containers:
  - name: inference
    image: nvcr.io/nvidia/tritonserver:24.01-py3
    resources:
      limits:
        nvidia.com/mig-1g.10gb: 1
---
# Medium training using 2g.20gb MIG slice
apiVersion: v1
kind: Pod
metadata:
  name: medium-training
  namespace: ml-training
spec:
  schedulerName: kai-scheduler
  containers:
  - name: trainer
    image: nvcr.io/nvidia/pytorch:24.01-py3
    resources:
      limits:
        nvidia.com/mig-2g.20gb: 1
---
# Large training using 3g.40gb MIG slice
apiVersion: v1
kind: Pod
metadata:
  name: large-training
  namespace: ml-training
spec:
  schedulerName: kai-scheduler
  containers:
  - name: trainer
    image: nvcr.io/nvidia/pytorch:24.01-py3
    resources:
      limits:
        nvidia.com/mig-3g.40gb: 1

Step 6: Configure Bin Packing Strategy

# bin-packing-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: kai-scheduler-config
  namespace: kai-scheduler
data:
  config.yaml: |
    scheduling:
      strategy: binpack  # Pack pods onto fewest nodes
      binpacking:
        weights:
          gpu: 10        # Highest priority to pack GPUs
          cpu: 2
          memory: 1
        thresholds:
          gpu: 0.8       # Consider node "full" at 80% GPU
          cpu: 0.9
          memory: 0.9
kubectl apply -f bin-packing-config.yaml
kubectl rollout restart statefulset kai-scheduler -n kai-scheduler

Step 7: Spread Scheduling for Resilience

# spread-scheduling.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: inference-spread
  namespace: ml-inference
  annotations:
    # Request spread scheduling for this deployment
    scheduling.run.ai/scheduling-strategy: spread
spec:
  replicas: 4
  selector:
    matchLabels:
      app: inference
  template:
    metadata:
      labels:
        app: inference
        runai/queue: inference
    spec:
      schedulerName: kai-scheduler
      # Pod anti-affinity for additional spreading
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: inference
              topologyKey: kubernetes.io/hostname
      containers:
      - name: server
        image: nvcr.io/nvidia/tritonserver:24.01-py3
        resources:
          limits:
            nvidia.com/gpu: 1

Step 8: Monitor GPU Sharing Utilization

# View GPU utilization per node
kubectl exec -it -n gpu-operator \
  $(kubectl get pods -n gpu-operator -l app=nvidia-dcgm-exporter -o name | head -1) \
  -- dcgmi dmon -e 155,156,203

# Check MIG devices
kubectl get nodes -o custom-columns=\
'NAME:.metadata.name,MIG-1G:.status.allocatable.nvidia\.com/mig-1g\.10gb,MIG-2G:.status.allocatable.nvidia\.com/mig-2g\.20gb'

# View time-sliced GPUs
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}: {.status.allocatable.nvidia\.com/gpu}{"\n"}{end}'

# Check bin packing efficiency
kubectl get pods -A -o wide --field-selector spec.schedulerName=kai-scheduler | \
  awk '{print $8}' | sort | uniq -c | sort -rn

GPU Sharing Strategies Comparison

StrategyUse CaseProsCons
MIGProduction inferenceHardware isolation, guaranteed memoryOnly A100/H100, fixed profiles
Time-SlicingDevelopment, notebooksAny GPU, flexibleNo memory isolation, context switching
Fractional GPUSmall workloadsFine-grained allocationRequires driver support
Bin PackingCost optimizationMaximum utilizationLess resilience
SpreadProduction HAFault toleranceLower utilization

Step 9: DRA Integration for GPU Sharing

# dra-gpu-sharing.yaml
apiVersion: resource.k8s.io/v1
kind: DeviceClass
metadata:
  name: nvidia.com/shared-gpu
spec:
  selectors:
  - cel:
      expression: 'device.driver == "nvidia.com/gpu"'
  config:
    opaque:
      driver: nvidia.com/gpu
      parameters:
        sharing:
          strategy: TimeSlicing
          replicas: 4
---
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: shared-gpu-claim
  namespace: ml-training
spec:
  spec:
    devices:
      requests:
      - name: gpu
        deviceClassName: nvidia.com/shared-gpu
        count: 1

Troubleshooting

Time-sliced GPUs not appearing

# Check GPU Operator config
kubectl get clusterpolicy cluster-policy -o yaml | grep -A20 devicePlugin

# Verify time-slicing config
kubectl get configmap time-slicing-config -n gpu-operator -o yaml

# Check device plugin logs
kubectl logs -n gpu-operator -l app=nvidia-device-plugin-daemonset

MIG profiles not available

# Check MIG mode on node
nvidia-smi -i 0 --query-gpu=mig.mode.current --format=csv

# List available MIG profiles
nvidia-smi mig -lgip

# Check node labels
kubectl get node <node> -o yaml | grep nvidia.com/mig

Best Practices

PracticeDescription
Match sharing to workloadUse MIG for production, time-slicing for dev
Set appropriate quotasAccount for shared GPU replicas in queue quotas
Monitor memory usageTime-slicing shares memory, watch for OOM
Use bin packing for costMaximize utilization in cloud environments
Use spread for productionEnsure availability for critical inference

Summary

GPU sharing with KAI Scheduler maximizes expensive GPU utilization. By combining MIG partitioning, time-slicing, and intelligent bin packing, you can serve multiple workloads on single GPUs while maintaining appropriate isolation and performance guarantees.


📘 Go Further with Kubernetes Recipes

Love this recipe? There’s so much more! This is just one of 100+ hands-on recipes in our comprehensive Kubernetes Recipes book.

Inside the book, you’ll master:

  • ✅ Production-ready deployment strategies
  • ✅ Advanced networking and security patterns
  • ✅ Observability, monitoring, and troubleshooting
  • ✅ Real-world best practices from industry experts

“The practical, recipe-based approach made complex Kubernetes concepts finally click for me.”

👉 Get Your Copy Now — Start building production-grade Kubernetes skills today!

#kai-scheduler #nvidia #gpu #gpu-sharing #bin-packing #mig #time-slicing

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.