πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai advanced ⏱ 20 minutes K8s 1.28+

Multi-Cloud AI Workloads Kubernetes

Run AI workloads across multiple cloud providers with Kubernetes. GPU instance availability, spot pricing arbitrage, model portability.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Use Kubernetes as the abstraction layer for multi-cloud AI: define GPU workloads as standard K8s manifests, deploy via ArgoCD ApplicationSets across EKS/GKE/AKS, and use spot/preemptible instances with checkpointing for 60-80% cost savings. Store models on cloud-agnostic S3-compatible storage.

The Problem

GPU availability varies by cloud and region β€” H100s might be available on GCP but not AWS this week. Pricing differs by 2-3x between providers. Locking into one cloud means missing availability and overpaying. Kubernetes provides the portability layer to run AI workloads anywhere GPUs are available.

The Solution

GPU Instance Comparison

GPUAWSGCPAzure
A100 80GBp4d.24xlarge ($32/hr)a2-ultragpu-8g ($29/hr)ND96amsr ($27/hr)
H100 80GBp5.48xlarge ($98/hr)a3-highgpu-8g ($85/hr)ND96isr ($88/hr)
L4 24GBg6.xlarge ($0.80/hr)g2-standard-4 ($0.70/hr)β€”
Spot/Preemptible60-70% off60-91% off60-80% off

Multi-Cloud Deployment with ArgoCD

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: inference-multicloud
  namespace: argocd
spec:
  generators:
    - clusters:
        selector:
          matchLabels:
            gpu-type: h100
  template:
    metadata:
      name: 'inference-{{name}}'
    spec:
      source:
        repoURL: https://git.example.com/ml/inference.git
        path: overlays/{{metadata.labels.cloud-provider}}
      destination:
        server: '{{server}}'
        namespace: inference
      syncPolicy:
        automated:
          selfHeal: true

Spot Instance Training with Checkpointing

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: resilient-training
spec:
  pytorchReplicaSpecs:
    Worker:
      replicas: 4
      template:
        spec:
          tolerations:
            - key: cloud.google.com/gke-spot
              operator: Equal
              value: "true"
              effect: NoSchedule
          containers:
            - name: pytorch
              command:
                - torchrun
                - --rdzv_backend=c10d
                - train.py
                - --checkpoint-dir=/checkpoints
                - --checkpoint-interval=500
                - --resume-from-checkpoint=latest
              volumeMounts:
                - name: checkpoints
                  mountPath: /checkpoints
          volumes:
            - name: checkpoints
              persistentVolumeClaim:
                claimName: training-checkpoints

Cloud-Agnostic Model Storage

# MinIO or any S3-compatible storage
apiVersion: v1
kind: Secret
metadata:
  name: model-storage
data:
  AWS_ACCESS_KEY_ID: <base64>
  AWS_SECRET_ACCESS_KEY: <base64>
  AWS_ENDPOINT_URL: <base64>  # MinIO or cloud S3 endpoint

Models stored on S3-compatible storage work across all clouds β€” no vendor lock-in.

graph TD
    GIT[Git Repository<br/>ML Manifests] --> ARGOCD[ArgoCD<br/>Multi-cluster sync]
    ARGOCD --> EKS[AWS EKS<br/>p5 H100 spot]
    ARGOCD --> GKE[GCP GKE<br/>a3 H100 preemptible]
    ARGOCD --> AKS[Azure AKS<br/>ND96 H100 spot]
    
    EKS --> S3[S3-Compatible<br/>Model Storage]
    GKE --> S3
    AKS --> S3
    
    SCHEDULER[GPU Availability<br/>Scheduler] -->|Route to cheapest<br/>available cloud| ARGOCD

Common Issues

Model not loading on different cloud

Storage paths differ between providers. Use S3-compatible storage with a consistent endpoint URL. Never hardcode cloud-specific paths.

Spot instance terminated mid-training

Checkpointing is mandatory for spot/preemptible. Set --checkpoint-interval=500 (steps) and resume from latest checkpoint on restart.

Best Practices

  • S3-compatible storage for models β€” works across all clouds
  • Checkpoint every 500 steps on spot instances β€” 2 minutes of lost work max
  • ArgoCD ApplicationSets for multi-cloud deployment β€” same manifest, different overlays
  • Monitor spot pricing β€” shift workloads to cheapest available cloud
  • Keep inference on on-demand β€” spot termination causes user-facing errors

Key Takeaways

  • Kubernetes provides the abstraction layer for multi-cloud AI workloads
  • GPU availability and pricing varies 2-3x between cloud providers
  • Spot/preemptible instances save 60-80% β€” but require checkpointing
  • S3-compatible storage enables model portability across clouds
  • ArgoCD ApplicationSets deploy the same workload across multiple clouds
#multi-cloud #gpu-availability #spot-instances #cloud-agnostic
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens