πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai intermediate ⏱ 20 minutes K8s 1.28+

KAITO AI Model Inference Kubernetes

Deploy AI models with KAITO (Kubernetes AI Toolchain Operator) for automated GPU provisioning, model serving, and inference workload management.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Create a KAITO Workspace CR specifying the model name and GPU requirements. KAITO automatically provisions GPU nodes, pulls model weights, containerizes the model, and deploys an inference endpoint β€” reducing LLM deployment from days to minutes.

The Problem

Deploying AI models on Kubernetes requires multiple manual steps: provision GPU nodes, configure the GPU operator, pull model weights (often 50-200GB), build a serving container, create Deployments/Services, and configure autoscaling. KAITO automates the entire pipeline with a single custom resource.

The Solution

KAITO Architecture

KAITO operates as a Kubernetes operator with three main components:

  1. Workspace Controller β€” watches Workspace CRs, orchestrates deployment
  2. GPU Provisioner β€” auto-provisions GPU nodes via cloud APIs (Karpenter-based)
  3. Model Manager β€” handles model weight download and containerization

Install KAITO

# Install KAITO operator
helm repo add kaito https://azure.github.io/kaito/charts
helm repo update

helm install kaito-workspace kaito/kaito-workspace \
  --namespace kaito-workspace \
  --create-namespace \
  --set gpu-provisioner.enabled=true

Deploy a Preset Model

apiVersion: kaito.sh/v1alpha1
kind: Workspace
metadata:
  name: llama-3-70b
  namespace: inference
spec:
  resource:
    instanceType: "Standard_NC96ads_A100_v4"
    count: 2
    labelSelector:
      matchLabels:
        apps: llama-inference
  inference:
    preset:
      name: "llama-3-70b-instruct"
    adapters:
      - source:
          name: "fine-tuned-adapter"
          image: "registry.example.com/adapters/custom-lora:v1"

KAITO will:

  1. Provision 2Γ— A100 GPU nodes
  2. Download Llama 3 70B weights
  3. Deploy with tensor parallelism across 2 nodes
  4. Create a Service endpoint

Custom Model from Private Registry

apiVersion: kaito.sh/v1alpha1
kind: Workspace
metadata:
  name: custom-model
  namespace: inference
spec:
  resource:
    instanceType: "Standard_NC24ads_A100_v4"
    count: 1
  inference:
    template:
      containers:
        - name: inference
          image: registry.example.com/models/custom-bert:v2
          ports:
            - containerPort: 8080
          resources:
            limits:
              nvidia.com/gpu: 1
          env:
            - name: MODEL_PATH
              value: /models/custom-bert

Query the Model

# Get the service endpoint
export SVC_IP=$(kubectl get svc llama-3-70b -n inference -o jsonpath='{.status.loadBalancer.ingress[0].ip}')

# Inference request
curl -X POST http://${SVC_IP}/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3-70b-instruct",
    "prompt": "Explain Kubernetes pods in simple terms",
    "max_tokens": 256,
    "temperature": 0.7
  }'

KAITO vs Manual Deployment

StepManualKAITO
GPU node provisioning10-30 minAutomatic
GPU operator setup15 minPre-configured
Model download (70B)10-20 minAutomatic + cached
Container build30 minAutomatic
K8s manifests30 min1 CR
Tensor parallelismComplex configAutomatic
Total2-4 hours10-15 minutes
graph TD
    USER[User creates<br/>Workspace CR] --> CTRL[KAITO Controller<br/>Reads spec]
    CTRL --> PROV[GPU Provisioner<br/>Create nodes]
    CTRL --> PULL[Model Manager<br/>Pull weights]
    PROV --> NODE[GPU Node Ready<br/>A100 / H100]
    PULL --> CONTAINER[Model Containerized<br/>+ Inference server]
    NODE --> DEPLOY[Deploy Pod<br/>on GPU node]
    CONTAINER --> DEPLOY
    DEPLOY --> SVC[Service Endpoint<br/>/v1/completions]

Common Issues

Workspace stuck in Provisioning

GPU instance type unavailable in region. Check cloud provider quota: az vm list-usage --location eastus | grep NC. Request quota increase or try a different instance type.

Model download timeout

Large models (70B+) take 10-20 minutes to download. KAITO caches model weights on PVCs β€” subsequent deployments use the cache. Ensure PVC storage class supports ReadWriteMany for multi-node.

Best Practices

  • Use preset models for common LLMs β€” Llama, Mistral, Phi β€” one-line deployment
  • Private registry for custom models β€” use the template spec for full control
  • Cache model weights on PVC β€” avoid re-downloading on pod restart
  • Label GPU nodes for workload isolation β€” inference vs training vs notebooks
  • Monitor Workspace status β€” kubectl get workspace -w shows provisioning progress

Key Takeaways

  • KAITO automates the full LLM deployment pipeline: GPU provisioning β†’ model download β†’ containerization β†’ serving
  • Single Workspace CR replaces hours of manual configuration
  • Automatic GPU node provisioning via Karpenter-based provisioner
  • Supports preset models (Llama, Mistral, Phi) and custom models from private registries
  • LoRA adapter support for fine-tuned model variants without re-downloading base weights
#kaito #inference #gpu-provisioning #model-serving #azure
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens