KAITO AI Model Inference Kubernetes
Deploy AI models with KAITO (Kubernetes AI Toolchain Operator) for automated GPU provisioning, model serving, and inference workload management.
π‘ Quick Answer: Create a KAITO
WorkspaceCR specifying the model name and GPU requirements. KAITO automatically provisions GPU nodes, pulls model weights, containerizes the model, and deploys an inference endpoint β reducing LLM deployment from days to minutes.
The Problem
Deploying AI models on Kubernetes requires multiple manual steps: provision GPU nodes, configure the GPU operator, pull model weights (often 50-200GB), build a serving container, create Deployments/Services, and configure autoscaling. KAITO automates the entire pipeline with a single custom resource.
The Solution
KAITO Architecture
KAITO operates as a Kubernetes operator with three main components:
- Workspace Controller β watches Workspace CRs, orchestrates deployment
- GPU Provisioner β auto-provisions GPU nodes via cloud APIs (Karpenter-based)
- Model Manager β handles model weight download and containerization
Install KAITO
# Install KAITO operator
helm repo add kaito https://azure.github.io/kaito/charts
helm repo update
helm install kaito-workspace kaito/kaito-workspace \
--namespace kaito-workspace \
--create-namespace \
--set gpu-provisioner.enabled=trueDeploy a Preset Model
apiVersion: kaito.sh/v1alpha1
kind: Workspace
metadata:
name: llama-3-70b
namespace: inference
spec:
resource:
instanceType: "Standard_NC96ads_A100_v4"
count: 2
labelSelector:
matchLabels:
apps: llama-inference
inference:
preset:
name: "llama-3-70b-instruct"
adapters:
- source:
name: "fine-tuned-adapter"
image: "registry.example.com/adapters/custom-lora:v1"KAITO will:
- Provision 2Γ A100 GPU nodes
- Download Llama 3 70B weights
- Deploy with tensor parallelism across 2 nodes
- Create a Service endpoint
Custom Model from Private Registry
apiVersion: kaito.sh/v1alpha1
kind: Workspace
metadata:
name: custom-model
namespace: inference
spec:
resource:
instanceType: "Standard_NC24ads_A100_v4"
count: 1
inference:
template:
containers:
- name: inference
image: registry.example.com/models/custom-bert:v2
ports:
- containerPort: 8080
resources:
limits:
nvidia.com/gpu: 1
env:
- name: MODEL_PATH
value: /models/custom-bertQuery the Model
# Get the service endpoint
export SVC_IP=$(kubectl get svc llama-3-70b -n inference -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
# Inference request
curl -X POST http://${SVC_IP}/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3-70b-instruct",
"prompt": "Explain Kubernetes pods in simple terms",
"max_tokens": 256,
"temperature": 0.7
}'KAITO vs Manual Deployment
| Step | Manual | KAITO |
|---|---|---|
| GPU node provisioning | 10-30 min | Automatic |
| GPU operator setup | 15 min | Pre-configured |
| Model download (70B) | 10-20 min | Automatic + cached |
| Container build | 30 min | Automatic |
| K8s manifests | 30 min | 1 CR |
| Tensor parallelism | Complex config | Automatic |
| Total | 2-4 hours | 10-15 minutes |
graph TD
USER[User creates<br/>Workspace CR] --> CTRL[KAITO Controller<br/>Reads spec]
CTRL --> PROV[GPU Provisioner<br/>Create nodes]
CTRL --> PULL[Model Manager<br/>Pull weights]
PROV --> NODE[GPU Node Ready<br/>A100 / H100]
PULL --> CONTAINER[Model Containerized<br/>+ Inference server]
NODE --> DEPLOY[Deploy Pod<br/>on GPU node]
CONTAINER --> DEPLOY
DEPLOY --> SVC[Service Endpoint<br/>/v1/completions]Common Issues
Workspace stuck in Provisioning
GPU instance type unavailable in region. Check cloud provider quota: az vm list-usage --location eastus | grep NC. Request quota increase or try a different instance type.
Model download timeout
Large models (70B+) take 10-20 minutes to download. KAITO caches model weights on PVCs β subsequent deployments use the cache. Ensure PVC storage class supports ReadWriteMany for multi-node.
Best Practices
- Use preset models for common LLMs β Llama, Mistral, Phi β one-line deployment
- Private registry for custom models β use the template spec for full control
- Cache model weights on PVC β avoid re-downloading on pod restart
- Label GPU nodes for workload isolation β inference vs training vs notebooks
- Monitor Workspace status β
kubectl get workspace -wshows provisioning progress
Key Takeaways
- KAITO automates the full LLM deployment pipeline: GPU provisioning β model download β containerization β serving
- Single Workspace CR replaces hours of manual configuration
- Automatic GPU node provisioning via Karpenter-based provisioner
- Supports preset models (Llama, Mistral, Phi) and custom models from private registries
- LoRA adapter support for fine-tuned model variants without re-downloading base weights

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
