πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai intermediate ⏱ 20 minutes K8s 1.28+

KServe Model Serving Kubernetes

Deploy ML models with KServe for serverless inference on Kubernetes. InferenceService, scale-to-zero, canary rollouts, model transformers.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Create a KServe InferenceService with your model storage URI. KServe handles model loading, autoscaling (including scale-to-zero), request batching, and canary rollouts. Use ModelMesh for high-density multi-model serving on shared GPU infrastructure.

The Problem

Serving ML models in production requires autoscaling, traffic splitting for A/B testing, request batching, model versioning, and monitoring β€” none of which a simple Deployment provides. KServe is the standard Kubernetes-native model serving platform that handles all of this declaratively.

The Solution

Basic InferenceService

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sklearn-iris
  namespace: production
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: "s3://models/sklearn/iris/v1"
      resources:
        requests:
          cpu: 100m
          memory: 256Mi
        limits:
          memory: 512Mi

GPU Model Serving

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llm-server
spec:
  predictor:
    model:
      modelFormat:
        name: pytorch
      storageUri: "pvc://model-storage/llama-7b"
      runtime: kserve-torchserve
      resources:
        limits:
          nvidia.com/gpu: 1
          memory: 32Gi
    minReplicas: 1
    maxReplicas: 4
    scaleTarget: 10
    scaleMetric: concurrency

Canary Rollout

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: text-classifier
spec:
  predictor:
    canaryTrafficPercent: 20
    model:
      modelFormat:
        name: pytorch
      storageUri: "s3://models/classifier/v2"

20% of traffic goes to v2 β€” monitor accuracy before promoting.

ModelMesh for Multi-Model Serving

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: triton-runtime
spec:
  supportedModelFormats:
    - name: onnx
      version: "1"
      autoSelect: true
  multiModel: true
  containers:
    - name: triton
      image: nvcr.io/nvidia/tritonserver:24.07-py3
      resources:
        requests:
          nvidia.com/gpu: 1
        limits:
          nvidia.com/gpu: 1
---
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: model-a
  annotations:
    serving.kserve.io/deploymentMode: ModelMesh
spec:
  predictor:
    model:
      modelFormat:
        name: onnx
      storageUri: "s3://models/model-a"

ModelMesh packs multiple models onto shared GPU pods β€” 10-100x better GPU utilization.

graph TD
    CLIENT[Client Request] --> ROUTER[KServe Router<br/>Traffic splitting]
    ROUTER -->|80%| STABLE[Stable Model v1<br/>2 replicas]
    ROUTER -->|20%| CANARY[Canary Model v2<br/>1 replica]
    
    subgraph ModelMesh
        REQ[Requests] --> MM[ModelMesh Router]
        MM --> GPU1[GPU Pod<br/>Model A + B + C]
        MM --> GPU2[GPU Pod<br/>Model D + E + F]
    end
    
    STABLE -->|Scale-to-zero<br/>after idle period| ZERO[0 replicas<br/>cold start on request]

Common Issues

InferenceService stuck in β€œNot Ready”

Check model download: kubectl logs deploy/sklearn-iris-predictor -c storage-initializer. Common cause: S3 credentials missing or wrong storage URI.

Scale-to-zero cold start too slow

Large models take minutes to load from storage. Use minReplicas: 1 for latency-sensitive services, or pre-warm with periodic health checks.

Best Practices

  • KServe for standardized serving β€” supports sklearn, PyTorch, TensorFlow, ONNX, XGBoost, LightGBM
  • ModelMesh for multi-model β€” pack 10-50 models per GPU pod
  • Canary rollouts for model updates β€” route 10-20% to new version, monitor metrics
  • Scale-to-zero for dev/staging β€” save GPU costs when idle
  • minReplicas: 1 for production β€” avoid cold-start latency

Key Takeaways

  • KServe provides declarative model serving with InferenceService CRD
  • Supports scale-to-zero, autoscaling on concurrency, and canary rollouts
  • ModelMesh enables multi-model serving on shared GPUs β€” 10-100x better utilization
  • Canary traffic splitting enables safe model updates with gradual promotion
  • Storage initializer handles model download from S3, GCS, PVC, or HTTP
#kserve #model-serving #inference #serverless #modelmesh
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens