πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai advanced ⏱ 20 minutes K8s 1.28+

ModelMesh Multi-Model Serving Kubernetes

Deploy hundreds of ML models on shared GPU infrastructure with ModelMesh. Intelligent model loading and unloading, memory management, routing.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Deploy ModelMesh with KServe for intelligent multi-model serving on shared GPUs. ModelMesh automatically loads frequently accessed models into GPU memory and unloads idle models β€” serving 100+ models on infrastructure that would otherwise require 100 dedicated GPU pods.

The Problem

Each ML model typically gets its own Deployment with a dedicated GPU. With 50-100 models in production, that’s 50-100 GPUs β€” most sitting at <10% utilization. ModelMesh packs multiple models onto shared GPU pods, loading models on-demand and evicting idle ones, achieving 10-100x better GPU utilization.

The Solution

Install ModelMesh with KServe

# Install KServe with ModelMesh
kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.13.0/kserve.yaml
kubectl apply -f https://github.com/kserve/modelmesh-serving/releases/download/v0.12.0/modelmesh.yaml

ServingRuntime for Model Format

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: triton-modelmesh
  namespace: ml-serving
spec:
  supportedModelFormats:
    - name: onnx
      version: "1"
      autoSelect: true
    - name: tensorrt
      version: "8"
    - name: pytorch
      version: "1"
  multiModel: true
  grpcEndpoint: "port:8085"
  grpcDataEndpoint: "port:8001"
  containers:
    - name: triton
      image: nvcr.io/nvidia/tritonserver:24.07-py3
      resources:
        requests:
          cpu: "2"
          memory: 8Gi
          nvidia.com/gpu: 1
        limits:
          memory: 16Gi
          nvidia.com/gpu: 1
      args:
        - --model-control-mode=explicit
        - --strict-model-config=false
  replicas: 3

Deploy Multiple Models

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sentiment-model
  namespace: ml-serving
  annotations:
    serving.kserve.io/deploymentMode: ModelMesh
spec:
  predictor:
    model:
      modelFormat:
        name: onnx
      storageUri: "s3://models/sentiment/v3"
---
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: translation-model
  namespace: ml-serving
  annotations:
    serving.kserve.io/deploymentMode: ModelMesh
spec:
  predictor:
    model:
      modelFormat:
        name: onnx
      storageUri: "s3://models/translation/v2"
---
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: classification-model
  namespace: ml-serving
  annotations:
    serving.kserve.io/deploymentMode: ModelMesh
spec:
  predictor:
    model:
      modelFormat:
        name: pytorch
      storageUri: "s3://models/classification/v1"

All 3 models share the same 3-replica Triton runtime β€” no dedicated pods per model.

ModelMesh Behavior

100 models registered, 3 GPU pods (each 80GB):
  β†’ 20 models actively loaded (fit in GPU memory)
  β†’ 80 models on standby (on disk/S3)
  β†’ Request for idle model β†’ auto-load (200-500ms for small models)
  β†’ LRU eviction when memory full β†’ least-used model unloaded

Result: 100 models served by 3 GPUs instead of 100 GPUs
graph TD
    REQ[Inference Requests] --> ROUTER[ModelMesh Router<br/>Intelligent routing]
    ROUTER --> POD1[GPU Pod 1<br/>Models A, B, C<br/>loaded in memory]
    ROUTER --> POD2[GPU Pod 2<br/>Models D, E, F<br/>loaded in memory]
    ROUTER --> POD3[GPU Pod 3<br/>Models G, H, I<br/>loaded in memory]
    
    S3[S3 Model Storage<br/>100 models total] -.->|Load on demand| POD1
    S3 -.->|Load on demand| POD2
    S3 -.->|Load on demand| POD3
    
    POD1 -.->|Evict LRU| S3

Common Issues

First request to a model is slow (cold load)

Expected β€” ModelMesh needs to load the model from storage. For latency-sensitive models, use serving.kserve.io/priority: high annotation to keep them always loaded.

Model evicted too frequently

Increase GPU memory per pod or add more replicas. Check eviction frequency: kubectl logs deploy/modelmesh-serving | grep evict.

Best Practices

  • ModelMesh for 10+ models on shared infrastructure β€” single-model KServe for <10
  • Group similar model sizes on the same runtime β€” prevents one large model from evicting many small ones
  • Set priority annotations for critical models β€” prevents LRU eviction
  • S3-compatible storage for model artifacts β€” consistent across environments
  • 3 replicas minimum for HA β€” ModelMesh distributes models across replicas

Key Takeaways

  • ModelMesh packs 10-100+ models onto shared GPU pods with intelligent memory management
  • LRU eviction automatically loads frequently-used models and unloads idle ones
  • 10-100x better GPU utilization compared to dedicated pod-per-model deployments
  • Cold-load latency is 200-500ms for small models β€” use priority annotations for hot models
  • Works with KServe InferenceService CRD β€” same API, just add deploymentMode annotation
#modelmesh #multi-model #inference #gpu-sharing #serving-runtime
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens