ModelMesh Multi-Model Serving Kubernetes
Deploy hundreds of ML models on shared GPU infrastructure with ModelMesh. Intelligent model loading and unloading, memory management, routing.
π‘ Quick Answer: Deploy ModelMesh with KServe for intelligent multi-model serving on shared GPUs. ModelMesh automatically loads frequently accessed models into GPU memory and unloads idle models β serving 100+ models on infrastructure that would otherwise require 100 dedicated GPU pods.
The Problem
Each ML model typically gets its own Deployment with a dedicated GPU. With 50-100 models in production, thatβs 50-100 GPUs β most sitting at <10% utilization. ModelMesh packs multiple models onto shared GPU pods, loading models on-demand and evicting idle ones, achieving 10-100x better GPU utilization.
The Solution
Install ModelMesh with KServe
# Install KServe with ModelMesh
kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.13.0/kserve.yaml
kubectl apply -f https://github.com/kserve/modelmesh-serving/releases/download/v0.12.0/modelmesh.yamlServingRuntime for Model Format
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: triton-modelmesh
namespace: ml-serving
spec:
supportedModelFormats:
- name: onnx
version: "1"
autoSelect: true
- name: tensorrt
version: "8"
- name: pytorch
version: "1"
multiModel: true
grpcEndpoint: "port:8085"
grpcDataEndpoint: "port:8001"
containers:
- name: triton
image: nvcr.io/nvidia/tritonserver:24.07-py3
resources:
requests:
cpu: "2"
memory: 8Gi
nvidia.com/gpu: 1
limits:
memory: 16Gi
nvidia.com/gpu: 1
args:
- --model-control-mode=explicit
- --strict-model-config=false
replicas: 3Deploy Multiple Models
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: sentiment-model
namespace: ml-serving
annotations:
serving.kserve.io/deploymentMode: ModelMesh
spec:
predictor:
model:
modelFormat:
name: onnx
storageUri: "s3://models/sentiment/v3"
---
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: translation-model
namespace: ml-serving
annotations:
serving.kserve.io/deploymentMode: ModelMesh
spec:
predictor:
model:
modelFormat:
name: onnx
storageUri: "s3://models/translation/v2"
---
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: classification-model
namespace: ml-serving
annotations:
serving.kserve.io/deploymentMode: ModelMesh
spec:
predictor:
model:
modelFormat:
name: pytorch
storageUri: "s3://models/classification/v1"All 3 models share the same 3-replica Triton runtime β no dedicated pods per model.
ModelMesh Behavior
100 models registered, 3 GPU pods (each 80GB):
β 20 models actively loaded (fit in GPU memory)
β 80 models on standby (on disk/S3)
β Request for idle model β auto-load (200-500ms for small models)
β LRU eviction when memory full β least-used model unloaded
Result: 100 models served by 3 GPUs instead of 100 GPUsgraph TD
REQ[Inference Requests] --> ROUTER[ModelMesh Router<br/>Intelligent routing]
ROUTER --> POD1[GPU Pod 1<br/>Models A, B, C<br/>loaded in memory]
ROUTER --> POD2[GPU Pod 2<br/>Models D, E, F<br/>loaded in memory]
ROUTER --> POD3[GPU Pod 3<br/>Models G, H, I<br/>loaded in memory]
S3[S3 Model Storage<br/>100 models total] -.->|Load on demand| POD1
S3 -.->|Load on demand| POD2
S3 -.->|Load on demand| POD3
POD1 -.->|Evict LRU| S3Common Issues
First request to a model is slow (cold load)
Expected β ModelMesh needs to load the model from storage. For latency-sensitive models, use serving.kserve.io/priority: high annotation to keep them always loaded.
Model evicted too frequently
Increase GPU memory per pod or add more replicas. Check eviction frequency: kubectl logs deploy/modelmesh-serving | grep evict.
Best Practices
- ModelMesh for 10+ models on shared infrastructure β single-model KServe for <10
- Group similar model sizes on the same runtime β prevents one large model from evicting many small ones
- Set priority annotations for critical models β prevents LRU eviction
- S3-compatible storage for model artifacts β consistent across environments
- 3 replicas minimum for HA β ModelMesh distributes models across replicas
Key Takeaways
- ModelMesh packs 10-100+ models onto shared GPU pods with intelligent memory management
- LRU eviction automatically loads frequently-used models and unloads idle ones
- 10-100x better GPU utilization compared to dedicated pod-per-model deployments
- Cold-load latency is 200-500ms for small models β use priority annotations for hot models
- Works with KServe InferenceService CRD β same API, just add deploymentMode annotation

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
