Shared Model Caching Across Pods on Kubernetes
Optimize LLM inference startup and reduce storage costs by sharing model weights across pods using emptyDir, hostPath, ReadWriteMany PVCs, and init.
π‘ Quick Answer: Use a ReadWriteMany PVC to store model weights once, shared across all inference pods. Pre-load models with an init container or a one-time Job. For fastest access, use
emptyDirwithmedium: Memoryfor small models (<16GB) orhostPathwith a DaemonSet model-loader for per-node caching of large models.Key concept: Loading a 16GB model from a shared PVC takes ~30s vs ~5min downloading from HuggingFace on every pod start.
Gotcha:
emptyDirwithmedium: Memorycounts against the podβs memory limit. A 16GB model in memory means you need 16GB extra in yourlimits.memory.
The Problem
Every LLM inference pod downloads or loads model weights at startup:
- Duplicate downloads waste bandwidth when scaling from 1 to 10 replicas
- Cold start times of 3-10 minutes make autoscaling impractical
- Storage costs multiply when each pod has its own model copy
- Registry bandwidth gets saturated pulling multi-GB model images
The Solution
Share model weights across pods using Kubernetes storage primitives, choosing the right strategy based on model size and access patterns.
Architecture Overview
flowchart TB
subgraph cluster["βΈοΈ KUBERNETES CLUSTER"]
subgraph strategy1["Strategy 1: RWX PVC"]
PVC["ReadWriteMany PVC<br/>NFS / EFS / CephFS"]
P1A["Pod 1"] --> PVC
P2A["Pod 2"] --> PVC
P3A["Pod 3"] --> PVC
end
subgraph strategy2["Strategy 2: Node Cache"]
DS["DaemonSet<br/>Model Loader"]
HP["hostPath<br/>/models"]
P1B["Pod 1"] --> HP
P2B["Pod 2"] --> HP
DS --> HP
end
subgraph strategy3["Strategy 3: Memory"]
ED["emptyDir<br/>medium: Memory"]
IC["Init Container<br/>Model Copy"]
P1C["Pod 1"] --> ED
IC --> ED
end
endStrategy 1: ReadWriteMany PVC (Recommended)
# model-store-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-store
namespace: ai-inference
spec:
accessModes: [ReadWriteMany]
resources:
requests:
storage: 500Gi
storageClassName: efs-sc # Or nfs-csi, cephfs
---
# One-time job to download models
apiVersion: batch/v1
kind: Job
metadata:
name: download-models
namespace: ai-inference
spec:
template:
spec:
containers:
- name: downloader
image: python:3.11-slim
command: ["bash", "-c"]
args:
- |
pip install huggingface_hub
python -c "
from huggingface_hub import snapshot_download
import os
models = [
('meta-llama/Llama-3.1-8B-Instruct', '/models/llama-3.1-8b-instruct'),
('mistralai/Mistral-7B-Instruct-v0.3', '/models/mistral-7b-instruct'),
]
for repo_id, local_dir in models:
if not os.path.exists(local_dir):
print(f'Downloading {repo_id}...')
snapshot_download(repo_id=repo_id, local_dir=local_dir, token=os.environ['HF_TOKEN'])
else:
print(f'{repo_id} already cached')
"
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-credentials
key: token
volumeMounts:
- name: models
mountPath: /models
resources:
requests:
cpu: "2"
memory: 4Gi
volumes:
- name: models
persistentVolumeClaim:
claimName: model-store
restartPolicy: Never
---
# Inference deployment using shared models
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-inference
namespace: ai-inference
spec:
replicas: 4
selector:
matchLabels:
app: vllm-inference
template:
metadata:
labels:
app: vllm-inference
spec:
containers:
- name: vllm
image: vllm/vllm-openai:v0.6.4
args:
- "--model=/models/llama-3.1-8b-instruct"
- "--gpu-memory-utilization=0.9"
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
- name: models
mountPath: /models
readOnly: true
- name: shm
mountPath: /dev/shm
volumes:
- name: models
persistentVolumeClaim:
claimName: model-store
readOnly: true
- name: shm
emptyDir:
medium: Memory
sizeLimit: 16GiStrategy 2: Node-Level Cache with DaemonSet
# model-cache-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: model-cache-loader
namespace: ai-inference
spec:
selector:
matchLabels:
app: model-cache-loader
template:
metadata:
labels:
app: model-cache-loader
spec:
initContainers:
- name: download
image: python:3.11-slim
command: ["bash", "-c"]
args:
- |
if [ ! -f /cache/llama-3.1-8b-instruct/config.json ]; then
pip install huggingface_hub
python -c "
from huggingface_hub import snapshot_download
snapshot_download('meta-llama/Llama-3.1-8B-Instruct', local_dir='/cache/llama-3.1-8b-instruct')
"
fi
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-credentials
key: token
volumeMounts:
- name: cache
mountPath: /cache
resources:
requests:
cpu: "2"
memory: 4Gi
containers:
- name: keepalive
image: busybox
command: ["sleep", "infinity"]
volumeMounts:
- name: cache
mountPath: /cache
volumes:
- name: cache
hostPath:
path: /var/lib/model-cache
type: DirectoryOrCreate
nodeSelector:
gpu.nvidia.com/class: A100_SXM4_80GB
---
# Inference pods use the hostPath cache
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-hostpath
namespace: ai-inference
spec:
replicas: 2
selector:
matchLabels:
app: vllm-hostpath
template:
spec:
containers:
- name: vllm
image: vllm/vllm-openai:v0.6.4
args: ["--model=/models/llama-3.1-8b-instruct"]
volumeMounts:
- name: models
mountPath: /models
readOnly: true
resources:
limits:
nvidia.com/gpu: 1
volumes:
- name: models
hostPath:
path: /var/lib/model-cache
type: DirectoryStrategy 3: In-Memory with emptyDir
# For small models (<8GB) that need fastest possible access
apiVersion: apps/v1
kind: Deployment
metadata:
name: fast-inference
namespace: ai-inference
spec:
replicas: 2
selector:
matchLabels:
app: fast-inference
template:
spec:
initContainers:
- name: load-model
image: python:3.11-slim
command: ["bash", "-c"]
args:
- |
pip install huggingface_hub
python -c "
from huggingface_hub import snapshot_download
snapshot_download('microsoft/phi-2', local_dir='/model-cache/phi-2')
"
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-credentials
key: token
volumeMounts:
- name: model-mem
mountPath: /model-cache
resources:
requests:
memory: 8Gi
containers:
- name: inference
image: vllm/vllm-openai:v0.6.4
args: ["--model=/model-cache/phi-2"]
volumeMounts:
- name: model-mem
mountPath: /model-cache
resources:
limits:
nvidia.com/gpu: 1
memory: 24Gi # 8GB model + 16GB for inference
volumes:
- name: model-mem
emptyDir:
medium: Memory
sizeLimit: 8GiComparison
| Strategy | Startup Time | Model Size | Multi-Node | Cost |
|---|---|---|---|---|
| RWX PVC | ~30s | Unlimited | β | Medium |
| hostPath + DaemonSet | ~5s | Limited by disk | β per-node | Low |
| emptyDir Memory | ~60s (download) | <16GB | β per-pod | High (RAM) |
| OCI Image (baked in) | 0s (pull time) | <10GB | β | High (registry) |
Common Issues
Issue 1: NFS slow model loading
# Symptoms: Model loading takes 5+ minutes from NFS
# Solution: Use a faster storage backend
# EFS with provisioned throughput: 1 GB/s+
# Or use local NVMe with hostPath caching
# Check I/O throughput
kubectl exec -n ai-inference deploy/vllm-inference -- \
dd if=/models/llama-3.1-8b-instruct/model-00001-of-00004.safetensors \
of=/dev/null bs=1M count=1000 2>&1 | tail -1Issue 2: emptyDir OOMKilled
# Memory emptyDir counts against pod memory limits
# Increase memory limits to account for model size
resources:
limits:
memory: "32Gi" # 16GB model + 16GB for applicationBest Practices
- Use RWX PVC for multi-node β One download, all pods share
- Use hostPath + DaemonSet for single-model β Fastest access, no network overhead
- Pre-download in Jobs, not init containers β Init containers delay every pod start
- Set PVC to ReadOnly in inference pods β Prevent accidental model corruption
- Use safetensors format β Loads 2-5x faster than PyTorch .bin files via mmap
- Monitor storage I/O β Slow NFS can bottleneck model loading more than download
Key Takeaways
- ReadWriteMany PVCs are the most practical for multi-replica inference deployments
- hostPath with DaemonSet gives fastest access but requires per-node management
- emptyDir Memory is fastest for small models but expensive in RAM
- Pre-downloading models in a Job eliminates startup delays when scaling
- safetensors format enables memory-mapped loading for near-instant model access

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
