Autoscale LLM Inference on Kubernetes
Configure Horizontal Pod Autoscaling and KEDA for LLM workloads using GPU utilization, request queue depth, and custom metrics.
π‘ Quick Answer: Use KEDA with Prometheus triggers to autoscale LLM replicas based on request queue depth or GPU utilization. Standard HPA works for CPU-based metrics. For GPU-aware scaling, scrape DCGM metrics (
DCGM_FI_DEV_GPU_UTIL) or vLLMβs built-in/metricsendpoint (vllm:num_requests_waiting). SetminReplicas: 1to avoid cold-start delays.
LLM inference workloads have variable demand. Autoscaling saves GPU costs during low traffic and prevents latency spikes during peaks.
Scaling Challenges for LLMs
| Challenge | Impact | Solution |
|---|---|---|
| Slow model loading | 30β120s cold start | Keep minReplicas β₯ 1 |
| GPU allocation | Must reserve full GPU per replica | Use GPU fractioning or time-slicing |
| Memory requirements | Each replica needs full model in VRAM | Plan total GPU budget |
| Batch processing | vLLM batches dynamically | Scale on queue depth, not CPU |
Strategy 1: HPA with Custom Metrics
vLLM Prometheus Metrics
vLLM exposes metrics at /metrics:
curl http://mistral-vllm:8000/metrics | grep vllmKey scaling metrics:
| Metric | Description | Good for Scaling? |
|---|---|---|
vllm:num_requests_running | Active requests | Yes |
vllm:num_requests_waiting | Queued requests | Best |
vllm:avg_generation_throughput_toks_per_s | Token throughput | Informational |
vllm:gpu_cache_usage_perc | KV cache utilization | Yes |
Prometheus ServiceMonitor
# vllm-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: vllm-metrics
namespace: ai-inference
spec:
selector:
matchLabels:
app: mistral-vllm
endpoints:
- port: http
path: /metrics
interval: 15sHPA with Prometheus Adapter
# vllm-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: mistral-vllm-hpa
namespace: ai-inference
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: mistral-vllm
minReplicas: 1
maxReplicas: 4
metrics:
- type: Pods
pods:
metric:
name: vllm_num_requests_waiting
target:
type: AverageValue
averageValue: "5" # Scale up when >5 requests queued per pod
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 1
periodSeconds: 120 # Add 1 replica every 2 min max
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 1
periodSeconds: 300 # Remove 1 replica every 5 min maxScale-down is deliberately slow because each replica holds significant GPU resources and model reload is expensive.
Strategy 2: KEDA (Recommended)
KEDA provides richer trigger options and simpler configuration than raw HPA + Prometheus Adapter.
Install KEDA
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda \
--namespace keda \
--create-namespaceKEDA ScaledObject for vLLM
# vllm-keda-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: mistral-vllm-keda
namespace: ai-inference
spec:
scaleTargetRef:
name: mistral-vllm
minReplicaCount: 1
maxReplicaCount: 4
cooldownPeriod: 300 # Wait 5 min before scaling down
pollingInterval: 30
triggers:
# Scale on queued requests
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
metricName: vllm_waiting_requests
query: |
sum(vllm:num_requests_waiting{namespace="ai-inference"})
threshold: "10" # Scale up when total queue > 10
# Optional: scale on GPU utilization
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
metricName: gpu_utilization
query: |
avg(DCGM_FI_DEV_GPU_UTIL{namespace="ai-inference"})
threshold: "85" # Scale up when avg GPU util > 85%KEDA with Scale-to-Zero
For non-production or cost-sensitive environments:
spec:
minReplicaCount: 0 # Scale to zero when idle
maxReplicaCount: 3
idleReplicaCount: 0
cooldownPeriod: 600 # 10 min idle before scaling to zeroWarning: Scale-to-zero means 30β120 second cold start on the next request (model must reload into GPU memory).
Strategy 3: GPU-Metric-Based HPA
Using DCGM GPU metrics directly:
# gpu-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llm-gpu-hpa
namespace: ai-inference
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: mistral-vllm
minReplicas: 1
maxReplicas: 4
metrics:
- type: Pods
pods:
metric:
name: DCGM_FI_DEV_GPU_UTIL
target:
type: AverageValue
averageValue: "80" # Scale when GPU util > 80%Requires the Prometheus Adapter to expose DCGM metrics as custom metrics API.
Run:ai Autoscaling
If using Run:ai, configure replica autoscaling in the UI:
| Field | Value |
|---|---|
| Minimum replicas | 1 |
| Maximum replicas | 4 |
| Scale-to-zero | Never (production) or after idle period |
Run:ai handles GPU allocation and quota management automatically.
Monitoring Autoscaling
# Check HPA status
kubectl get hpa -n ai-inference
# Watch KEDA ScaledObject
kubectl get scaledobject -n ai-inference
# Check current replicas
kubectl get deployment mistral-vllm -n ai-inference
# View scaling events
kubectl get events -n ai-inference --sort-by=.lastTimestamp | grep -i "scal"Recommended Autoscaling Settings
| Environment | Min Replicas | Max Replicas | Scale-to-Zero | Cooldown |
|---|---|---|---|---|
| Production | 2 | 8 | No | 5β10 min |
| Staging | 1 | 3 | Optional | 5 min |
| Development | 0 | 2 | Yes | 2 min |
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
HPA shows <unknown> | Metrics not being scraped | Check ServiceMonitor and Prometheus targets |
| Never scales up | Threshold too high | Lower threshold; check metric values |
| Scales up and down rapidly | No stabilization window | Increase stabilizationWindowSeconds |
| New replica not serving | Model still loading | Increase readiness probe initialDelaySeconds |
Related Recipes

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
