πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai advanced ⏱ 30 minutes K8s 1.28+

Autoscale LLM Inference on Kubernetes

Configure Horizontal Pod Autoscaling and KEDA for LLM workloads using GPU utilization, request queue depth, and custom metrics.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Use KEDA with Prometheus triggers to autoscale LLM replicas based on request queue depth or GPU utilization. Standard HPA works for CPU-based metrics. For GPU-aware scaling, scrape DCGM metrics (DCGM_FI_DEV_GPU_UTIL) or vLLM’s built-in /metrics endpoint (vllm:num_requests_waiting). Set minReplicas: 1 to avoid cold-start delays.

LLM inference workloads have variable demand. Autoscaling saves GPU costs during low traffic and prevents latency spikes during peaks.

Scaling Challenges for LLMs

ChallengeImpactSolution
Slow model loading30–120s cold startKeep minReplicas β‰₯ 1
GPU allocationMust reserve full GPU per replicaUse GPU fractioning or time-slicing
Memory requirementsEach replica needs full model in VRAMPlan total GPU budget
Batch processingvLLM batches dynamicallyScale on queue depth, not CPU

Strategy 1: HPA with Custom Metrics

vLLM Prometheus Metrics

vLLM exposes metrics at /metrics:

curl http://mistral-vllm:8000/metrics | grep vllm

Key scaling metrics:

MetricDescriptionGood for Scaling?
vllm:num_requests_runningActive requestsYes
vllm:num_requests_waitingQueued requestsBest
vllm:avg_generation_throughput_toks_per_sToken throughputInformational
vllm:gpu_cache_usage_percKV cache utilizationYes

Prometheus ServiceMonitor

# vllm-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: vllm-metrics
  namespace: ai-inference
spec:
  selector:
    matchLabels:
      app: mistral-vllm
  endpoints:
    - port: http
      path: /metrics
      interval: 15s

HPA with Prometheus Adapter

# vllm-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: mistral-vllm-hpa
  namespace: ai-inference
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: mistral-vllm
  minReplicas: 1
  maxReplicas: 4
  metrics:
    - type: Pods
      pods:
        metric:
          name: vllm_num_requests_waiting
        target:
          type: AverageValue
          averageValue: "5"    # Scale up when >5 requests queued per pod
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120    # Add 1 replica every 2 min max
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 300    # Remove 1 replica every 5 min max

Scale-down is deliberately slow because each replica holds significant GPU resources and model reload is expensive.

KEDA provides richer trigger options and simpler configuration than raw HPA + Prometheus Adapter.

Install KEDA

helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda \
  --namespace keda \
  --create-namespace

KEDA ScaledObject for vLLM

# vllm-keda-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: mistral-vllm-keda
  namespace: ai-inference
spec:
  scaleTargetRef:
    name: mistral-vllm
  minReplicaCount: 1
  maxReplicaCount: 4
  cooldownPeriod: 300            # Wait 5 min before scaling down
  pollingInterval: 30
  triggers:
    # Scale on queued requests
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
        metricName: vllm_waiting_requests
        query: |
          sum(vllm:num_requests_waiting{namespace="ai-inference"})
        threshold: "10"          # Scale up when total queue > 10
    # Optional: scale on GPU utilization
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
        metricName: gpu_utilization
        query: |
          avg(DCGM_FI_DEV_GPU_UTIL{namespace="ai-inference"})
        threshold: "85"          # Scale up when avg GPU util > 85%

KEDA with Scale-to-Zero

For non-production or cost-sensitive environments:

spec:
  minReplicaCount: 0             # Scale to zero when idle
  maxReplicaCount: 3
  idleReplicaCount: 0
  cooldownPeriod: 600            # 10 min idle before scaling to zero

Warning: Scale-to-zero means 30–120 second cold start on the next request (model must reload into GPU memory).

Strategy 3: GPU-Metric-Based HPA

Using DCGM GPU metrics directly:

# gpu-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-gpu-hpa
  namespace: ai-inference
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: mistral-vllm
  minReplicas: 1
  maxReplicas: 4
  metrics:
    - type: Pods
      pods:
        metric:
          name: DCGM_FI_DEV_GPU_UTIL
        target:
          type: AverageValue
          averageValue: "80"     # Scale when GPU util > 80%

Requires the Prometheus Adapter to expose DCGM metrics as custom metrics API.

Run:ai Autoscaling

If using Run:ai, configure replica autoscaling in the UI:

FieldValue
Minimum replicas1
Maximum replicas4
Scale-to-zeroNever (production) or after idle period

Run:ai handles GPU allocation and quota management automatically.

Monitoring Autoscaling

# Check HPA status
kubectl get hpa -n ai-inference

# Watch KEDA ScaledObject
kubectl get scaledobject -n ai-inference

# Check current replicas
kubectl get deployment mistral-vllm -n ai-inference

# View scaling events
kubectl get events -n ai-inference --sort-by=.lastTimestamp | grep -i "scal"
EnvironmentMin ReplicasMax ReplicasScale-to-ZeroCooldown
Production28No5–10 min
Staging13Optional5 min
Development02Yes2 min

Troubleshooting

SymptomCauseFix
HPA shows <unknown>Metrics not being scrapedCheck ServiceMonitor and Prometheus targets
Never scales upThreshold too highLower threshold; check metric values
Scales up and down rapidlyNo stabilization windowIncrease stabilizationWindowSeconds
New replica not servingModel still loadingIncrease readiness probe initialDelaySeconds
#autoscaling #hpa #keda #llm #inference #gpu #scaling #ai-workloads
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens