πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai advanced ⏱ 20 minutes K8s 1.28+

LLM Deployment Challenges Kubernetes

Address common LLM deployment challenges on Kubernetes. GPU memory management, model loading optimization, inference latency tuning, batch scheduling.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Tackle the 5 biggest LLM deployment challenges: (1) GPU memory β€” use quantization (AWQ/GPTQ) to fit larger models, (2) model loading β€” pre-cache models on PVCs instead of pulling each time, (3) latency β€” tune max_batch_size and max_tokens, (4) scaling β€” autoscale on request queue depth not CPU, (5) multi-node β€” use tensor parallelism across nodes for models that don’t fit on one GPU.

The Problem

Deploying LLMs on Kubernetes is fundamentally different from deploying web services. Models are 10-200GB, require specialized GPU hardware, have complex memory requirements, and exhibit non-linear latency under load. Standard Kubernetes patterns (HPA on CPU, small container images, horizontal scaling) don’t apply.

The Solution

Challenge 1: GPU Memory Management

# Model size vs GPU memory
# Llama-3 70B in FP16: ~140GB VRAM β†’ needs 2Γ— H100 (80GB each)
# Llama-3 70B in INT4 (AWQ): ~35GB VRAM β†’ fits on 1Γ— A100 (80GB)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-server
spec:
  template:
    spec:
      containers:
        - name: vllm
          image: registry.example.com/vllm:0.6.0
          args:
            - --model=/models/llama-3-70b-awq
            - --quantization=awq
            - --tensor-parallel-size=2
            - --max-model-len=4096
            - --gpu-memory-utilization=0.90
          resources:
            limits:
              nvidia.com/gpu: 2
              memory: 64Gi
          volumeMounts:
            - name: model-cache
              mountPath: /models
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-storage

Challenge 2: Model Loading Speed

# Problem: Downloading 70GB model from S3 takes 10+ minutes
# Solution: Pre-cache on PVC with ReadWriteMany

apiVersion: batch/v1
kind: Job
metadata:
  name: model-downloader
spec:
  template:
    spec:
      containers:
        - name: download
          image: registry.example.com/model-downloader:1.0
          command:
            - huggingface-cli
            - download
            - meta-llama/Llama-3-70B-AWQ
            - --local-dir=/models/llama-3-70b-awq
          volumeMounts:
            - name: models
              mountPath: /models
      volumes:
        - name: models
          persistentVolumeClaim:
            claimName: model-storage

Challenge 3: Inference Latency

# vLLM serving config for optimal latency
args:
  - --max-num-batched-tokens=4096
  - --max-num-seqs=32
  - --enable-chunked-prefill
  - --disable-log-requests
  # KV cache optimization
  - --kv-cache-dtype=fp8_e5m2
  - --enable-prefix-caching

Challenge 4: Autoscaling on Queue Depth

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-server
  minReplicas: 1
  maxReplicas: 8
  metrics:
    - type: Pods
      pods:
        metric:
          name: vllm_num_requests_waiting
        target:
          type: AverageValue
          averageValue: "5"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 120
    scaleDown:
      stabilizationWindowSeconds: 600

Challenge 5: Model Size Reference

ModelFP16 VRAMINT4 VRAMMin GPUs (FP16)
7B14GB4GB1Γ— T4/A10
13B26GB7GB1Γ— A100-40
34B68GB17GB1Γ— A100-80
70B140GB35GB2Γ— A100-80
405B810GB203GB8Γ— H100
graph TD
    MODEL[LLM Model<br/>70B params] -->|FP16: 140GB| MULTI[Multi-GPU<br/>Tensor Parallel]
    MODEL -->|INT4: 35GB| SINGLE[Single GPU<br/>A100-80GB]
    
    MULTI -->|Load from| PVC[PVC Model Cache<br/>Pre-downloaded]
    SINGLE -->|Load from| PVC
    
    PVC -->|10s load| FAST[Fast startup βœ…]
    S3[S3 Download] -->|10min load| SLOW[Slow startup ❌]
    
    HPA[HPA on queue depth] -->|Scale| MULTI
    HPA -->|Scale| SINGLE

Common Issues

OOMKilled during model loading

Model loading temporarily uses more memory than inference. Set memory limit 50% higher than model size. Use --gpu-memory-utilization=0.85 to leave headroom.

Inference latency spikes under load

Reduce max-num-seqs to limit concurrent requests per instance. Scale horizontally instead of overloading one replica.

Best Practices

  • Pre-cache models on PVCs β€” never download at pod startup
  • Quantize aggressively β€” AWQ INT4 loses <1% accuracy with 75% memory reduction
  • Autoscale on queue depth, not CPU β€” LLM workloads are GPU-bound
  • Slow scale-down (600s) β€” model loading is expensive, avoid thrashing
  • FP8 KV cache β€” reduces memory by 50% with minimal quality impact

Key Takeaways

  • LLMs require fundamentally different deployment patterns than web services
  • GPU memory is the primary constraint β€” use quantization to fit larger models
  • Pre-cache models on PVC β€” S3 downloads at pod startup cause 10+ minute cold starts
  • Autoscale on inference queue depth β€” CPU/memory metrics are meaningless for LLMs
  • Multi-node tensor parallelism for models that don’t fit on one node’s GPUs
#llm #deployment #gpu-memory #inference #optimization
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens