πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai intermediate ⏱ 20 minutes K8s 1.28+

Deploy vLLM OpenAI Container on Kubernetes

Deploy the vLLM OpenAI-compatible server container on Kubernetes. Pull ghcr.io/vllm-project/vllm-openai, configure GPU resources, model loading.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Deploy vLLM’s OpenAI-compatible server with ghcr.io/vllm-project/vllm-openai:latest. Create a Deployment with GPU resources, mount a model cache PVC, and expose via Service. The container serves /v1/completions, /v1/chat/completions, and /v1/embeddings β€” drop-in replacement for the OpenAI API with any open model.

The Problem

You want to serve open-source LLMs (Llama, Mistral, Qwen, etc.) with an OpenAI-compatible API so existing application code works without changes. The vllm-openai container image from ghcr.io/vllm-project/vllm-openai provides exactly this β€” a high-performance inference server with continuous batching, PagedAttention, and tensor parallelism.

flowchart LR
    CLIENT["Application<br/>(OpenAI SDK)"] -->|"/v1/chat/completions"| SVC["K8s Service"]
    SVC --> POD1["vLLM Pod<br/>(GPU 0-1)"]
    SVC --> POD2["vLLM Pod<br/>(GPU 0-1)"]
    POD1 --> MODEL["Shared Model<br/>Cache (PVC)"]
    POD2 --> MODEL

The Solution

Basic Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
  labels:
    app: vllm-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-server
  template:
    metadata:
      labels:
        app: vllm-server
    spec:
      containers:
        - name: vllm
          image: ghcr.io/vllm-project/vllm-openai:v0.8.0
          args:
            - --model=meta-llama/Llama-3.1-8B-Instruct
            - --tensor-parallel-size=1
            - --max-model-len=8192
            - --gpu-memory-utilization=0.90
            - --port=8000
          ports:
            - containerPort: 8000
              name: http
          env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: token
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: 32Gi
            requests:
              nvidia.com/gpu: 1
              memory: 16Gi
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 120     # Model loading takes time
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 180
            periodSeconds: 30
          volumeMounts:
            - name: model-cache
              mountPath: /root/.cache/huggingface
            - name: shm
              mountPath: /dev/shm
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: vllm-model-cache
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 2Gi
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-server
spec:
  selector:
    app: vllm-server
  ports:
    - port: 8000
      targetPort: 8000
      name: http
---
apiVersion: v1
kind: Secret
metadata:
  name: hf-token
type: Opaque
stringData:
  token: "hf_xxxxxxxxxxxx"       # Your HuggingFace token
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vllm-model-cache
spec:
  accessModes: ["ReadWriteOnce"]
  resources:
    requests:
      storage: 100Gi               # Enough for several models

Multi-GPU with Tensor Parallelism

# Llama 3.1 70B needs 4Γ— A100 80GB (or 8Γ— A100 40GB)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama-70b
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-llama-70b
  template:
    metadata:
      labels:
        app: vllm-llama-70b
    spec:
      containers:
        - name: vllm
          image: ghcr.io/vllm-project/vllm-openai:v0.8.0
          args:
            - --model=meta-llama/Llama-3.1-70B-Instruct
            - --tensor-parallel-size=4
            - --max-model-len=8192
            - --gpu-memory-utilization=0.92
            - --enable-chunked-prefill
            - --max-num-batched-tokens=8192
            - --port=8000
          resources:
            limits:
              nvidia.com/gpu: 4
              memory: 200Gi
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 300   # 70B takes longer to load

Model Sizing Guide

ModelParametersGPUs NeededVRAMTP Size
Llama 3.1 8B8B1Γ— A100 80GB~16GB1
Mistral 7B7B1Γ— A100 80GB~14GB1
Qwen2.5 14B14B1Γ— A100 80GB~28GB1
Llama 3.1 70B70B4Γ— A100 80GB~140GB4
Mixtral 8x7B47B (MoE)2Γ— A100 80GB~90GB2
Llama 3.1 405B405B8Γ— H100 80GB~810GB8

HPA Autoscaling

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-server
  minReplicas: 1
  maxReplicas: 4
  metrics:
    - type: Pods
      pods:
        metric:
          name: vllm:num_requests_running
        target:
          type: AverageValue
          averageValue: "10"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300  # Don't scale down too fast

Query the Server

# Chat completions (same as OpenAI API)
curl http://vllm-server:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain Kubernetes pods in one paragraph."}
    ],
    "max_tokens": 256,
    "temperature": 0.7
  }'

# Text completions
curl http://vllm-server:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "prompt": "Kubernetes is",
    "max_tokens": 64
  }'

# List available models
curl http://vllm-server:8000/v1/models

# Prometheus metrics
curl http://vllm-server:8000/metrics

Use with OpenAI Python SDK

from openai import OpenAI

client = OpenAI(
    base_url="http://vllm-server:8000/v1",
    api_key="not-needed"             # vLLM doesn't require auth by default
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "Write a haiku about Kubernetes"}
    ],
    max_tokens=64
)
print(response.choices[0].message.content)

Common Issues

IssueCauseFix
OOM on model loadModel too large for GPU VRAMIncrease tensor-parallel-size or use smaller model
Slow first responseModel loading into GPUSet initialDelaySeconds high on probes
CUDA out of memory during inferencegpu-memory-utilization too highLower to 0.85-0.90
Image pull slow10-20GB container imageUse imagePullPolicy: IfNotPresent + node pre-pull
Model download on every restartNo persistent cacheMount PVC at /root/.cache/huggingface
401 on model downloadGated model needs HF tokenSet HUGGING_FACE_HUB_TOKEN secret

Best Practices

  • Pin image version β€” use v0.8.0 not latest for reproducibility
  • Use PVC for model cache β€” avoids re-downloading on pod restarts
  • Set readiness probe with long delay β€” model loading takes 30-300s
  • Mount /dev/shm as Memory β€” required for multi-GPU tensor parallelism
  • Enable chunked prefill for long contexts β€” --enable-chunked-prefill
  • Monitor with /metrics endpoint β€” queue depth, latency, throughput
  • Use gpu-memory-utilization=0.90 β€” leaves headroom for KV cache growth

Key Takeaways

  • ghcr.io/vllm-project/vllm-openai provides drop-in OpenAI API compatibility
  • Serves /v1/completions, /v1/chat/completions, /v1/embeddings
  • Continuous batching + PagedAttention for high throughput
  • Tensor parallelism across multiple GPUs for large models
  • Existing OpenAI SDK code works with just a base_url change
  • PVC model cache + proper probes = production-ready deployment
#vllm #openai-api #inference #gpu #llm-serving
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens