Deploy vLLM OpenAI Container on Kubernetes
Deploy the vLLM OpenAI-compatible server container on Kubernetes. Pull ghcr.io/vllm-project/vllm-openai, configure GPU resources, model loading.
π‘ Quick Answer: Deploy vLLMβs OpenAI-compatible server with
ghcr.io/vllm-project/vllm-openai:latest. Create a Deployment with GPU resources, mount a model cache PVC, and expose via Service. The container serves/v1/completions,/v1/chat/completions, and/v1/embeddingsβ drop-in replacement for the OpenAI API with any open model.
The Problem
You want to serve open-source LLMs (Llama, Mistral, Qwen, etc.) with an OpenAI-compatible API so existing application code works without changes. The vllm-openai container image from ghcr.io/vllm-project/vllm-openai provides exactly this β a high-performance inference server with continuous batching, PagedAttention, and tensor parallelism.
flowchart LR
CLIENT["Application<br/>(OpenAI SDK)"] -->|"/v1/chat/completions"| SVC["K8s Service"]
SVC --> POD1["vLLM Pod<br/>(GPU 0-1)"]
SVC --> POD2["vLLM Pod<br/>(GPU 0-1)"]
POD1 --> MODEL["Shared Model<br/>Cache (PVC)"]
POD2 --> MODELThe Solution
Basic Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-server
labels:
app: vllm-server
spec:
replicas: 1
selector:
matchLabels:
app: vllm-server
template:
metadata:
labels:
app: vllm-server
spec:
containers:
- name: vllm
image: ghcr.io/vllm-project/vllm-openai:v0.8.0
args:
- --model=meta-llama/Llama-3.1-8B-Instruct
- --tensor-parallel-size=1
- --max-model-len=8192
- --gpu-memory-utilization=0.90
- --port=8000
ports:
- containerPort: 8000
name: http
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
resources:
limits:
nvidia.com/gpu: 1
memory: 32Gi
requests:
nvidia.com/gpu: 1
memory: 16Gi
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120 # Model loading takes time
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 180
periodSeconds: 30
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
- name: shm
mountPath: /dev/shm
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: vllm-model-cache
- name: shm
emptyDir:
medium: Memory
sizeLimit: 2Gi
---
apiVersion: v1
kind: Service
metadata:
name: vllm-server
spec:
selector:
app: vllm-server
ports:
- port: 8000
targetPort: 8000
name: http
---
apiVersion: v1
kind: Secret
metadata:
name: hf-token
type: Opaque
stringData:
token: "hf_xxxxxxxxxxxx" # Your HuggingFace token
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: vllm-model-cache
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi # Enough for several modelsMulti-GPU with Tensor Parallelism
# Llama 3.1 70B needs 4Γ A100 80GB (or 8Γ A100 40GB)
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama-70b
spec:
replicas: 1
selector:
matchLabels:
app: vllm-llama-70b
template:
metadata:
labels:
app: vllm-llama-70b
spec:
containers:
- name: vllm
image: ghcr.io/vllm-project/vllm-openai:v0.8.0
args:
- --model=meta-llama/Llama-3.1-70B-Instruct
- --tensor-parallel-size=4
- --max-model-len=8192
- --gpu-memory-utilization=0.92
- --enable-chunked-prefill
- --max-num-batched-tokens=8192
- --port=8000
resources:
limits:
nvidia.com/gpu: 4
memory: 200Gi
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 300 # 70B takes longer to loadModel Sizing Guide
| Model | Parameters | GPUs Needed | VRAM | TP Size |
|---|---|---|---|---|
| Llama 3.1 8B | 8B | 1Γ A100 80GB | ~16GB | 1 |
| Mistral 7B | 7B | 1Γ A100 80GB | ~14GB | 1 |
| Qwen2.5 14B | 14B | 1Γ A100 80GB | ~28GB | 1 |
| Llama 3.1 70B | 70B | 4Γ A100 80GB | ~140GB | 4 |
| Mixtral 8x7B | 47B (MoE) | 2Γ A100 80GB | ~90GB | 2 |
| Llama 3.1 405B | 405B | 8Γ H100 80GB | ~810GB | 8 |
HPA Autoscaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-server
minReplicas: 1
maxReplicas: 4
metrics:
- type: Pods
pods:
metric:
name: vllm:num_requests_running
target:
type: AverageValue
averageValue: "10"
behavior:
scaleUp:
stabilizationWindowSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300 # Don't scale down too fastQuery the Server
# Chat completions (same as OpenAI API)
curl http://vllm-server:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain Kubernetes pods in one paragraph."}
],
"max_tokens": 256,
"temperature": 0.7
}'
# Text completions
curl http://vllm-server:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"prompt": "Kubernetes is",
"max_tokens": 64
}'
# List available models
curl http://vllm-server:8000/v1/models
# Prometheus metrics
curl http://vllm-server:8000/metricsUse with OpenAI Python SDK
from openai import OpenAI
client = OpenAI(
base_url="http://vllm-server:8000/v1",
api_key="not-needed" # vLLM doesn't require auth by default
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "user", "content": "Write a haiku about Kubernetes"}
],
max_tokens=64
)
print(response.choices[0].message.content)Common Issues
| Issue | Cause | Fix |
|---|---|---|
| OOM on model load | Model too large for GPU VRAM | Increase tensor-parallel-size or use smaller model |
| Slow first response | Model loading into GPU | Set initialDelaySeconds high on probes |
CUDA out of memory during inference | gpu-memory-utilization too high | Lower to 0.85-0.90 |
| Image pull slow | 10-20GB container image | Use imagePullPolicy: IfNotPresent + node pre-pull |
| Model download on every restart | No persistent cache | Mount PVC at /root/.cache/huggingface |
| 401 on model download | Gated model needs HF token | Set HUGGING_FACE_HUB_TOKEN secret |
Best Practices
- Pin image version β use
v0.8.0notlatestfor reproducibility - Use PVC for model cache β avoids re-downloading on pod restarts
- Set readiness probe with long delay β model loading takes 30-300s
- Mount
/dev/shmas Memory β required for multi-GPU tensor parallelism - Enable chunked prefill for long contexts β
--enable-chunked-prefill - Monitor with
/metricsendpoint β queue depth, latency, throughput - Use
gpu-memory-utilization=0.90β leaves headroom for KV cache growth
Key Takeaways
ghcr.io/vllm-project/vllm-openaiprovides drop-in OpenAI API compatibility- Serves
/v1/completions,/v1/chat/completions,/v1/embeddings - Continuous batching + PagedAttention for high throughput
- Tensor parallelism across multiple GPUs for large models
- Existing OpenAI SDK code works with just a
base_urlchange - PVC model cache + proper probes = production-ready deployment

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
