Deploy Mistral 7B with vLLM on Kubernetes
Step-by-step guide to deploy Mistral-7B-v0.1 using vLLM as an OpenAI-compatible inference server on Kubernetes with GPU fractioning.
π‘ Quick Answer: Run vLLM with
python -m vllm.entrypoints.openai.api_server --model /data/Mistral-7B-v0.1 --dtype bfloat16 --tensor-parallel-size 1. Mount model weights via PVC at/data. SetHF_HUB_OFFLINE=1andTRANSFORMERS_OFFLINE=1for air-gapped clusters. The API is OpenAI-compatible on port 8000.Important: The model ID in API calls must match the exact path shown by
/v1/models(e.g.,/data/Mistral-7B-v0.1).
vLLM is a high-throughput inference engine for LLMs that exposes an OpenAI-compatible API. This recipe walks through deploying Mistral-7B-v0.1 on Kubernetes using vLLM with GPU fractioning.
Architecture Overview
βββββββββββββββββββββββββββββββββββββββββββββββ
β Kubernetes / OpenShift Cluster β
β β
β ββββββββββββββββββββββββββββββββββββββββββ β
β β Inference Pod (vLLM) β β
β β - python -m vllm...openai.api_server β β
β β - Port 8000 (HTTP) β β
β β - GPU: 0.5β1.0 (fractioning) β β
β β - Volume: /data (PVC) β β
β ββββββββββββββββββββββββββββββββββββββββββ β
β β
β ββββββββββββββββ βββββββββββββββββββββββ β
β β PVC / S3 β β Ingress / Route β β
β β Model files β β HTTPS β port 8000 β β
β ββββββββββββββββ βββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββPrerequisites
1) Model Weights on a PVC
Your PVC should contain the full Mistral-7B-v0.1 directory:
/data/Mistral-7B-v0.1/
βββ config.json
βββ tokenizer.json
βββ tokenizer_config.json
βββ special_tokens_map.json
βββ model-00001-of-00002.safetensors
βββ model-00002-of-00002.safetensors
βββ model.safetensors.index.json2) Container Image
Use a vLLM image built with CUDA support. Example:
registry.example.com/org/vllm-cuda:latestDeployment Manifest
# mistral-vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: mistral-vllm
namespace: ai-inference
spec:
replicas: 1
selector:
matchLabels:
app: mistral-vllm
template:
metadata:
labels:
app: mistral-vllm
spec:
containers:
- name: vllm
image: registry.example.com/org/vllm-cuda:latest
command:
- python
- -m
- vllm.entrypoints.openai.api_server
args:
- --model
- /data/Mistral-7B-v0.1
- --download-dir
- /data
- --dtype
- bfloat16
- --tensor-parallel-size
- "1"
ports:
- containerPort: 8000
name: http
protocol: TCP
env:
- name: HF_HUB_OFFLINE
value: "1"
- name: TRANSFORMERS_OFFLINE
value: "1"
- name: VLLM_NO_USAGE_STATS
value: "1"
resources:
limits:
nvidia.com/gpu: "1" # or fractional via GPU operator
requests:
nvidia.com/gpu: "1"
volumeMounts:
- name: model-data
mountPath: /data
readOnly: true
volumes:
- name: model-data
persistentVolumeClaim:
claimName: model-storage-pvc
---
apiVersion: v1
kind: Service
metadata:
name: mistral-vllm
namespace: ai-inference
spec:
selector:
app: mistral-vllm
ports:
- port: 8000
targetPort: 8000
protocol: TCP
name: httpEnvironment Variables Explained
| Variable | Value | Purpose |
|---|---|---|
HF_HUB_OFFLINE | 1 | Prevents downloads from Hugging Face Hub |
TRANSFORMERS_OFFLINE | 1 | Forces transformers to use local files only |
VLLM_NO_USAGE_STATS | 1 | Disables telemetry |
These are critical for air-gapped or disconnected environments.
GPU Fractioning
If your cluster supports GPU fractioning (e.g., Run:ai, MIG, or time-slicing):
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"With Run:ai or similar schedulers, configure fractional GPU (e.g., 50%) through the platform UI rather than the manifest.
Mistral-7B requirements:
- Minimum: ~14 GB VRAM (bfloat16)
- Recommended: 24+ GB VRAM for production batch sizes
- Works on: A10, A30, A100, H100
Verify Deployment
# Check pod is running
kubectl get pods -n ai-inference -l app=mistral-vllm
# Check logs for successful startup
kubectl logs -n ai-inference deployment/mistral-vllm | tail -20
# List available models
curl -k https://<inference-endpoint>/v1/modelsExpected /v1/models response:
{
"object": "list",
"data": [{
"id": "/data/Mistral-7B-v0.1",
"object": "model",
"owned_by": "vllm",
"max_model_len": 32768
}]
}Important: Model ID in API Calls
vLLM uses the exact model path as the model ID. You must use it as-is:
# Correct β uses the exact ID from /v1/models
curl -k -X POST https://<endpoint>/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/data/Mistral-7B-v0.1",
"prompt": "Write a one-line greeting:",
"max_tokens": 32
}'
# Wrong β this returns 404
curl -k -X POST https://<endpoint>/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Mistral-7B-v0.1",
"prompt": "Write a one-line greeting:",
"max_tokens": 32
}'The second call fails with:
{"error": {"message": "The model `Mistral-7B-v0.1` does not exist.", "type": "NotFoundError", "code": 404}}Chat vs Completions
Mistral-7B-v0.1 (base model) does not include a chat template:
| Endpoint | Works? | Notes |
|---|---|---|
/v1/completions | Yes | Use this for base Mistral |
/v1/chat/completions | No | Requires a model with chat template (e.g., Mistral-7B-Instruct) |
If you need /v1/chat/completions, deploy Mistral-7B-Instruct-v0.2 or newer instruct-tuned variants instead.
Run:ai Deployment (UI)
If using Run:ai, configure:
| Field | Value |
|---|---|
| Inference type | Custom |
| Image URL | registry.example.com/org/vllm-cuda:latest |
| Image pull | Only if not present (recommended) |
| Container port | 8000 (HTTP) |
| Command | python -m vllm.entrypoints.openai.api_server |
| Arguments | --model /data/Mistral-7B-v0.1 --download-dir /data --dtype bfloat16 --tensor-parallel-size 1 |
| Env: HF_HUB_OFFLINE | 1 |
| Env: TRANSFORMERS_OFFLINE | 1 |
| Env: VLLM_NO_USAGE_STATS | 1 |
| GPU devices | 1 |
| GPU fraction | 50% |
| Data origin (PVC) | your-model-storage-pvc |
| Container path | /data |
| Priority | high or very-high |
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
404 on /v1/completions | Wrong model name | Use exact ID from /v1/models |
| Chat template error | Base model has no template | Use /v1/completions or switch to Instruct variant |
| Pod OOMKilled | Insufficient GPU memory | Increase GPU fraction or use quantized model |
| Slow first request | Model loading / warmup | Wait 30β60s after pod starts |
Related Recipes

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
