πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai intermediate ⏱ 30 minutes K8s 1.28+

Deploy Mistral 7B with vLLM on Kubernetes

Step-by-step guide to deploy Mistral-7B-v0.1 using vLLM as an OpenAI-compatible inference server on Kubernetes with GPU fractioning.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Run vLLM with python -m vllm.entrypoints.openai.api_server --model /data/Mistral-7B-v0.1 --dtype bfloat16 --tensor-parallel-size 1. Mount model weights via PVC at /data. Set HF_HUB_OFFLINE=1 and TRANSFORMERS_OFFLINE=1 for air-gapped clusters. The API is OpenAI-compatible on port 8000.

Important: The model ID in API calls must match the exact path shown by /v1/models (e.g., /data/Mistral-7B-v0.1).

vLLM is a high-throughput inference engine for LLMs that exposes an OpenAI-compatible API. This recipe walks through deploying Mistral-7B-v0.1 on Kubernetes using vLLM with GPU fractioning.

Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Kubernetes / OpenShift Cluster              β”‚
β”‚                                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  Inference Pod (vLLM)                  β”‚  β”‚
β”‚  β”‚  - python -m vllm...openai.api_server  β”‚  β”‚
β”‚  β”‚  - Port 8000 (HTTP)                    β”‚  β”‚
β”‚  β”‚  - GPU: 0.5–1.0 (fractioning)         β”‚  β”‚
β”‚  β”‚  - Volume: /data (PVC)                β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ PVC / S3     β”‚   β”‚ Ingress / Route     β”‚  β”‚
β”‚  β”‚ Model files  β”‚   β”‚ HTTPS β†’ port 8000   β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Prerequisites

1) Model Weights on a PVC

Your PVC should contain the full Mistral-7B-v0.1 directory:

/data/Mistral-7B-v0.1/
β”œβ”€β”€ config.json
β”œβ”€β”€ tokenizer.json
β”œβ”€β”€ tokenizer_config.json
β”œβ”€β”€ special_tokens_map.json
β”œβ”€β”€ model-00001-of-00002.safetensors
β”œβ”€β”€ model-00002-of-00002.safetensors
└── model.safetensors.index.json

2) Container Image

Use a vLLM image built with CUDA support. Example:

registry.example.com/org/vllm-cuda:latest

Deployment Manifest

# mistral-vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mistral-vllm
  namespace: ai-inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mistral-vllm
  template:
    metadata:
      labels:
        app: mistral-vllm
    spec:
      containers:
        - name: vllm
          image: registry.example.com/org/vllm-cuda:latest
          command:
            - python
            - -m
            - vllm.entrypoints.openai.api_server
          args:
            - --model
            - /data/Mistral-7B-v0.1
            - --download-dir
            - /data
            - --dtype
            - bfloat16
            - --tensor-parallel-size
            - "1"
          ports:
            - containerPort: 8000
              name: http
              protocol: TCP
          env:
            - name: HF_HUB_OFFLINE
              value: "1"
            - name: TRANSFORMERS_OFFLINE
              value: "1"
            - name: VLLM_NO_USAGE_STATS
              value: "1"
          resources:
            limits:
              nvidia.com/gpu: "1"    # or fractional via GPU operator
            requests:
              nvidia.com/gpu: "1"
          volumeMounts:
            - name: model-data
              mountPath: /data
              readOnly: true
      volumes:
        - name: model-data
          persistentVolumeClaim:
            claimName: model-storage-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: mistral-vllm
  namespace: ai-inference
spec:
  selector:
    app: mistral-vllm
  ports:
    - port: 8000
      targetPort: 8000
      protocol: TCP
      name: http

Environment Variables Explained

VariableValuePurpose
HF_HUB_OFFLINE1Prevents downloads from Hugging Face Hub
TRANSFORMERS_OFFLINE1Forces transformers to use local files only
VLLM_NO_USAGE_STATS1Disables telemetry

These are critical for air-gapped or disconnected environments.

GPU Fractioning

If your cluster supports GPU fractioning (e.g., Run:ai, MIG, or time-slicing):

resources:
  limits:
    nvidia.com/gpu: "1"
  requests:
    nvidia.com/gpu: "1"

With Run:ai or similar schedulers, configure fractional GPU (e.g., 50%) through the platform UI rather than the manifest.

Mistral-7B requirements:

  • Minimum: ~14 GB VRAM (bfloat16)
  • Recommended: 24+ GB VRAM for production batch sizes
  • Works on: A10, A30, A100, H100

Verify Deployment

# Check pod is running
kubectl get pods -n ai-inference -l app=mistral-vllm

# Check logs for successful startup
kubectl logs -n ai-inference deployment/mistral-vllm | tail -20

# List available models
curl -k https://<inference-endpoint>/v1/models

Expected /v1/models response:

{
  "object": "list",
  "data": [{
    "id": "/data/Mistral-7B-v0.1",
    "object": "model",
    "owned_by": "vllm",
    "max_model_len": 32768
  }]
}

Important: Model ID in API Calls

vLLM uses the exact model path as the model ID. You must use it as-is:

# Correct β€” uses the exact ID from /v1/models
curl -k -X POST https://<endpoint>/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/data/Mistral-7B-v0.1",
    "prompt": "Write a one-line greeting:",
    "max_tokens": 32
  }'

# Wrong β€” this returns 404
curl -k -X POST https://<endpoint>/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Mistral-7B-v0.1",
    "prompt": "Write a one-line greeting:",
    "max_tokens": 32
  }'

The second call fails with:

{"error": {"message": "The model `Mistral-7B-v0.1` does not exist.", "type": "NotFoundError", "code": 404}}

Chat vs Completions

Mistral-7B-v0.1 (base model) does not include a chat template:

EndpointWorks?Notes
/v1/completionsYesUse this for base Mistral
/v1/chat/completionsNoRequires a model with chat template (e.g., Mistral-7B-Instruct)

If you need /v1/chat/completions, deploy Mistral-7B-Instruct-v0.2 or newer instruct-tuned variants instead.

Run:ai Deployment (UI)

If using Run:ai, configure:

FieldValue
Inference typeCustom
Image URLregistry.example.com/org/vllm-cuda:latest
Image pullOnly if not present (recommended)
Container port8000 (HTTP)
Commandpython -m vllm.entrypoints.openai.api_server
Arguments--model /data/Mistral-7B-v0.1 --download-dir /data --dtype bfloat16 --tensor-parallel-size 1
Env: HF_HUB_OFFLINE1
Env: TRANSFORMERS_OFFLINE1
Env: VLLM_NO_USAGE_STATS1
GPU devices1
GPU fraction50%
Data origin (PVC)your-model-storage-pvc
Container path/data
Priorityhigh or very-high

Troubleshooting

SymptomCauseFix
404 on /v1/completionsWrong model nameUse exact ID from /v1/models
Chat template errorBase model has no templateUse /v1/completions or switch to Instruct variant
Pod OOMKilledInsufficient GPU memoryIncrease GPU fraction or use quantized model
Slow first requestModel loading / warmupWait 30–60s after pod starts
#vllm #mistral #llm #inference #gpu #ai-workloads #openai-api
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens