πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai intermediate ⏱ 20 minutes K8s 1.28+

Quantize LLMs for Efficient GPU Inference on Kubernetes

Run quantized LLM models (GPTQ, AWQ, GGUF) on Kubernetes to reduce GPU memory requirements and serve models on smaller GPUs.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Use quantized models (AWQ or GPTQ) to cut GPU memory by 50–75%. Mistral-7B goes from ~14 GB (bf16) β†’ ~4 GB (4-bit). In vLLM, set --quantization awq or --quantization gptq. Download pre-quantized models from Hugging Face (e.g., TheBloke/Mistral-7B-v0.1-AWQ). No code changes needed β€” same OpenAI-compatible API.

Quantization reduces model precision (e.g., 16-bit β†’ 4-bit) to shrink GPU memory requirements and increase throughput. This lets you serve production LLMs on smaller or shared GPUs.

Memory Savings Overview

Modelbf16 (full)8-bit4-bit (AWQ/GPTQ)
Mistral-7B~14 GB~8 GB~4 GB
Llama-2-13B~26 GB~14 GB~7 GB
Llama-2-70B~140 GB~70 GB~35 GB
Mixtral-8x7B~90 GB~48 GB~24 GB

Quantization Formats

FormatQualitySpeedvLLM SupportNotes
AWQExcellentFastYesRecommended for vLLM
GPTQExcellentGoodYesWidely adopted
GGUFGoodVariesNo (use llama.cpp)Best for CPU inference
bitsandbytesGoodModerateLimitedEasiest to apply
FP8Near-losslessFastestNIM onlyRequires H100/Ada

Deploy AWQ Model with vLLM

Step 1: Get Pre-Quantized Weights

Download a pre-quantized model. Example with Mistral-7B AWQ:

# From Hugging Face (on a machine with internet access)
huggingface-cli download TheBloke/Mistral-7B-v0.1-AWQ \
  --local-dir ./Mistral-7B-v0.1-AWQ

# Upload to your PVC or S3 storage
# Model directory structure is identical to full-precision models

Step 2: Deployment Manifest

# mistral-awq-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mistral-awq
  namespace: ai-inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mistral-awq
  template:
    metadata:
      labels:
        app: mistral-awq
    spec:
      containers:
        - name: vllm
          image: registry.example.com/org/vllm-cuda:latest
          command:
            - python
            - -m
            - vllm.entrypoints.openai.api_server
          args:
            - --model
            - /data/Mistral-7B-v0.1-AWQ
            - --quantization
            - awq
            - --dtype
            - float16
            - --tensor-parallel-size
            - "1"
            - --max-model-len
            - "8192"
          ports:
            - containerPort: 8000
          env:
            - name: HF_HUB_OFFLINE
              value: "1"
            - name: TRANSFORMERS_OFFLINE
              value: "1"
          resources:
            limits:
              nvidia.com/gpu: "1"
          volumeMounts:
            - name: model-data
              mountPath: /data
              readOnly: true
      volumes:
        - name: model-data
          persistentVolumeClaim:
            claimName: model-storage-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: mistral-awq
  namespace: ai-inference
spec:
  selector:
    app: mistral-awq
  ports:
    - port: 8000
      targetPort: 8000

Key Difference from Full-Precision

The only changes are:

--model /data/Mistral-7B-v0.1-AWQ    # quantized weights path
--quantization awq                    # tell vLLM the format
--dtype float16                       # AWQ works with fp16, not bf16

Deploy GPTQ Model with vLLM

args:
  - --model
  - /data/Mistral-7B-v0.1-GPTQ
  - --quantization
  - gptq
  - --dtype
  - float16
  - --tensor-parallel-size
  - "1"

GPU Selection Guide for Quantized Models

GPUVRAMMistral-7B (4-bit)Llama-2-13B (4-bit)Llama-2-70B (4-bit)
T416 GBβœ…βœ… (tight)❌
A1024 GBβœ…βœ…βŒ
A3024 GBβœ…βœ…βŒ
A100-40GB40 GBβœ…βœ…βœ… (tight)
A100-80GB80 GBβœ…βœ…βœ…
H10080 GBβœ…βœ…βœ…

With 4-bit quantization, Mistral-7B fits comfortably on a T4 β€” enabling inference on much cheaper hardware.

Quality Comparison

Quantization introduces small accuracy trade-offs:

Benchmark (Mistral-7B):
  bf16 (baseline):  MMLU 62.5%  |  Perplexity 5.21
  AWQ 4-bit:        MMLU 62.1%  |  Perplexity 5.28
  GPTQ 4-bit:       MMLU 61.8%  |  Perplexity 5.32

Practical impact: Negligible for most applications.

Run:ai Configuration for Quantized Models

FieldFull PrecisionAWQ 4-bit
ImagevLLM containerSame
Arguments--model /data/Mistral-7B-v0.1 --dtype bfloat16--model /data/Mistral-7B-v0.1-AWQ --quantization awq --dtype float16
GPU fraction50% (of A100)25% or smaller GPU
GPU memory needed~14 GB~4 GB

Verify Quantized Deployment

# Check model is loaded
curl -k https://<endpoint>/v1/models

# Run inference (same API as full-precision)
curl -k -X POST https://<endpoint>/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/data/Mistral-7B-v0.1-AWQ",
    "prompt": "Explain quantization in one sentence:",
    "max_tokens": 32
  }'

Troubleshooting

SymptomCauseFix
ValueError: quantization method not supportedWrong vLLM versionUse vLLM β‰₯ 0.4.0
Slow inferenceCPU fallback for some opsEnsure GPU is allocated
Quality degradationOver-aggressive quantizationTry AWQ instead of GPTQ, or use 8-bit
CUDA out of memoryBatch size too large for quantized modelReduce --max-num-seqs
#quantization #gptq #awq #gguf #llm #gpu #optimization #ai-workloads
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens