πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai intermediate ⏱ 20 minutes K8s 1.28+

Quantize LLMs for Efficient GPU Inference

Run quantized LLM models (GPTQ, AWQ, GGUF) on Kubernetes to reduce GPU memory requirements and serve models on smaller GPUs.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Use quantized models (AWQ or GPTQ) to cut GPU memory by 50–75%. Mistral-7B goes from ~14 GB (bf16) β†’ ~4 GB (4-bit). In vLLM, set --quantization awq or --quantization gptq. Download pre-quantized models from Hugging Face (e.g., TheBloke/Mistral-7B-v0.1-AWQ). No code changes needed β€” same OpenAI-compatible API.

Quantization reduces model precision (e.g., 16-bit β†’ 4-bit) to shrink GPU memory requirements and increase throughput. This lets you serve production LLMs on smaller or shared GPUs.

Memory Savings Overview

Modelbf16 (full)8-bit4-bit (AWQ/GPTQ)
Mistral-7B~14 GB~8 GB~4 GB
Llama-2-13B~26 GB~14 GB~7 GB
Llama-2-70B~140 GB~70 GB~35 GB
Mixtral-8x7B~90 GB~48 GB~24 GB

Quantization Formats

FormatQualitySpeedvLLM SupportNotes
AWQExcellentFastYesRecommended for vLLM
GPTQExcellentGoodYesWidely adopted
GGUFGoodVariesNo (use llama.cpp)Best for CPU inference
bitsandbytesGoodModerateLimitedEasiest to apply
FP8Near-losslessFastestNIM onlyRequires H100/Ada

Deploy AWQ Model with vLLM

Step 1: Get Pre-Quantized Weights

Download a pre-quantized model. Example with Mistral-7B AWQ:

# From Hugging Face (on a machine with internet access)
huggingface-cli download TheBloke/Mistral-7B-v0.1-AWQ \
  --local-dir ./Mistral-7B-v0.1-AWQ

# Upload to your PVC or S3 storage
# Model directory structure is identical to full-precision models

Step 2: Deployment Manifest

# mistral-awq-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mistral-awq
  namespace: ai-inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mistral-awq
  template:
    metadata:
      labels:
        app: mistral-awq
    spec:
      containers:
        - name: vllm
          image: registry.example.com/org/vllm-cuda:latest
          command:
            - python
            - -m
            - vllm.entrypoints.openai.api_server
          args:
            - --model
            - /data/Mistral-7B-v0.1-AWQ
            - --quantization
            - awq
            - --dtype
            - float16
            - --tensor-parallel-size
            - "1"
            - --max-model-len
            - "8192"
          ports:
            - containerPort: 8000
          env:
            - name: HF_HUB_OFFLINE
              value: "1"
            - name: TRANSFORMERS_OFFLINE
              value: "1"
          resources:
            limits:
              nvidia.com/gpu: "1"
          volumeMounts:
            - name: model-data
              mountPath: /data
              readOnly: true
      volumes:
        - name: model-data
          persistentVolumeClaim:
            claimName: model-storage-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: mistral-awq
  namespace: ai-inference
spec:
  selector:
    app: mistral-awq
  ports:
    - port: 8000
      targetPort: 8000

Key Difference from Full-Precision

The only changes are:

--model /data/Mistral-7B-v0.1-AWQ    # quantized weights path
--quantization awq                    # tell vLLM the format
--dtype float16                       # AWQ works with fp16, not bf16

Deploy GPTQ Model with vLLM

args:
  - --model
  - /data/Mistral-7B-v0.1-GPTQ
  - --quantization
  - gptq
  - --dtype
  - float16
  - --tensor-parallel-size
  - "1"

GPU Selection Guide for Quantized Models

GPUVRAMMistral-7B (4-bit)Llama-2-13B (4-bit)Llama-2-70B (4-bit)
T416 GBβœ…βœ… (tight)❌
A1024 GBβœ…βœ…βŒ
A3024 GBβœ…βœ…βŒ
A100-40GB40 GBβœ…βœ…βœ… (tight)
A100-80GB80 GBβœ…βœ…βœ…
H10080 GBβœ…βœ…βœ…

With 4-bit quantization, Mistral-7B fits comfortably on a T4 β€” enabling inference on much cheaper hardware.

Quality Comparison

Quantization introduces small accuracy trade-offs:

Benchmark (Mistral-7B):
  bf16 (baseline):  MMLU 62.5%  |  Perplexity 5.21
  AWQ 4-bit:        MMLU 62.1%  |  Perplexity 5.28
  GPTQ 4-bit:       MMLU 61.8%  |  Perplexity 5.32

Practical impact: Negligible for most applications.

Run:ai Configuration for Quantized Models

FieldFull PrecisionAWQ 4-bit
ImagevLLM containerSame
Arguments--model /data/Mistral-7B-v0.1 --dtype bfloat16--model /data/Mistral-7B-v0.1-AWQ --quantization awq --dtype float16
GPU fraction50% (of A100)25% or smaller GPU
GPU memory needed~14 GB~4 GB

Verify Quantized Deployment

# Check model is loaded
curl -k https://<endpoint>/v1/models

# Run inference (same API as full-precision)
curl -k -X POST https://<endpoint>/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/data/Mistral-7B-v0.1-AWQ",
    "prompt": "Explain quantization in one sentence:",
    "max_tokens": 32
  }'

Troubleshooting

SymptomCauseFix
ValueError: quantization method not supportedWrong vLLM versionUse vLLM β‰₯ 0.4.0
Slow inferenceCPU fallback for some opsEnsure GPU is allocated
Quality degradationOver-aggressive quantizationTry AWQ instead of GPTQ, or use 8-bit
CUDA out of memoryBatch size too large for quantized modelReduce --max-num-seqs
#quantization #gptq #awq #gguf #llm #gpu #optimization #ai-workloads
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens