Quantize LLMs for Efficient GPU Inference on Kubernetes
Run quantized LLM models (GPTQ, AWQ, GGUF) on Kubernetes to reduce GPU memory requirements and serve models on smaller GPUs.
π‘ Quick Answer: Use quantized models (AWQ or GPTQ) to cut GPU memory by 50β75%. Mistral-7B goes from ~14 GB (bf16) β ~4 GB (4-bit). In vLLM, set
--quantization awqor--quantization gptq. Download pre-quantized models from Hugging Face (e.g.,TheBloke/Mistral-7B-v0.1-AWQ). No code changes needed β same OpenAI-compatible API.
Quantization reduces model precision (e.g., 16-bit β 4-bit) to shrink GPU memory requirements and increase throughput. This lets you serve production LLMs on smaller or shared GPUs.
Memory Savings Overview
| Model | bf16 (full) | 8-bit | 4-bit (AWQ/GPTQ) |
|---|---|---|---|
| Mistral-7B | ~14 GB | ~8 GB | ~4 GB |
| Llama-2-13B | ~26 GB | ~14 GB | ~7 GB |
| Llama-2-70B | ~140 GB | ~70 GB | ~35 GB |
| Mixtral-8x7B | ~90 GB | ~48 GB | ~24 GB |
Quantization Formats
| Format | Quality | Speed | vLLM Support | Notes |
|---|---|---|---|---|
| AWQ | Excellent | Fast | Yes | Recommended for vLLM |
| GPTQ | Excellent | Good | Yes | Widely adopted |
| GGUF | Good | Varies | No (use llama.cpp) | Best for CPU inference |
| bitsandbytes | Good | Moderate | Limited | Easiest to apply |
| FP8 | Near-lossless | Fastest | NIM only | Requires H100/Ada |
Deploy AWQ Model with vLLM
Step 1: Get Pre-Quantized Weights
Download a pre-quantized model. Example with Mistral-7B AWQ:
# From Hugging Face (on a machine with internet access)
huggingface-cli download TheBloke/Mistral-7B-v0.1-AWQ \
--local-dir ./Mistral-7B-v0.1-AWQ
# Upload to your PVC or S3 storage
# Model directory structure is identical to full-precision modelsStep 2: Deployment Manifest
# mistral-awq-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: mistral-awq
namespace: ai-inference
spec:
replicas: 1
selector:
matchLabels:
app: mistral-awq
template:
metadata:
labels:
app: mistral-awq
spec:
containers:
- name: vllm
image: registry.example.com/org/vllm-cuda:latest
command:
- python
- -m
- vllm.entrypoints.openai.api_server
args:
- --model
- /data/Mistral-7B-v0.1-AWQ
- --quantization
- awq
- --dtype
- float16
- --tensor-parallel-size
- "1"
- --max-model-len
- "8192"
ports:
- containerPort: 8000
env:
- name: HF_HUB_OFFLINE
value: "1"
- name: TRANSFORMERS_OFFLINE
value: "1"
resources:
limits:
nvidia.com/gpu: "1"
volumeMounts:
- name: model-data
mountPath: /data
readOnly: true
volumes:
- name: model-data
persistentVolumeClaim:
claimName: model-storage-pvc
---
apiVersion: v1
kind: Service
metadata:
name: mistral-awq
namespace: ai-inference
spec:
selector:
app: mistral-awq
ports:
- port: 8000
targetPort: 8000Key Difference from Full-Precision
The only changes are:
--model /data/Mistral-7B-v0.1-AWQ # quantized weights path
--quantization awq # tell vLLM the format
--dtype float16 # AWQ works with fp16, not bf16Deploy GPTQ Model with vLLM
args:
- --model
- /data/Mistral-7B-v0.1-GPTQ
- --quantization
- gptq
- --dtype
- float16
- --tensor-parallel-size
- "1"GPU Selection Guide for Quantized Models
| GPU | VRAM | Mistral-7B (4-bit) | Llama-2-13B (4-bit) | Llama-2-70B (4-bit) |
|---|---|---|---|---|
| T4 | 16 GB | β | β (tight) | β |
| A10 | 24 GB | β | β | β |
| A30 | 24 GB | β | β | β |
| A100-40GB | 40 GB | β | β | β (tight) |
| A100-80GB | 80 GB | β | β | β |
| H100 | 80 GB | β | β | β |
With 4-bit quantization, Mistral-7B fits comfortably on a T4 β enabling inference on much cheaper hardware.
Quality Comparison
Quantization introduces small accuracy trade-offs:
Benchmark (Mistral-7B):
bf16 (baseline): MMLU 62.5% | Perplexity 5.21
AWQ 4-bit: MMLU 62.1% | Perplexity 5.28
GPTQ 4-bit: MMLU 61.8% | Perplexity 5.32
Practical impact: Negligible for most applications.Run:ai Configuration for Quantized Models
| Field | Full Precision | AWQ 4-bit |
|---|---|---|
| Image | vLLM container | Same |
| Arguments | --model /data/Mistral-7B-v0.1 --dtype bfloat16 | --model /data/Mistral-7B-v0.1-AWQ --quantization awq --dtype float16 |
| GPU fraction | 50% (of A100) | 25% or smaller GPU |
| GPU memory needed | ~14 GB | ~4 GB |
Verify Quantized Deployment
# Check model is loaded
curl -k https://<endpoint>/v1/models
# Run inference (same API as full-precision)
curl -k -X POST https://<endpoint>/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/data/Mistral-7B-v0.1-AWQ",
"prompt": "Explain quantization in one sentence:",
"max_tokens": 32
}'Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
ValueError: quantization method not supported | Wrong vLLM version | Use vLLM β₯ 0.4.0 |
| Slow inference | CPU fallback for some ops | Ensure GPU is allocated |
| Quality degradation | Over-aggressive quantization | Try AWQ instead of GPTQ, or use 8-bit |
CUDA out of memory | Batch size too large for quantized model | Reduce --max-num-seqs |
Related Recipes

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
