Deploy Llama 3.1 8B Instruct on K8s
Deploy Meta Llama 3.1 8B Instruct on Kubernetes with vLLM. Production-ready single-GPU deployment with 128K context, tool calling, and autoscaling.
π‘ Quick Answer: Deploy Llama 3.1 8B Instruct with vLLM on a single A100 or L40S. It supports 128K context, native tool/function calling, and serves 50+ concurrent users. At ~16GB VRAM (FP16), it fits comfortably on most datacenter GPUs.
The Problem
Llama 3.1 8B Instruct is Metaβs best small model β 128K context window, native tool calling, multilingual support. But deploying it on Kubernetes requires:
- Context length management β 128K tokens needs careful KV cache sizing
- Tool calling β OpenAI-compatible function calling format
- Efficient serving β maximizing throughput on a single GPU
- Cost optimization β choosing between FP16, FP8, and AWQ based on hardware
The Solution
Step 1: Deploy with vLLM
apiVersion: apps/v1
kind: Deployment
metadata:
name: llama31-8b
namespace: ai-inference
labels:
app: llama31-8b
spec:
replicas: 1
selector:
matchLabels:
app: llama31-8b
template:
metadata:
labels:
app: llama31-8b
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "meta-llama/Llama-3.1-8B-Instruct"
- "--max-model-len"
- "32768"
- "--gpu-memory-utilization"
- "0.90"
- "--max-num-seqs"
- "64"
- "--enable-chunked-prefill"
- "--enable-auto-tool-choice"
- "--tool-call-parser"
- "llama3_json"
- "--port"
- "8000"
ports:
- containerPort: 8000
name: http
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: huggingface-token
key: token
resources:
limits:
nvidia.com/gpu: "1"
memory: 32Gi
cpu: "8"
requests:
memory: 16Gi
cpu: "4"
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
- name: shm
mountPath: /dev/shm
startupProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
failureThreshold: 20
readinessProbe:
httpGet:
path: /health
port: 8000
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8000
periodSeconds: 30
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: llama31-model-cache
- name: shm
emptyDir:
medium: Memory
sizeLimit: 4Gi
---
apiVersion: v1
kind: Service
metadata:
name: llama31-8b
namespace: ai-inference
spec:
selector:
app: llama31-8b
ports:
- port: 8000
targetPort: 8000
name: httpStep 2: Full 128K Context Deployment
# For 128K context β needs A100 80GB or H100
apiVersion: apps/v1
kind: Deployment
metadata:
name: llama31-8b-128k
namespace: ai-inference
spec:
replicas: 1
selector:
matchLabels:
app: llama31-8b-128k
template:
metadata:
labels:
app: llama31-8b-128k
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "meta-llama/Llama-3.1-8B-Instruct"
- "--max-model-len"
- "131072"
- "--gpu-memory-utilization"
- "0.95"
- "--max-num-seqs"
- "8"
- "--enable-chunked-prefill"
- "--enable-auto-tool-choice"
- "--tool-call-parser"
- "llama3_json"
resources:
limits:
nvidia.com/gpu: "1"
memory: 96Gi
cpu: "16"Step 3: Tool Calling / Function Calling
# Llama 3.1 supports native tool calling
kubectl run test-tools --rm -it --image=curlimages/curl -- \
curl -s http://llama31-8b:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant with access to tools."},
{"role": "user", "content": "What pods are running in the default namespace?"}
],
"tools": [
{
"type": "function",
"function": {
"name": "kubectl_get",
"description": "Run kubectl get command",
"parameters": {
"type": "object",
"properties": {
"resource": {"type": "string", "description": "Kubernetes resource type"},
"namespace": {"type": "string", "description": "Namespace"}
},
"required": ["resource"]
}
}
}
],
"tool_choice": "auto",
"max_tokens": 512
}'Step 4: AWQ for Smaller GPUs (A10G, T4)
apiVersion: apps/v1
kind: Deployment
metadata:
name: llama31-8b-awq
namespace: ai-inference
spec:
replicas: 1
selector:
matchLabels:
app: llama31-8b-awq
template:
metadata:
labels:
app: llama31-8b-awq
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4"
- "--quantization"
- "awq"
- "--max-model-len"
- "16384"
- "--gpu-memory-utilization"
- "0.90"
- "--max-num-seqs"
- "128"
resources:
limits:
nvidia.com/gpu: "1"
memory: 24Gi
cpu: "4"Context Length vs GPU Memory
| Context Length | VRAM (FP16) | VRAM (AWQ) | Recommended GPU |
|---------------|-------------|------------|--------------------|
| 8,192 | ~18GB | ~8GB | A10G, L4, T4 |
| 32,768 | ~26GB | ~14GB | A100 40GB, L40S |
| 65,536 | ~42GB | ~22GB | A100 80GB |
| 131,072 | ~72GB | ~38GB | A100 80GB, H100 |flowchart TD
A{GPU Available?} --> B[A10G / T4 / L4 - 24GB]
A --> C[A100 40GB / L40S]
A --> D[A100 80GB / H100]
B --> E[AWQ INT4 + 8K-16K context]
C --> F[FP16 + 32K context]
D --> G[FP16 + 128K full context]
E --> H[~128 concurrent seqs]
F --> I[~64 concurrent seqs]
G --> J[~8 concurrent seqs at 128K]
H --> K[Best throughput per dollar]
I --> L[Balance of context and throughput]
J --> M[Long document processing]Common Issues
KV cache OOM at high concurrency + long context
# Reduce max concurrent sequences when using long context
--max-num-seqs 8 # for 128K context
--max-num-seqs 32 # for 32K context
--max-num-seqs 128 # for 8K context
# Or reduce max_model_len to cap per-request context
--max-model-len 32768 # limit to 32K even though model supports 128KTool calling not working
# vLLM needs explicit tool call parser for Llama 3.1
--enable-auto-tool-choice \
--tool-call-parser llama3_json
# Without these flags, tool_choice in API requests is ignoredSlow prefill on long inputs
# Chunked prefill prevents long inputs from blocking short ones
--enable-chunked-prefill
# This is critical for mixed workloads with both short and long promptsBest Practices
- Start with 32K context β most use cases donβt need 128K, and 32K is 3x more memory efficient
- AWQ for cost-sensitive deployments β fits on 24GB GPUs with minimal quality loss
- Enable tool calling β
--enable-auto-tool-choice --tool-call-parser llama3_jsonfor agent workflows - Chunked prefill β essential for mixed-length workloads
- PVC for model cache β ~16GB model files, avoid re-downloading
Key Takeaways
- Llama 3.1 8B Instruct fits on a single GPU (~16GB VRAM at FP16)
- Supports 128K context natively β but reduce
max-model-lento save VRAM for throughput - Native tool/function calling via vLLMβs
llama3_jsonparser - AWQ INT4 enables deployment on 24GB GPUs (A10G, L4, T4)
- Best balance of quality, speed, and cost for instruction-following and code generation
- Use
--enable-chunked-prefillfor mixed-length workloads

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
