Deploy Llama 2 70B on Kubernetes
Deploy Meta Llama 2 70B on Kubernetes with multi-GPU tensor parallelism, vLLM serving, and production-ready health checks and resource limits.
π‘ Quick Answer: Deploy Llama 2 70B with vLLM using
tensor_parallel_size: 4across 4x A100 80GB GPUs. Use AWQ quantization to fit on 2x A100 or a single H100. Serve via OpenAI-compatible API with health checks and HPA autoscaling.
The Problem
Llama 2 70B is one of the most capable open-weight LLMs, but deploying it on Kubernetes is challenging:
- Model size β 140GB in FP16, requires multiple GPUs with tensor parallelism
- Memory management β KV cache can exhaust GPU memory under concurrent load
- Multi-GPU coordination β NCCL communication between GPUs needs proper configuration
- Production readiness β health checks, graceful shutdown, and autoscaling are essential
The Solution
Step 1: Create Secrets and Storage
apiVersion: v1
kind: Secret
metadata:
name: huggingface-token
namespace: ai-inference
type: Opaque
stringData:
token: "hf_your_token_here"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache-llama70b
namespace: ai-inference
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 200Gi
storageClassName: fast-ssdStep 2: Deploy Llama 2 70B with vLLM
apiVersion: apps/v1
kind: Deployment
metadata:
name: llama2-70b
namespace: ai-inference
labels:
app: llama2-70b
spec:
replicas: 1
selector:
matchLabels:
app: llama2-70b
template:
metadata:
labels:
app: llama2-70b
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "meta-llama/Llama-2-70b-chat-hf"
- "--tensor-parallel-size"
- "4"
- "--max-model-len"
- "4096"
- "--gpu-memory-utilization"
- "0.90"
- "--max-num-seqs"
- "64"
- "--enable-chunked-prefill"
- "--port"
- "8000"
ports:
- containerPort: 8000
name: http
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: huggingface-token
key: token
- name: NCCL_DEBUG
value: "WARN"
- name: NCCL_SOCKET_IFNAME
value: "eth0"
resources:
limits:
nvidia.com/gpu: "4"
memory: 64Gi
cpu: "16"
requests:
memory: 32Gi
cpu: "8"
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
- name: shm
mountPath: /dev/shm
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 600
periodSeconds: 30
timeoutSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 300
periodSeconds: 10
startupProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 30
failureThreshold: 20
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache-llama70b
- name: shm
emptyDir:
medium: Memory
sizeLimit: 16Gi
terminationGracePeriodSeconds: 120
---
apiVersion: v1
kind: Service
metadata:
name: llama2-70b
namespace: ai-inference
spec:
selector:
app: llama2-70b
ports:
- port: 8000
targetPort: 8000
name: httpStep 3: AWQ Quantized Version (2x A100)
apiVersion: apps/v1
kind: Deployment
metadata:
name: llama2-70b-awq
namespace: ai-inference
spec:
replicas: 1
selector:
matchLabels:
app: llama2-70b-awq
template:
metadata:
labels:
app: llama2-70b-awq
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "TheBloke/Llama-2-70B-Chat-AWQ"
- "--quantization"
- "awq"
- "--tensor-parallel-size"
- "2"
- "--max-model-len"
- "4096"
- "--gpu-memory-utilization"
- "0.90"
- "--max-num-seqs"
- "128"
resources:
limits:
nvidia.com/gpu: "2"
memory: 48Gi
cpu: "8"
volumeMounts:
- name: shm
mountPath: /dev/shm
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: 8GiStep 4: Test the Deployment
# Wait for model to load (can take 10-15 minutes)
kubectl wait --for=condition=ready pod -l app=llama2-70b \
-n ai-inference --timeout=900s
# Test inference
kubectl run test-llama --rm -it --image=curlimages/curl -- \
curl -s http://llama2-70b:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-70b-chat-hf",
"messages": [{"role": "user", "content": "Explain Kubernetes pods in 3 sentences"}],
"max_tokens": 256,
"temperature": 0.7
}'flowchart TD
A[Client Request] --> B[Service: llama2-70b:8000]
B --> C[vLLM Pod]
C --> D[GPU 0 - Layer 0-19]
C --> E[GPU 1 - Layer 20-39]
C --> F[GPU 2 - Layer 40-59]
C --> G[GPU 3 - Layer 60-79]
D <-->|NCCL| E
E <-->|NCCL| F
F <-->|NCCL| G
C --> H[OpenAI-compatible Response]Common Issues
NCCL timeout during multi-GPU initialization
# Increase NCCL timeout and ensure shared memory
env:
- name: NCCL_TIMEOUT
value: "1800"
- name: NCCL_SOCKET_IFNAME
value: "eth0"
# Ensure /dev/shm is large enough (16Gi for 4 GPUs)OOM during model loading
# FP16 needs ~140GB GPU memory total
# 4x A100 80GB = 320GB available (safe)
# 2x A100 80GB = 160GB β tight, use AWQ quantization
# Use --gpu-memory-utilization 0.85 to leave headroomStartup probe timeout
# 70B model takes 10-15 minutes to load
startupProbe:
failureThreshold: 30 # 30 Γ 30s = 15 minutes
periodSeconds: 30Best Practices
- Use
/dev/shmwith adequate size β NCCL uses shared memory for GPU communication - PVC for model cache β avoid re-downloading 140GB on every pod restart
- AWQ for cost efficiency β 4-bit quantization fits on 2 GPUs with minimal quality loss
- Startup probes with long timeout β large models need 10-15 minutes to load
- Set
NCCL_SOCKET_IFNAMEβ prevents NCCL from using wrong network interface
Key Takeaways
- Llama 2 70B requires 4x A100 80GB in FP16 or 2x A100 with AWQ quantization
- Use vLLM with
--tensor-parallel-sizematching your GPU count - Mount
/dev/shmas emptyDir Memory for NCCL inter-GPU communication - Startup probes need 10-15 minute timeout for model loading
- AWQ quantization reduces GPU requirement by 50% with minimal quality impact

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
