Llama 2 70B FP16 Model Size 140GB Guide
Llama 2 70B FP16 model size is 140GB. Complete GPU memory requirements for FP16, FP8, INT4 quantization, and multi-GPU tensor parallelism on Kubernetes.
π‘ Quick Answer: Llama 2 70B in FP16 precision requires ~140GB of VRAM (70 billion parameters Γ 2 bytes). A single H200 (141GB) can fit it. For H100 (80GB), use 2Γ GPUs with tensor parallelism. FP8 halves to ~70GB (1Γ H100), and INT4/GPTQ reduces to ~35GB (1Γ A100 40GB).
The Problem
Before deploying Llama 2 70B (or any large model) on Kubernetes, you need to calculate VRAM requirements to choose the right GPU type, count, and parallelism strategy. Getting this wrong means either OOM crashes or wasted GPU spend.
flowchart TB
PARAMS["70B Parameters"] --> FP16["FP16: 70B Γ 2 bytes<br/>= 140 GB"]
PARAMS --> FP8["FP8: 70B Γ 1 byte<br/>= 70 GB"]
PARAMS --> INT4["INT4: 70B Γ 0.5 bytes<br/>= 35 GB"]
FP16 --> H200["1Γ H200 (141GB)<br/>or 2Γ H100 (80GB)"]
FP8 --> H100["1Γ H100 (80GB)<br/>or 2Γ A100 (40GB)"]
INT4 --> A100["1Γ A100 40GB<br/>or 1Γ L40S (48GB)"]The Solution
Model Size Formula
VRAM = Parameters Γ Bytes per Parameter + Overhead
Bytes per precision:
FP32: 4 bytes
FP16: 2 bytes (BF16 same)
FP8: 1 byte
INT4: 0.5 bytes (GPTQ/AWQ)
Overhead: ~10-20% for KV cache, activations, CUDA kernelsLlama Model Family Sizes
| Model | Parameters | FP16 (GB) | FP8 (GB) | INT4 (GB) | +20% Overhead (FP16) |
|---|---|---|---|---|---|
| Llama 2 7B | 7B | 14 | 7 | 3.5 | 17 |
| Llama 2 13B | 13B | 26 | 13 | 6.5 | 31 |
| Llama 2 70B | 70B | 140 | 70 | 35 | 168 |
| Llama 3.1 8B | 8B | 16 | 8 | 4 | 19 |
| Llama 3.1 70B | 70B | 140 | 70 | 35 | 168 |
| Llama 3.1 405B | 405B | 810 | 405 | 203 | 972 |
GPU Memory Reference
| GPU | VRAM | Fits Llama 70B | Precision |
|---|---|---|---|
| A100 40GB | 40 GB | INT4 only (1Γ) or FP8 (2Γ) | INT4 β , FP8 2Γ, FP16 4Γ |
| A100 80GB | 80 GB | FP8 (1Γ) or FP16 (2Γ) | FP8 β , FP16 2Γ |
| L40S | 48 GB | INT4 only (1Γ) or FP8 (2Γ) | INT4 β , FP8 2Γ |
| H100 80GB | 80 GB | FP8 (1Γ) or FP16 (2Γ) | FP8 β , FP16 2Γ |
| H200 | 141 GB | FP16 (1Γ) | FP16 β |
| GH200 480GB | 480 GB | FP16 (1Γ) with room | FP16 β |
| B200 | 192 GB | FP16 (1Γ) with room | FP16 β |
Kubernetes Deployment by GPU
1Γ H200 β FP16 (Best Quality)
apiVersion: apps/v1
kind: Deployment
metadata:
name: llama-70b-fp16
spec:
replicas: 1
template:
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- --model=meta-llama/Llama-2-70b-chat-hf
- --dtype=float16
- --tensor-parallel-size=1
- --max-model-len=4096
- --gpu-memory-utilization=0.90
resources:
limits:
nvidia.com/gpu: 1
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
nodeSelector:
nvidia.com/gpu.product: NVIDIA-H2002Γ H100 β FP16 with Tensor Parallelism
apiVersion: apps/v1
kind: Deployment
metadata:
name: llama-70b-fp16-tp2
spec:
replicas: 1
template:
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- --model=meta-llama/Llama-2-70b-chat-hf
- --dtype=float16
- --tensor-parallel-size=2
- --max-model-len=4096
- --gpu-memory-utilization=0.90
resources:
limits:
nvidia.com/gpu: 2
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
nodeSelector:
nvidia.com/gpu.product: NVIDIA-H100-80GB-HBM31Γ H100 β FP8 (Best Balance)
args:
- --model=meta-llama/Llama-2-70b-chat-hf
- --dtype=float16
- --quantization=fp8
- --tensor-parallel-size=1
- --max-model-len=4096
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
nvidia.com/gpu.product: NVIDIA-H100-80GB-HBM31Γ A100 40GB β INT4 GPTQ (Most Affordable)
args:
- --model=TheBloke/Llama-2-70B-Chat-GPTQ
- --quantization=gptq
- --tensor-parallel-size=1
- --max-model-len=2048 # Reduced for 40GB
- --gpu-memory-utilization=0.95
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GBKV Cache Memory Impact
Model weights are just part of the story. KV cache grows with context length and batch size:
KV Cache per token = 2 Γ num_layers Γ num_kv_heads Γ head_dim Γ bytes_per_param
Llama 2 70B (FP16):
KV per token = 2 Γ 80 Γ 8 Γ 128 Γ 2 bytes = 327,680 bytes β 0.31 MB
Context 4096 tokens Γ batch 16:
KV cache = 4096 Γ 16 Γ 0.31 MB β 20 GB
Total VRAM = Model (140 GB) + KV Cache (20 GB) + Overhead β 168 GBThis is why a single H200 (141GB) can load the model but may need reduced batch size for long contexts.
Quick Sizing Decision Matrix
| Your GPU | Budget | Recommendation |
|---|---|---|
| H200 / B200 / GH200 | High | FP16, TP=1 β best quality, simplest setup |
| 2Γ H100 80GB | Medium | FP16, TP=2 β full quality, needs NVLink |
| 1Γ H100 80GB | Medium | FP8, TP=1 β minimal quality loss |
| 2Γ A100 80GB | Medium | FP8, TP=2 β good balance |
| 4Γ A100 40GB | Lower | FP8, TP=4 β more GPUs but works |
| 1Γ A100 40GB / L40S | Low | INT4 GPTQ β noticeable quality loss |
Common Issues
| Issue | Cause | Fix |
|---|---|---|
| OOM on 1Γ H100 with FP16 | 140GB > 80GB VRAM | Use FP8 or add second GPU with TP=2 |
| Slow inference on 4Γ GPU | Communication bottleneck | Ensure NVLink (not PCIe) between GPUs |
| Quality degradation | INT4 quantization | Move to FP8 β much better quality/VRAM tradeoff |
| KV cache OOM at high batch | Model fits but KV cache doesnβt | Reduce `βmax-model-len` or batch size |
| Model download timeout | 140GB+ download over slow network | Pre-cache model on PV or use `modelcache` init container |
Best Practices
- Start with FP8 on H100/H200 β best quality-per-VRAM ratio
- Use tensor parallelism, not pipeline parallelism for inference β lower latency
- Set `βgpu-memory-utilization=0.90` β leaves headroom for KV cache spikes
- Pre-download models to PersistentVolumes β avoid cold-start download delays
- Use NVLink for multi-GPU β PCIe bottlenecks tensor parallelism significantly
- Monitor with `nvidia-smi` β watch memory usage under load, not just at startup
Key Takeaways
- Llama 2 70B FP16 = 140GB VRAM (70B params Γ 2 bytes)
- Add 20% overhead for KV cache, activations, and CUDA context
- H200 (141GB) fits FP16 on 1 GPU; H100 (80GB) needs FP8 or 2Γ GPUs
- FP8 is the sweet spot β 50% less VRAM with minimal quality loss
- INT4/GPTQ cuts to 35GB but quality degrades noticeably
- KV cache scales with context length Γ batch size β factor this into VRAM planning

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
