πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai intermediate ⏱ 15 minutes K8s 1.28+

Llama 2 70B FP16 Model Size 140GB Guide

Llama 2 70B FP16 model size is 140GB. Complete GPU memory requirements for FP16, FP8, INT4 quantization, and multi-GPU tensor parallelism on Kubernetes.

By Luca Berton β€’ β€’ πŸ“– 6 min read

πŸ’‘ Quick Answer: Llama 2 70B in FP16 precision requires ~140GB of VRAM (70 billion parameters Γ— 2 bytes). A single H200 (141GB) can fit it. For H100 (80GB), use 2Γ— GPUs with tensor parallelism. FP8 halves to ~70GB (1Γ— H100), and INT4/GPTQ reduces to ~35GB (1Γ— A100 40GB).

The Problem

Before deploying Llama 2 70B (or any large model) on Kubernetes, you need to calculate VRAM requirements to choose the right GPU type, count, and parallelism strategy. Getting this wrong means either OOM crashes or wasted GPU spend.

flowchart TB
    PARAMS["70B Parameters"] --> FP16["FP16: 70B Γ— 2 bytes<br/>= 140 GB"]
    PARAMS --> FP8["FP8: 70B Γ— 1 byte<br/>= 70 GB"]
    PARAMS --> INT4["INT4: 70B Γ— 0.5 bytes<br/>= 35 GB"]
    FP16 --> H200["1Γ— H200 (141GB)<br/>or 2Γ— H100 (80GB)"]
    FP8 --> H100["1Γ— H100 (80GB)<br/>or 2Γ— A100 (40GB)"]
    INT4 --> A100["1Γ— A100 40GB<br/>or 1Γ— L40S (48GB)"]

The Solution

Model Size Formula

VRAM = Parameters Γ— Bytes per Parameter + Overhead

Bytes per precision:
  FP32:  4 bytes
  FP16:  2 bytes (BF16 same)
  FP8:   1 byte
  INT4:  0.5 bytes (GPTQ/AWQ)

Overhead: ~10-20% for KV cache, activations, CUDA kernels

Llama Model Family Sizes

ModelParametersFP16 (GB)FP8 (GB)INT4 (GB)+20% Overhead (FP16)
Llama 2 7B7B1473.517
Llama 2 13B13B26136.531
Llama 2 70B70B1407035168
Llama 3.1 8B8B168419
Llama 3.1 70B70B1407035168
Llama 3.1 405B405B810405203972

GPU Memory Reference

GPUVRAMFits Llama 70BPrecision
A100 40GB40 GBINT4 only (1Γ—) or FP8 (2Γ—)INT4 βœ…, FP8 2Γ—, FP16 4Γ—
A100 80GB80 GBFP8 (1Γ—) or FP16 (2Γ—)FP8 βœ…, FP16 2Γ—
L40S48 GBINT4 only (1Γ—) or FP8 (2Γ—)INT4 βœ…, FP8 2Γ—
H100 80GB80 GBFP8 (1Γ—) or FP16 (2Γ—)FP8 βœ…, FP16 2Γ—
H200141 GBFP16 (1Γ—)FP16 βœ…
GH200 480GB480 GBFP16 (1Γ—) with roomFP16 βœ…
B200192 GBFP16 (1Γ—) with roomFP16 βœ…

Kubernetes Deployment by GPU

1Γ— H200 β€” FP16 (Best Quality)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-70b-fp16
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - --model=meta-llama/Llama-2-70b-chat-hf
            - --dtype=float16
            - --tensor-parallel-size=1
            - --max-model-len=4096
            - --gpu-memory-utilization=0.90
          resources:
            limits:
              nvidia.com/gpu: 1
          env:
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: token
      nodeSelector:
        nvidia.com/gpu.product: NVIDIA-H200

2Γ— H100 β€” FP16 with Tensor Parallelism

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-70b-fp16-tp2
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - --model=meta-llama/Llama-2-70b-chat-hf
            - --dtype=float16
            - --tensor-parallel-size=2
            - --max-model-len=4096
            - --gpu-memory-utilization=0.90
          resources:
            limits:
              nvidia.com/gpu: 2
          env:
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: token
      nodeSelector:
        nvidia.com/gpu.product: NVIDIA-H100-80GB-HBM3

1Γ— H100 β€” FP8 (Best Balance)

args:
  - --model=meta-llama/Llama-2-70b-chat-hf
  - --dtype=float16
  - --quantization=fp8
  - --tensor-parallel-size=1
  - --max-model-len=4096
resources:
  limits:
    nvidia.com/gpu: 1
nodeSelector:
  nvidia.com/gpu.product: NVIDIA-H100-80GB-HBM3

1Γ— A100 40GB β€” INT4 GPTQ (Most Affordable)

args:
  - --model=TheBloke/Llama-2-70B-Chat-GPTQ
  - --quantization=gptq
  - --tensor-parallel-size=1
  - --max-model-len=2048    # Reduced for 40GB
  - --gpu-memory-utilization=0.95
resources:
  limits:
    nvidia.com/gpu: 1
nodeSelector:
  nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB

KV Cache Memory Impact

Model weights are just part of the story. KV cache grows with context length and batch size:

KV Cache per token = 2 Γ— num_layers Γ— num_kv_heads Γ— head_dim Γ— bytes_per_param

Llama 2 70B (FP16):
  KV per token = 2 Γ— 80 Γ— 8 Γ— 128 Γ— 2 bytes = 327,680 bytes β‰ˆ 0.31 MB

Context 4096 tokens Γ— batch 16:
  KV cache = 4096 Γ— 16 Γ— 0.31 MB β‰ˆ 20 GB

Total VRAM = Model (140 GB) + KV Cache (20 GB) + Overhead β‰ˆ 168 GB

This is why a single H200 (141GB) can load the model but may need reduced batch size for long contexts.

Quick Sizing Decision Matrix

Your GPUBudgetRecommendation
H200 / B200 / GH200HighFP16, TP=1 β€” best quality, simplest setup
2Γ— H100 80GBMediumFP16, TP=2 β€” full quality, needs NVLink
1Γ— H100 80GBMediumFP8, TP=1 β€” minimal quality loss
2Γ— A100 80GBMediumFP8, TP=2 β€” good balance
4Γ— A100 40GBLowerFP8, TP=4 β€” more GPUs but works
1Γ— A100 40GB / L40SLowINT4 GPTQ β€” noticeable quality loss

Common Issues

IssueCauseFix
OOM on 1Γ— H100 with FP16140GB > 80GB VRAMUse FP8 or add second GPU with TP=2
Slow inference on 4Γ— GPUCommunication bottleneckEnsure NVLink (not PCIe) between GPUs
Quality degradationINT4 quantizationMove to FP8 β€” much better quality/VRAM tradeoff
KV cache OOM at high batchModel fits but KV cache doesn’tReduce `β€”max-model-len` or batch size
Model download timeout140GB+ download over slow networkPre-cache model on PV or use `modelcache` init container

Best Practices

  • Start with FP8 on H100/H200 β€” best quality-per-VRAM ratio
  • Use tensor parallelism, not pipeline parallelism for inference β€” lower latency
  • Set `β€”gpu-memory-utilization=0.90` β€” leaves headroom for KV cache spikes
  • Pre-download models to PersistentVolumes β€” avoid cold-start download delays
  • Use NVLink for multi-GPU β€” PCIe bottlenecks tensor parallelism significantly
  • Monitor with `nvidia-smi` β€” watch memory usage under load, not just at startup

Key Takeaways

  • Llama 2 70B FP16 = 140GB VRAM (70B params Γ— 2 bytes)
  • Add 20% overhead for KV cache, activations, and CUDA context
  • H200 (141GB) fits FP16 on 1 GPU; H100 (80GB) needs FP8 or 2Γ— GPUs
  • FP8 is the sweet spot β€” 50% less VRAM with minimal quality loss
  • INT4/GPTQ cuts to 35GB but quality degrades noticeably
  • KV cache scales with context length Γ— batch size β€” factor this into VRAM planning
#llama #model-sizing #gpu-requirements #quantization #vram
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens