πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai intermediate ⏱ 15 minutes K8s 1.28+

NVIDIA H300 GPU Setup on Kubernetes

Deploy NVIDIA H300 GPUs on Kubernetes. H300 vs H100 vs H200 specs comparison, memory bandwidth, GPU Operator setup, and AI inference optimization.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: The NVIDIA H300 is a rumored/upcoming GPU in the Hopper family, positioned between H200 and Blackwell B200. Currently, H100 (80GB HBM3, 3.35 TB/s) and H200 (141GB HBM3e, 4.8 TB/s) are the production GPUs for Kubernetes AI workloads. Deploy them with GPU Operator + Node Feature Discovery, request via nvidia.com/gpu: 1, and use MIG or time-slicing for multi-tenant sharing.

The Problem

Choosing the right NVIDIA GPU for Kubernetes AI workloads requires understanding:

  • Memory capacity limits (model fits in VRAM?)
  • Memory bandwidth (inference tokens/s)
  • Interconnect topology (NVLink, NVSwitch for multi-GPU)
  • Cost vs performance trade-offs
  • Kubernetes scheduling and resource management

The Solution

NVIDIA GPU Comparison Matrix

GPUArchitectureVRAMMemory BWFP16 TFLOPSNVLinkUse Case
A100 40GBAmpere40GB HBM2e1.6 TB/s312600 GB/sTraining/inference
A100 80GBAmpere80GB HBM2e2.0 TB/s312600 GB/sLarge model training
H100 SXMHopper80GB HBM33.35 TB/s990900 GB/sFlagship training
H100 PCIeHopper80GB HBM32.0 TB/s756600 GB/sInference / edge
H200 SXMHopper141GB HBM3e4.8 TB/s990900 GB/sLarge model inference
B100Blackwell192GB HBM3e8.0 TB/s18001800 GB/sNext-gen training
B200Blackwell192GB HBM3e8.0 TB/s22501800 GB/sFlagship next-gen
GB200Blackwell384GB (2Γ—192)16 TB/s4500NVLink 5Superchip

Deploy GPUs on Kubernetes

# 1. Install Node Feature Discovery
helm install nfd nvidia/node-feature-discovery \
  -n gpu-operator --create-namespace

# 2. Install GPU Operator
helm install gpu-operator nvidia/gpu-operator \
  -n gpu-operator \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set devicePlugin.enabled=true \
  --set mig.strategy=single

# 3. Verify GPU nodes
kubectl get nodes -l nvidia.com/gpu.present=true
kubectl describe node gpu-node-1 | grep -A5 "Allocatable:"
#   nvidia.com/gpu: 8

Schedule Pods on Specific GPU Types

apiVersion: v1
kind: Pod
metadata:
  name: llm-inference
spec:
  nodeSelector:
    nvidia.com/gpu.product: NVIDIA-H100-SXM5-80GB
  containers:
  - name: inference
    image: nvcr.io/nvidia/pytorch:24.07-py3
    resources:
      limits:
        nvidia.com/gpu: 1
      requests:
        cpu: "4"
        memory: 32Gi
# List available GPU types in cluster
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.labels.nvidia\.com/gpu\.product}{"\n"}{end}' | sort -u
# NVIDIA-A100-SXM4-80GB
# NVIDIA-H100-SXM5-80GB
# NVIDIA-H200-SXM-141GB

Model-to-GPU Sizing Guide

ModelParamsFP16 SizeMin GPURecommended
Llama 3.1 8B8B~16GB1Γ— A100 40GB1Γ— H100
Llama 3.1 70B70B~140GB2Γ— H100 80GB2Γ— H200 141GB
Llama 3.1 405B405B~810GB8Γ— H200 (TP=8,PP=2)8Γ— B200
Mixtral 8Γ—22B141B~282GB4Γ— H100 80GB4Γ— H200
GPT-4 class~1.8T~3.6TB64Γ— H10032Γ— B200

Memory Bandwidth Impact on Inference

Inference tokens/s β‰ˆ Memory_BW / (2 Γ— Params_in_bytes)

Llama 70B FP16 on different GPUs:
- A100 80GB:  2000 GB/s / (2 Γ— 140GB) = ~7 tok/s per GPU
- H100 SXM:   3350 GB/s / (2 Γ— 140GB) = ~12 tok/s per GPU  
- H200 SXM:   4800 GB/s / (2 Γ— 140GB) = ~17 tok/s per GPU
- B200:       8000 GB/s / (2 Γ— 140GB) = ~29 tok/s per GPU

For batch inference, compute (TFLOPS) becomes the bottleneck instead.

Multi-GPU Tensor Parallelism

# Deploy 70B model across 2Γ— H100 with tensor parallelism
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-70b
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
        - --model=meta-llama/Llama-3.1-70B-Instruct
        - --tensor-parallel-size=2
        - --gpu-memory-utilization=0.95
        resources:
          limits:
            nvidia.com/gpu: 2        # 2 GPUs with NVLink
        ports:
        - containerPort: 8000

Common Issues

β€œCUDA out of memory” on model load

Model doesn’t fit in GPU VRAM. Use quantization (INT4/INT8) or increase tensor parallelism across more GPUs.

NVLink not detected between GPUs

GPUs must be on the same NVSwitch fabric (SXM form factor). PCIe GPUs use slower PCIe interconnect. Check: nvidia-smi topo -m.

GPU Operator not detecting GPUs

Check node labels: kubectl get node <node> -o yaml | grep nvidia. Ensure GPU drivers match container CUDA version.

Best Practices

  • H100 SXM for training β€” highest compute with NVLink bandwidth
  • H200 for inference β€” 141GB VRAM fits larger models without TP
  • Use MIG on H100 for multi-tenant inference workloads
  • Pin GPU type with nodeSelector β€” prevent scheduling on wrong GPU
  • Size VRAM for FP16 model + 20% KV cache overhead

Key Takeaways

  • H100 (80GB, 3.35TB/s) is the current training standard, H200 (141GB, 4.8TB/s) is best for large model inference
  • Memory bandwidth determines inference throughput, TFLOPS determines training speed
  • Use nvidia.com/gpu.product label selector to target specific GPU types
  • Model size in FP16 bytes Γ— 1.2 (KV cache) = minimum VRAM needed
  • Tensor parallelism across NVLink-connected GPUs for models exceeding single GPU VRAM
#nvidia #gpu #h300 #h100 #h200 #inference
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens