Distributed Multi-GPU Inference on Kubernetes
Deploy distributed inference across multiple GPUs and nodes on Kubernetes. Covers tensor parallelism, pipeline parallelism, vLLM, and NIM multi-GPU serving.
π‘ Quick Answer: Large models (70B+ parameters) donβt fit on a single GPU. Use tensor parallelism (split layers across GPUs on one node) or pipeline parallelism (split model stages across nodes) to serve them. vLLM and NIM handle this natively on Kubernetes.
The Problem
- Llama 3.1 405B needs ~810GB VRAM in FP16 β thatβs 10+ A100 80GB GPUs
- Mixtral 8x22B needs ~176GB β 3 A100 80GB GPUs minimum
- Single-GPU serving has latency and throughput limits even for models that fit
- Multi-node inference adds network overhead β needs high-bandwidth interconnects
The Solution
Single-Node Multi-GPU: Tensor Parallelism with vLLM
apiVersion: apps/v1
kind: Deployment
metadata:
name: llama-70b-tp8
namespace: inference
spec:
replicas: 1
selector:
matchLabels:
app: llama-70b
template:
metadata:
labels:
app: llama-70b
spec:
containers:
- name: vllm
image: vllm/vllm-openai:v0.8.0
args:
- --model=meta-llama/Llama-3.1-70B-Instruct
- --tensor-parallel-size=8
- --gpu-memory-utilization=0.90
- --max-model-len=32768
- --port=8000
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 8
memory: 512Gi
requests:
nvidia.com/gpu: 8
memory: 256Gi
env:
- name: NCCL_DEBUG
value: "WARN"
- name: CUDA_VISIBLE_DEVICES
value: "0,1,2,3,4,5,6,7"
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
- name: shm
mountPath: /dev/shm
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-storage
- name: shm
emptyDir:
medium: Memory
sizeLimit: 64Gi
nodeSelector:
nvidia.com/gpu.count: "8"Multi-Node Inference: Pipeline Parallelism with vLLM
# Head node (rank 0) β coordinates inference
apiVersion: apps/v1
kind: Deployment
metadata:
name: llama-405b-head
namespace: inference
spec:
replicas: 1
selector:
matchLabels:
app: llama-405b
role: head
template:
metadata:
labels:
app: llama-405b
role: head
spec:
containers:
- name: vllm
image: vllm/vllm-openai:v0.8.0
args:
- --model=meta-llama/Llama-3.1-405B-Instruct
- --tensor-parallel-size=8
- --pipeline-parallel-size=2
- --port=8000
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 8
env:
- name: VLLM_HOST_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: NCCL_SOCKET_IFNAME
value: "eth0"
volumeMounts:
- name: shm
mountPath: /dev/shm
- name: model-cache
mountPath: /root/.cache/huggingface
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: 64Gi
- name: model-cache
persistentVolumeClaim:
claimName: model-storage-405b
---
# Worker node (rank 1)
apiVersion: apps/v1
kind: Deployment
metadata:
name: llama-405b-worker
namespace: inference
spec:
replicas: 1
selector:
matchLabels:
app: llama-405b
role: worker
template:
metadata:
labels:
app: llama-405b
role: worker
spec:
containers:
- name: vllm-worker
image: vllm/vllm-openai:v0.8.0
args:
- --model=meta-llama/Llama-3.1-405B-Instruct
- --tensor-parallel-size=8
- --pipeline-parallel-size=2
resources:
limits:
nvidia.com/gpu: 8
env:
- name: VLLM_HOST_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: NCCL_SOCKET_IFNAME
value: "eth0"
volumeMounts:
- name: shm
mountPath: /dev/shm
- name: model-cache
mountPath: /root/.cache/huggingface
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: 64Gi
- name: model-cache
persistentVolumeClaim:
claimName: model-storage-405bNIM Multi-GPU Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: nim-llama-70b
namespace: inference
spec:
replicas: 1
selector:
matchLabels:
app: nim-llama-70b
template:
metadata:
labels:
app: nim-llama-70b
spec:
containers:
- name: nim
image: nvcr.io/nim/meta/llama-3.1-70b-instruct:1.8.0
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 8
env:
- name: NIM_TENSOR_PARALLEL_SIZE
value: "8"
- name: NIM_MAX_MODEL_LEN
value: "32768"
- name: NIM_GPU_MEMORY_UTILIZATION
value: "0.90"
volumeMounts:
- name: nim-cache
mountPath: /opt/nim/.cache
- name: shm
mountPath: /dev/shm
volumes:
- name: nim-cache
persistentVolumeClaim:
claimName: nim-model-cache
- name: shm
emptyDir:
medium: Memory
sizeLimit: 64GiService and Autoscaling
apiVersion: v1
kind: Service
metadata:
name: inference-api
namespace: inference
spec:
selector:
app: llama-70b
ports:
- port: 8000
targetPort: 8000
type: ClusterIP
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: inference-hpa
namespace: inference
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: llama-70b-tp8
minReplicas: 1
maxReplicas: 4
metrics:
- type: Pods
pods:
metric:
name: vllm_num_requests_waiting
target:
type: AverageValue
averageValue: "10"Parallelism Strategy Decision
Model fits on 1 GPU?
β Single GPU, no parallelism needed
Model fits on 1 node (multiple GPUs)?
β Tensor Parallelism (TP)
β tp_size = number of GPUs needed
β Best performance (NVLink bandwidth)
Model needs multiple nodes?
β Pipeline Parallelism (PP) across nodes + TP within each node
β pp_size = number of nodes
β tp_size = GPUs per node
β Network becomes bottleneck (use RDMA/InfiniBand)GPU Memory Calculator
Model VRAM (FP16) β parameters Γ 2 bytes
Model VRAM (INT8) β parameters Γ 1 byte
Model VRAM (INT4/GPTQ) β parameters Γ 0.5 bytes
KV Cache VRAM = batch_size Γ seq_len Γ hidden_dim Γ num_layers Γ 2 Γ 2 bytes
Examples:
7B FP16: ~14 GB β 1Γ A100 β
70B FP16: ~140 GB β 2Γ A100-80GB (TP=2) or 4Γ A100-40GB (TP=4)
405B FP16: ~810 GB β 2 nodes Γ 8Γ A100-80GB (TP=8, PP=2)Common Issues
NCCL timeout during multi-node inference
- Cause: Network latency too high or firewall blocking NCCL ports
- Fix: Use RDMA/InfiniBand; open ports 29400-29500; set
NCCL_SOCKET_IFNAME
OOM with tensor parallelism
- Cause: KV cache too large for remaining memory after model sharding
- Fix: Reduce
--max-model-lenor increase--tensor-parallel-size
Uneven GPU utilization
- Cause: Pipeline parallelism creates bubbles between stages
- Fix: Use tensor parallelism when possible; PP is last resort for models too large for one node
Best Practices
- Prefer tensor parallelism β lower latency than pipeline parallelism
- Use NVLink/NVSwitch nodes β 900 GB/s vs 32 GB/s PCIe for inter-GPU communication
- Size
/dev/shmgenerously β NCCL uses shared memory for GPU communication - Pin to topology β use topology-aware scheduling to keep TP GPUs on same NVLink domain
- Quantize first β INT8/INT4 reduces GPU count needed before resorting to multi-node
Key Takeaways
- Tensor parallelism splits across GPUs on one node (fast, uses NVLink)
- Pipeline parallelism splits across nodes (slower, needs network)
- vLLM and NIM handle parallelism natively via
--tensor-parallel-sizeand--pipeline-parallel-size - Always try quantization before adding more GPUs
/dev/shmmust be large enough for NCCL shared memory buffers

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
