Multi-GPU and Tensor Parallel LLM Inference on Kubernetes
Deploy large language models across multiple GPUs using tensor parallelism with vLLM and NVIDIA NIM on Kubernetes for high-throughput inference serving.
π‘ Quick Answer: Set
--tensor-parallel-size Nin vLLM (orNIM_TP_SIZE=Nfor NIM) where N matches the GPU count. Requestnvidia.com/gpu: Nin the pod spec. vLLM automatically shards model layers across GPUs. A 70B model at bf16 needs 4Γ A100-40GB or 2Γ A100-80GB. Ensure GPUs are on the same node with NVLink for best performance.
Models larger than ~15B parameters typically exceed single-GPU memory. Tensor parallelism splits the model across multiple GPUs so they work together on each request.
When You Need Multi-GPU
| Model | Parameters | bf16 Memory | GPUs Needed (A100-80GB) |
|---|---|---|---|
| Mistral-7B | 7B | ~14 GB | 1 |
| Llama-2-13B | 13B | ~26 GB | 1 |
| Llama-2-70B | 70B | ~140 GB | 2 |
| Mixtral-8x7B | 46.7B | ~90 GB | 2 |
| Llama-3-405B | 405B | ~810 GB | 8+ |
Parallelism Strategies
Tensor Parallelism (TP)
Splits each layer across GPUs. All GPUs process every request together.
GPU 0: Layer 1 (half) + Layer 2 (half) + ... + Layer N (half)
GPU 1: Layer 1 (half) + Layer 2 (half) + ... + Layer N (half)- Best for: low-latency single-request inference
- Requires: GPUs on same node with NVLink
- Set with:
--tensor-parallel-size
Pipeline Parallelism (PP)
Assigns different layers to different GPUs. Requests flow through GPUs sequentially.
GPU 0: Layers 1-16
GPU 1: Layers 17-32- Best for: spreading across nodes or non-NVLink setups
- Higher latency per request but more flexible
- Set with:
--pipeline-parallel-size
vLLM Multi-GPU Deployment
# llama-70b-multi-gpu.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: llama-70b
namespace: ai-inference
spec:
replicas: 1
selector:
matchLabels:
app: llama-70b
template:
metadata:
labels:
app: llama-70b
spec:
containers:
- name: vllm
image: registry.example.com/org/vllm-cuda:latest
command:
- python
- -m
- vllm.entrypoints.openai.api_server
args:
- --model
- /data/Llama-2-70B-hf
- --dtype
- bfloat16
- --tensor-parallel-size
- "2" # Split across 2 GPUs
- --max-model-len
- "4096"
ports:
- containerPort: 8000
env:
- name: HF_HUB_OFFLINE
value: "1"
- name: TRANSFORMERS_OFFLINE
value: "1"
- name: NCCL_DEBUG
value: "WARN" # Set to INFO for debugging
resources:
limits:
nvidia.com/gpu: "2" # Must match tensor-parallel-size
requests:
nvidia.com/gpu: "2"
volumeMounts:
- name: model-data
mountPath: /data
readOnly: true
- name: shm
mountPath: /dev/shm
volumes:
- name: model-data
persistentVolumeClaim:
claimName: model-storage-pvc
- name: shm
emptyDir:
medium: Memory
sizeLimit: 16Gi # Shared memory for NCCL
---
apiVersion: v1
kind: Service
metadata:
name: llama-70b
namespace: ai-inference
spec:
selector:
app: llama-70b
ports:
- port: 8000
targetPort: 8000Critical: Shared Memory
Multi-GPU inference uses NCCL for GPU-to-GPU communication. NCCL requires shared memory (/dev/shm):
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: 16Gi # At least 1 GB, 16 GB recommended for large modelsWithout this, you get:
NCCL WARN: Failed to open shared memoryNIM Multi-GPU Deployment
For NVIDIA NIM, set tensor parallelism via environment variable:
env:
- name: NIM_MODEL_NAME
value: "/data/Llama-2-70B-hf/"
- name: NIM_SERVED_MODEL_NAME
value: "Llama-2-70B"
- name: NIM_TP_SIZE
value: "2"
resources:
limits:
nvidia.com/gpu: "2"Topology-Aware Scheduling
For best multi-GPU performance, schedule pods on nodes where GPUs are connected via NVLink:
# Node affinity for NVLink nodes
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.count
operator: Gte
values: ["2"]
- key: nvidia.com/gpu.product
operator: In
values:
- NVIDIA-A100-SXM4-80GB
- NVIDIA-H100-SXM5-80GBIf using KAI Scheduler, it can automatically detect and prefer NVLink topologies. See KAI Scheduler Topology-Aware Placement.
Verify Multi-GPU Setup
# Check pod has multiple GPUs
kubectl exec -it <pod> -n ai-inference -- nvidia-smi
# Should show 2+ GPUs listed
# Verify NCCL connectivity
kubectl logs <pod> -n ai-inference | grep -i "nccl\|parallel"
# Look for successful initialization:
# "Initializing tensor parallel group with size 2"
# "NCCL version: ..."
# Test inference
curl -k -X POST https://<endpoint>/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/data/Llama-2-70B-hf",
"prompt": "Hello from multi-GPU inference:",
"max_tokens": 32
}'TP Size Selection Guide
| Model Size | A100-40GB | A100-80GB | H100-80GB |
|---|---|---|---|
| 7B (bf16) | TP=1 | TP=1 | TP=1 |
| 13B (bf16) | TP=1 | TP=1 | TP=1 |
| 34B (bf16) | TP=2 | TP=1 | TP=1 |
| 70B (bf16) | TP=4 | TP=2 | TP=2 |
| 70B (AWQ 4-bit) | TP=2 | TP=1 | TP=1 |
| 405B (bf16) | TP=8+ | TP=8 | TP=8 |
Rule of thumb: TP = ceil(model_bf16_GB / single_GPU_VRAM Γ 1.2)
The 1.2Γ factor accounts for KV cache and activation memory.
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
NCCL error: unhandled system error | Missing /dev/shm mount | Add emptyDir with medium: Memory |
| Slow multi-GPU inference | PCIe instead of NVLink | Use SXM GPUs or NVSwitch topology |
CUDA error: out of memory | TP size too small | Increase --tensor-parallel-size |
| Pod pending | Not enough GPUs on one node | Check node GPU count; use PP for cross-node |
| Hangs on startup | NCCL port blocked | Ensure pod-to-pod communication is allowed |
Related Recipes

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
