πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai advanced ⏱ 30 minutes K8s 1.28+

Multi-GPU and Tensor Parallel LLM Inference on Kubernetes

Deploy large language models across multiple GPUs using tensor parallelism with vLLM and NVIDIA NIM on Kubernetes for high-throughput inference serving.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Set --tensor-parallel-size N in vLLM (or NIM_TP_SIZE=N for NIM) where N matches the GPU count. Request nvidia.com/gpu: N in the pod spec. vLLM automatically shards model layers across GPUs. A 70B model at bf16 needs 4Γ— A100-40GB or 2Γ— A100-80GB. Ensure GPUs are on the same node with NVLink for best performance.

Models larger than ~15B parameters typically exceed single-GPU memory. Tensor parallelism splits the model across multiple GPUs so they work together on each request.

When You Need Multi-GPU

ModelParametersbf16 MemoryGPUs Needed (A100-80GB)
Mistral-7B7B~14 GB1
Llama-2-13B13B~26 GB1
Llama-2-70B70B~140 GB2
Mixtral-8x7B46.7B~90 GB2
Llama-3-405B405B~810 GB8+

Parallelism Strategies

Tensor Parallelism (TP)

Splits each layer across GPUs. All GPUs process every request together.

GPU 0: Layer 1 (half) + Layer 2 (half) + ... + Layer N (half)
GPU 1: Layer 1 (half) + Layer 2 (half) + ... + Layer N (half)
  • Best for: low-latency single-request inference
  • Requires: GPUs on same node with NVLink
  • Set with: --tensor-parallel-size

Pipeline Parallelism (PP)

Assigns different layers to different GPUs. Requests flow through GPUs sequentially.

GPU 0: Layers 1-16
GPU 1: Layers 17-32
  • Best for: spreading across nodes or non-NVLink setups
  • Higher latency per request but more flexible
  • Set with: --pipeline-parallel-size

vLLM Multi-GPU Deployment

# llama-70b-multi-gpu.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-70b
  namespace: ai-inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llama-70b
  template:
    metadata:
      labels:
        app: llama-70b
    spec:
      containers:
        - name: vllm
          image: registry.example.com/org/vllm-cuda:latest
          command:
            - python
            - -m
            - vllm.entrypoints.openai.api_server
          args:
            - --model
            - /data/Llama-2-70B-hf
            - --dtype
            - bfloat16
            - --tensor-parallel-size
            - "2"                    # Split across 2 GPUs
            - --max-model-len
            - "4096"
          ports:
            - containerPort: 8000
          env:
            - name: HF_HUB_OFFLINE
              value: "1"
            - name: TRANSFORMERS_OFFLINE
              value: "1"
            - name: NCCL_DEBUG
              value: "WARN"          # Set to INFO for debugging
          resources:
            limits:
              nvidia.com/gpu: "2"    # Must match tensor-parallel-size
            requests:
              nvidia.com/gpu: "2"
          volumeMounts:
            - name: model-data
              mountPath: /data
              readOnly: true
            - name: shm
              mountPath: /dev/shm
      volumes:
        - name: model-data
          persistentVolumeClaim:
            claimName: model-storage-pvc
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 16Gi        # Shared memory for NCCL
---
apiVersion: v1
kind: Service
metadata:
  name: llama-70b
  namespace: ai-inference
spec:
  selector:
    app: llama-70b
  ports:
    - port: 8000
      targetPort: 8000

Critical: Shared Memory

Multi-GPU inference uses NCCL for GPU-to-GPU communication. NCCL requires shared memory (/dev/shm):

volumes:
  - name: shm
    emptyDir:
      medium: Memory
      sizeLimit: 16Gi    # At least 1 GB, 16 GB recommended for large models

Without this, you get:

NCCL WARN: Failed to open shared memory

NIM Multi-GPU Deployment

For NVIDIA NIM, set tensor parallelism via environment variable:

env:
  - name: NIM_MODEL_NAME
    value: "/data/Llama-2-70B-hf/"
  - name: NIM_SERVED_MODEL_NAME
    value: "Llama-2-70B"
  - name: NIM_TP_SIZE
    value: "2"
resources:
  limits:
    nvidia.com/gpu: "2"

Topology-Aware Scheduling

For best multi-GPU performance, schedule pods on nodes where GPUs are connected via NVLink:

# Node affinity for NVLink nodes
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: nvidia.com/gpu.count
                operator: Gte
                values: ["2"]
              - key: nvidia.com/gpu.product
                operator: In
                values:
                  - NVIDIA-A100-SXM4-80GB
                  - NVIDIA-H100-SXM5-80GB

If using KAI Scheduler, it can automatically detect and prefer NVLink topologies. See KAI Scheduler Topology-Aware Placement.

Verify Multi-GPU Setup

# Check pod has multiple GPUs
kubectl exec -it <pod> -n ai-inference -- nvidia-smi

# Should show 2+ GPUs listed

# Verify NCCL connectivity
kubectl logs <pod> -n ai-inference | grep -i "nccl\|parallel"

# Look for successful initialization:
# "Initializing tensor parallel group with size 2"
# "NCCL version: ..."

# Test inference
curl -k -X POST https://<endpoint>/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/data/Llama-2-70B-hf",
    "prompt": "Hello from multi-GPU inference:",
    "max_tokens": 32
  }'

TP Size Selection Guide

Model SizeA100-40GBA100-80GBH100-80GB
7B (bf16)TP=1TP=1TP=1
13B (bf16)TP=1TP=1TP=1
34B (bf16)TP=2TP=1TP=1
70B (bf16)TP=4TP=2TP=2
70B (AWQ 4-bit)TP=2TP=1TP=1
405B (bf16)TP=8+TP=8TP=8

Rule of thumb: TP = ceil(model_bf16_GB / single_GPU_VRAM Γ— 1.2)

The 1.2Γ— factor accounts for KV cache and activation memory.

Troubleshooting

SymptomCauseFix
NCCL error: unhandled system errorMissing /dev/shm mountAdd emptyDir with medium: Memory
Slow multi-GPU inferencePCIe instead of NVLinkUse SXM GPUs or NVSwitch topology
CUDA error: out of memoryTP size too smallIncrease --tensor-parallel-size
Pod pendingNot enough GPUs on one nodeCheck node GPU count; use PP for cross-node
Hangs on startupNCCL port blockedEnsure pod-to-pod communication is allowed
#multi-gpu #tensor-parallelism #pipeline-parallelism #llm #inference #gpu #ai-workloads #scaling
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens