πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai advanced ⏱ 15 minutes K8s 1.28+

Distributed Inference on Kubernetes

Deploy distributed LLM inference with tensor parallelism across multiple GPUs and pipeline parallelism across nodes on Kubernetes.

By Luca Berton β€’ β€’ πŸ“– 6 min read

πŸ’‘ Quick Answer: Use tensor parallelism (TP) to split model layers across GPUs within a node (NVLink), and pipeline parallelism (PP) to split layer groups across nodes (InfiniBand/RoCE). vLLM: set --tensor-parallel-size 8 --pipeline-parallel-size 2 for 16-GPU inference across 2 nodes.

The Problem

Large models (70B+) don’t fit in a single GPU’s memory. A 70B FP16 model needs ~140GB VRAM β€” more than one H200 (141GB). A 405B model needs ~810GB. You must split the model across multiple GPUs and potentially multiple nodes. The challenge: minimizing inter-GPU communication latency while maximizing throughput.

The Solution

Parallelism Strategies

tensor_parallelism:
  what: "Split each layer across GPUs β€” every GPU computes part of every layer"
  when: "GPUs within a single node connected via NVLink"
  bandwidth: "NVLink 4.0: 900 GB/s (H200)"
  latency: "~1ΞΌs per all-reduce"
  max_practical: "TP=8 (one node)"

pipeline_parallelism:
  what: "Split layer groups across nodes β€” each node handles consecutive layers"
  when: "Model too large for single node; multi-node inference"
  bandwidth: "InfiniBand NDR: 400 Gb/s, RoCE: 200-400 Gb/s"
  latency: "~2-5ΞΌs per send/recv"
  note: "Creates micro-batch pipeline bubbles"

combined:
  example: "405B model on 2 nodes Γ— 8 GPUs = TP=8, PP=2"
  total_gpus: 16
  memory_per_gpu: "~51GB (810GB / 16)"

vLLM Distributed Inference (Single Node, TP=8)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama-70b
  namespace: tenant-alpha
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-70b
  template:
    metadata:
      labels:
        app: vllm-70b
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.6.6
          args:
            - "--model"
            - "meta-llama/Llama-3.1-70B-Instruct"
            - "--tensor-parallel-size"
            - "8"
            - "--gpu-memory-utilization"
            - "0.92"
            - "--max-model-len"
            - "8192"
            - "--port"
            - "8000"
          ports:
            - containerPort: 8000
          resources:
            limits:
              nvidia.com/gpu: 8
          env:
            - name: NCCL_DEBUG
              value: "WARN"
            - name: CUDA_VISIBLE_DEVICES
              value: "0,1,2,3,4,5,6,7"
          volumeMounts:
            - name: model-cache
              mountPath: /root/.cache/huggingface
            - name: shm
              mountPath: /dev/shm
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-cache-pvc
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 64Gi

vLLM Multi-Node (TP=8, PP=2) via Ray

# Head node (Ray head + vLLM controller)
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: vllm-405b-head
  namespace: tenant-alpha
spec:
  serviceName: vllm-head
  replicas: 1
  selector:
    matchLabels:
      app: vllm-405b
      role: head
  template:
    metadata:
      labels:
        app: vllm-405b
        role: head
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.6.6
          args:
            - "--model"
            - "meta-llama/Llama-3.1-405B-Instruct-FP8"
            - "--tensor-parallel-size"
            - "8"
            - "--pipeline-parallel-size"
            - "2"
            - "--gpu-memory-utilization"
            - "0.92"
            - "--port"
            - "8000"
          ports:
            - containerPort: 8000
              name: api
            - containerPort: 6379
              name: ray-head
          resources:
            limits:
              nvidia.com/gpu: 8
          env:
            - name: RAY_ADDRESS
              value: "local"
            - name: NCCL_IB_HCA
              value: "mlx5_0,mlx5_1"
            - name: NCCL_NET_GDR_LEVEL
              value: "5"
          volumeMounts:
            - name: model-cache
              mountPath: /root/.cache/huggingface
            - name: shm
              mountPath: /dev/shm
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-405b-cache
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 128Gi
---
# Worker node (Ray worker)
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: vllm-405b-worker
  namespace: tenant-alpha
spec:
  serviceName: vllm-worker
  replicas: 1
  selector:
    matchLabels:
      app: vllm-405b
      role: worker
  template:
    metadata:
      labels:
        app: vllm-405b
        role: worker
    spec:
      containers:
        - name: ray-worker
          image: vllm/vllm-openai:v0.6.6
          command: ["ray", "start", "--block", "--address=vllm-head-0.vllm-head:6379"]
          resources:
            limits:
              nvidia.com/gpu: 8
          env:
            - name: NCCL_IB_HCA
              value: "mlx5_0,mlx5_1"
            - name: NCCL_NET_GDR_LEVEL
              value: "5"
          volumeMounts:
            - name: model-cache
              mountPath: /root/.cache/huggingface
            - name: shm
              mountPath: /dev/shm
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-405b-cache
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 128Gi
---
# Headless service for Ray discovery
apiVersion: v1
kind: Service
metadata:
  name: vllm-head
  namespace: tenant-alpha
spec:
  clusterIP: None
  selector:
    app: vllm-405b
    role: head
  ports:
    - port: 6379
      name: ray
    - port: 8000
      name: api

TensorRT-LLM Multi-GPU (Triton)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-trtllm-70b
  namespace: tenant-alpha
spec:
  replicas: 1
  selector:
    matchLabels:
      app: triton-trtllm
  template:
    metadata:
      labels:
        app: triton-trtllm
    spec:
      containers:
        - name: triton
          image: nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3
          args:
            - "tritonserver"
            - "--model-repository=/models"
            - "--model-control-mode=explicit"
            - "--load-model=llama-70b"
          resources:
            limits:
              nvidia.com/gpu: 8
          env:
            - name: CUDA_VISIBLE_DEVICES
              value: "0,1,2,3,4,5,6,7"
          volumeMounts:
            - name: models
              mountPath: /models
            - name: shm
              mountPath: /dev/shm
      volumes:
        - name: models
          persistentVolumeClaim:
            claimName: trtllm-models
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 64Gi

Model Engine Build (TensorRT-LLM)

# Build TensorRT-LLM engine with TP=8
python convert_checkpoint.py \
  --model_dir /models/llama-70b-hf \
  --output_dir /engines/llama-70b-tp8 \
  --tp_size 8 \
  --dtype float16

trtllm-build \
  --checkpoint_dir /engines/llama-70b-tp8 \
  --output_dir /models/llama-70b/1/ \
  --gemm_plugin float16 \
  --gpt_attention_plugin float16 \
  --max_batch_size 64 \
  --max_input_len 2048 \
  --max_seq_len 4096

# For multi-node PP=2:
# Build with --tp_size 8 --pp_size 2
# Deploy with MPI across 2 nodes

Sizing Guide

# Model size to GPU mapping:
# FP16 memory β‰ˆ 2 Γ— params (bytes)
# FP8 memory β‰ˆ 1 Γ— params (bytes) + KV cache

models:
  "7B-8B":
    fp16_memory: "~16 GB"
    config: "TP=1, 1x H200"

  "13B":
    fp16_memory: "~26 GB"
    config: "TP=1, 1x H200"

  "70B":
    fp16_memory: "~140 GB"
    config: "TP=2 (H200 141GB just fits TP=1 with small context)"
    fp8_config: "TP=1, 1x H200 with FP8 quantization"

  "120B":
    fp16_memory: "~240 GB"
    config: "TP=2 or TP=4"

  "405B":
    fp16_memory: "~810 GB"
    config: "TP=8, PP=1 (8x H200 = 1128 GB)"
    fp8_config: "TP=8, 1 node (8x H200)"

  "405B+":
    fp8_memory: "~405 GB + large KV cache"
    config: "TP=8, PP=2 (2 nodes, 16 GPUs)"
graph TD
    A[405B Model] --> B[PP Split: Layers 0-39]
    A --> C[PP Split: Layers 40-79]
    
    B --> D[Node 1: 8x H200]
    C --> E[Node 2: 8x H200]
    
    D --> F[TP across 8 GPUs via NVLink]
    E --> G[TP across 8 GPUs via NVLink]
    
    D -->|InfiniBand or RoCE| E
    
    H[Request] --> D
    D -->|Pipeline Stage 1| E
    E -->|Result| I[Response]

Common Issues

  • OOM on model load β€” model too large for TP config; increase TP size or use FP8 quantization
  • Multi-node inference slow β€” check NCCL transport: NCCL_DEBUG=INFO should show NET/IB or NET/Socket; ensure GPUDirect RDMA is active
  • Ray worker can’t connect β€” headless service DNS must resolve; check nslookup vllm-head-0.vllm-head
  • /dev/shm too small β€” NCCL uses shared memory for intra-node communication; set sizeLimit: 64Gi+
  • Pipeline bubbles hurt throughput β€” PP adds latency per micro-batch; maximize batch size to fill the pipeline

Best Practices

  • TP within a node (NVLink), PP across nodes (InfiniBand) β€” never the reverse
  • Use FP8 quantization to halve memory requirements with minimal quality loss
  • Size /dev/shm to at least 1GB per GPU for NCCL shared memory
  • Pre-download models to PVC β€” don’t download at pod startup
  • Run genai-perf after deployment to validate TTFT/ITL SLOs
  • Use StatefulSet for multi-node inference β€” stable network identities for Ray discovery

Key Takeaways

  • Tensor parallelism splits layers across GPUs (intra-node, NVLink)
  • Pipeline parallelism splits layer groups across nodes (inter-node, InfiniBand)
  • vLLM supports both TP and PP via Ray for multi-node inference
  • TensorRT-LLM requires engine compilation with TP/PP baked in
  • H200 (141GB) enables 70B on 1-2 GPUs; 405B needs 8+ GPUs across 1-2 nodes
  • FP8 quantization halves memory with <1% quality loss on most models
#distributed-inference #tensor-parallelism #pipeline-parallelism #vllm #triton #multi-gpu
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens