πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai advanced ⏱ 15 minutes K8s 1.28+

LeaderWorkerSet Multi-Node Inference on K8s

Deploy multi-node distributed inference using LeaderWorkerSet (LWS) operator on Kubernetes. Covers vLLM pipeline parallelism across nodes for 405B+ parameter

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: LeaderWorkerSet (LWS) operator manages multi-node inference deployments where a leader Pod coordinates workers. Use it for models too large for a single node (405B+) that need pipeline parallelism across multiple 8-GPU nodes.

The Problem

Models like Llama 3.1 405B need 10+ A100-80GB GPUs β€” more than fit in a single node. You need:

  • A leader Pod that serves the API and coordinates inference
  • Worker Pods on other nodes that hold model shards
  • Reliable discovery between leader and workers
  • Automatic restart if any Pod fails (all must restart together)

The Solution

Install LeaderWorkerSet Operator

# Install LWS CRD and controller
kubectl apply --server-side -f \
  https://github.com/kubernetes-sigs/lws/releases/download/v0.5.0/manifests.yaml

# Verify
kubectl get pods -n lws-system

Deploy 405B Model Across 2 Nodes

apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: llama-405b-inference
  namespace: inference
spec:
  replicas: 1                          # 1 replica group (leader + workers)
  leaderWorkerTemplate:
    size: 2                            # 2 Pods total (1 leader + 1 worker)
    restartPolicy: RecreateGroupOnPodRestart
    leaderTemplate:
      metadata:
        labels:
          role: leader
      spec:
        containers:
          - name: vllm
            image: vllm/vllm-openai:v0.8.0
            command:
              - bash
              - -c
              - |
                # Leader starts Ray head
                ray start --head --port=6379
                
                # Wait for worker to join
                while [ $(ray status 2>/dev/null | grep -c "node_") -lt 2 ]; do
                  echo "Waiting for worker..."
                  sleep 5
                done
                
                # Start vLLM with pipeline parallelism
                python -m vllm.entrypoints.openai.api_server \
                  --model meta-llama/Llama-3.1-405B-Instruct \
                  --tensor-parallel-size 8 \
                  --pipeline-parallel-size 2 \
                  --port 8000 \
                  --trust-remote-code
            ports:
              - containerPort: 8000
                name: http
              - containerPort: 6379
                name: ray
            resources:
              limits:
                nvidia.com/gpu: 8
                memory: 600Gi
              requests:
                nvidia.com/gpu: 8
                memory: 400Gi
            env:
              - name: NCCL_SOCKET_IFNAME
                value: "eth0"
              - name: HF_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-token
                    key: token
            volumeMounts:
              - name: model-cache
                mountPath: /root/.cache/huggingface
              - name: shm
                mountPath: /dev/shm
        volumes:
          - name: model-cache
            persistentVolumeClaim:
              claimName: model-cache-405b
          - name: shm
            emptyDir:
              medium: Memory
              sizeLimit: 64Gi
        nodeSelector:
          nvidia.com/gpu.count: "8"
    workerTemplate:
      metadata:
        labels:
          role: worker
      spec:
        containers:
          - name: vllm-worker
            image: vllm/vllm-openai:v0.8.0
            command:
              - bash
              - -c
              - |
                # Connect to leader's Ray head
                LEADER_ADDR=$(echo $LWS_LEADER_ADDRESS)
                ray start --address=${LEADER_ADDR}:6379 --block
            resources:
              limits:
                nvidia.com/gpu: 8
                memory: 600Gi
              requests:
                nvidia.com/gpu: 8
                memory: 400Gi
            env:
              - name: NCCL_SOCKET_IFNAME
                value: "eth0"
              - name: LWS_LEADER_ADDRESS
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.annotations['leaderworkerset.sigs.k8s.io/leader-address']
            volumeMounts:
              - name: model-cache
                mountPath: /root/.cache/huggingface
              - name: shm
                mountPath: /dev/shm
        volumes:
          - name: model-cache
            persistentVolumeClaim:
              claimName: model-cache-405b
          - name: shm
            emptyDir:
              medium: Memory
              sizeLimit: 64Gi
        nodeSelector:
          nvidia.com/gpu.count: "8"
---
apiVersion: v1
kind: Service
metadata:
  name: llama-405b-api
  namespace: inference
spec:
  selector:
    leaderworkerset.sigs.k8s.io/name: llama-405b-inference
    role: leader
  ports:
    - port: 8000
      targetPort: 8000
  type: ClusterIP

Key LWS Features

Feature                          Benefit
────────────────────────────────────────────────────────
RecreateGroupOnPodRestart        If any Pod dies, entire group restarts
LWS_LEADER_ADDRESS annotation    Workers auto-discover leader IP
size: N                          1 leader + (N-1) workers guaranteed together
replicas: M                      Scale to M independent serving groups
Exclusive placement              Each group gets dedicated nodes

Benchmark the Distributed Endpoint

# After deployment is ready
genai-perf \
  --endpoint-type chat \
  --backend vllm \
  --url http://llama-405b-api.inference:8000/v1 \
  --model meta-llama/Llama-3.1-405B-Instruct \
  --concurrency 1 \
  --input-tokens-mean 200 \
  --output-tokens-mean 200 \
  --num-requests 50

Common Issues

Worker can’t connect to leader Ray head

  • Cause: Network policy blocking port 6379 between Pods
  • Fix: Allow intra-namespace traffic; verify LWS_LEADER_ADDRESS resolves

Pipeline parallel slower than expected

  • Cause: Network bandwidth between nodes insufficient for activation transfers
  • Fix: Use RDMA/InfiniBand; reduce pipeline stages; increase tensor parallelism per node

Model loading OOM

  • Cause: Both nodes trying to load full 405B model simultaneously
  • Fix: Use shared RWX PVC with pre-downloaded model; Ray handles shard distribution

Best Practices

  1. LWS over manual Deployments β€” handles group restart semantics correctly
  2. Pre-download models β€” avoid each Pod downloading 800GB independently
  3. RDMA networking β€” pipeline parallelism is network-bound between nodes
  4. Size replicas for HA β€” replicas: 2 gives you a hot spare serving group
  5. Monitor both nodes β€” pipeline bubble means one GPU is idle while other computes

Key Takeaways

  • LWS operator manages leader+worker groups with atomic restart
  • Workers discover leader via LWS_LEADER_ADDRESS annotation
  • Pipeline parallel across nodes + tensor parallel within each node
  • 405B model needs 2Γ— 8-GPU nodes minimum (A100-80GB or H200)
  • RecreateGroupOnPodRestart ensures consistent model state after failure
  • Benchmark with GenAI-Perf to validate multi-node overhead is acceptable
#inference #distributed #lws #vllm #multi-node
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens