NIM Multi-Node Deployment with Helm on K8s
Deploy NVIDIA NIM across multiple Kubernetes nodes using Helm, LeaderWorkerSet, Ray, and vLLM. Run Llama 405B and DeepSeek-R1 on 16+ GPUs.
π‘ Quick Answer: Enable
multiNode.enabled=truein the NIM Helm chart withtensorParallelSizeandpipelineParallelSizevalues. NIM uses LeaderWorkerSet + Ray for cluster formation and vLLM for distributed execution. The leader downloads the model to shared storage (RWX PVC); workers join via Ray and provide GPU resources.
The Problem
Very large language models like Llama 3.1 405B (810GB in BF16) or DeepSeek-R1 (1.2TB) cannot fit on a single nodeβs GPU memory. Even with 8Γ H100 80GB (640GB total), you need multiple nodes. NIM supports multi-node deployment using Ray for cluster orchestration and vLLM for distributed model parallelism, but setting it up on Kubernetes requires LeaderWorkerSet, shared storage, and proper NCCL networking configuration.
flowchart LR
subgraph Leader["Leader Node (8Γ GPU)"]
RH["Ray Head"]
VLLM["vLLM Server<br/>API endpoint :8000"]
end
subgraph Worker["Worker Node (8Γ GPU)"]
RW["Ray Worker"]
WGPU["8Γ H100 GPUs"]
end
RH <-->|"Ray cluster<br/>port 6379"| RW
VLLM -->|"NCCL over<br/>InfiniBand/RoCE"| WGPU
SHARED["Shared PVC (RWX)<br/>Model cache 200Gi"] --- Leader
SHARED --- WorkerThe Solution
Prerequisites
1. Install LeaderWorkerSet CRD:
kubectl apply --server-side -f \
https://github.com/kubernetes-sigs/lws/releases/latest/download/manifests.yaml2. Create NGC API secret:
kubectl create secret generic ngc-api \
--from-literal=NGC_API_KEY=<your-ngc-key>3. Create image pull secret for NGC registry:
kubectl create secret docker-registry nvcr-imagepull \
--docker-server=nvcr.io \
--docker-username='$oauthtoken' \
--docker-password=<your-ngc-key>4. Ensure shared storage β a StorageClass that supports ReadWriteMany (NFS, CephFS, Lustre, etc.)
5. High-speed networking β InfiniBand or RoCE strongly recommended for cross-node NCCL traffic.
Parallelism Strategies
NIM splits model weights across nodes using two strategies:
| Strategy | What it splits | When to use |
|---|---|---|
| Tensor Parallelism (TP) | Model layers across GPUs | Within a node (fast NVLink) |
| Pipeline Parallelism (PP) | Model stages across nodes | Across nodes (slower network) |
Rule of thumb: Set TP = GPUs per node, PP = number of nodes.
Total GPUs = TP Γ PP
Example: Llama 405B on 2 nodes Γ 8 GPUs
TP = 8 (split layers across 8 GPUs per node)
PP = 2 (split stages across 2 nodes)
Total = 16 GPUsFor high-bandwidth interconnects (InfiniBand), you can also use multi-node TP:
Example: TP=16, PP=1 across 2 nodes Γ 8 GPUs
One TP group spans both nodes
Requires RDMA for acceptable latencyDeploy Llama 3.1 405B (TP=8, PP=2)
# values-llama-405b.yaml
image:
repository: nvcr.io/nim/meta/llama-3.1-405b-instruct
tag: "1.7.3"
model:
ngcAPISecret: ngc-api
jsonLogging: false # REQUIRED: JSON logging breaks Ray workers
multiNode:
enabled: true
workers: 1 # 1 worker + 1 leader = 2 nodes total
tensorParallelSize: 8 # 8 GPUs per node
pipelineParallelSize: 2 # 2 pipeline stages (1 per node)
resources:
limits:
nvidia.com/gpu: 8
requests:
nvidia.com/gpu: 8
persistence:
enabled: true
size: 200Gi
accessMode: ReadWriteMany
storageClass: nfs-csi # Must support RWX
imagePullSecrets:
- name: nvcr-imagepull# Install
helm install nim-405b nim-llm/ -f values-llama-405b.yaml
# Monitor
kubectl get pods -w
# nim-405b-0 (leader) β downloads model, starts Ray head + vLLM
# nim-405b-0-1 (worker) β joins Ray cluster, provides GPUsDeploy DeepSeek-R1 (TP=8, PP=2)
# values-deepseek-r1.yaml
image:
repository: nvcr.io/nim/deepseek-ai/deepseek-r1
tag: "1.7.3"
model:
ngcAPISecret: ngc-api
jsonLogging: false
multiNode:
enabled: true
workers: 1
tensorParallelSize: 8
pipelineParallelSize: 2
resources:
limits:
nvidia.com/gpu: 8
requests:
nvidia.com/gpu: 8
persistence:
enabled: true
size: 300Gi # DeepSeek-R1 is larger
accessMode: ReadWriteMany
storageClass: nfs-csi
imagePullSecrets:
- name: nvcr-imagepullMulti-Node Tensor Parallelism (TP=16)
For InfiniBand-equipped clusters where you want a single TP group spanning nodes:
multiNode:
enabled: true
workers: 1
tensorParallelSize: 16 # Single TP group across 2 nodes
pipelineParallelSize: 1 # No pipeline split
env:
- name: NCCL_IB_DISABLE
value: "0" # Enable InfiniBand
- name: NCCL_NET_GDR_LEVEL
value: "5" # GPUDirect RDMAβ οΈ Multi-node TP requires high-bandwidth, low-latency interconnect (InfiniBand/RoCE with RDMA). Ethernet-only clusters should use PP instead.
Profile Selection
Two approaches to select the model profile:
Option A: TP/PP values (recommended)
helm install nim-llm nim-llm/ \
--set multiNode.enabled=true \
--set multiNode.workers=1 \
--set multiNode.tensorParallelSize=8 \
--set multiNode.pipelineParallelSize=2NIM injects NIM_TENSOR_PARALLEL_SIZE and NIM_PIPELINE_PARALLEL_SIZE and auto-selects the matching profile.
Option B: Explicit profile name
helm install nim-llm nim-llm/ \
--set multiNode.enabled=true \
--set multiNode.workers=1 \
--set model.profile=vllm-bf16-tp8-pp2Model Storage Options
Shared PVC (recommended):
persistence:
enabled: true
size: 200Gi
accessMode: ReadWriteMany # REQUIRED for multi-node
storageClass: nfs-csi
# Or use a pre-created PVC:
# existingClaim: nim-model-cache # Survives helm uninstallNFS direct mount:
persistence:
enabled: true
nfs:
enabled: true
server: "10.0.1.50"
path: "/exports/nim-cache"Independent downloads (not recommended):
Without shared storage, each node downloads the model independently to emptyDir. Wastes time and bandwidth.
Model-Free Deployment (Hugging Face Models)
Deploy any HF model across multiple nodes without a pre-built NIM container:
image:
repository: nvcr.io/nim/nim-llm
tag: "2.0.2"
model:
modelPath: "hf://meta-llama/Llama-3.1-405B-Instruct"
ngcAPISecret: ngc-api
hfTokenSecret: hf-token # Secret with HF_TOKEN key
jsonLogging: false
multiNode:
enabled: true
workers: 1
tensorParallelSize: 8
pipelineParallelSize: 2
persistence:
enabled: true
size: 200Gi
accessMode: ReadWriteMany
storageClass: nfs-csi# Create HF token secret
kubectl create secret generic hf-token \
--from-literal=HF_TOKEN=<your-hf-token>NCCL Tuning for Multi-Node
# Add to Helm values
env:
# InfiniBand settings
- name: NCCL_IB_DISABLE
value: "0" # 0 = enable IB
- name: NCCL_NET_GDR_LEVEL
value: "5" # GPUDirect RDMA level
- name: NCCL_IB_HCA
value: "mlx5" # HCA device filter
# Debugging
- name: NCCL_DEBUG
value: "WARN" # Use INFO for troubleshooting
# For Ethernet-only (no IB)
# - name: NCCL_SOCKET_IFNAME
# value: "eth0"
# - name: NCCL_IB_DISABLE
# value: "1"Verify Deployment
# Check all pods are running
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# nim-405b-0 1/1 Running 0 12m (leader)
# nim-405b-0-1 1/1 Running 0 12m (worker)
# Check leader logs for Ray cluster formation
kubectl logs nim-405b-0 | grep -E "Ray|worker|GPU"
# Ray head started at 10.244.1.5:6379
# Worker 10.244.2.8 connected with 8 GPUs
# Total GPUs in cluster: 16
# Check profile selection
kubectl logs nim-405b-0 | grep "profile"
# Selected profile: vllm-bf16-tp8-pp2
# Test inference
kubectl port-forward svc/nim-405b 8000:8000
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta/llama-3.1-405b-instruct",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 64
}'Helm Values Quick Reference
| Parameter | Description | Default |
|---|---|---|
multiNode.enabled | Enable multi-node mode | false |
multiNode.workers | Worker pods (total nodes = workers + 1) | 1 |
multiNode.tensorParallelSize | GPUs per TP group | 0 |
multiNode.pipelineParallelSize | Pipeline stages across nodes | 0 |
multiNode.ray.port | Ray head communication port | 6379 |
model.profile | Explicit profile name or hash | "" |
model.jsonLogging | JSON log format (must be false) | true |
model.hfTokenSecret | Secret name for HF_TOKEN | "" |
persistence.accessMode | PVC access mode (must be RWX) | ReadWriteOnce |
Common Issues
| Issue | Cause | Fix |
|---|---|---|
| Worker fails to start with log formatter error | jsonLogging is true | Set model.jsonLogging: false β NIM JSON formatter not available in Ray workers |
| Worker canβt join Ray cluster | Network policy blocking port 6379 | Allow TCP 6379 between leader and worker pods |
| PVC mount fails on worker | PVC is ReadWriteOnce | Change to ReadWriteMany β RWO canβt mount on multiple nodes |
| NCCL timeout | No RDMA, slow Ethernet between nodes | Use PP instead of multi-node TP, or enable InfiniBand |
| Model download on every restart | No persistent storage | Use PVC with existingClaim β survives helm uninstall |
LWS_LEADER_ADDRESS not set | LeaderWorkerSet CRD not installed | Install LWS: kubectl apply --server-side -f https://github.com/kubernetes-sigs/lws/releases/latest/download/manifests.yaml |
| Pods pending (insufficient GPUs) | Not enough GPU nodes | Verify nvidia.com/gpu availability: kubectl describe nodes | grep nvidia.com/gpu |
| Profile not found | TP/PP and profile both unset | Must set either tensorParallelSize/pipelineParallelSize OR model.profile |
Best Practices
- Always disable JSON logging β
model.jsonLogging: falseis mandatory for multi-node - Use shared storage β RWX PVC lets the leader download once; workers read from cache
- Use
existingClaimβ pre-created PVC persists across Helm upgrades and reinstalls - TP within node, PP across nodes β NVLink is 10-50Γ faster than network; keep TP intra-node
- Pin image versions β use
tag: "1.7.3"notlatestfor reproducible deployments - Size PVC for your model β 200Gi for 405B BF16, 300Gi for DeepSeek-R1, 100Gi for FP8 quantized
- InfiniBand for multi-node TP β if using TP across nodes, RDMA is not optional
Key Takeaways
- NIM multi-node uses LeaderWorkerSet + Ray + vLLM for distributed inference across Kubernetes nodes
- Set TP = GPUs per node, PP = number of nodes as the default strategy
- Shared RWX storage eliminates redundant model downloads across nodes
- JSON logging must be disabled β it crashes Ray worker processes
- The NIM Helm chart handles all Ray cluster formation and vLLM distributed execution automatically
- Multi-node TP (single TP group spanning nodes) requires InfiniBand/RoCE for production performance

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
