Deploy Multinode NIM Models on Kubernetes
Run large language models across multiple GPU nodes with NVIDIA NIM. Tensor parallelism, NCCL, InfiniBand, and Kubernetes Job orchestration.
π‘ Quick Answer: Multinode NIM deploys models too large for a single node (like DeepSeek-R1 671B or Llama 3.1 405B) across multiple GPU nodes using tensor parallelism over NCCL + InfiniBand. Deploy as a Kubernetes LeaderWorkerSet or paired Jobs with RDMA networking for optimal performance.
The Problem
Some large language models exceed the GPU memory available on a single node. Even a node with 8ΓH100 80GB (640GB total) cannot fit models like DeepSeek-R1 (671B parameters, MoE architecture) or the full-precision Llama 3.1 405B in memory. You need to split the model across multiple nodes while maintaining the low-latency inter-GPU communication required for inference.
flowchart TB
subgraph Node1["Node 1 β 8Γ H100 GPUs"]
G1["GPU 0-7<br/>Model Layers 0-39"]
NIC1["ConnectX-7<br/>InfiniBand 400Gb/s"]
end
subgraph Node2["Node 2 β 8Γ H100 GPUs"]
G2["GPU 0-7<br/>Model Layers 40-79"]
NIC2["ConnectX-7<br/>InfiniBand 400Gb/s"]
end
NIC1 <-->|"GPUDirect RDMA<br/>NCCL AllReduce"| NIC2
CLIENT["Client Request"] --> G1
G2 --> G1
G1 --> RESP["Response"]The Solution
Prerequisites
Before deploying multinode NIM, ensure your cluster has:
- NVIDIA GPU Operator with open kernel modules (
useOpenKernelModules: true) - Network Operator with InfiniBand or RoCE support
- GPUDirect RDMA enabled (DMA-BUF export)
- SR-IOV configured for network isolation (optional but recommended)
- NFD (Node Feature Discovery) labeling GPU nodes
# Verify GPU nodes are ready
kubectl get nodes -l nvidia.com/gpu.present=true
# Verify GPUDirect RDMA is functional
kubectl exec -it gpu-test-pod -- nvidia-smi topo -m
# Look for NV12 (NVLink) between local GPUs
# Look for SYS (cross-node) connections
# Verify NCCL can use RDMA
kubectl exec -it gpu-test-pod -- \
bash -c 'ls /dev/infiniband/uverbs* 2>/dev/null && echo "RDMA devices available"'Which NIM Models Support Multinode?
Models that support multinode inference on NVIDIA NGC NIM:
| Model | Parameters | Architecture | Min GPUs | Why Multinode |
|---|---|---|---|---|
| DeepSeek-R1 | 671B | MoE | 16Γ H100 | Active params exceed single-node memory |
| Llama 3.1 405B | 405B | Dense | 16Γ H100 | Full model > 640GB at FP16 |
| Nemotron-4 340B | 340B | Dense | 16Γ H100 | Dense architecture, large KV cache |
| Mixtral 8x22B | 176B | MoE | 16Γ H100 | MoE routing + expert memory |
| DBRX | 132B | MoE | 16Γ H100 | Expert parallelism across nodes |
Check the NGC catalog for the current list β filter by Multinode Support: Yes on any NIM container page.
Understanding Tensor Parallelism Across Nodes
NIM uses tensor parallelism (TP) to split model layers across GPUs. Within a node, GPUs communicate via NVLink. Across nodes, NCCL uses InfiniBand with GPUDirect RDMA for direct GPU-to-GPU data transfer.
βββββββββββββββββββββββββββββββββββββββββββββββ
β Tensor Parallelism β
β β
β TP Rank 0-7 (Node 1, GPUs 0-7) β
β β NVLink (900 GB/s) β
β TP Rank 0-7 communicate locally β
β β
β β InfiniBand (400 Gb/s) β
β β GPUDirect RDMA (zero-copy) β
β β
β TP Rank 8-15 (Node 2, GPUs 0-7) β
β β NVLink (900 GB/s) β
β TP Rank 8-15 communicate locally β
βββββββββββββββββββββββββββββββββββββββββββββββMethod 1: LeaderWorkerSet (Recommended)
The LeaderWorkerSet API is the cleanest way to deploy multinode NIM on Kubernetes:
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
name: deepseek-r1-nim
namespace: ai-inference
labels:
app: deepseek-r1
spec:
replicas: 1
leaderWorkerTemplate:
size: 2 # 2 nodes total
restartPolicy: RecreateGroupOnPodRestart
leaderTemplate:
metadata:
labels:
role: leader
spec:
containers:
- name: nim
image: nvcr.io/nim/deepseek-ai/deepseek-r1:1.7.3
env:
- name: NIM_TENSOR_PARALLEL_SIZE
value: "16" # 8 GPUs Γ 2 nodes
- name: NIM_PIPELINE_PARALLEL_SIZE
value: "1"
- name: NCCL_IB_HCA
value: "mlx5"
- name: NCCL_IB_GID_INDEX
value: "3"
- name: NCCL_NET_GDR_LEVEL
value: "SYS"
- name: NCCL_P2P_LEVEL
value: "SYS"
- name: NCCL_DEBUG
value: "INFO"
- name: NIM_LEADER_ADDRESS
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: NIM_NUM_NODES
value: "2"
- name: NIM_NODE_RANK
value: "0"
ports:
- containerPort: 8000
name: http
resources:
limits:
nvidia.com/gpu: "8"
# Request RDMA device if using SR-IOV
# nvidia.com/rdma_shared_device: "1"
requests:
nvidia.com/gpu: "8"
cpu: "32"
memory: "256Gi"
volumeMounts:
- name: model-cache
mountPath: /opt/nim/.cache
- name: shm
mountPath: /dev/shm
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: nim-model-cache
- name: shm
emptyDir:
medium: Memory
sizeLimit: 64Gi
nodeSelector:
nvidia.com/gpu.product: NVIDIA-H100-80GB-HBM3
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
workerTemplate:
metadata:
labels:
role: worker
spec:
containers:
- name: nim
image: nvcr.io/nim/deepseek-ai/deepseek-r1:1.7.3
env:
- name: NIM_TENSOR_PARALLEL_SIZE
value: "16"
- name: NIM_PIPELINE_PARALLEL_SIZE
value: "1"
- name: NCCL_IB_HCA
value: "mlx5"
- name: NCCL_IB_GID_INDEX
value: "3"
- name: NCCL_NET_GDR_LEVEL
value: "SYS"
- name: NCCL_P2P_LEVEL
value: "SYS"
- name: NIM_LEADER_ADDRESS
value: "$(LWS_LEADER_ADDRESS)"
- name: NIM_NUM_NODES
value: "2"
- name: NIM_NODE_RANK
value: "1"
resources:
limits:
nvidia.com/gpu: "8"
requests:
nvidia.com/gpu: "8"
cpu: "32"
memory: "256Gi"
volumeMounts:
- name: model-cache
mountPath: /opt/nim/.cache
- name: shm
mountPath: /dev/shm
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: nim-model-cache
- name: shm
emptyDir:
medium: Memory
sizeLimit: 64Gi
nodeSelector:
nvidia.com/gpu.product: NVIDIA-H100-80GB-HBM3
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoScheduleInstall LeaderWorkerSet if not already present:
kubectl apply --server-side -f \
https://github.com/kubernetes-sigs/lws/releases/latest/download/default.yamlMethod 2: Paired Kubernetes Jobs with Indexed Completion
For clusters without LeaderWorkerSet:
apiVersion: v1
kind: Service
metadata:
name: deepseek-r1-headless
namespace: ai-inference
spec:
clusterIP: None
selector:
job-name: deepseek-r1-multinode
ports:
- port: 29500
name: nccl
---
apiVersion: batch/v1
kind: Job
metadata:
name: deepseek-r1-multinode
namespace: ai-inference
spec:
completions: 2
parallelism: 2
completionMode: Indexed
template:
metadata:
labels:
app: deepseek-r1
spec:
subdomain: deepseek-r1-headless
containers:
- name: nim
image: nvcr.io/nim/deepseek-ai/deepseek-r1:1.7.3
env:
- name: NIM_TENSOR_PARALLEL_SIZE
value: "16"
- name: NIM_NUM_NODES
value: "2"
- name: NIM_NODE_RANK
valueFrom:
fieldRef:
fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
- name: NIM_LEADER_ADDRESS
value: "deepseek-r1-multinode-0.deepseek-r1-headless"
- name: NCCL_IB_HCA
value: "mlx5"
- name: NCCL_NET_GDR_LEVEL
value: "SYS"
- name: NCCL_SOCKET_IFNAME
value: "eth0"
ports:
- containerPort: 8000
name: http
- containerPort: 29500
name: nccl
resources:
limits:
nvidia.com/gpu: "8"
requests:
cpu: "32"
memory: "256Gi"
volumeMounts:
- name: shm
mountPath: /dev/shm
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: 64Gi
nodeSelector:
nvidia.com/gpu.product: NVIDIA-H100-80GB-HBM3
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
restartPolicy: NeverService for Inference Endpoint
Expose only the leader node (rank 0) for inference requests:
apiVersion: v1
kind: Service
metadata:
name: deepseek-r1-inference
namespace: ai-inference
spec:
selector:
app: deepseek-r1
role: leader # Only route to leader
ports:
- port: 8000
targetPort: 8000
protocol: TCP
type: ClusterIP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: deepseek-r1-ingress
namespace: ai-inference
annotations:
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
spec:
rules:
- host: deepseek-r1.ai.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: deepseek-r1-inference
port:
number: 8000NCCL Environment Variables Explained
env:
# Which InfiniBand HCA to use (mlx5 = ConnectX-6/7)
- name: NCCL_IB_HCA
value: "mlx5"
# GID index for RoCE (usually 3 for RoCEv2)
- name: NCCL_IB_GID_INDEX
value: "3"
# Enable GPUDirect RDMA at system level (cross-node)
- name: NCCL_NET_GDR_LEVEL
value: "SYS"
# Enable P2P across system (not just NVLink)
- name: NCCL_P2P_LEVEL
value: "SYS"
# Network interface for out-of-band control
- name: NCCL_SOCKET_IFNAME
value: "eth0"
# Debug logging (INFO for setup, WARN for production)
- name: NCCL_DEBUG
value: "INFO"
# Disable IB if using RoCE only
# - name: NCCL_IB_DISABLE
# value: "0"Verify Multinode Communication
# Check NCCL initialization in leader logs
kubectl logs -f deepseek-r1-nim-0 -n ai-inference | grep -i nccl
# Expected:
# NCCL INFO Bootstrap: Using eth0:10.244.1.5<0>
# NCCL INFO Channel 00 : 0[0] -> 1[1] via NET/IB/0
# NCCL INFO Connected all rings
# NCCL INFO 16 coll channels, 16 nvls channels, ...
# Verify all ranks joined
kubectl logs -f deepseek-r1-nim-0 -n ai-inference | grep "rank"
# Expected: All 16 ranks initialized
# Test inference
curl -s http://deepseek-r1-inference.ai-inference.svc:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/deepseek-r1",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 64
}' | jq .choices[0].message.contentModel Cache with Shared PVC
For multinode, both nodes need access to the model weights. Use ReadWriteMany storage:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nim-model-cache
namespace: ai-inference
spec:
accessModes:
- ReadWriteMany # Both nodes read the model
resources:
requests:
storage: 500Gi # DeepSeek-R1 is ~300GB on disk
storageClassName: cephfs # Or any RWX-capable StorageClassAlternatively, use an init container to download the model to local NVMe on each node:
initContainers:
- name: model-download
image: nvcr.io/nim/deepseek-ai/deepseek-r1:1.7.3
command: ["nim-download-model"]
env:
- name: NIM_CACHE_PATH
value: /cache
volumeMounts:
- name: local-cache
mountPath: /cache
volumes:
- name: local-cache
emptyDir:
sizeLimit: 500GiCommon Issues
| Issue | Cause | Fix |
|---|---|---|
| NCCL timeout during init | Nodes canβt reach each other on NCCL port | Ensure headless Service covers port 29500, check NetworkPolicy |
NCCL WARN Connect to ... failed | InfiniBand not configured or RDMA unavailable | Verify ibstat, check GPU Operator ClusterPolicy has RDMA enabled |
| OOM on single node | TP size too small for model | Increase to 16 (2 nodes) or 24 (3 nodes) |
| Slow inference despite multinode | Falling back to TCP instead of RDMA | Check NCCL_DEBUG=INFO logs for NET/Socket vs NET/IB β should be IB |
| Worker canβt find leader | DNS not resolving headless service | Wait for both pods to be Running, verify nslookup from worker |
GPUDirect RDMA not available | Open kernel modules not loaded | Set useOpenKernelModules: true in ClusterPolicy |
| Asymmetric GPU topology | Nodes have different GPU/NIC affinity | Use nvidia-smi topo -m to verify, pin NCCL to correct HCA |
Best Practices
- Use InfiniBand over RoCE when available β lower latency, higher bandwidth, no PFC tuning needed
- Pin NCCL to the correct HCA β
NCCL_IB_HCA=mlx5_0avoids NCCL probing all devices - Size /dev/shm generously β NCCL uses shared memory for local GPU communication; 64Gi minimum
- Use LeaderWorkerSet over raw Jobs β automatic leader election, coordinated restarts, cleaner lifecycle
- Monitor with DCGM β track NVLink/IB bandwidth utilization per GPU during inference
- Pre-pull images β NIM containers are 10-15GB; use DaemonSets to pre-pull on GPU nodes
- Test NCCL standalone first β run
nccl-tests(allreduce) across nodes before deploying NIM
Key Takeaways
- Multinode NIM is required for models exceeding single-node GPU memory (DeepSeek-R1 671B, Llama 405B, Nemotron 340B)
- Tensor parallelism splits model layers across all GPUs on all nodes β NCCL handles the AllReduce communication
- GPUDirect RDMA over InfiniBand is essential for acceptable cross-node latency
- Deploy with LeaderWorkerSet for production or Indexed Jobs for simpler setups
- The full NVIDIA stack (GPU Operator + Network Operator + SR-IOV) must be in place before multinode NIM works

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
