πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai advanced ⏱ 45 minutes K8s 1.28+

Deploy Multinode NIM Models on Kubernetes

Run large language models across multiple GPU nodes with NVIDIA NIM. Tensor parallelism, NCCL, InfiniBand, and Kubernetes Job orchestration.

By Luca Berton β€’ β€’ πŸ“– 9 min read

πŸ’‘ Quick Answer: Multinode NIM deploys models too large for a single node (like DeepSeek-R1 671B or Llama 3.1 405B) across multiple GPU nodes using tensor parallelism over NCCL + InfiniBand. Deploy as a Kubernetes LeaderWorkerSet or paired Jobs with RDMA networking for optimal performance.

The Problem

Some large language models exceed the GPU memory available on a single node. Even a node with 8Γ—H100 80GB (640GB total) cannot fit models like DeepSeek-R1 (671B parameters, MoE architecture) or the full-precision Llama 3.1 405B in memory. You need to split the model across multiple nodes while maintaining the low-latency inter-GPU communication required for inference.

flowchart TB
    subgraph Node1["Node 1 β€” 8Γ— H100 GPUs"]
        G1["GPU 0-7<br/>Model Layers 0-39"]
        NIC1["ConnectX-7<br/>InfiniBand 400Gb/s"]
    end
    subgraph Node2["Node 2 β€” 8Γ— H100 GPUs"]
        G2["GPU 0-7<br/>Model Layers 40-79"]
        NIC2["ConnectX-7<br/>InfiniBand 400Gb/s"]
    end
    NIC1 <-->|"GPUDirect RDMA<br/>NCCL AllReduce"| NIC2
    CLIENT["Client Request"] --> G1
    G2 --> G1
    G1 --> RESP["Response"]

The Solution

Prerequisites

Before deploying multinode NIM, ensure your cluster has:

  1. NVIDIA GPU Operator with open kernel modules (useOpenKernelModules: true)
  2. Network Operator with InfiniBand or RoCE support
  3. GPUDirect RDMA enabled (DMA-BUF export)
  4. SR-IOV configured for network isolation (optional but recommended)
  5. NFD (Node Feature Discovery) labeling GPU nodes
# Verify GPU nodes are ready
kubectl get nodes -l nvidia.com/gpu.present=true

# Verify GPUDirect RDMA is functional
kubectl exec -it gpu-test-pod -- nvidia-smi topo -m
# Look for NV12 (NVLink) between local GPUs
# Look for SYS (cross-node) connections

# Verify NCCL can use RDMA
kubectl exec -it gpu-test-pod -- \
  bash -c 'ls /dev/infiniband/uverbs* 2>/dev/null && echo "RDMA devices available"'

Which NIM Models Support Multinode?

Models that support multinode inference on NVIDIA NGC NIM:

ModelParametersArchitectureMin GPUsWhy Multinode
DeepSeek-R1671BMoE16Γ— H100Active params exceed single-node memory
Llama 3.1 405B405BDense16Γ— H100Full model > 640GB at FP16
Nemotron-4 340B340BDense16Γ— H100Dense architecture, large KV cache
Mixtral 8x22B176BMoE16Γ— H100MoE routing + expert memory
DBRX132BMoE16Γ— H100Expert parallelism across nodes

Check the NGC catalog for the current list β€” filter by Multinode Support: Yes on any NIM container page.

Understanding Tensor Parallelism Across Nodes

NIM uses tensor parallelism (TP) to split model layers across GPUs. Within a node, GPUs communicate via NVLink. Across nodes, NCCL uses InfiniBand with GPUDirect RDMA for direct GPU-to-GPU data transfer.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚            Tensor Parallelism               β”‚
β”‚                                             β”‚
β”‚  TP Rank 0-7 (Node 1, GPUs 0-7)            β”‚
β”‚       ↕ NVLink (900 GB/s)                  β”‚
β”‚  TP Rank 0-7 communicate locally            β”‚
β”‚                                             β”‚
β”‚       ↕ InfiniBand (400 Gb/s)              β”‚
β”‚       ↕ GPUDirect RDMA (zero-copy)         β”‚
β”‚                                             β”‚
β”‚  TP Rank 8-15 (Node 2, GPUs 0-7)           β”‚
β”‚       ↕ NVLink (900 GB/s)                  β”‚
β”‚  TP Rank 8-15 communicate locally           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The LeaderWorkerSet API is the cleanest way to deploy multinode NIM on Kubernetes:

apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: deepseek-r1-nim
  namespace: ai-inference
  labels:
    app: deepseek-r1
spec:
  replicas: 1
  leaderWorkerTemplate:
    size: 2  # 2 nodes total
    restartPolicy: RecreateGroupOnPodRestart
    leaderTemplate:
      metadata:
        labels:
          role: leader
      spec:
        containers:
          - name: nim
            image: nvcr.io/nim/deepseek-ai/deepseek-r1:1.7.3
            env:
              - name: NIM_TENSOR_PARALLEL_SIZE
                value: "16"  # 8 GPUs Γ— 2 nodes
              - name: NIM_PIPELINE_PARALLEL_SIZE
                value: "1"
              - name: NCCL_IB_HCA
                value: "mlx5"
              - name: NCCL_IB_GID_INDEX
                value: "3"
              - name: NCCL_NET_GDR_LEVEL
                value: "SYS"
              - name: NCCL_P2P_LEVEL
                value: "SYS"
              - name: NCCL_DEBUG
                value: "INFO"
              - name: NIM_LEADER_ADDRESS
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.name
              - name: NIM_NUM_NODES
                value: "2"
              - name: NIM_NODE_RANK
                value: "0"
            ports:
              - containerPort: 8000
                name: http
            resources:
              limits:
                nvidia.com/gpu: "8"
                # Request RDMA device if using SR-IOV
                # nvidia.com/rdma_shared_device: "1"
              requests:
                nvidia.com/gpu: "8"
                cpu: "32"
                memory: "256Gi"
            volumeMounts:
              - name: model-cache
                mountPath: /opt/nim/.cache
              - name: shm
                mountPath: /dev/shm
        volumes:
          - name: model-cache
            persistentVolumeClaim:
              claimName: nim-model-cache
          - name: shm
            emptyDir:
              medium: Memory
              sizeLimit: 64Gi
        nodeSelector:
          nvidia.com/gpu.product: NVIDIA-H100-80GB-HBM3
        tolerations:
          - key: nvidia.com/gpu
            operator: Exists
            effect: NoSchedule
    workerTemplate:
      metadata:
        labels:
          role: worker
      spec:
        containers:
          - name: nim
            image: nvcr.io/nim/deepseek-ai/deepseek-r1:1.7.3
            env:
              - name: NIM_TENSOR_PARALLEL_SIZE
                value: "16"
              - name: NIM_PIPELINE_PARALLEL_SIZE
                value: "1"
              - name: NCCL_IB_HCA
                value: "mlx5"
              - name: NCCL_IB_GID_INDEX
                value: "3"
              - name: NCCL_NET_GDR_LEVEL
                value: "SYS"
              - name: NCCL_P2P_LEVEL
                value: "SYS"
              - name: NIM_LEADER_ADDRESS
                value: "$(LWS_LEADER_ADDRESS)"
              - name: NIM_NUM_NODES
                value: "2"
              - name: NIM_NODE_RANK
                value: "1"
            resources:
              limits:
                nvidia.com/gpu: "8"
              requests:
                nvidia.com/gpu: "8"
                cpu: "32"
                memory: "256Gi"
            volumeMounts:
              - name: model-cache
                mountPath: /opt/nim/.cache
              - name: shm
                mountPath: /dev/shm
        volumes:
          - name: model-cache
            persistentVolumeClaim:
              claimName: nim-model-cache
          - name: shm
            emptyDir:
              medium: Memory
              sizeLimit: 64Gi
        nodeSelector:
          nvidia.com/gpu.product: NVIDIA-H100-80GB-HBM3
        tolerations:
          - key: nvidia.com/gpu
            operator: Exists
            effect: NoSchedule

Install LeaderWorkerSet if not already present:

kubectl apply --server-side -f \
  https://github.com/kubernetes-sigs/lws/releases/latest/download/default.yaml

Method 2: Paired Kubernetes Jobs with Indexed Completion

For clusters without LeaderWorkerSet:

apiVersion: v1
kind: Service
metadata:
  name: deepseek-r1-headless
  namespace: ai-inference
spec:
  clusterIP: None
  selector:
    job-name: deepseek-r1-multinode
  ports:
    - port: 29500
      name: nccl
---
apiVersion: batch/v1
kind: Job
metadata:
  name: deepseek-r1-multinode
  namespace: ai-inference
spec:
  completions: 2
  parallelism: 2
  completionMode: Indexed
  template:
    metadata:
      labels:
        app: deepseek-r1
    spec:
      subdomain: deepseek-r1-headless
      containers:
        - name: nim
          image: nvcr.io/nim/deepseek-ai/deepseek-r1:1.7.3
          env:
            - name: NIM_TENSOR_PARALLEL_SIZE
              value: "16"
            - name: NIM_NUM_NODES
              value: "2"
            - name: NIM_NODE_RANK
              valueFrom:
                fieldRef:
                  fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
            - name: NIM_LEADER_ADDRESS
              value: "deepseek-r1-multinode-0.deepseek-r1-headless"
            - name: NCCL_IB_HCA
              value: "mlx5"
            - name: NCCL_NET_GDR_LEVEL
              value: "SYS"
            - name: NCCL_SOCKET_IFNAME
              value: "eth0"
          ports:
            - containerPort: 8000
              name: http
            - containerPort: 29500
              name: nccl
          resources:
            limits:
              nvidia.com/gpu: "8"
            requests:
              cpu: "32"
              memory: "256Gi"
          volumeMounts:
            - name: shm
              mountPath: /dev/shm
      volumes:
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 64Gi
      nodeSelector:
        nvidia.com/gpu.product: NVIDIA-H100-80GB-HBM3
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      restartPolicy: Never

Service for Inference Endpoint

Expose only the leader node (rank 0) for inference requests:

apiVersion: v1
kind: Service
metadata:
  name: deepseek-r1-inference
  namespace: ai-inference
spec:
  selector:
    app: deepseek-r1
    role: leader  # Only route to leader
  ports:
    - port: 8000
      targetPort: 8000
      protocol: TCP
  type: ClusterIP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: deepseek-r1-ingress
  namespace: ai-inference
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
spec:
  rules:
    - host: deepseek-r1.ai.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: deepseek-r1-inference
                port:
                  number: 8000

NCCL Environment Variables Explained

env:
  # Which InfiniBand HCA to use (mlx5 = ConnectX-6/7)
  - name: NCCL_IB_HCA
    value: "mlx5"

  # GID index for RoCE (usually 3 for RoCEv2)
  - name: NCCL_IB_GID_INDEX
    value: "3"

  # Enable GPUDirect RDMA at system level (cross-node)
  - name: NCCL_NET_GDR_LEVEL
    value: "SYS"

  # Enable P2P across system (not just NVLink)
  - name: NCCL_P2P_LEVEL
    value: "SYS"

  # Network interface for out-of-band control
  - name: NCCL_SOCKET_IFNAME
    value: "eth0"

  # Debug logging (INFO for setup, WARN for production)
  - name: NCCL_DEBUG
    value: "INFO"

  # Disable IB if using RoCE only
  # - name: NCCL_IB_DISABLE
  #   value: "0"

Verify Multinode Communication

# Check NCCL initialization in leader logs
kubectl logs -f deepseek-r1-nim-0 -n ai-inference | grep -i nccl
# Expected:
# NCCL INFO Bootstrap: Using eth0:10.244.1.5<0>
# NCCL INFO Channel 00 : 0[0] -> 1[1] via NET/IB/0
# NCCL INFO Connected all rings
# NCCL INFO 16 coll channels, 16 nvls channels, ...

# Verify all ranks joined
kubectl logs -f deepseek-r1-nim-0 -n ai-inference | grep "rank"
# Expected: All 16 ranks initialized

# Test inference
curl -s http://deepseek-r1-inference.ai-inference.svc:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/deepseek-r1",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 64
  }' | jq .choices[0].message.content

Model Cache with Shared PVC

For multinode, both nodes need access to the model weights. Use ReadWriteMany storage:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nim-model-cache
  namespace: ai-inference
spec:
  accessModes:
    - ReadWriteMany  # Both nodes read the model
  resources:
    requests:
      storage: 500Gi  # DeepSeek-R1 is ~300GB on disk
  storageClassName: cephfs  # Or any RWX-capable StorageClass

Alternatively, use an init container to download the model to local NVMe on each node:

initContainers:
  - name: model-download
    image: nvcr.io/nim/deepseek-ai/deepseek-r1:1.7.3
    command: ["nim-download-model"]
    env:
      - name: NIM_CACHE_PATH
        value: /cache
    volumeMounts:
      - name: local-cache
        mountPath: /cache
volumes:
  - name: local-cache
    emptyDir:
      sizeLimit: 500Gi

Common Issues

IssueCauseFix
NCCL timeout during initNodes can’t reach each other on NCCL portEnsure headless Service covers port 29500, check NetworkPolicy
NCCL WARN Connect to ... failedInfiniBand not configured or RDMA unavailableVerify ibstat, check GPU Operator ClusterPolicy has RDMA enabled
OOM on single nodeTP size too small for modelIncrease to 16 (2 nodes) or 24 (3 nodes)
Slow inference despite multinodeFalling back to TCP instead of RDMACheck NCCL_DEBUG=INFO logs for NET/Socket vs NET/IB β€” should be IB
Worker can’t find leaderDNS not resolving headless serviceWait for both pods to be Running, verify nslookup from worker
GPUDirect RDMA not availableOpen kernel modules not loadedSet useOpenKernelModules: true in ClusterPolicy
Asymmetric GPU topologyNodes have different GPU/NIC affinityUse nvidia-smi topo -m to verify, pin NCCL to correct HCA

Best Practices

  • Use InfiniBand over RoCE when available β€” lower latency, higher bandwidth, no PFC tuning needed
  • Pin NCCL to the correct HCA β€” NCCL_IB_HCA=mlx5_0 avoids NCCL probing all devices
  • Size /dev/shm generously β€” NCCL uses shared memory for local GPU communication; 64Gi minimum
  • Use LeaderWorkerSet over raw Jobs β€” automatic leader election, coordinated restarts, cleaner lifecycle
  • Monitor with DCGM β€” track NVLink/IB bandwidth utilization per GPU during inference
  • Pre-pull images β€” NIM containers are 10-15GB; use DaemonSets to pre-pull on GPU nodes
  • Test NCCL standalone first β€” run nccl-tests (allreduce) across nodes before deploying NIM

Key Takeaways

  • Multinode NIM is required for models exceeding single-node GPU memory (DeepSeek-R1 671B, Llama 405B, Nemotron 340B)
  • Tensor parallelism splits model layers across all GPUs on all nodes β€” NCCL handles the AllReduce communication
  • GPUDirect RDMA over InfiniBand is essential for acceptable cross-node latency
  • Deploy with LeaderWorkerSet for production or Indexed Jobs for simpler setups
  • The full NVIDIA stack (GPU Operator + Network Operator + SR-IOV) must be in place before multinode NIM works
#nvidia-nim #multinode #tensor-parallelism #nccl #infiniband #gpudirect-rdma
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens