πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai advanced ⏱ 60 minutes K8s 1.28+

Run:ai NIM Distributed Inference Tutorial

Step-by-step guide to deploy DeepSeek-R1 distributed inference on Run:ai with LeaderWorkerSet, SGLang, PVC caching, and OpenShift security.

By Luca Berton β€’ β€’ πŸ“– 10 min read

πŸ’‘ Quick Answer: Deploy DeepSeek-R1 671B across 2 nodes (16Γ— H100 GPUs) using NVIDIA Run:ai distributed inference API. Uses LeaderWorkerSet with SGLang runtime, TP=8 per node, PP=2 across nodes, and PVC-cached model weights for fast restarts.

The Problem

DeepSeek-R1 is a 671B parameter Mixture-of-Experts model that requires more GPU memory than a single node provides. You need to split inference across multiple nodes using tensor parallelism (within each node) and pipeline parallelism (across nodes), while managing NGC credentials, model caching, and OpenShift security contexts. The NVIDIA Run:ai platform orchestrates this via its distributed inference API.

flowchart TB
    CLIENT["Client Request"] --> LB["Load Balancer"]
    LB --> NGINX["NGINX Ingress"]
    NGINX --> LEADER["Leader Pod<br/>NIM_LEADER_ROLE=1<br/>NIM_NODE_RANK=0<br/>8Γ— H100 GPUs<br/>TP=8"]
    LEADER -->|"Pipeline Parallel<br/>NCCL over IB"| WORKER["Worker Pod<br/>NIM_LEADER_ROLE=0<br/>NIM_NODE_RANK=1<br/>8Γ— H100 GPUs<br/>TP=8"]
    WORKER --> LEADER
    LEADER --> RESP["Response"]

    subgraph Cache["Shared PVC"]
        PVC["/opt/nim/.cache<br/>Model weights<br/>Tokenizer<br/>Compiled artifacts"]
    end
    LEADER -.-> PVC
    WORKER -.-> PVC

The Solution

Prerequisites

Before starting, ensure:

  1. Run:ai project created by your administrator
  2. LeaderWorkerSet (LWS) installed on the cluster
  3. External access configured (endpoints ending in .svc.cluster.local are cluster-internal only)
  4. NGC account with an active API key from https://catalog.ngc.nvidia.com/ β†’ Setup β†’ API Keys
  5. GPU nodes with H100 80GB GPUs (2 nodes Γ— 8 GPUs = 16 total)

Step 1: Create a Run:ai Access Key

Access keys provide client credentials for API authentication:

  1. In Run:ai UI β†’ click user avatar β†’ Settings
  2. Click +ACCESS KEY
  3. Enter a name β†’ CREATE
  4. Copy the Client ID and Client Secret (store securely)

Request an API token:

# Obtain API token from Run:ai
curl -X POST 'https://runai.cluster.example.com/api/v1/token' \
  -H 'Accept: */*' \
  -H 'Content-Type: application/json' \
  -d '{
    "grantType": "client_credentials",
    "clientId": "<CLIENT_ID>",
    "clientSecret": "<CLIENT_SECRET>"
  }'

# Export the token for subsequent calls
export TOKEN="<token-from-response>"

Step 2: Store NGC API Key as User Credential

In the Run:ai UI (not available via API):

  1. Click user avatar β†’ Settings
  2. Click +CREDENTIAL β†’ select NGC API key
  3. Enter a unique name (e.g., ngc-credentials)
  4. Paste your NGC API key β†’ CREATE CREDENTIAL

Step 3: Create PVC for Model Cache

The PVC caches downloaded model weights, tokenizer files, and compiled artifacts so subsequent runs start faster:

Via UI:

  1. Go to Workload manager β†’ Data sources
  2. Click +NEW DATA SOURCE β†’ PVC
  3. Configure:
    • Access mode: Read-write by many nodes
    • Claim size: 2 TB
    • Volume mode: Filesystem
    • Container path: /opt/nim/.cache
  4. Click CREATE DATA SOURCE

Via API:

curl -L 'https://runai.cluster.example.com/api/v1/asset/datasource/pvc' \
  -H 'Content-Type: application/json' \
  -H "Authorization: Bearer $TOKEN" \
  -d '{
    "meta": {
      "name": "nim-model-cache",
      "scope": "project"
    },
    "spec": {
      "path": "/opt/nim/.cache",
      "existingPvc": false,
      "claimInfo": {
        "size": "2TB",
        "storageClass": "cephfs-rwx",
        "accessModes": {
          "readWriteMany": true
        },
        "volumeMode": "Filesystem"
      }
    }
  }'

⚠️ The first launch with a new PVC takes longer β€” storage is provisioned on first claim.

Step 4: Deploy Distributed Inference Workload

This is the core step. The configuration splits DeepSeek-R1 across 2 nodes using:

  • Tensor Parallelism (TP=8): Each node’s 8 GPUs share layer computation
  • Pipeline Parallelism (PP=2): Model split into 2 sequential stages across nodes
  • SGLang runtime: Accelerated inference engine required for DeepSeek-R1

Via API:

curl -L 'https://runai.cluster.example.com/api/v1/workloads/distributed-inferences' \
  -H 'Content-Type: application/json' \
  -H "Authorization: Bearer $TOKEN" \
  -d '{
    "name": "deepseek-r1-distributed",
    "projectId": "<PROJECT-ID>",
    "clusterId": "<CLUSTER-UUID>",
    "spec": {
      "workers": 1,
      "servingPort": {
        "port": 8000,
        "authorizationType": "authenticatedUsers"
      },
      "leader": {
        "image": "nvcr.io/nim/deepseek-ai/deepseek-r1:1.7.3",
        "environmentVariables": [
          {
            "name": "NGC_API_KEY",
            "userCredential": {
              "name": "ngc-credentials",
              "key": "NGC_API_KEY"
            }
          },
          { "name": "NIM_LEADER_ROLE", "value": "1" },
          { "name": "OMPI_MCA_orte_keep_fqdn_hostnames", "value": "1" },
          { "name": "OMPI_MCA_plm_rsh_args", "value": "-o ConnectionAttempts=20" },
          { "name": "NIM_USE_SGLANG", "value": "1" },
          { "name": "NIM_MULTI_NODE", "value": "1" },
          { "name": "NIM_TENSOR_PARALLEL_SIZE", "value": "8" },
          { "name": "NIM_PIPELINE_PARALLEL_SIZE", "value": "2" },
          { "name": "NIM_TRUST_CUSTOM_CODE", "value": "1" },
          { "name": "NIM_MODEL_PROFILE", "value": "sglang-h100-bf16-tp8-pp2" },
          {
            "name": "NIM_NODE_RANK",
            "podFieldRef": {
              "path": "metadata.labels['"'"'leaderworkerset.sigs.k8s.io/worker-index'"'"']"
            }
          },
          { "name": "NIM_NUM_COMPUTE_NODES", "value": "2" }
        ],
        "imagePullSecrets": [
          { "name": "ngc-credentials", "userCredential": true }
        ],
        "storage": {
          "pvc": [{
            "path": "/opt/nim/.cache",
            "existingPvc": true,
            "claimName": "<pvc-claim-name>"
          }]
        },
        "compute": { "gpuDevicesRequest": 8 },
        "security": {
          "runAsUid": 1000,
          "runAsGid": 1000,
          "runAsNonRoot": true
        }
      },
      "worker": {
        "image": "nvcr.io/nim/deepseek-ai/deepseek-r1:1.7.3",
        "environmentVariables": [
          {
            "name": "NGC_API_KEY",
            "userCredential": {
              "name": "ngc-credentials",
              "key": "NGC_API_KEY"
            }
          },
          { "name": "NIM_LEADER_ROLE", "value": "0" },
          { "name": "NIM_USE_SGLANG", "value": "1" },
          { "name": "NIM_MULTI_NODE", "value": "1" },
          { "name": "NIM_TENSOR_PARALLEL_SIZE", "value": "8" },
          { "name": "NIM_PIPELINE_PARALLEL_SIZE", "value": "2" },
          { "name": "NIM_TRUST_CUSTOM_CODE", "value": "1" },
          { "name": "NIM_MODEL_PROFILE", "value": "sglang-h100-bf16-tp8-pp2" },
          {
            "name": "NIM_NODE_RANK",
            "podFieldRef": {
              "path": "metadata.labels['"'"'leaderworkerset.sigs.k8s.io/worker-index'"'"']"
            }
          },
          { "name": "NIM_NUM_COMPUTE_NODES", "value": "2" }
        ],
        "imagePullSecrets": [
          { "name": "ngc-credentials", "userCredential": true }
        ],
        "storage": {
          "pvc": [{
            "path": "/opt/nim/.cache",
            "existingPvc": true,
            "claimName": "<pvc-claim-name>"
          }]
        },
        "compute": { "gpuDevicesRequest": 8 },
        "security": {
          "runAsUid": 1000,
          "runAsGid": 1000,
          "runAsNonRoot": true
        }
      }
    }
  }'

Via CLI v2:

runai inference distributed submit deepseek-r1-dist \
  -p <project-id> \
  -i nvcr.io/nim/deepseek-ai/deepseek-r1:1.7.3 \
  --workers 1 \
  --serving-port "container=8000,authorization-type=authenticatedUsers" \
  -g 8 \
  --existing-pvc claimname=<pvc-claim-name>,path=/opt/nim/.cache \
  --env-secret NGC_API_KEY=ngc-credentials,key=NGC_API_KEY \
  --environment NIM_NUM_COMPUTE_NODES=2 \
  --environment NIM_LEADER_ROLE=1 \
  --environment OMPI_MCA_orte_keep_fqdn_hostnames=1 \
  --environment "OMPI_MCA_plm_rsh_args=-o ConnectionAttempts=20" \
  --environment NIM_USE_SGLANG=1 \
  --environment NIM_MULTI_NODE=1 \
  --environment NIM_TENSOR_PARALLEL_SIZE=8 \
  --environment NIM_PIPELINE_PARALLEL_SIZE=2 \
  --environment NIM_TRUST_CUSTOM_CODE=1 \
  --environment NIM_MODEL_PROFILE=sglang-h100-bf16-tp8-pp2 \
  --env-pod-field-ref "NIM_NODE_RANK=metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']"

Understanding the NIM Environment Variables

VariableLeaderWorkerPurpose
NIM_LEADER_ROLE10Designates which pod runs the API server
NIM_NODE_RANKauto (0)auto (1..N)Injected from LWS worker-index label
NIM_MULTI_NODE11Enables multinode mode on both pods
NIM_TENSOR_PARALLEL_SIZE88Splits layers across 8 GPUs per node
NIM_PIPELINE_PARALLEL_SIZE22Splits model into 2 pipeline stages across nodes
NIM_NUM_COMPUTE_NODES22Total nodes (must match LWS size)
NIM_MODEL_PROFILEsglang-h100-bf16-tp8-pp2sameOptimized profile for H100 + TP8 + PP2
NIM_USE_SGLANG11SGLang runtime (required for DeepSeek-R1)
NIM_TRUST_CUSTOM_CODE11Load custom kernels from NIM image
OMPI_MCA_orte_keep_fqdn_hostnames1β€”OpenMPI: keep full hostnames for DNS
OMPI_MCA_plm_rsh_args-o ConnectionAttempts=20β€”OpenMPI: retry SSH connections

Step 5: Test the Inference Endpoint

# Get the inference URL from Run:ai
runai inference list -p <project-id>
# or via API:
curl -s 'https://runai.cluster.example.com/api/v1/workloads/distributed-inferences' \
  -H "Authorization: Bearer $TOKEN" | jq '.[].status.endpoints'

# Send a test request (authenticated)
curl -s 'https://deepseek-r1-dist.inference.example.com/v1/chat/completions' \
  -H 'Content-Type: application/json' \
  -H "Authorization: Bearer $TOKEN" \
  -d '{
    "model": "deepseek-ai/deepseek-r1",
    "messages": [
      {"role": "user", "content": "Explain tensor parallelism in 3 sentences."}
    ],
    "max_tokens": 256,
    "temperature": 0.7
  }' | jq .choices[0].message.content

Equivalent Kubernetes Manifests (Without Run:ai)

For clusters without Run:ai, here’s the equivalent LeaderWorkerSet:

apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: deepseek-r1
  namespace: ai-inference
spec:
  replicas: 1
  leaderWorkerTemplate:
    size: 2
    restartPolicy: RecreateGroupOnPodRestart
    leaderTemplate:
      spec:
        containers:
          - name: nim
            image: nvcr.io/nim/deepseek-ai/deepseek-r1:1.7.3
            env:
              - name: NGC_API_KEY
                valueFrom:
                  secretKeyRef:
                    name: ngc-secret
                    key: api-key
              - name: NIM_LEADER_ROLE
                value: "1"
              - name: NIM_USE_SGLANG
                value: "1"
              - name: NIM_MULTI_NODE
                value: "1"
              - name: NIM_TENSOR_PARALLEL_SIZE
                value: "8"
              - name: NIM_PIPELINE_PARALLEL_SIZE
                value: "2"
              - name: NIM_TRUST_CUSTOM_CODE
                value: "1"
              - name: NIM_MODEL_PROFILE
                value: "sglang-h100-bf16-tp8-pp2"
              - name: NIM_NODE_RANK
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
              - name: NIM_NUM_COMPUTE_NODES
                value: "2"
              - name: OMPI_MCA_orte_keep_fqdn_hostnames
                value: "1"
              - name: OMPI_MCA_plm_rsh_args
                value: "-o ConnectionAttempts=20"
            ports:
              - containerPort: 8000
            resources:
              limits:
                nvidia.com/gpu: "8"
              requests:
                cpu: "32"
                memory: "256Gi"
            volumeMounts:
              - name: cache
                mountPath: /opt/nim/.cache
              - name: shm
                mountPath: /dev/shm
            securityContext:
              runAsUser: 1000
              runAsGroup: 1000
              runAsNonRoot: true
        volumes:
          - name: cache
            persistentVolumeClaim:
              claimName: nim-model-cache
          - name: shm
            emptyDir:
              medium: Memory
              sizeLimit: 64Gi
        nodeSelector:
          nvidia.com/gpu.product: NVIDIA-H100-80GB-HBM3
    workerTemplate:
      spec:
        containers:
          - name: nim
            image: nvcr.io/nim/deepseek-ai/deepseek-r1:1.7.3
            env:
              - name: NGC_API_KEY
                valueFrom:
                  secretKeyRef:
                    name: ngc-secret
                    key: api-key
              - name: NIM_LEADER_ROLE
                value: "0"
              - name: NIM_USE_SGLANG
                value: "1"
              - name: NIM_MULTI_NODE
                value: "1"
              - name: NIM_TENSOR_PARALLEL_SIZE
                value: "8"
              - name: NIM_PIPELINE_PARALLEL_SIZE
                value: "2"
              - name: NIM_TRUST_CUSTOM_CODE
                value: "1"
              - name: NIM_MODEL_PROFILE
                value: "sglang-h100-bf16-tp8-pp2"
              - name: NIM_NODE_RANK
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
              - name: NIM_NUM_COMPUTE_NODES
                value: "2"
            resources:
              limits:
                nvidia.com/gpu: "8"
              requests:
                cpu: "32"
                memory: "256Gi"
            volumeMounts:
              - name: cache
                mountPath: /opt/nim/.cache
              - name: shm
                mountPath: /dev/shm
            securityContext:
              runAsUser: 1000
              runAsGroup: 1000
              runAsNonRoot: true
        volumes:
          - name: cache
            persistentVolumeClaim:
              claimName: nim-model-cache
          - name: shm
            emptyDir:
              medium: Memory
              sizeLimit: 64Gi
        nodeSelector:
          nvidia.com/gpu.product: NVIDIA-H100-80GB-HBM3

Parallelism Strategies Explained

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              DeepSeek-R1 (671B MoE)                 β”‚
β”‚                                                     β”‚
β”‚  Tensor Parallelism (TP=8) β€” within each node:      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚GPU 0β”‚GPU 1β”‚GPU 2β”‚GPU 3β”‚GPU 4β”‚GPU 5β”‚GPU 6β”‚GPU 7β”‚ β”‚
β”‚  β”‚ ←── Each GPU holds 1/8 of each layer ──→       β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜ β”‚
β”‚                                                     β”‚
β”‚  Pipeline Parallelism (PP=2) β€” across nodes:        β”‚
β”‚  Node 0 (Leader): Layers 0-N/2  β†’ forward pass β†’   β”‚
β”‚  Node 1 (Worker): Layers N/2-N  β†’ forward pass β†’   β”‚
β”‚  ← backward aggregation ←                          β”‚
β”‚                                                     β”‚
β”‚  Total: 16 GPUs, TP=8 Γ— PP=2                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Tensor Parallelism (TP=8): Splits each transformer layer horizontally across 8 GPUs on one node. All 8 GPUs process the same layer simultaneously, communicating via NVLink (900 GB/s). Low latency, high bandwidth β€” ideal for intra-node.

Pipeline Parallelism (PP=2): Splits the model vertically β€” first half of layers on node 0, second half on node 1. Data flows sequentially between nodes over InfiniBand. Higher latency than TP, but enables scaling beyond single-node memory.

SGLang Runtime: DeepSeek-R1 requires SGLang (not vLLM or TensorRT-LLM) for its MoE routing and custom attention patterns. The NIM_MODEL_PROFILE=sglang-h100-bf16-tp8-pp2 selects the pre-optimized configuration.

Common Issues

IssueCauseFix
permission denied on /opt/nim/.cacheOpenShift SCC blocks rootAdd runAsUid: 1000, runAsGid: 1000, runAsNonRoot: true
Worker can’t connect to leaderDNS not resolving LWS headless serviceCheck OMPI_MCA_orte_keep_fqdn_hostnames=1 and ConnectionAttempts=20
NIM_MULTI_NODE not setForgotten on worker podMust be 1 on both leader and worker
Model download takes 30+ minutesFirst run with empty PVCExpected β€” subsequent runs use cached weights
unsupported value '' for GrantTypeTrailing spaces after \ in curlEnsure backslash is last char on line (no trailing spaces)
Token endpoint returns 400Wrong field caseUse grantType (camelCase) per Run:ai API
NCCL timeoutNo InfiniBand/RDMA between nodesVerify ibstat, check GPUDirect RDMA is enabled
SGLang compilation errorsWrong model profileVerify NIM_MODEL_PROFILE matches your GPU type (H100/A100)
PVC not provisioningStorageClass missing or wrong access modeEnsure RWX-capable StorageClass exists

Best Practices

  • Pin NIM image versions β€” use 1.7.3 not latest for reproducible deployments
  • Pre-cache model weights β€” run a warm-up job first to populate the PVC before serving production traffic
  • Use authenticatedUsers β€” never expose large model inference endpoints publicly without auth
  • Size PVC generously β€” DeepSeek-R1 needs ~300GB for weights plus compiled artifacts; 2TB allows room for multiple models
  • OpenShift security context β€” always set UID/GID 1000 and runAsNonRoot to avoid permission errors on cache mounts
  • Monitor with Run:ai metrics β€” track GPU utilization, TTFT, and KV-cache across both leader and worker
  • Use topology-aware scheduling β€” Run:ai’s scheduler can place leader/worker on same rack for lower latency
  • Test NCCL separately β€” run nccl-tests allreduce between nodes before deploying NIM

Key Takeaways

  • Run:ai distributed inference deploys NIM across multiple nodes via LeaderWorkerSet (LWS)
  • DeepSeek-R1 uses TP=8 (per node) Γ— PP=2 (across nodes) = 16 GPUs total with SGLang runtime
  • The leader pod handles the API server and auth; workers handle compute and connect via LWS_LEADER_ADDRESS
  • NIM_NODE_RANK is auto-injected from the LWS worker-index label β€” no manual rank assignment needed
  • PVC caching at /opt/nim/.cache dramatically speeds up subsequent deployments
  • OpenShift requires explicit runAsUid/runAsGid: 1000 and runAsNonRoot: true for cache directory permissions
#nvidia-runai #nvidia-nim #distributed-inference #deepseek-r1 #leader-worker-set #sglang
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens