πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai advanced ⏱ 15 minutes K8s 1.28+

Run:ai Distributed Inference with vLLM and NCCL

Deploy distributed LLM inference on Run:ai with vLLM tensor parallelism across multiple workers. Covers multi-node GPU splitting, NCCL configuration, PVC model

By Luca Berton β€’ β€’ πŸ“– 8 min read

πŸ’‘ Quick Answer: Run:ai’s inference distributed submit deploys vLLM across multiple GPU workers with tensor parallelism. For a 119B parameter model needing 4 GPUs total: use 2 workers Γ— 2 GPUs each with --tensor-parallel-size 2. NCCL handles inter-GPU communication β€” disable InfiniBand (NCCL_IB_DISABLE=1) when using Ethernet-only clusters.

The Problem

Large language models (100B+ parameters) don’t fit on a single GPU:

  • Mistral-Small-4 119B needs ~240GB VRAM in float16 (3-4Γ— A100 80GB)
  • Must split model across GPUs using tensor parallelism
  • Multi-node inference needs NCCL for inter-worker communication
  • Run:ai manages GPU scheduling, but distributed inference needs specific config
  • Security requirements: non-root, specific UID/GID, preemptible workloads

The Solution

Run:ai Distributed Inference Command

runai inference distributed submit my-llm-inference \
  -p my-project \
  -i registry.example.com/vllm-openai:latest \
  --existing-pvc claimname=my-project-models,path=/data \
  --workers 2 \
  -g 2 \
  --serving-port container=8000,authorization-type=authenticatedUsers \
  --environment-variable NCCL_IB_DISABLE=1 \
  --environment-variable NCCL_P2P_DISABLE=0 \
  --run-as-uid 2000 \
  --run-as-gid 2000 \
  --run-as-non-root \
  --preemptibility preemptible \
  -- \
  --model /data/input/Models/Mistral-Small-4-119B-2603 \
  --served-model-name mistral4 \
  --tensor-parallel-size 2 \
  --port 8000

Breaking Down Each Flag

Run:ai Flags:
──────────────────────────────────────────────────────────────────
Flag                              Purpose
──────────────────────────────────────────────────────────────────
inference distributed submit      Distributed inference workload type
my-llm-inference                  Workload name
-p my-project                     Run:ai project (quota + namespace)
-i registry.example.com/...       vLLM container image
--existing-pvc ...                Mount PVC with model weights
--workers 2                       2 worker Pods (1 head + 1 worker)
-g 2                              2 GPUs per worker (4 total)
--serving-port container=8000     Expose inference endpoint
--environment-variable ...        NCCL tuning
--run-as-uid 2000                 Non-root UID
--run-as-gid 2000                 Non-root GID
--run-as-non-root                 Security: forbid root
--preemptibility preemptible      Can be evicted for higher-priority jobs

vLLM Flags (after --):
──────────────────────────────────────────────────────────────────
--model /data/input/Models/...    Path to model weights on PVC
--served-model-name mistral4      API model name for OpenAI-compatible endpoint
--tensor-parallel-size 2          Split model across 2 GPUs per worker
--port 8000                       vLLM HTTP server port

GPU Topology for This Deployment

Total: 2 workers Γ— 2 GPUs = 4 GPUs
──────────────────────────────────────────────────────────────────

Worker 0 (Head):                 Worker 1:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  GPU 0   GPU 1       β”‚        β”‚  GPU 0   GPU 1       β”‚
β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€           β”‚        β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€           β”‚
β”‚  β”‚ TP rank 0, 1 β”‚     β”‚        β”‚  β”‚ TP rank 0, 1 β”‚     β”‚
β”‚  β”‚ (tensor parallel)β”‚ β”‚        β”‚  β”‚ (tensor parallel)β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚        β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚  vLLM engine          β”‚        β”‚  vLLM engine          β”‚
β”‚  Port 8000 (API)      β”‚        β”‚                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚          NCCL (Ethernet)         β”‚
           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Model split: 119B params / 2 TP = ~60B per GPU
VRAM per GPU: ~120GB (float16) β†’ fits on 2Γ— A100 80GB with KV cache

NCCL Configuration Explained

# NCCL_IB_DISABLE=1
# Disable InfiniBand transport β€” use Ethernet (TCP) for NCCL
# Use when:
#   - Cluster has no InfiniBand fabric
#   - Only Ethernet available between workers
#   - SR-IOV/RDMA not configured

# NCCL_P2P_DISABLE=0
# Enable GPU-to-GPU peer-to-peer within each worker
# P2P via NVLink/PCIe between the 2 GPUs in each worker
# Only disabling IB for inter-node, keeping P2P for intra-node
NCCL Transport Selection for This Setup:
──────────────────────────────────────────────────────────────────
Path                      Transport        Performance
──────────────────────────────────────────────────────────────────
GPU0 ↔ GPU1 (same worker) NVLink/PCIe P2P  Best (~600 GB/s NVLink)
Worker0 ↔ Worker1         TCP/Ethernet     Good enough for inference
                                            (~10-25 Gb/s)

For training: IB/RDMA would be critical (all-reduce heavy)
For inference: Ethernet is often sufficient (less cross-node traffic)

Equivalent Kubernetes Manifests

# What Run:ai creates under the hood:

# Head worker (rank 0)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-llm-inference-head
  namespace: runai-my-project
spec:
  replicas: 1
  template:
    metadata:
      labels:
        run.ai/workload: my-llm-inference
        run.ai/role: head
    spec:
      securityContext:
        runAsUser: 2000
        runAsGroup: 2000
        runAsNonRoot: true
      containers:
        - name: vllm
          image: registry.example.com/vllm-openai:latest
          command:
            - python3
            - -m
            - vllm.entrypoints.openai.api_server
          args:
            - --model
            - /data/input/Models/Mistral-Small-4-119B-2603
            - --served-model-name
            - mistral4
            - --tensor-parallel-size
            - "2"
            - --port
            - "8000"
          env:
            - name: NCCL_IB_DISABLE
              value: "1"
            - name: NCCL_P2P_DISABLE
              value: "0"
          ports:
            - containerPort: 8000
          resources:
            limits:
              nvidia.com/gpu: "2"
          volumeMounts:
            - name: model-data
              mountPath: /data
      volumes:
        - name: model-data
          persistentVolumeClaim:
            claimName: my-project-models
---
# Worker (rank 1+)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-llm-inference-worker-0
  namespace: runai-my-project
spec:
  replicas: 1
  template:
    spec:
      securityContext:
        runAsUser: 2000
        runAsGroup: 2000
        runAsNonRoot: true
      containers:
        - name: vllm-worker
          image: registry.example.com/vllm-openai:latest
          env:
            - name: NCCL_IB_DISABLE
              value: "1"
            - name: NCCL_P2P_DISABLE
              value: "0"
          resources:
            limits:
              nvidia.com/gpu: "2"
          volumeMounts:
            - name: model-data
              mountPath: /data
      volumes:
        - name: model-data
          persistentVolumeClaim:
            claimName: my-project-models

PVC for Model Weights

# Model PVC β€” must be ReadWriteMany for multi-node
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-project-models
  namespace: runai-my-project
spec:
  accessModes:
    - ReadWriteMany          # Required for multi-worker access
  resources:
    requests:
      storage: 500Gi         # 119B model β‰ˆ 240GB in float16
  storageClassName: nfs       # NFS, Lustre, or GPFS for RWX

Scaling Options

# Scale up: more workers for pipeline parallelism
runai inference distributed submit my-llm-large \
  -p my-project \
  -i registry.example.com/vllm-openai:latest \
  --existing-pvc claimname=my-project-models,path=/data \
  --workers 4 \
  -g 4 \
  -- \
  --model /data/input/Models/Large-405B \
  --tensor-parallel-size 4 \
  --pipeline-parallel-size 4 \
  --port 8000
# Total: 4 workers Γ— 4 GPUs = 16 GPUs
# TP=4 (split layers across 4 GPUs per node)
# PP=4 (pipeline across 4 nodes)

# Scale down: single worker for smaller models
runai inference submit my-llm-small \
  -p my-project \
  -i registry.example.com/vllm-openai:latest \
  --existing-pvc claimname=my-project-models,path=/data \
  -g 2 \
  -- \
  --model /data/input/Models/Small-7B \
  --tensor-parallel-size 2 \
  --port 8000

Enable InfiniBand (When Available)

# When SR-IOV RDMA is configured, enable IB for better performance:
runai inference distributed submit my-llm-ib \
  -p my-project \
  -i registry.example.com/vllm-openai:latest \
  --existing-pvc claimname=my-project-models,path=/data \
  --workers 2 \
  -g 2 \
  --environment-variable NCCL_IB_DISABLE=0 \
  --environment-variable NCCL_IB_HCA=mlx5_0 \
  --environment-variable NCCL_NET_GDR_LEVEL=5 \
  --environment-variable NCCL_P2P_DISABLE=0 \
  -- \
  --model /data/input/Models/Mistral-Small-4-119B-2603 \
  --served-model-name mistral4 \
  --tensor-parallel-size 2 \
  --port 8000

Monitor the Deployment

# Check workload status
runai describe job my-llm-inference -p my-project

# Check worker Pods
kubectl get pods -n runai-my-project -l run.ai/workload=my-llm-inference

# Check vLLM logs (head worker)
kubectl logs -n runai-my-project -l run.ai/role=head -f

# Look for:
# "INFO: Started server process [pid]"
# "INFO: Application startup complete."
# "INFO: Uvicorn running on http://0.0.0.0:8000"

# Test inference endpoint
curl -X POST http://my-llm-inference:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral4",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 100
  }'

# Check GPU utilization across workers
runai exec my-llm-inference -- nvidia-smi

Security: Non-Root Execution

Security Configuration:
──────────────────────────────────────────────────────────────────
--run-as-uid 2000         Container runs as UID 2000 (not root)
--run-as-gid 2000         Container runs as GID 2000
--run-as-non-root         Kubernetes enforces non-root

Requirements:
  β€’ Model files on PVC must be readable by UID 2000
  β€’ vLLM image must support non-root (no bind to port < 1024)
  β€’ /tmp and cache dirs must be writable (or use emptyDir)

Common fix for permission issues:
  chown -R 2000:2000 /data/input/Models/
  # Or set group-readable:
  chmod -R g+r /data/input/Models/

Preemptibility

--preemptibility preemptible
──────────────────────────────────────────────────────────────────
  β€’ Run:ai can evict this workload for higher-priority jobs
  β€’ Inference resumes when GPUs become available
  β€’ Use for dev/staging inference endpoints
  β€’ Production inference: use --preemptibility non-preemptible

Priority order in Run:ai:
  1. Non-preemptible training
  2. Non-preemptible inference
  3. Preemptible training
  4. Preemptible inference  ← this workload

Common Issues

NCCL timeout between workers

  • Cause: Workers can’t reach each other on NCCL port; network policy blocking
  • Fix: Ensure Pods can communicate on all ports; check NCCL_SOCKET_IFNAME

Model loading OOM

  • Cause: 119B model too large for available VRAM with current TP size
  • Fix: Increase --tensor-parallel-size or add --workers; check --max-model-len

Permission denied on model files

  • Cause: PVC files owned by root; container runs as UID 2000
  • Fix: chown -R 2000:2000 /data/input/Models/ on the PVC

Preempted during inference

  • Cause: Higher-priority job needs GPUs; this workload is preemptible
  • Fix: Use --preemptibility non-preemptible for production endpoints

Workers start but can’t find head

  • Cause: Head Pod DNS not resolvable; Ray/vLLM cluster init failed
  • Fix: Check Run:ai creates headless Service; verify head Pod is Running first

Best Practices

  1. TP size = GPUs per worker β€” tensor parallelism within a node (NVLink fast)
  2. PP size = number of workers β€” pipeline parallelism across nodes (network)
  3. Disable IB only when unavailable β€” InfiniBand is 10x faster than Ethernet for NCCL
  4. RWX PVC for multi-worker β€” all workers need to read model weights
  5. Non-root always β€” security best practice; fix file permissions on PVC
  6. Preemptible for dev β€” save GPU quota; non-preemptible for production
  7. Start with Ethernet β€” enable IB/RDMA after validating the setup works

Key Takeaways

  • runai inference distributed submit manages multi-worker vLLM with tensor parallelism
  • 2 workers Γ— 2 GPUs = 4 GPUs total; TP=2 splits model across GPUs within each worker
  • NCCL_IB_DISABLE=1 uses Ethernet for inter-node (sufficient for inference)
  • NCCL_P2P_DISABLE=0 keeps NVLink P2P for intra-node GPU communication
  • PVC must be ReadWriteMany (NFS/Lustre) for multi-worker model access
  • Non-root execution (UID/GID 2000) requires model files readable by that UID
  • Preemptible workloads yield GPUs to higher priority β€” use for staging
  • When SR-IOV RDMA is ready, switch to NCCL_IB_DISABLE=0 + NCCL_IB_HCA=mlx5_0
#runai #vllm #nccl #distributed-inference #tensor-parallelism
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens