Run:ai Distributed Inference with vLLM and NCCL
Deploy distributed LLM inference on Run:ai with vLLM tensor parallelism across multiple workers. Covers multi-node GPU splitting, NCCL configuration, PVC model
π‘ Quick Answer: Run:aiβs
inference distributed submitdeploys vLLM across multiple GPU workers with tensor parallelism. For a 119B parameter model needing 4 GPUs total: use 2 workers Γ 2 GPUs each with--tensor-parallel-size 2. NCCL handles inter-GPU communication β disable InfiniBand (NCCL_IB_DISABLE=1) when using Ethernet-only clusters.
The Problem
Large language models (100B+ parameters) donβt fit on a single GPU:
- Mistral-Small-4 119B needs ~240GB VRAM in float16 (3-4Γ A100 80GB)
- Must split model across GPUs using tensor parallelism
- Multi-node inference needs NCCL for inter-worker communication
- Run:ai manages GPU scheduling, but distributed inference needs specific config
- Security requirements: non-root, specific UID/GID, preemptible workloads
The Solution
Run:ai Distributed Inference Command
runai inference distributed submit my-llm-inference \
-p my-project \
-i registry.example.com/vllm-openai:latest \
--existing-pvc claimname=my-project-models,path=/data \
--workers 2 \
-g 2 \
--serving-port container=8000,authorization-type=authenticatedUsers \
--environment-variable NCCL_IB_DISABLE=1 \
--environment-variable NCCL_P2P_DISABLE=0 \
--run-as-uid 2000 \
--run-as-gid 2000 \
--run-as-non-root \
--preemptibility preemptible \
-- \
--model /data/input/Models/Mistral-Small-4-119B-2603 \
--served-model-name mistral4 \
--tensor-parallel-size 2 \
--port 8000Breaking Down Each Flag
Run:ai Flags:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Flag Purpose
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
inference distributed submit Distributed inference workload type
my-llm-inference Workload name
-p my-project Run:ai project (quota + namespace)
-i registry.example.com/... vLLM container image
--existing-pvc ... Mount PVC with model weights
--workers 2 2 worker Pods (1 head + 1 worker)
-g 2 2 GPUs per worker (4 total)
--serving-port container=8000 Expose inference endpoint
--environment-variable ... NCCL tuning
--run-as-uid 2000 Non-root UID
--run-as-gid 2000 Non-root GID
--run-as-non-root Security: forbid root
--preemptibility preemptible Can be evicted for higher-priority jobs
vLLM Flags (after --):
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
--model /data/input/Models/... Path to model weights on PVC
--served-model-name mistral4 API model name for OpenAI-compatible endpoint
--tensor-parallel-size 2 Split model across 2 GPUs per worker
--port 8000 vLLM HTTP server portGPU Topology for This Deployment
Total: 2 workers Γ 2 GPUs = 4 GPUs
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Worker 0 (Head): Worker 1:
ββββββββββββββββββββββββ ββββββββββββββββββββββββ
β GPU 0 GPU 1 β β GPU 0 GPU 1 β
β ββββββββββ€ β β ββββββββββ€ β
β β TP rank 0, 1 β β β β TP rank 0, 1 β β
β β (tensor parallel)β β β β (tensor parallel)β β
β ββββββββββββββββββ β β ββββββββββββββββββ β
β vLLM engine β β vLLM engine β
β Port 8000 (API) β β β
ββββββββββββ¬ββββββββββββββ ββββββββββββ¬ββββββββββββββ
β NCCL (Ethernet) β
ββββββββββββββββββββββββββββββββββββ
Model split: 119B params / 2 TP = ~60B per GPU
VRAM per GPU: ~120GB (float16) β fits on 2Γ A100 80GB with KV cacheNCCL Configuration Explained
# NCCL_IB_DISABLE=1
# Disable InfiniBand transport β use Ethernet (TCP) for NCCL
# Use when:
# - Cluster has no InfiniBand fabric
# - Only Ethernet available between workers
# - SR-IOV/RDMA not configured
# NCCL_P2P_DISABLE=0
# Enable GPU-to-GPU peer-to-peer within each worker
# P2P via NVLink/PCIe between the 2 GPUs in each worker
# Only disabling IB for inter-node, keeping P2P for intra-nodeNCCL Transport Selection for This Setup:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Path Transport Performance
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
GPU0 β GPU1 (same worker) NVLink/PCIe P2P Best (~600 GB/s NVLink)
Worker0 β Worker1 TCP/Ethernet Good enough for inference
(~10-25 Gb/s)
For training: IB/RDMA would be critical (all-reduce heavy)
For inference: Ethernet is often sufficient (less cross-node traffic)Equivalent Kubernetes Manifests
# What Run:ai creates under the hood:
# Head worker (rank 0)
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-llm-inference-head
namespace: runai-my-project
spec:
replicas: 1
template:
metadata:
labels:
run.ai/workload: my-llm-inference
run.ai/role: head
spec:
securityContext:
runAsUser: 2000
runAsGroup: 2000
runAsNonRoot: true
containers:
- name: vllm
image: registry.example.com/vllm-openai:latest
command:
- python3
- -m
- vllm.entrypoints.openai.api_server
args:
- --model
- /data/input/Models/Mistral-Small-4-119B-2603
- --served-model-name
- mistral4
- --tensor-parallel-size
- "2"
- --port
- "8000"
env:
- name: NCCL_IB_DISABLE
value: "1"
- name: NCCL_P2P_DISABLE
value: "0"
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: "2"
volumeMounts:
- name: model-data
mountPath: /data
volumes:
- name: model-data
persistentVolumeClaim:
claimName: my-project-models
---
# Worker (rank 1+)
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-llm-inference-worker-0
namespace: runai-my-project
spec:
replicas: 1
template:
spec:
securityContext:
runAsUser: 2000
runAsGroup: 2000
runAsNonRoot: true
containers:
- name: vllm-worker
image: registry.example.com/vllm-openai:latest
env:
- name: NCCL_IB_DISABLE
value: "1"
- name: NCCL_P2P_DISABLE
value: "0"
resources:
limits:
nvidia.com/gpu: "2"
volumeMounts:
- name: model-data
mountPath: /data
volumes:
- name: model-data
persistentVolumeClaim:
claimName: my-project-modelsPVC for Model Weights
# Model PVC β must be ReadWriteMany for multi-node
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: my-project-models
namespace: runai-my-project
spec:
accessModes:
- ReadWriteMany # Required for multi-worker access
resources:
requests:
storage: 500Gi # 119B model β 240GB in float16
storageClassName: nfs # NFS, Lustre, or GPFS for RWXScaling Options
# Scale up: more workers for pipeline parallelism
runai inference distributed submit my-llm-large \
-p my-project \
-i registry.example.com/vllm-openai:latest \
--existing-pvc claimname=my-project-models,path=/data \
--workers 4 \
-g 4 \
-- \
--model /data/input/Models/Large-405B \
--tensor-parallel-size 4 \
--pipeline-parallel-size 4 \
--port 8000
# Total: 4 workers Γ 4 GPUs = 16 GPUs
# TP=4 (split layers across 4 GPUs per node)
# PP=4 (pipeline across 4 nodes)
# Scale down: single worker for smaller models
runai inference submit my-llm-small \
-p my-project \
-i registry.example.com/vllm-openai:latest \
--existing-pvc claimname=my-project-models,path=/data \
-g 2 \
-- \
--model /data/input/Models/Small-7B \
--tensor-parallel-size 2 \
--port 8000Enable InfiniBand (When Available)
# When SR-IOV RDMA is configured, enable IB for better performance:
runai inference distributed submit my-llm-ib \
-p my-project \
-i registry.example.com/vllm-openai:latest \
--existing-pvc claimname=my-project-models,path=/data \
--workers 2 \
-g 2 \
--environment-variable NCCL_IB_DISABLE=0 \
--environment-variable NCCL_IB_HCA=mlx5_0 \
--environment-variable NCCL_NET_GDR_LEVEL=5 \
--environment-variable NCCL_P2P_DISABLE=0 \
-- \
--model /data/input/Models/Mistral-Small-4-119B-2603 \
--served-model-name mistral4 \
--tensor-parallel-size 2 \
--port 8000Monitor the Deployment
# Check workload status
runai describe job my-llm-inference -p my-project
# Check worker Pods
kubectl get pods -n runai-my-project -l run.ai/workload=my-llm-inference
# Check vLLM logs (head worker)
kubectl logs -n runai-my-project -l run.ai/role=head -f
# Look for:
# "INFO: Started server process [pid]"
# "INFO: Application startup complete."
# "INFO: Uvicorn running on http://0.0.0.0:8000"
# Test inference endpoint
curl -X POST http://my-llm-inference:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistral4",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 100
}'
# Check GPU utilization across workers
runai exec my-llm-inference -- nvidia-smiSecurity: Non-Root Execution
Security Configuration:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
--run-as-uid 2000 Container runs as UID 2000 (not root)
--run-as-gid 2000 Container runs as GID 2000
--run-as-non-root Kubernetes enforces non-root
Requirements:
β’ Model files on PVC must be readable by UID 2000
β’ vLLM image must support non-root (no bind to port < 1024)
β’ /tmp and cache dirs must be writable (or use emptyDir)
Common fix for permission issues:
chown -R 2000:2000 /data/input/Models/
# Or set group-readable:
chmod -R g+r /data/input/Models/Preemptibility
--preemptibility preemptible
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β’ Run:ai can evict this workload for higher-priority jobs
β’ Inference resumes when GPUs become available
β’ Use for dev/staging inference endpoints
β’ Production inference: use --preemptibility non-preemptible
Priority order in Run:ai:
1. Non-preemptible training
2. Non-preemptible inference
3. Preemptible training
4. Preemptible inference β this workloadCommon Issues
NCCL timeout between workers
- Cause: Workers canβt reach each other on NCCL port; network policy blocking
- Fix: Ensure Pods can communicate on all ports; check
NCCL_SOCKET_IFNAME
Model loading OOM
- Cause: 119B model too large for available VRAM with current TP size
- Fix: Increase
--tensor-parallel-sizeor add--workers; check--max-model-len
Permission denied on model files
- Cause: PVC files owned by root; container runs as UID 2000
- Fix:
chown -R 2000:2000 /data/input/Models/on the PVC
Preempted during inference
- Cause: Higher-priority job needs GPUs; this workload is preemptible
- Fix: Use
--preemptibility non-preemptiblefor production endpoints
Workers start but canβt find head
- Cause: Head Pod DNS not resolvable; Ray/vLLM cluster init failed
- Fix: Check Run:ai creates headless Service; verify head Pod is Running first
Best Practices
- TP size = GPUs per worker β tensor parallelism within a node (NVLink fast)
- PP size = number of workers β pipeline parallelism across nodes (network)
- Disable IB only when unavailable β InfiniBand is 10x faster than Ethernet for NCCL
- RWX PVC for multi-worker β all workers need to read model weights
- Non-root always β security best practice; fix file permissions on PVC
- Preemptible for dev β save GPU quota; non-preemptible for production
- Start with Ethernet β enable IB/RDMA after validating the setup works
Key Takeaways
runai inference distributed submitmanages multi-worker vLLM with tensor parallelism- 2 workers Γ 2 GPUs = 4 GPUs total; TP=2 splits model across GPUs within each worker
NCCL_IB_DISABLE=1uses Ethernet for inter-node (sufficient for inference)NCCL_P2P_DISABLE=0keeps NVLink P2P for intra-node GPU communication- PVC must be ReadWriteMany (NFS/Lustre) for multi-worker model access
- Non-root execution (UID/GID 2000) requires model files readable by that UID
- Preemptible workloads yield GPUs to higher priority β use for staging
- When SR-IOV RDMA is ready, switch to
NCCL_IB_DISABLE=0+NCCL_IB_HCA=mlx5_0

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
