Run:ai Distrib. vLLM Inference Multimodal LLMs
Deploy multimodal LLMs with Run:ai distributed inference and vLLM on Kubernetes. Tensor parallelism, NCCL over NVLink, GPUDirect RDMA.
π‘ Quick Answer: Use
runai inference distributed submitwith--workers 2 -g 2 --tensor-parallel-size 2to shard a large multimodal model across GPUs via vLLM. SetNCCL_SOCKET_IFNAME=net1for the RDMA secondary NIC, requestopenshift.io/mellanoxnics=1for GPUDirect RDMA, and run fully offline withTRANSFORMERS_OFFLINE=1 HF_HUB_OFFLINE=1for air-gapped environments. NCCL auto-discovers NVLink (123.6 GB/s) between local GPUs and uses P2P/CUMEM transport.
The Problem
Large multimodal models (100B+ parameters with vision encoders) exceed single-GPU memory. You need tensor parallelism across multiple GPUs, NCCL for collective operations, and an air-gapped deployment that pulls models from a local PVC β not the internet. Run:aiβs distributed inference CLI handles the orchestration, but getting NCCL, RDMA, and vLLM configured correctly requires precise environment variables and resource requests.
flowchart TB
subgraph RUNAI["Run:ai Distributed Inference"]
CLI["runai inference distributed submit"]
end
subgraph POD["Worker Pod (2 GPUs)"]
VLLM["vLLM 0.18.0<br/>OpenAI-compatible API"]
subgraph GPUS["GPU Topology"]
GPU0["GPU 0<br/>rank 0"]
GPU1["GPU 1<br/>rank 1"]
GPU0 <-->|"NVLink<br/>123.6 GB/s"| GPU1
end
NIC["ConnectX-7<br/>400 Gb/s RoCE"]
PVC["PVC: /data<br/>Model weights"]
end
VLLM --> GPU0
VLLM --> GPU1
GPU0 <-->|"P2P/CUMEM"| GPU1
NIC -.->|"GPUDirect RDMA<br/>(multi-node)"| GPU0
PVC -->|"Model load"| VLLM
CLI --> POD
CLIENT["Client"] -->|"POST /v1/chat/completions"| VLLMThe Solution
Run:ai Distributed Inference Command
runai inference distributed submit my-model-dist \
-p my-project \
-i registry.example.com/ai-platform/vllm-openai:v0.18.0 \
--existing-pvc claimname=my-project-pvc,path=/data \
--workers 2 \
-g 2 \
--serving-port container=8000,authorization-type=authenticatedUsers \
--environment-variable TRANSFORMERS_OFFLINE=1 \
--environment-variable HF_HUB_OFFLINE=1 \
--environment-variable NCCL_DEBUG=INFO \
--environment-variable NCCL_DEBUG_SUBSYS=ALL \
--environment-variable NCCL_SOCKET_IFNAME=net1 \
--extended-resource "openshift.io/mellanoxnics=1" \
--annotation "k8s.v1.cni.cncf.io/networks=rdma-net" \
--run-as-uid 2000 \
--run-as-gid 2000 \
--run-as-non-root \
--preemptibility preemptible \
-- --model /data/input/Models/My-Model-119B \
--served-model-name my-model \
--tensor-parallel-size 2 \
--port 8000Breaking Down the Flags
Run:ai orchestration:
| Flag | Value | Purpose |
|---|---|---|
--workers 2 | 2 worker pods | Distributed inference topology |
-g 2 | 2 GPUs per worker | GPU allocation per pod |
--serving-port | container=8000 | OpenAI-compatible endpoint |
--preemptibility | preemptible | Can be preempted for higher-priority jobs |
--run-as-uid/gid 2000 | Non-root | Security compliance (OpenShift SCC) |
Air-gapped environment:
| Variable | Purpose |
|---|---|
TRANSFORMERS_OFFLINE=1 | Donβt attempt HuggingFace Hub downloads |
HF_HUB_OFFLINE=1 | Fully offline β fail instead of reaching out |
--existing-pvc | Model weights pre-loaded on PVC |
NCCL and networking:
| Variable | Purpose |
|---|---|
NCCL_DEBUG=INFO | Enable NCCL debug logging |
NCCL_DEBUG_SUBSYS=ALL | Log all NCCL subsystems (NET, INIT, TOPO) |
NCCL_SOCKET_IFNAME=net1 | Use secondary NIC (Multus) for NCCL bootstrap |
openshift.io/mellanoxnics=1 | Request RDMA shared device plugin NIC |
k8s.v1.cni.cncf.io/networks=rdma-net | Attach Multus secondary network |
vLLM engine (after --):
| Flag | Value | Purpose |
|---|---|---|
--model | /data/input/Models/My-Model-119B | Local model path on PVC |
--served-model-name | my-model | API model name |
--tensor-parallel-size 2 | TP=2 | Shard across 2 GPUs |
--port 8000 | API port | OpenAI-compatible server |
NCCL Tuning for Different Network Topologies
# Intra-node only (NVLink available) β default, no special config needed
# NCCL auto-detects NVLink and uses P2P/CUMEM transport
# Multi-node with IB/RoCE RDMA
--environment-variable NCCL_NET=IB \
--environment-variable NCCL_IB_HCA=mlx5_0:1 \
--environment-variable NCCL_IB_DISABLE=0 \
--environment-variable NCCL_P2P_DISABLE=0 \
# Multi-node without RDMA (TCP fallback β much slower)
--environment-variable NCCL_IB_DISABLE=1 \
--environment-variable NCCL_SOCKET_IFNAME=eth1 \Reading NCCL Debug Output
When NCCL_DEBUG=INFO and NCCL_DEBUG_SUBSYS=ALL are set, the logs tell you everything about the communication topology. Hereβs what to look for:
β Good Signs
# NCCL found the IB/RoCE device
NET/IB: [0] mlx5_130:uverbs132:1/RoCE provider=Mlx5 speed=400000
NET/IB : Using [0]mlx5_130:1/RoCE [RO]; OOB net1:10.x.x.x<0>
# GPUDirect RDMA is active
GPU Direct RDMA Enabled for HCA 0 'mlx5_130'
GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 4 <= 5)
DMA-BUF is available on GPU device 0
# NVLink detected between GPUs (123.6 GB/s)
GPU/0-118000 :GPU/0-1b3000 (1/123.6/NVL)
# P2P/CUMEM transport (GPU-to-GPU via NVLink, not going through CPU)
Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM
Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM
# Init completed successfully
ncclCommInitRank comm 0x... rank 0 nranks 2 ... - Init COMPLETE
Init timings - total 0.33sβ οΈ Warning Signs
# Missing NCCL net plugin (not necessarily a problem for intra-node)
NET/Plugin: Could not find: libnccl-net.so.
# Missing mlx5 symbols (older libmlx5 β no data-direct or DMA-BUF MR)
dlvsym failed on mlx5dv_get_data_direct_sysfs_path
dlvsym failed on mlx5dv_reg_dmabuf_mr
# If you see NET/Socket instead of NET/IB, RDMA is NOT being used
NET/Socket : Using [0]eth0:10.x.x.x<0> β TCP fallback!Understanding the Topology Tree
# NCCL topology output shows the PCI hierarchy:
=== System : maxBw 123.6 totalBw 123.6 ===
CPU/0-1 (1/1/3)
+ PCI[48.0] - PCI/0-115000
+ PCI[48.0] - GPU/0-118000 (0) β GPU 0
+ NVL[123.6] - GPU/0-1b3000 β NVLink to GPU 1
+ PCI[48.0] - NIC/0-119000 β ConnectX NIC
+ PCI[48.0] - PCI/0-1b0000
+ PCI[48.0] - GPU/0-1b3000 (1) β GPU 1
+ NVL[123.6] - GPU/0-118000 β NVLink to GPU 0
# This tells you:
# - 2 GPUs connected via NVLink at 123.6 GB/s
# - NIC is on the same PCI switch as GPU 0 (distance 4)
# - Both GPUs can do GPUDirect RDMA via the NICCommunication Patterns
NCCL selects algorithms based on topology:
# Ring: Good for AllReduce (TP main operation)
Ring 00 : 0 -> 1 -> 0
Ring 01 : 0 -> 1 -> 0
# Tree: Good for Broadcast/Reduce
Tree 0 : -1 -> 0 -> 1/-1/-1
Tree 2 : 1 -> 0 -> -1/-1/-1
# 8 collective channels, 8 P2P channels
8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels
# Bandwidth matrix shows what NCCL can achieve:
# Algorithm | Ring (Simple) | Tree (Simple)
# AllReduce | 120.0 GB/s | 77.4 GB/s
# AllGather | 240.0 GB/s | N/A
# ReduceScatter | 240.0 GB/s | N/AvLLM Configuration Details
What vLLM Auto-Detects
From the logs, vLLM 0.18.0 automatically:
# Inferred from model files:
dtype: torch.bfloat16 # From consolidated*.safetensors
architecture: PixtralForConditionalGeneration # Multimodal (vision + language)
quantization: fp8 # Auto-detected FP8 quantization
max_seq_len: 1048576 # 1M context (from model config)
# Enabled optimizations:
chunked_prefill: true # max_num_batched_tokens=8192
async_scheduling: true # Asynchronous request scheduling
enable_prefix_caching: true # KV cache reuse for shared prefixes
compilation_mode: VLLM_COMPILE # Torch compilation + CUDAGraph
custom_fusions: allreduce_rms # Fused AllReduce + RMSNorm kernel
# CUDAGraph captures for batch sizes: 1,2,4,8,...,512
cudagraph_capture_sizes: [1, 2, 4, 8, 16, 24, 32, ..., 512]Important: Model Path Warning
vLLM 0.18.0 deprecates --model as a flag:
WARNING: With `vllm serve`, you should provide the model as a
positional argument or in a config file instead of via the `--model`
option. The `--model` option will be removed in v0.13.For future versions, use:
-- vllm serve /data/input/Models/My-Model-119B \
--served-model-name my-model \
--tensor-parallel-size 2 \
--port 8000Setting Max Model Length
For models with very long context (1M tokens), you may need to limit it:
# Default uses model config (1M tokens = massive KV cache memory)
# Reduce if you don't need full context:
--max-model-len 32768
# β οΈ REQUIRED for some hardware (e.g., Huawei Ascend NPUs):
# O(nΒ²) attention mask causes OOM at large max_model_len
--max-model-len 4096Air-Gapped Deployment Pattern
flowchart LR
subgraph EXTERNAL["Connected Environment"]
HF["HuggingFace Hub"]
REG_EXT["Public Registry"]
end
subgraph TRANSFER["Transfer"]
DISK["Portable Storage<br/>or Approved Transfer"]
end
subgraph AIRGAP["Air-Gapped Cluster"]
REG_INT["Internal Registry<br/>registry.example.com"]
PVC["PVC with<br/>Model Weights"]
POD["vLLM Pod<br/>TRANSFORMERS_OFFLINE=1<br/>HF_HUB_OFFLINE=1"]
end
HF -->|"Download model"| DISK
REG_EXT -->|"Pull image"| DISK
DISK -->|"Copy"| REG_INT
DISK -->|"Load"| PVC
REG_INT -->|"Pull image"| POD
PVC -->|"Mount /data"| PODPre-Loading Models to PVC
# On a connected machine, download the model:
huggingface-cli download org/My-Model-119B \
--local-dir /mnt/transfer/Models/My-Model-119B
# Transfer to air-gapped environment
# (USB drive, approved file transfer, etc.)
# Create PVC and copy model
kubectl create -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: my-project-pvc
spec:
accessModes: [ReadWriteMany]
storageClassName: nfs-client
resources:
requests:
storage: 500Gi
EOF
# Copy model to PVC via a temporary pod
kubectl run model-loader --image=busybox --restart=Never \
--overrides='{"spec":{"containers":[{"name":"loader","image":"busybox","command":["sleep","3600"],"volumeMounts":[{"name":"data","mountPath":"/data"}]}],"volumes":[{"name":"data","persistentVolumeClaim":{"claimName":"my-project-pvc"}}]}}'
kubectl cp /mnt/transfer/Models/My-Model-119B \
model-loader:/data/input/Models/My-Model-119B
kubectl delete pod model-loaderMirroring vLLM Image to Internal Registry
# On connected machine
skopeo copy \
docker://vllm/vllm-openai:v0.18.0 \
docker://registry.example.com/ai-platform/vllm-openai:v0.18.0
# Or using podman
podman pull vllm/vllm-openai:v0.18.0
podman tag vllm/vllm-openai:v0.18.0 \
registry.example.com/ai-platform/vllm-openai:v0.18.0
podman push registry.example.com/ai-platform/vllm-openai:v0.18.0Security: Non-Root Execution
Run:ai + OpenShift require non-root by default:
--run-as-uid 2000 \
--run-as-gid 2000 \
--run-as-non-rootThis maps to the pod security context:
securityContext:
runAsUser: 2000
runAsGroup: 2000
runAsNonRoot: true
fsGroup: 2000Common issue: If model files on PVC are owned by root, the vLLM process (UID 2000) canβt read them. Fix:
# In the model-loader pod:
chown -R 2000:2000 /data/input/Models/
# Or use fsGroupChangePolicy in PVC:Multus Secondary Network for NCCL
The --annotation "k8s.v1.cni.cncf.io/networks=rdma-net" attaches a secondary NIC via Multus:
# NetworkAttachmentDefinition
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
name: rdma-net
namespace: my-project
spec:
config: |
{
"cniVersion": "0.3.1",
"type": "macvlan",
"master": "ens8f0np0",
"mode": "bridge",
"ipam": {
"type": "whereabouts",
"range": "10.232.0.0/16"
}
}This creates net1 inside the pod β which is why NCCL_SOCKET_IFNAME=net1 points NCCL to the RDMA-capable interface.
Scaling: Single-Node vs Multi-Node
Single-Node TP (What This Example Does)
Both GPUs on the same node, communicating via NVLink:
nNodes=1, localRanks=2
Transport: P2P/CUMEM (GPU-to-GPU direct memory)
Bandwidth: 123.6 GB/s per NVLink link
Latency: ~1-2 ΞΌsBest for: Models that fit in 2-8 GPUs on one node.
Multi-Node TP (Scaling Beyond One Node)
When you need more GPUs than one node has:
runai inference distributed submit my-large-model \
-p my-project \
-i registry.example.com/ai-platform/vllm-openai:v0.18.0 \
--existing-pvc claimname=my-project-pvc,path=/data \
--workers 4 \
-g 8 \
--environment-variable NCCL_NET=IB \
--environment-variable NCCL_IB_HCA=mlx5_0:1 \
--environment-variable NCCL_SOCKET_IFNAME=net1 \
--environment-variable NCCL_DEBUG=INFO \
--extended-resource "openshift.io/mellanoxnics=1" \
--annotation "k8s.v1.cni.cncf.io/networks=rdma-net" \
-- --model /data/input/Models/My-Huge-Model-700B \
--served-model-name huge-model \
--tensor-parallel-size 8 \
--pipeline-parallel-size 4 \
--port 8000Key differences for multi-node:
NCCL_NET=IBβ force IB/RoCE transport for inter-nodeNCCL_IB_HCA=mlx5_0:1β specify which HCA port to use- TP handles intra-node (NVLink), PP handles inter-node (RDMA)
Common Issues
| Issue | Cause | Fix |
|---|---|---|
NET/Socket in NCCL logs | RDMA not configured | Check openshift.io/mellanoxnics resource + Multus annotation |
libnccl-net.so not found | Missing NCCL net plugin | Usually OK for intra-node; install nccl-rdma-sharp-plugins for multi-node |
mlx5dv_reg_dmabuf_mr symbol missing | Old libmlx5 in container | Rebuild image with newer MLNX OFED or use NVIDIA GPU Cloud base images |
| Model download fails | Forgot offline flags | Set both TRANSFORMERS_OFFLINE=1 and HF_HUB_OFFLINE=1 |
| Permission denied on model files | PVC files owned by root | chown -R 2000:2000 or set fsGroup: 2000 |
| OOM on model load | 1M context KV cache | Add --max-model-len 32768 to reduce KV cache size |
| NCCL timeout on multi-node | Firewall blocking NCCL ports | Open port range 29400-29500 between worker nodes |
| Slow inference despite NVLink | CUDAGraph not capturing | Check logs for cudagraph_capture_sizes, ensure warm-up completed |
--model deprecation warning | vLLM 0.18+ | Use positional arg: vllm serve /path/to/model |
Best Practices
- Always set
NCCL_DEBUG=INFOinitially β verify the transport is what you expect, then remove for production - Pin vLLM image versions β use
v0.18.0notlatest; breaking changes happen between versions - Use
TRANSFORMERS_OFFLINE=1even in connected environments β prevents unexpected downloads during inference - Request RDMA resources explicitly β
openshift.io/mellanoxnics=1ensures the pod lands on an RDMA-capable node - Match TP size to GPU count per node β TP=2 for 2 GPUs, TP=8 for 8 GPUs; add PP for multi-node
- Set
--max-model-lenfor production β donβt use the modelβs full context unless you need it - Non-root is mandatory on OpenShift β pre-set file ownership on PVCs
- Preemptible for dev/test β saves GPU quota; use non-preemptible for production endpoints
Key Takeaways
runai inference distributed submitorchestrates multi-GPU vLLM deployment with one command- NCCL auto-discovers NVLink (123.6 GB/s) for intra-node GPU communication
- GPUDirect RDMA requires: Multus secondary NIC + RDMA shared device plugin + NCCL_SOCKET_IFNAME
- Air-gapped deployment: pre-load model to PVC, mirror image to internal registry, set offline flags
- Read NCCL logs:
NET/IB= RDMA β ,NET/Socket= TCP fallback β,P2P/CUMEM= NVLink β - vLLM 0.18.0 auto-detects: dtype, quantization (FP8), architecture, enables chunked prefill + CUDAGraph
- Pin image versions, limit context length, and run non-root for production deployments

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
