πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai advanced ⏱ 30 minutes K8s 1.28+

Run:ai Distrib. vLLM Inference Multimodal LLMs

Deploy multimodal LLMs with Run:ai distributed inference and vLLM on Kubernetes. Tensor parallelism, NCCL over NVLink, GPUDirect RDMA.

By Luca Berton β€’ β€’ πŸ“– 11 min read

πŸ’‘ Quick Answer: Use runai inference distributed submit with --workers 2 -g 2 --tensor-parallel-size 2 to shard a large multimodal model across GPUs via vLLM. Set NCCL_SOCKET_IFNAME=net1 for the RDMA secondary NIC, request openshift.io/mellanoxnics=1 for GPUDirect RDMA, and run fully offline with TRANSFORMERS_OFFLINE=1 HF_HUB_OFFLINE=1 for air-gapped environments. NCCL auto-discovers NVLink (123.6 GB/s) between local GPUs and uses P2P/CUMEM transport.

The Problem

Large multimodal models (100B+ parameters with vision encoders) exceed single-GPU memory. You need tensor parallelism across multiple GPUs, NCCL for collective operations, and an air-gapped deployment that pulls models from a local PVC β€” not the internet. Run:ai’s distributed inference CLI handles the orchestration, but getting NCCL, RDMA, and vLLM configured correctly requires precise environment variables and resource requests.

flowchart TB
    subgraph RUNAI["Run:ai Distributed Inference"]
        CLI["runai inference distributed submit"]
    end
    
    subgraph POD["Worker Pod (2 GPUs)"]
        VLLM["vLLM 0.18.0<br/>OpenAI-compatible API"]
        
        subgraph GPUS["GPU Topology"]
            GPU0["GPU 0<br/>rank 0"]
            GPU1["GPU 1<br/>rank 1"]
            GPU0 <-->|"NVLink<br/>123.6 GB/s"| GPU1
        end
        
        NIC["ConnectX-7<br/>400 Gb/s RoCE"]
        PVC["PVC: /data<br/>Model weights"]
    end
    
    VLLM --> GPU0
    VLLM --> GPU1
    GPU0 <-->|"P2P/CUMEM"| GPU1
    NIC -.->|"GPUDirect RDMA<br/>(multi-node)"| GPU0
    PVC -->|"Model load"| VLLM
    CLI --> POD
    
    CLIENT["Client"] -->|"POST /v1/chat/completions"| VLLM

The Solution

Run:ai Distributed Inference Command

runai inference distributed submit my-model-dist \
  -p my-project \
  -i registry.example.com/ai-platform/vllm-openai:v0.18.0 \
  --existing-pvc claimname=my-project-pvc,path=/data \
  --workers 2 \
  -g 2 \
  --serving-port container=8000,authorization-type=authenticatedUsers \
  --environment-variable TRANSFORMERS_OFFLINE=1 \
  --environment-variable HF_HUB_OFFLINE=1 \
  --environment-variable NCCL_DEBUG=INFO \
  --environment-variable NCCL_DEBUG_SUBSYS=ALL \
  --environment-variable NCCL_SOCKET_IFNAME=net1 \
  --extended-resource "openshift.io/mellanoxnics=1" \
  --annotation "k8s.v1.cni.cncf.io/networks=rdma-net" \
  --run-as-uid 2000 \
  --run-as-gid 2000 \
  --run-as-non-root \
  --preemptibility preemptible \
  -- --model /data/input/Models/My-Model-119B \
  --served-model-name my-model \
  --tensor-parallel-size 2 \
  --port 8000

Breaking Down the Flags

Run:ai orchestration:

FlagValuePurpose
--workers 22 worker podsDistributed inference topology
-g 22 GPUs per workerGPU allocation per pod
--serving-portcontainer=8000OpenAI-compatible endpoint
--preemptibilitypreemptibleCan be preempted for higher-priority jobs
--run-as-uid/gid 2000Non-rootSecurity compliance (OpenShift SCC)

Air-gapped environment:

VariablePurpose
TRANSFORMERS_OFFLINE=1Don’t attempt HuggingFace Hub downloads
HF_HUB_OFFLINE=1Fully offline β€” fail instead of reaching out
--existing-pvcModel weights pre-loaded on PVC

NCCL and networking:

VariablePurpose
NCCL_DEBUG=INFOEnable NCCL debug logging
NCCL_DEBUG_SUBSYS=ALLLog all NCCL subsystems (NET, INIT, TOPO)
NCCL_SOCKET_IFNAME=net1Use secondary NIC (Multus) for NCCL bootstrap
openshift.io/mellanoxnics=1Request RDMA shared device plugin NIC
k8s.v1.cni.cncf.io/networks=rdma-netAttach Multus secondary network

vLLM engine (after --):

FlagValuePurpose
--model/data/input/Models/My-Model-119BLocal model path on PVC
--served-model-namemy-modelAPI model name
--tensor-parallel-size 2TP=2Shard across 2 GPUs
--port 8000API portOpenAI-compatible server

NCCL Tuning for Different Network Topologies

# Intra-node only (NVLink available) β€” default, no special config needed
# NCCL auto-detects NVLink and uses P2P/CUMEM transport

# Multi-node with IB/RoCE RDMA
--environment-variable NCCL_NET=IB \
--environment-variable NCCL_IB_HCA=mlx5_0:1 \
--environment-variable NCCL_IB_DISABLE=0 \
--environment-variable NCCL_P2P_DISABLE=0 \

# Multi-node without RDMA (TCP fallback β€” much slower)
--environment-variable NCCL_IB_DISABLE=1 \
--environment-variable NCCL_SOCKET_IFNAME=eth1 \

Reading NCCL Debug Output

When NCCL_DEBUG=INFO and NCCL_DEBUG_SUBSYS=ALL are set, the logs tell you everything about the communication topology. Here’s what to look for:

βœ… Good Signs

# NCCL found the IB/RoCE device
NET/IB: [0] mlx5_130:uverbs132:1/RoCE provider=Mlx5 speed=400000
NET/IB : Using [0]mlx5_130:1/RoCE [RO]; OOB net1:10.x.x.x<0>

# GPUDirect RDMA is active
GPU Direct RDMA Enabled for HCA 0 'mlx5_130'
GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 4 <= 5)
DMA-BUF is available on GPU device 0

# NVLink detected between GPUs (123.6 GB/s)
GPU/0-118000 :GPU/0-1b3000 (1/123.6/NVL)

# P2P/CUMEM transport (GPU-to-GPU via NVLink, not going through CPU)
Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM
Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM

# Init completed successfully
ncclCommInitRank comm 0x... rank 0 nranks 2 ... - Init COMPLETE
Init timings - total 0.33s

⚠️ Warning Signs

# Missing NCCL net plugin (not necessarily a problem for intra-node)
NET/Plugin: Could not find: libnccl-net.so.

# Missing mlx5 symbols (older libmlx5 β€” no data-direct or DMA-BUF MR)
dlvsym failed on mlx5dv_get_data_direct_sysfs_path
dlvsym failed on mlx5dv_reg_dmabuf_mr

# If you see NET/Socket instead of NET/IB, RDMA is NOT being used
NET/Socket : Using [0]eth0:10.x.x.x<0>  ← TCP fallback!

Understanding the Topology Tree

# NCCL topology output shows the PCI hierarchy:
=== System : maxBw 123.6 totalBw 123.6 ===
CPU/0-1 (1/1/3)
+ PCI[48.0] - PCI/0-115000
              + PCI[48.0] - GPU/0-118000 (0)        ← GPU 0
                            + NVL[123.6] - GPU/0-1b3000  ← NVLink to GPU 1
              + PCI[48.0] - NIC/0-119000             ← ConnectX NIC
+ PCI[48.0] - PCI/0-1b0000
              + PCI[48.0] - GPU/0-1b3000 (1)        ← GPU 1
                            + NVL[123.6] - GPU/0-118000  ← NVLink to GPU 0

# This tells you:
# - 2 GPUs connected via NVLink at 123.6 GB/s
# - NIC is on the same PCI switch as GPU 0 (distance 4)
# - Both GPUs can do GPUDirect RDMA via the NIC

Communication Patterns

NCCL selects algorithms based on topology:

# Ring: Good for AllReduce (TP main operation)
Ring 00 : 0 -> 1 -> 0
Ring 01 : 0 -> 1 -> 0

# Tree: Good for Broadcast/Reduce
Tree 0 : -1 -> 0 -> 1/-1/-1
Tree 2 :  1 -> 0 -> -1/-1/-1

# 8 collective channels, 8 P2P channels
8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels

# Bandwidth matrix shows what NCCL can achieve:
#   Algorithm   |     Ring (Simple)  |    Tree (Simple)
#   AllReduce   |     120.0 GB/s     |     77.4 GB/s
#   AllGather   |     240.0 GB/s     |     N/A
# ReduceScatter |     240.0 GB/s     |     N/A

vLLM Configuration Details

What vLLM Auto-Detects

From the logs, vLLM 0.18.0 automatically:

# Inferred from model files:
dtype: torch.bfloat16           # From consolidated*.safetensors
architecture: PixtralForConditionalGeneration  # Multimodal (vision + language)
quantization: fp8               # Auto-detected FP8 quantization
max_seq_len: 1048576            # 1M context (from model config)

# Enabled optimizations:
chunked_prefill: true           # max_num_batched_tokens=8192
async_scheduling: true          # Asynchronous request scheduling
enable_prefix_caching: true     # KV cache reuse for shared prefixes
compilation_mode: VLLM_COMPILE  # Torch compilation + CUDAGraph
custom_fusions: allreduce_rms   # Fused AllReduce + RMSNorm kernel

# CUDAGraph captures for batch sizes: 1,2,4,8,...,512
cudagraph_capture_sizes: [1, 2, 4, 8, 16, 24, 32, ..., 512]

Important: Model Path Warning

vLLM 0.18.0 deprecates --model as a flag:

WARNING: With `vllm serve`, you should provide the model as a 
positional argument or in a config file instead of via the `--model` 
option. The `--model` option will be removed in v0.13.

For future versions, use:

-- vllm serve /data/input/Models/My-Model-119B \
  --served-model-name my-model \
  --tensor-parallel-size 2 \
  --port 8000

Setting Max Model Length

For models with very long context (1M tokens), you may need to limit it:

# Default uses model config (1M tokens = massive KV cache memory)
# Reduce if you don't need full context:
--max-model-len 32768

# ⚠️ REQUIRED for some hardware (e.g., Huawei Ascend NPUs):
# O(nΒ²) attention mask causes OOM at large max_model_len
--max-model-len 4096

Air-Gapped Deployment Pattern

flowchart LR
    subgraph EXTERNAL["Connected Environment"]
        HF["HuggingFace Hub"]
        REG_EXT["Public Registry"]
    end
    
    subgraph TRANSFER["Transfer"]
        DISK["Portable Storage<br/>or Approved Transfer"]
    end
    
    subgraph AIRGAP["Air-Gapped Cluster"]
        REG_INT["Internal Registry<br/>registry.example.com"]
        PVC["PVC with<br/>Model Weights"]
        POD["vLLM Pod<br/>TRANSFORMERS_OFFLINE=1<br/>HF_HUB_OFFLINE=1"]
    end
    
    HF -->|"Download model"| DISK
    REG_EXT -->|"Pull image"| DISK
    DISK -->|"Copy"| REG_INT
    DISK -->|"Load"| PVC
    REG_INT -->|"Pull image"| POD
    PVC -->|"Mount /data"| POD

Pre-Loading Models to PVC

# On a connected machine, download the model:
huggingface-cli download org/My-Model-119B \
  --local-dir /mnt/transfer/Models/My-Model-119B

# Transfer to air-gapped environment
# (USB drive, approved file transfer, etc.)

# Create PVC and copy model
kubectl create -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-project-pvc
spec:
  accessModes: [ReadWriteMany]
  storageClassName: nfs-client
  resources:
    requests:
      storage: 500Gi
EOF

# Copy model to PVC via a temporary pod
kubectl run model-loader --image=busybox --restart=Never \
  --overrides='{"spec":{"containers":[{"name":"loader","image":"busybox","command":["sleep","3600"],"volumeMounts":[{"name":"data","mountPath":"/data"}]}],"volumes":[{"name":"data","persistentVolumeClaim":{"claimName":"my-project-pvc"}}]}}'

kubectl cp /mnt/transfer/Models/My-Model-119B \
  model-loader:/data/input/Models/My-Model-119B

kubectl delete pod model-loader

Mirroring vLLM Image to Internal Registry

# On connected machine
skopeo copy \
  docker://vllm/vllm-openai:v0.18.0 \
  docker://registry.example.com/ai-platform/vllm-openai:v0.18.0

# Or using podman
podman pull vllm/vllm-openai:v0.18.0
podman tag vllm/vllm-openai:v0.18.0 \
  registry.example.com/ai-platform/vllm-openai:v0.18.0
podman push registry.example.com/ai-platform/vllm-openai:v0.18.0

Security: Non-Root Execution

Run:ai + OpenShift require non-root by default:

--run-as-uid 2000 \
--run-as-gid 2000 \
--run-as-non-root

This maps to the pod security context:

securityContext:
  runAsUser: 2000
  runAsGroup: 2000
  runAsNonRoot: true
  fsGroup: 2000

Common issue: If model files on PVC are owned by root, the vLLM process (UID 2000) can’t read them. Fix:

# In the model-loader pod:
chown -R 2000:2000 /data/input/Models/
# Or use fsGroupChangePolicy in PVC:

Multus Secondary Network for NCCL

The --annotation "k8s.v1.cni.cncf.io/networks=rdma-net" attaches a secondary NIC via Multus:

# NetworkAttachmentDefinition
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
  name: rdma-net
  namespace: my-project
spec:
  config: |
    {
      "cniVersion": "0.3.1",
      "type": "macvlan",
      "master": "ens8f0np0",
      "mode": "bridge",
      "ipam": {
        "type": "whereabouts",
        "range": "10.232.0.0/16"
      }
    }

This creates net1 inside the pod β€” which is why NCCL_SOCKET_IFNAME=net1 points NCCL to the RDMA-capable interface.

Scaling: Single-Node vs Multi-Node

Single-Node TP (What This Example Does)

Both GPUs on the same node, communicating via NVLink:

nNodes=1, localRanks=2
Transport: P2P/CUMEM (GPU-to-GPU direct memory)
Bandwidth: 123.6 GB/s per NVLink link
Latency: ~1-2 ΞΌs

Best for: Models that fit in 2-8 GPUs on one node.

Multi-Node TP (Scaling Beyond One Node)

When you need more GPUs than one node has:

runai inference distributed submit my-large-model \
  -p my-project \
  -i registry.example.com/ai-platform/vllm-openai:v0.18.0 \
  --existing-pvc claimname=my-project-pvc,path=/data \
  --workers 4 \
  -g 8 \
  --environment-variable NCCL_NET=IB \
  --environment-variable NCCL_IB_HCA=mlx5_0:1 \
  --environment-variable NCCL_SOCKET_IFNAME=net1 \
  --environment-variable NCCL_DEBUG=INFO \
  --extended-resource "openshift.io/mellanoxnics=1" \
  --annotation "k8s.v1.cni.cncf.io/networks=rdma-net" \
  -- --model /data/input/Models/My-Huge-Model-700B \
  --served-model-name huge-model \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 4 \
  --port 8000

Key differences for multi-node:

  • NCCL_NET=IB β€” force IB/RoCE transport for inter-node
  • NCCL_IB_HCA=mlx5_0:1 β€” specify which HCA port to use
  • TP handles intra-node (NVLink), PP handles inter-node (RDMA)

Common Issues

IssueCauseFix
NET/Socket in NCCL logsRDMA not configuredCheck openshift.io/mellanoxnics resource + Multus annotation
libnccl-net.so not foundMissing NCCL net pluginUsually OK for intra-node; install nccl-rdma-sharp-plugins for multi-node
mlx5dv_reg_dmabuf_mr symbol missingOld libmlx5 in containerRebuild image with newer MLNX OFED or use NVIDIA GPU Cloud base images
Model download failsForgot offline flagsSet both TRANSFORMERS_OFFLINE=1 and HF_HUB_OFFLINE=1
Permission denied on model filesPVC files owned by rootchown -R 2000:2000 or set fsGroup: 2000
OOM on model load1M context KV cacheAdd --max-model-len 32768 to reduce KV cache size
NCCL timeout on multi-nodeFirewall blocking NCCL portsOpen port range 29400-29500 between worker nodes
Slow inference despite NVLinkCUDAGraph not capturingCheck logs for cudagraph_capture_sizes, ensure warm-up completed
--model deprecation warningvLLM 0.18+Use positional arg: vllm serve /path/to/model

Best Practices

  • Always set NCCL_DEBUG=INFO initially β€” verify the transport is what you expect, then remove for production
  • Pin vLLM image versions β€” use v0.18.0 not latest; breaking changes happen between versions
  • Use TRANSFORMERS_OFFLINE=1 even in connected environments β€” prevents unexpected downloads during inference
  • Request RDMA resources explicitly β€” openshift.io/mellanoxnics=1 ensures the pod lands on an RDMA-capable node
  • Match TP size to GPU count per node β€” TP=2 for 2 GPUs, TP=8 for 8 GPUs; add PP for multi-node
  • Set --max-model-len for production β€” don’t use the model’s full context unless you need it
  • Non-root is mandatory on OpenShift β€” pre-set file ownership on PVCs
  • Preemptible for dev/test β€” saves GPU quota; use non-preemptible for production endpoints

Key Takeaways

  • runai inference distributed submit orchestrates multi-GPU vLLM deployment with one command
  • NCCL auto-discovers NVLink (123.6 GB/s) for intra-node GPU communication
  • GPUDirect RDMA requires: Multus secondary NIC + RDMA shared device plugin + NCCL_SOCKET_IFNAME
  • Air-gapped deployment: pre-load model to PVC, mirror image to internal registry, set offline flags
  • Read NCCL logs: NET/IB = RDMA βœ…, NET/Socket = TCP fallback ❌, P2P/CUMEM = NVLink βœ…
  • vLLM 0.18.0 auto-detects: dtype, quantization (FP8), architecture, enables chunked prefill + CUDAGraph
  • Pin image versions, limit context length, and run non-root for production deployments
#runai #vllm #distributed-inference #tensor-parallelism #nccl #nvlink #gpudirect-rdma #multimodal #air-gapped
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens