Run:ai Distributed PyTorch Training on OpenShift
Submit multi-node distributed PyTorch training jobs on OpenShift using Run:ai CLI. Covers DDP, FSDP, RDMA networking, and GPU scheduling.
π‘ Quick Answer: Run:ai provides a CLI (
runai training pytorch submit) to launch distributed PyTorch jobs on OpenShift with automatic rank assignment, RDMA networking, and GPU scheduling. It handlesMASTER_ADDR,MASTER_PORT, andRANKinjection automatically.
The Problem
Launching distributed training on OpenShift manually requires:
- Creating headless Services for worker discovery
- Managing environment variables (RANK, MASTER_ADDR, MASTER_PORT) per Pod
- Configuring SR-IOV / Mellanox RDMA network attachments
- Handling security contexts (non-root UID/GID for OpenShift SCC)
- Coordinating shared storage mounts across workers
Run:ai abstracts all of this into a single CLI command.
The Solution
Submit a Multi-Node DDP Job
runai training pytorch submit retinanet-ddp \
--image registry.example.com/training/pytorch:24.12 \
--annotation "k8s.v1.cni.cncf.io/networks=sriov-rdma" \
--extended-resource "openshift.io/mellanoxnics=1" \
--gpu-devices-request 1 \
--workers 2 \
--large-shm \
--cpu-core-request 16 \
--cpu-memory-request 2560 \
--cpu-memory-limit 8896 \
--run-as-uid 2000 \
--run-as-gid 2000 \
--environment-variable NCCL_DEBUG="INFO" \
--environment-variable NCCL_SOCKET_IFNAME="net1" \
--environment-variable NCCL_IB_QPS_PER_CONNECTION=1 \
--environment-variable NCCL_IB_SPLIT_DATA_ON_QPS=1 \
--environment-variable NCCL_SOCKET_NTHREADS=2 \
--environment-variable NCCL_NSOCKS_PERTHREAD=2 \
--environment-variable CUDA_VISIBLE_DEVICES=0 \
--environment-variable RDMAV_FORK_SAFE=1 \
--environment-variable PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
--existing-pvc claimname=project-data,path=/data \
--command -- /data/scripts/train/shell/retinanet-ddp.shSubmit an FSDP Fine-Tuning Job
runai training pytorch submit mistral-finetune-fsdp \
--image registry.example.com/training/pytorch:24.12 \
--annotation "k8s.v1.cni.cncf.io/networks=sriov-rdma" \
--extended-resource "openshift.io/mellanoxnics=1" \
--large-shm \
--workers 2 \
--gpu-devices-request 1 \
--cpu-memory-request 3846 \
--cpu-memory-limit 8996 \
--run-as-uid 2000 \
--run-as-gid 2000 \
--working-dir /data/scripts/llm/finetune-peft \
--environment-variable PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
--environment-variable NCCL_DEBUG="INFO" \
--environment-variable NCCL_IB_QPS_PER_CONNECTION=1 \
--environment-variable NCCL_IB_SPLIT_DATA_ON_QPS=1 \
--environment-variable NCCL_SOCKET_NTHREADS=2 \
--environment-variable NCCL_NSOCKS_PERTHREAD=2 \
--environment-variable NCCL_SOCKET_IFNAME="net1" \
--environment-variable CUDA_VISIBLE_DEVICES=0 \
--environment-variable RDMAV_FORK_SAFE=1 \
--existing-pvc claimname=project-data,path=/data \
--command -- /data/scripts/llm/shell/accelerate-peft-fsdp.shWhat Run:ai Injects Automatically
Run:ai sets these environment variables in each worker Pod:
RANK=0 # Worker index (0, 1, 2, ...)
MASTER_ADDR=... # IP/hostname of rank 0
MASTER_PORT=... # Port for rendezvous (default 29500)Your training script uses them directly:
#!/bin/bash
set -euo pipefail
echo "rank = $RANK"
echo "master addr = $MASTER_ADDR"
echo "master port = $MASTER_PORT"
export RDMAV_FORK_SAFE=1
torchrun --nnodes=2 --nproc_per_node=1 --node_rank=${RANK} \
--master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT} \
train.py \
--batch-size 16 \
--accum-steps 1 \
--image-size 800 \
--max-train 200 \
--backend cpu \
--num-workers 8 \
--epochs 2NCCL Environment Explained
| Variable | Value | Purpose |
|---|---|---|
NCCL_SOCKET_IFNAME | net1 | Use SR-IOV RDMA interface (not default eth0) |
NCCL_IB_QPS_PER_CONNECTION | 1 | QP count per IB connection |
NCCL_IB_SPLIT_DATA_ON_QPS | 1 | Distribute data across QPs |
NCCL_SOCKET_NTHREADS | 2 | Socket threads for fallback |
NCCL_NSOCKS_PERTHREAD | 2 | Sockets per thread |
RDMAV_FORK_SAFE | 1 | Safe RDMA after fork() |
CUDA_VISIBLE_DEVICES | 0 | Restrict to single GPU per worker |
PYTORCH_CUDA_ALLOC_CONF | expandable_segments:True | Reduce memory fragmentation |
Check Job Status
# View job status
runai training standard describe ${JOB_NAME}
# Get logs
runai training standard logs ${JOB_NAME} -f
# Shell into a running worker
runai training standard exec ${JOB_NAME} --pod ${JOB_NAME}-worker-0 -- /bin/bash
# Delete job
runai training standard delete ${JOB_NAME}SR-IOV / Mellanox RDMA on OpenShift
The --annotation and --extended-resource flags configure RDMA:
# Network attachment: SR-IOV interface for RDMA
--annotation "k8s.v1.cni.cncf.io/networks=sriov-rdma"
# Request one Mellanox NIC VF (Virtual Function)
--extended-resource "openshift.io/mellanoxnics=1"This gives each Pod a dedicated RDMA-capable network interface (net1) for NCCL communication at full InfiniBand/RoCE bandwidth.
Security: Non-Root Execution
OpenShift requires non-root by default. Run:ai supports this:
--run-as-uid 2000
--run-as-gid 2000Ensures Pods comply with OpenShift Security Context Constraints (SCC) without needing privileged or anyuid.
Common Issues
NCCL canβt find RDMA interface
- Cause:
NCCL_SOCKET_IFNAMEdoesnβt match the SR-IOV interface name - Fix: Check interface name inside Pod:
ip addr | grep netβ itβs usuallynet1
Job stuck in Pending
- Cause: Not enough GPUs or SR-IOV VFs available
- Fix: Check Run:ai dashboard for resource availability; reduce
--workers
Authentication failed for git pull
- Cause: Token expired for internal Git repos
- Fix: Refresh Git credentials in the mounted secret or config
OOM on large model loading
- Cause: All ranks trying to load full model weights simultaneously
- Fix: Use
fsdp_cpu_ram_efficient_loading(only rank 0 loads, broadcasts to others)
Best Practices
- Use
--large-shmβ enables large/dev/shmfor NCCL shared memory - Set
RDMAV_FORK_SAFE=1β prevents RDMA issues after Python multiprocessing fork - Use
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:Trueβ reduces OOM from fragmentation - Pin to
net1interface β SR-IOV interface for RDMA, not default Pod network - Non-root UIDs β always set
--run-as-uid/gidfor OpenShift SCC compliance - Mount persistent storage β use
--existing-pvcfor datasets, checkpoints, and scripts
Key Takeaways
- Run:ai CLI automates distributed PyTorch job submission on OpenShift
- Automatically injects
RANK,MASTER_ADDR,MASTER_PORTper worker - SR-IOV RDMA networking via annotations + extended resources
--large-shm+ NCCL tuning essential for multi-node GPU training- Non-root execution with
--run-as-uid/gidfor OpenShift SCC compliance

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
