πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai intermediate ⏱ 15 minutes K8s 1.28+

Run:ai Training Job Submit Script Pattern

Production pattern for submitting Run:ai training jobs via shell scripts with GPU fractional allocation, NFS mounts, custom Python environments, and private

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Wrap runai training standard submit in a shell script with image pinning (SHA256), GPU fraction requests, NFS shared storage, Python virtual environments, and private PyPI configuration for reproducible fine-tuning job submission.

The Problem

Submitting Run:ai training jobs manually via CLI is error-prone. You need:

  • Reproducible job submission (version-controlled scripts)
  • GPU fractional allocation (MIG or time-slicing)
  • Shared NFS storage for datasets and checkpoints
  • Private container registries with SHA pinning
  • Custom Python environments with private PyPI mirrors
  • Consistent UID/GID for NFS permission compatibility

The Solution

Complete Training Submit Script

#!/bin/bash

echo "submission job de finetuning"
export MSYS_NO_PATHCONV=1
export MSYS2_ARG_CONV_EXCL="*"

# Clean up previous run (optional)
#runai training standard delete finetune-job

# Pin image by SHA256 for reproducibility
IMAGE="registry.example.com/ml/vscode@sha256:51561bd181fbc8c55859ab6876f79b25af82f57bd5531a5edbf854783a747b45"

runai training standard submit finetune-job-rhel \
  --image $IMAGE \
  --gpu-devices-request 2 \
  --gpu-portion-request 1 \
  --gpu-portion-limit 1 \
  --run-as-uid 2000 \
  --run-as-gid 2000 \
  --working-dir /data/scripts/archive/gen-bench-main/llm/finetune-peft \
  --environment-variable CUDA_HOME=/shared/cuda-13.0 \
  --environment-variable VIRTUAL_ENV=/data/scripts/archive/gen-bench-main/llm/finetune-peft/.venv \
  --environment-variable UV_CACHE_DIR=/data/output/.cache/uv \
  --environment-variable UV_CONFIG_FILE=/data/scripts/archive/gen-bench-main/config/uv.toml \
  --nfs path=/ifs/S1000575/platform/shared,server=nfs-platform.sto.example.com,mountpath=/shared,readwrite \
  --environment-variable CUDA_VISIBLE_DEVICES=0 \
  --environment-variable PIP_INDEX=https://artifactory.example.com/api/pypi/pypi-virtual/pypi \
  --environment-variable PIP_INDEX_URL=https://artifactory.example.com/api/pypi/pypi-virtual/simple \
  --environment-variable PIP_TRUSTED_HOST=artifactory.example.com \
  --existing-pvc claimname=project-001,path=/data \
  --command -- uv run python finetune_mistral.py --config config/devstral_123b_1xH200.yaml

# Useful commands after submission:
#runai training standard exec finetune-job --pod finetune-job-0-0 -- nvidia-smi
#runai training standard describe finetune-job

Script Breakdown

GPU Fractional Allocation

--gpu-devices-request 2 \    # Request 2 physical GPU devices
--gpu-portion-request 1 \    # Request 100% of each GPU (1.0 = full GPU)
--gpu-portion-limit 1 \      # Limit to 100% (no overcommit)
GPU allocation modes:
β”œβ”€β”€ --gpu-devices-request N    β†’ Number of physical GPUs
β”œβ”€β”€ --gpu-portion-request 0.5  β†’ 50% of each GPU (MIG or time-slicing)
β”œβ”€β”€ --gpu-portion-limit 1      β†’ Hard cap (prevents burst above allocation)
└── --gpu-memory-request 20Gi  β†’ Request by VRAM instead of portion

UID/GID for NFS Compatibility

--run-as-uid 2000 \    # Match NFS export squash UID
--run-as-gid 2000 \    # Match NFS export squash GID

This ensures files created in NFS shares have correct ownership, avoiding permission denied errors when multiple users share storage.

NFS Mount (Shared Storage)

--nfs path=/ifs/S1000575/platform/shared,\
      server=nfs-platform.sto.example.com,\
      mountpath=/shared,\
      readwrite
NFS mount options in Run:ai:
β”œβ”€β”€ path       β†’ Export path on NFS server
β”œβ”€β”€ server     β†’ NFS server hostname/IP
β”œβ”€β”€ mountpath  β†’ Mount point inside container
└── readwrite  β†’ Access mode (readwrite | readonly)

Existing PVC (Dataset Storage)

--existing-pvc claimname=project-001,path=/data

Pre-provisioned PVC containing datasets and output directories. Persists across job restarts.

Python Environment (uv + Private PyPI)

--environment-variable VIRTUAL_ENV=/data/.../finetune-peft/.venv \
--environment-variable UV_CACHE_DIR=/data/output/.cache/uv \
--environment-variable UV_CONFIG_FILE=/data/.../config/uv.toml \
--environment-variable PIP_INDEX=https://artifactory.example.com/.../pypi \
--environment-variable PIP_INDEX_URL=https://artifactory.example.com/.../simple \
--environment-variable PIP_TRUSTED_HOST=artifactory.example.com \

Using uv (fast Python package manager) with:

  • Shared virtual environment on persistent storage (no reinstall per job)
  • Private Artifactory PyPI mirror (air-gapped environments)
  • Cache directory on persistent volume (speeds up subsequent runs)

Image Pinning by SHA256

IMAGE="registry.example.com/ml/vscode@sha256:51561bd181..."

Never use :latest in production training. SHA pinning ensures:

  • Exact same image across all runs
  • No surprise dependency changes mid-experiment
  • Audit trail of which image produced which results

Training Config File Pattern

# config/devstral_123b_1xH200.yaml
model:
  name: mistral-small-instruct
  size: 123b
  dtype: bfloat16

training:
  strategy: fsdp
  sharding: full_shard
  batch_size_per_device: 1
  gradient_accumulation_steps: 8
  num_epochs: 3
  learning_rate: 2.0e-5
  warmup_steps: 100

hardware:
  gpus: 1
  gpu_type: H200
  precision: bf16

data:
  dataset_path: /data/datasets/instruction-tuning
  max_seq_length: 4096
  
output:
  checkpoint_dir: /data/output/checkpoints
  save_steps: 500
  logging_steps: 10

Management Commands

# Check job status
runai training standard describe finetune-job

# Exec into running Pod
runai training standard exec finetune-job \
  --pod finetune-job-0-0 -- nvidia-smi

# Stream logs
runai training standard logs finetune-job -f

# Delete job
runai training standard delete finetune-job

# List all training jobs in project
runai training standard list

Common Issues

Permission denied on NFS mount

  • Cause: Container UID doesn’t match NFS export anonuid
  • Fix: Set --run-as-uid and --run-as-gid to match NFS server config

CUDA_HOME not found

  • Cause: CUDA toolkit on shared NFS not in expected path
  • Fix: Verify NFS mount succeeded; check path with ls /shared/cuda-13.0

uv fails to install packages

  • Cause: Private PyPI mirror unreachable from GPU node
  • Fix: Verify network policy allows egress to Artifactory; check PIP_TRUSTED_HOST

GPU portion request denied

  • Cause: Requested GPU fraction not available (MIG not configured for that slice)
  • Fix: Check available GPU fractions with runai list nodes; adjust portion request

Best Practices

  1. Pin images by SHA256 β€” never :latest for training reproducibility
  2. Version-control submit scripts β€” treat them as code in GitLab
  3. Use NFS for shared CUDA/datasets β€” avoid downloading per job
  4. Set UID/GID explicitly β€” NFS permission errors are the #1 time waste
  5. uv over pip β€” 10-100x faster package resolution
  6. Config files over CLI args β€” easier to track experiments
  7. Separate existing-pvc for data β€” survives job deletion; shared across experiments

Key Takeaways

  • Shell scripts wrap runai training standard submit for reproducibility
  • GPU fraction allocation: --gpu-devices-request Γ— --gpu-portion-request
  • NFS mounts provide shared CUDA toolkit, datasets, and model weights
  • Private PyPI (Artifactory) enables air-gapped package installation
  • uv run python executes within the persistent virtual environment
  • UID/GID alignment critical for NFS permission compatibility
  • SHA256 image pinning ensures experiment reproducibility
  • Config YAML files (e.g., devstral_123b_1xH200.yaml) parametrize training runs
#runai #training #gpu #finetuning #shell-scripting
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens