FSDP Distributed Training on Run:ai
Run PyTorch FSDP distributed training workloads on Run:ai with GPU scheduling, event tracking, and GPU memory monitoring. Covers Mistral-class model
π‘ Quick Answer: Submit PyTorch FSDP (Fully Sharded Data Parallel) training jobs through Run:ai to automatically schedule multi-GPU workloads with preemption, fairness queuing, and real-time GPU utilization monitoring. Run:ai handles container lifecycle, GPU allocation, and metrics collection.
The Problem
Training large language models (7B-100B+ parameters) requires:
- Multiple GPUs with FSDP sharding across devices
- Fair scheduling when multiple teams share a GPU cluster
- Real-time GPU compute and memory utilization monitoring
- Automatic container lifecycle management (pull, create, start)
- Preemption support for priority workloads
The Solution
Submit FSDP Training Workload via Run:ai CLI
# Submit a Mistral-class model fine-tuning job
runai submit mistral-fsdp \
--project team-nlp \
--type training \
--gpu 4 \
--image registry.example.com/ml/pytorch-fsdp:2.3.0-cuda12.4 \
--pvc data-vol:/data \
--pvc model-vol:/models \
--environment MASTER_PORT=29500 \
--environment NCCL_DEBUG=INFO \
--environment TORCH_DISTRIBUTED_DEBUG=DETAIL \
--command "torchrun --nproc_per_node=4 train_fsdp.py \
--model_name mistral-small \
--dataset /data/finetune-dataset \
--output_dir /models/checkpoints \
--fsdp_sharding_strategy FULL_SHARD \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 4 \
--num_train_epochs 3 \
--bf16 true"Run:ai Workload YAML (Declarative)
apiVersion: run.ai/v2alpha1
kind: TrainingWorkload
metadata:
name: mistral-fsdp
namespace: runai-team-nlp
spec:
gpu:
value: "4"
image:
value: registry.example.com/ml/pytorch-fsdp:2.3.0-cuda12.4
name:
value: mistral-fsdp
pvcs:
items:
- claimName: data-vol
path: /data
- claimName: model-vol
path: /models
environment:
items:
- name: MASTER_PORT
value: "29500"
- name: NCCL_DEBUG
value: INFO
- name: FSDP_SHARDING_STRATEGY
value: FULL_SHARD
command:
value: >-
torchrun --nproc_per_node=4 train_fsdp.py
--model_name mistral-small
--fsdp_sharding_strategy FULL_SHARD
--bf16 true
priority:
value: low
preemptible:
value: trueWorkload Lifecycle Events
Date & Time Event Type Issuer Component Details
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
05/05/2026 15:32:36.000 Started Normal kubelet Pod Started container pytorch
05/05/2026 15:32:36.000 Created Normal kubelet Pod Created container pytorch
05/05/2026 15:32:56.173 Status - - - Running
05/05/2026 15:33:00.000 Running Normal status- PodGroup Job status changed from
updater ContainerCreating to Run
05/05/2026 15:34:01.000 Pulled Normal kubelet Pod Container image already
present on machine
05/05/2026 15:34:03.000 Created Normal kubelet Pod Created container pytorch
05/05/2026 15:34:03.000 Started Normal kubelet Pod Started container pytorchGPU Metrics During Training
GPU Compute Utilization:
βββ Ramp-up phase: 0% β 95% (first 30s as FSDP initializes)
βββ Steady-state: 85-98% (training loop active)
βββ Checkpoint intervals: drops to ~20% briefly
βββ End of epoch: brief 0% during eval
GPU Memory Usage:
βββ Model shards loaded: ~12 GB per GPU (FSDP FULL_SHARD)
βββ Optimizer states: ~8 GB per GPU
βββ Activation memory: ~4 GB per GPU (with gradient checkpointing)
βββ NCCL buffers: ~2 GB
βββ Total per GPU: ~26 GB / 80 GB (A100)
GPU Memory Utilization:
βββ Steady-state: ~32% of 80GB A100
βββ Peak (during all-gather): ~65%
βββ With gradient accumulation: ~45%Monitor Running Workload
# Check workload status
runai describe job mistral-fsdp -p team-nlp
# Stream logs
runai logs mistral-fsdp -p team-nlp --tail 100 -f
# Check GPU utilization via Run:ai
runai top job mistral-fsdp -p team-nlp
# Or via oc/kubectl
oc exec -n runai-team-nlp mistral-fsdp-0-0 -- nvidia-smi
# DCGM metrics (if DCGM exporter running)
oc exec -n runai-team-nlp mistral-fsdp-0-0 -- \
dcgmi dmon -e 203,204,1001,1002 -d 1000FSDP Configuration for Run:ai
# train_fsdp.py - Key FSDP configuration
import torch
from torch.distributed.fsdp import (
FullyShardedDataParallel as FSDP,
ShardingStrategy,
MixedPrecision,
BackwardPrefetch,
)
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
# FSDP wrapping policy for transformer models
auto_wrap_policy = transformer_auto_wrap_policy(
transformer_layer_cls={MistralDecoderLayer},
)
# Mixed precision for memory efficiency
mp_policy = MixedPrecision(
param_dtype=torch.bfloat16,
reduce_dtype=torch.bfloat16,
buffer_dtype=torch.bfloat16,
)
# Wrap model
model = FSDP(
model,
sharding_strategy=ShardingStrategy.FULL_SHARD,
mixed_precision=mp_policy,
auto_wrap_policy=auto_wrap_policy,
backward_prefetch=BackwardPrefetch.BACKWARD_PRE,
device_id=torch.cuda.current_device(),
limit_all_gathers=True, # Reduces peak memory
)Run:ai Scheduling Behavior
Scheduling flow:
1. User submits training workload (priority: low, preemptible: true)
2. Run:ai scheduler checks project GPU quota
3. If GPUs available β schedule immediately
4. If over-quota β queue until resources free
5. If higher-priority job arrives β preempt this job
6. Preempted job re-queues and resumes from last checkpoint
Priority levels:
βββ Critical β never preempted
βββ High β preempts low/medium
βββ Medium β preempts low
βββ Low β preempted by all (good for training experiments)Multi-Node FSDP (Large Models)
# For models that don't fit on a single node (70B+)
runai submit mistral-large-fsdp \
--project team-nlp \
--type training \
--gpu 8 \
--workers 2 \
--image registry.example.com/ml/pytorch-fsdp:2.3.0-cuda12.4 \
--environment NCCL_IB_DISABLE=0 \
--environment NCCL_NET_GDR_LEVEL=5 \
--environment NCCL_SOCKET_IFNAME=eth0 \
--command "torchrun \
--nnodes=2 \
--nproc_per_node=8 \
--rdzv_backend=c10d \
--rdzv_endpoint=\$MASTER_ADDR:\$MASTER_PORT \
train_fsdp.py --model_name mistral-large --fsdp_sharding_strategy HYBRID_SHARD"Common Issues
Job stuck in βInitializingβ state
- Cause: Image pull taking long (large PyTorch images ~15GB)
- Fix: Pre-pull images on GPU nodes; use
imagePullPolicy: IfNotPresent
GPU memory OOM during training
- Cause: Batch size too large or gradient accumulation not enabled
- Fix: Reduce
per_device_train_batch_size; enable gradient checkpointing
NCCL timeout errors
- Cause: Network issues between GPUs/nodes; one rank slower
- Fix: Set
NCCL_TIMEOUT=1800; check for stragglers; verify InfiniBand/RoCE
Job preempted mid-training
- Cause: Higher-priority job needed GPUs
- Fix: Save checkpoints frequently; use
--save_steps 500; set higher priority
Best Practices
- Always use
preemptible: truefor experimental training β improves cluster utilization - Checkpoint every N steps β protects against preemption and OOM
- Use FULL_SHARD for memory efficiency on single-node multi-GPU
- Use HYBRID_SHARD for multi-node (shards within node, replicates across)
- Set NCCL_DEBUG=INFO initially, then disable for production runs (reduces overhead)
- Monitor GPU memory via Run:ai metrics β stay below 80% to avoid OOM spikes
Key Takeaways
- Run:ai handles GPU scheduling, queuing, and preemption for FSDP training jobs
- Workload lifecycle: Pulled β Created β Started β Running (tracked via events)
- GPU metrics: compute utilization, memory usage, and GPU usage graphs in Run:ai UI
- FSDP FULL_SHARD splits model, optimizer, and gradients across all GPUs
- Preemptible + low priority = maximum cluster utilization for training experiments
- Multi-node FSDP uses HYBRID_SHARD with NCCL RDMA for cross-node communication
- Run:ai + Thanos collects historical GPU metrics (when Thanos Receive is healthy π)

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
