Run:ai GPU Scheduling with Kubeflow MPIJob
Integrate Run:ai GPU scheduler with Kubeflow MPIJob for multi-node NCCL workloads. Covers Run:ai project namespaces, GPU quota annotations, pod group
π‘ Quick Answer: Run:ai automatically manages GPU scheduling for Kubeflow MPIJob workloads. Deploy MPIJobs into Run:ai project namespaces (e.g.,
my-project). Run:ai adds tracking annotations (runai-current-allocated-gpus,runai-used-nodes) and enforces per-project GPU quotas. UsecleanPodPolicy: NoneandbackoffLimit: 0for validation jobs to preserve logs and prevent retry loops.
The Problem
- Multi-node GPU jobs need gang scheduling (all workers or none)
- GPU quota must be enforced per team/project
- Need visibility into GPU memory allocation across nodes
- Kubernetes default scheduler doesnβt understand multi-node GPU topology
- Failed validation jobs shouldnβt retry indefinitely
The Solution
Run:ai Project Namespace
# MPIJob deployed to a Run:ai project namespace
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: nccl-roce-validation
namespace: my-project # Run:ai project namespace
spec:
runPolicy:
cleanPodPolicy: None # Keep pods for log inspection
backoffLimit: 0 # Don't retry on failure
# ...Run:ai Annotations (Auto-Added)
# Run:ai controller automatically adds these annotations:
metadata:
annotations:
# Status tracking
runai-calculated-status: Running # Overall job status
runai-pending-pods: "0" # Pods waiting for resources
runai-running-pods: "2" # Active worker pods
# GPU allocation
runai-current-allocated-gpus: "4" # Total GPUs in use
runai-current-allocated-gpus-memory: "301509" # MB (~294 GB)
runai-current-requested-gpus: "4" # Requested GPU count
runai-total-requested-gpus: "4" # Total across all pods
# Pod group tracking
runai-podgroup-requested-gpus: "4" # Gang total
runai-podgroup-requested-gpus-memory: "0" # 0 = no memory fraction
# Topology
runai-used-nodes: gpu-worker-01, gpu-worker-02 # PlacementGPU Memory Calculation
runai-current-allocated-gpus-memory: "301509"
301509 MB Γ· 4 GPUs = ~75,377 MB per GPU = ~73.6 GB
β H200 NVL (80 GB HBM3e, usable ~75 GB after ECC)
This confirms:
- 4Γ NVIDIA H200 NVL GPUs allocated
- Full GPU allocation (not MIG or time-slicing)
- Memory tracking is per-GPU-model awareRun:ai Scheduling Flow
1. User submits MPIJob to Run:ai project namespace
2. Run:ai admission webhook intercepts the MPIJob
3. Run:ai scheduler evaluates:
- Project GPU quota (e.g., 8 GPU limit)
- Current usage in project
- Node availability with required GPUs
- Anti-affinity requirements (multi-node)
4. If resources available β schedule all pods together (gang)
5. If resources unavailable β queue job with priority
6. Run:ai adds tracking annotations to MPIJob
7. Pods start β Run:ai updates status annotations
8. Job completes β annotations show final stateRun Policy Configuration
runPolicy:
# Don't retry failed validation jobs
backoffLimit: 0
# Keep pods after completion for log access
cleanPodPolicy: None
# Options:
# None β keep all pods (launcher + workers) after completion
# Running β delete running pods on failure, keep completed
# All β delete all pods after completion (clean but no logs)
# Don't suspend by default
suspend: falseGang Scheduling Behavior
# Run:ai ensures all workers start together (gang scheduling):
# - 2 workers Γ 2 GPUs = 4 GPUs total must be available
# - If only 2 GPUs available β job stays queued (not partial start)
# - Workers must fit on different nodes (if anti-affinity set)
# slotsPerWorker tells MPI Operator and Run:ai the GPU-per-worker ratio:
spec:
slotsPerWorker: 2 # Each worker pod gets 2 GPUs (slots)
# Total GPUs = replicas Γ slotsPerWorker = 2 Γ 2 = 4Project Quota Example
# Run:ai project configuration (admin sets via UI or API):
# Project: my-project
# Quota: 8 GPUs guaranteed, 16 GPUs over-quota allowed
# Priority: Normal
# Node pools: gpu-h200-pool
# The MPIJob requesting 4 GPUs fits within 8 GPU quota
# No queuing required if other jobs use β€ 4 GPUs in this projectMonitoring with Run:ai CLI
# Check job status:
runai describe job nccl-roce-validation -p my-project
# List GPU allocation:
runai list jobs -p my-project
# NAME STATUS GPUs NODE
# nccl-roce-validation Running 4 gpu-worker-01,gpu-worker-02
# Check project quota:
runai list projects
# PROJECT GPU_QUOTA GPU_USED OVER_QUOTA
# my-project 8 4 0Common Issues
Job stuck in βPendingβ with Run:ai
- Cause: Project GPU quota exceeded or no nodes with required GPUs
- Fix: Check
runai list jobs; free GPUs or increase quota
Workers scheduled on same node (not distributed)
- Cause: No anti-affinity and Run:ai optimizes for bin-packing
- Fix: Add podAntiAffinity or use Run:ai topology-aware scheduling
βrunai-current-allocated-gpus-memory: 0β
- Cause: GPU memory tracking not available (older GPU Operator)
- Fix: Upgrade GPU Operator; verify DCGM exporter is running
Job succeeds but runai-calculated-status still βRunningβ
- Cause: Status reconciliation delay with Kubeflow controller
- Fix: Wait 30-60 seconds; Run:ai eventually picks up MPIJob condition
Best Practices
- Deploy to Run:ai project namespace β enables quota and priority
backoffLimit: 0for validation β donβt burn GPU hours on retriescleanPodPolicy: Nonefor debugging β preserves worker logsslotsPerWorker= GPUs per worker β enables correct gang accounting- Check
runai-used-nodesβ confirms multi-node placement - Monitor GPU memory β 301509 MB Γ· 4 confirms H200 full allocation
- Use Run:ai priorities for production vs. benchmark workloads
Key Takeaways
- Run:ai automatically tracks GPU allocation via annotations on MPIJob
- Gang scheduling ensures all workers start together or none start
- GPU memory tracking confirms correct hardware allocation (H200 = ~75 GB each)
cleanPodPolicy: None+backoffLimit: 0= ideal for validation workloadsslotsPerWorkermust match GPU count in worker resource requests- Run:ai node placement visible via
runai-used-nodesannotation - Project quotas enforce per-team fairness without manual scheduling

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
