Run:ai Workload Controllers on OpenShift
Understand Run:ai cluster-level workload controllers on OpenShift: workload-controller, workload-overseer, workload-exporter, and status-updater components.
π‘ Quick Answer: Run:ai deploys 5 cluster-level controllers in the
runainamespace that manage GPU workload lifecycle: scheduling, status tracking, metrics export, and shared resource coordination. These run on infra nodes alongside per-node DaemonSets (node-exporter, runtime-installer, container-toolkit).
The Problem
When GPU workloads donβt schedule, metrics are missing, or job status is stale, you need to know which Run:ai controller to investigate.
The Solution
Cluster-Level Controllers
oc get pods -n runai
# NAME READY STATUS AGE
# shared-objects-controller-<hash> 1/1 Running 23h
# status-updater-<hash> 1/1 Running 0
# workload-controller-<hash> 1/1 Running 0
# workload-exporter-<hash> 1/1 Running 2 (23h)
# workload-overseer-<hash> 1/1 Running 0Controller Responsibilities
| Controller | Purpose | Failure Impact |
|---|---|---|
workload-controller | Reconciles Run:ai workloads β K8s Pods/Jobs | Jobs wonβt start/stop |
workload-overseer | Monitors workload health, enforces policies | No preemption, no fairness |
workload-exporter | Exports workload metrics to Prometheus | Missing dashboard data |
status-updater | Syncs workload status to Run:ai backend | UI shows stale status |
shared-objects-controller | Manages shared ConfigMaps, Secrets, PVCs | Shared resources unavailable |
Per-Node DaemonSets
# These run on every GPU node
oc get ds -n runai
# NAME DESIRED CURRENT READY
# runai-node-exporter 8 8 8 # GPU metrics
# runai-runtime-installer 8 8 8 # Container runtime hooks
# runai-container-toolkit 8 8 8 # GPU toolkit injectionWorkload Controller Deep Dive
workload-controller watches:
βββ RunaiJob CRD β Creates K8s Jobs/Pods
βββ RunaiTrainingWorkload β Multi-node training setup
βββ RunaiInferenceWorkload β Deployment with GPU scheduling
βββ RunaiInteractiveWorkload β Notebook/IDE Pods
Reconciliation loop:
1. User submits workload via UI/CLI
2. Run:ai backend creates CRD in cluster
3. workload-controller detects new CRD
4. Creates K8s resources (Pod, Service, PVC)
5. GPU scheduler places Pod on best node
6. status-updater reports back to backendWorkload Exporter Metrics
# Metrics exported by workload-exporter:
runai_workload_status β Current state (pending/running/completed/failed)
runai_workload_gpu_allocation β GPUs allocated per workload
runai_workload_runtime_seconds β Total runtime
runai_workload_queue_time β Time spent waiting for resources
runai_workload_preemptions β Number of preemptionsTroubleshooting Controllers
# Check controller logs
oc logs -n runai deploy/workload-controller --tail=50
oc logs -n runai deploy/workload-overseer --tail=50
oc logs -n runai deploy/status-updater --tail=50
# Check if controllers are leader-elected
oc get lease -n runai
# Restart a specific controller
oc rollout restart deploy/workload-controller -n runai
# Check workload CRDs
oc get runaiworkloads -A
oc get runaijobs -AWorkload Exporter Restart Count
# workload-exporter shows "2 (23h)" restarts
# This means 2 restarts over 23 hours β likely:
# - One OOMKill during metric spike
# - One restart during node maintenance
# Check restart reason:
oc get pod -n runai -l app=workload-exporter -o jsonpath='{.items[0].status.containerStatuses[0].lastState}'Integration with Run:ai Backend
runai namespace (cluster agents) runai-backend namespace (control plane)
βββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββ
β workload-controller βββNATSβββΆβ cluster-service β
β status-updater βββNATSβββΆβ workloads-service β
β workload-exporter βββPromβββΆβ metrics-service β thanos-receive β
β workload-overseer βββNATSβββΆβ policy-service β
β shared-objects-controller β β β
βββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββ
β β
βΌ βΌ
GPU Nodes (DaemonSets) PostgreSQL + NATS
- node-exporter (persistent state)
- runtime-installer
- container-toolkitCommon Issues
Workloads stuck in βPendingβ forever
- Cause:
workload-controllercanβt create Pods (RBAC, quota, or crash) - Fix: Check controller logs; verify ClusterRole bindings
Dashboard shows βUnknownβ status
- Cause:
status-updatercanβt reach Run:ai backend (NATS down) - Fix: Check NATS cluster health; verify network policies
GPU metrics missing from Grafana
- Cause:
workload-exportercrashing or node-exporter DaemonSet not ready - Fix: Check exporter Pod restarts; verify ServiceMonitor exists
Preemption not working
- Cause:
workload-overseernot running or policy-service unreachable - Fix: Check overseer logs; verify NATS connectivity to backend
Best Practices
- Monitor controller restarts β more than 5/day indicates resource issues
- Check NATS connectivity β all controllers depend on NATS for backend comms
- DaemonSets must be 100% ready β missing node-exporter = missing GPU metrics
- Donβt scale controllers β they use leader election (only 1 active)
- Log level info is sufficient β debug level causes excessive NATS traffic
Key Takeaways
- 5 controllers in
runainamespace manage the full workload lifecycle - Communication to backend is via NATS (events, status) and Prometheus (metrics)
- Per-node DaemonSets (node-exporter, runtime-installer, container-toolkit) run on every GPU node
workload-controlleris the most critical β without it, no Pods get created- Restart counts of 1-2 over 23h are normal; 100+ indicates OOM or crash loop
- All controllers are stateless β restart fixes most transient issues

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
