πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai advanced ⏱ 15 minutes K8s 1.28+

Run:ai Workload Controllers on OpenShift

Understand Run:ai cluster-level workload controllers on OpenShift: workload-controller, workload-overseer, workload-exporter, and status-updater components.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Run:ai deploys 5 cluster-level controllers in the runai namespace that manage GPU workload lifecycle: scheduling, status tracking, metrics export, and shared resource coordination. These run on infra nodes alongside per-node DaemonSets (node-exporter, runtime-installer, container-toolkit).

The Problem

When GPU workloads don’t schedule, metrics are missing, or job status is stale, you need to know which Run:ai controller to investigate.

The Solution

Cluster-Level Controllers

oc get pods -n runai

# NAME                                      READY   STATUS    AGE
# shared-objects-controller-<hash>          1/1     Running   23h
# status-updater-<hash>                     1/1     Running   0
# workload-controller-<hash>                1/1     Running   0
# workload-exporter-<hash>                  1/1     Running   2 (23h)
# workload-overseer-<hash>                  1/1     Running   0

Controller Responsibilities

ControllerPurposeFailure Impact
workload-controllerReconciles Run:ai workloads β†’ K8s Pods/JobsJobs won’t start/stop
workload-overseerMonitors workload health, enforces policiesNo preemption, no fairness
workload-exporterExports workload metrics to PrometheusMissing dashboard data
status-updaterSyncs workload status to Run:ai backendUI shows stale status
shared-objects-controllerManages shared ConfigMaps, Secrets, PVCsShared resources unavailable

Per-Node DaemonSets

# These run on every GPU node
oc get ds -n runai

# NAME                       DESIRED   CURRENT   READY
# runai-node-exporter        8         8         8      # GPU metrics
# runai-runtime-installer    8         8         8      # Container runtime hooks
# runai-container-toolkit    8         8         8      # GPU toolkit injection

Workload Controller Deep Dive

workload-controller watches:
β”œβ”€β”€ RunaiJob CRD           β†’ Creates K8s Jobs/Pods
β”œβ”€β”€ RunaiTrainingWorkload  β†’ Multi-node training setup
β”œβ”€β”€ RunaiInferenceWorkload β†’ Deployment with GPU scheduling
└── RunaiInteractiveWorkload β†’ Notebook/IDE Pods

Reconciliation loop:
1. User submits workload via UI/CLI
2. Run:ai backend creates CRD in cluster
3. workload-controller detects new CRD
4. Creates K8s resources (Pod, Service, PVC)
5. GPU scheduler places Pod on best node
6. status-updater reports back to backend

Workload Exporter Metrics

# Metrics exported by workload-exporter:
runai_workload_status          β€” Current state (pending/running/completed/failed)
runai_workload_gpu_allocation  β€” GPUs allocated per workload
runai_workload_runtime_seconds β€” Total runtime
runai_workload_queue_time      β€” Time spent waiting for resources
runai_workload_preemptions     β€” Number of preemptions

Troubleshooting Controllers

# Check controller logs
oc logs -n runai deploy/workload-controller --tail=50
oc logs -n runai deploy/workload-overseer --tail=50
oc logs -n runai deploy/status-updater --tail=50

# Check if controllers are leader-elected
oc get lease -n runai

# Restart a specific controller
oc rollout restart deploy/workload-controller -n runai

# Check workload CRDs
oc get runaiworkloads -A
oc get runaijobs -A

Workload Exporter Restart Count

# workload-exporter shows "2 (23h)" restarts
# This means 2 restarts over 23 hours β€” likely:
# - One OOMKill during metric spike
# - One restart during node maintenance

# Check restart reason:
oc get pod -n runai -l app=workload-exporter -o jsonpath='{.items[0].status.containerStatuses[0].lastState}'

Integration with Run:ai Backend

runai namespace (cluster agents)          runai-backend namespace (control plane)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ workload-controller         │──NATS──▢│ cluster-service                  β”‚
β”‚ status-updater              │──NATS──▢│ workloads-service                β”‚
β”‚ workload-exporter           │──Prom──▢│ metrics-service β†’ thanos-receive β”‚
β”‚ workload-overseer           │──NATS──▢│ policy-service                   β”‚
β”‚ shared-objects-controller   β”‚         β”‚                                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                                           β”‚
         β–Ό                                           β–Ό
    GPU Nodes (DaemonSets)                     PostgreSQL + NATS
    - node-exporter                            (persistent state)
    - runtime-installer
    - container-toolkit

Common Issues

Workloads stuck in β€œPending” forever

  • Cause: workload-controller can’t create Pods (RBAC, quota, or crash)
  • Fix: Check controller logs; verify ClusterRole bindings

Dashboard shows β€œUnknown” status

  • Cause: status-updater can’t reach Run:ai backend (NATS down)
  • Fix: Check NATS cluster health; verify network policies

GPU metrics missing from Grafana

  • Cause: workload-exporter crashing or node-exporter DaemonSet not ready
  • Fix: Check exporter Pod restarts; verify ServiceMonitor exists

Preemption not working

  • Cause: workload-overseer not running or policy-service unreachable
  • Fix: Check overseer logs; verify NATS connectivity to backend

Best Practices

  1. Monitor controller restarts β€” more than 5/day indicates resource issues
  2. Check NATS connectivity β€” all controllers depend on NATS for backend comms
  3. DaemonSets must be 100% ready β€” missing node-exporter = missing GPU metrics
  4. Don’t scale controllers β€” they use leader election (only 1 active)
  5. Log level info is sufficient β€” debug level causes excessive NATS traffic

Key Takeaways

  • 5 controllers in runai namespace manage the full workload lifecycle
  • Communication to backend is via NATS (events, status) and Prometheus (metrics)
  • Per-node DaemonSets (node-exporter, runtime-installer, container-toolkit) run on every GPU node
  • workload-controller is the most critical β€” without it, no Pods get created
  • Restart counts of 1-2 over 23h are normal; 100+ indicates OOM or crash loop
  • All controllers are stateless β€” restart fixes most transient issues
#runai #openshift #controllers #scheduling #gpu
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens