Run:ai Platform Backend Components
Overview of Run:ai backend StatefulSets and components on OpenShift: Thanos receive/query, Keycloak, NATS, Redis, PostgreSQL, workload controllers, and their
π‘ Quick Answer: Run:ai backend deploys ~15 StatefulSets/Deployments in the
runai-backendnamespace covering GPU scheduling, metrics (Thanos), auth (Keycloak), messaging (NATS), storage (PostgreSQL/Redis), and workload management. Managed via ArgoCD with Helm values in a GitOps repo.
The Problem
Understanding Run:ai backend components helps you:
- Right-size infrastructure nodes for the control plane
- Troubleshoot component failures (which piece is broken?)
- Plan for high availability
- Manage GitOps values for each component independently
The Solution
Backend Component Inventory
Namespace: runai-backend
StatefulSets:
βββ runai-backend-thanos-receive (metrics ingestion from GPU nodes)
βββ runai-backend-thanos-query (PromQL queries for dashboards)
βββ runai-backend-postgresql (workload metadata, projects, quotas)
βββ runai-backend-redis (session cache, job queuing)
βββ runai-backend-nats (event bus between components)
βββ runai-backend-keycloak (SSO / authentication)
Deployments:
βββ runai-backend-workload-controller (job scheduling logic)
βββ runai-backend-api-server (REST API for CLI/UI)
βββ runai-backend-frontend (React dashboard UI)
βββ runai-backend-grafana (GPU metrics dashboards)
βββ runai-backend-traefik (ingress/routing)
βββ runai-backend-redoc (API documentation)Helm Values Structure (GitOps)
# values.yaml for runai-backend chart
keycloak:
tolerations: *tolerations
grafana:
db:
existingSecret: grafana-db-secret
userKey: username
passwordKey: password
tolerations: *tolerations
adminUser: admin
adminPassword: admin # Override in production!
dbScheme: backend
traefik:
tolerations: *tolerations
thanos:
tolerations: *tolerations
query:
tolerations: *tolerations
receive:
tolerations: *tolerations
resources:
limits:
cpu: 800m
memory: 4Gi
requests:
cpu: 500m
memory: 2Gi
nats:
tolerations: *tolerations
redoc:
tolerations: *tolerations
workloads:
tolerations: *tolerationsComponent Dependencies
User β Traefik (ingress) β Frontend (UI)
β API Server β PostgreSQL (metadata)
β Redis (cache)
β NATS (events)
β Keycloak (auth)
β Grafana β Thanos Query β Thanos Receive
β
GPU Nodes β DCGM Exporter β Prometheus β Remote Write β Thanos ReceiveTolerations Pattern (Anchor/Alias)
# Define once at top of values:
tolerations: &tolerations
- key: "node-role.kubernetes.io/infra"
operator: "Exists"
effect: "NoSchedule"
# Reference everywhere via *tolerations alias
# This pins all Run:ai backend Pods to infra nodesResource Requirements Summary
Component CPU Req Mem Req CPU Lim Mem Lim Replicas
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
thanos-receive 500m 2Gi 800m 4Gi 1
thanos-query 200m 512Mi 500m 1Gi 1
postgresql 250m 512Mi 1000m 2Gi 1
redis 100m 128Mi 500m 512Mi 1
nats 100m 128Mi 500m 512Mi 1 (or 3)
keycloak 500m 1Gi 1000m 2Gi 1
api-server 200m 512Mi 500m 1Gi 2
workload-controller 200m 512Mi 500m 1Gi 2
frontend 50m 128Mi 200m 256Mi 2
grafana 100m 256Mi 500m 1Gi 1
traefik 100m 128Mi 500m 512Mi 2
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
TOTAL (approx) ~3 cores ~8Gi ~7 cores ~16GiHealth Check All Components
# Quick status check
oc get pods -n runai-backend -o wide
# Check StatefulSets
oc get sts -n runai-backend
# Check which components are unhealthy
oc get pods -n runai-backend --field-selector=status.phase!=Running
# Thanos Receive specifically
oc logs -n runai-backend runai-backend-thanos-receive-0 --tail=20
# Keycloak (auth issues)
oc logs -n runai-backend -l app=keycloak --tail=20
# API server
oc logs -n runai-backend -l app=runai-api-server --tail=20ArgoCD Application Structure
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: runai-backend
namespace: argocd
spec:
project: default
source:
repoURL: https://gitlab.example.com/gitops/runai.git
path: config/runai/backend
targetRevision: main
helm:
valueFiles:
- values.yaml
destination:
server: https://kubernetes.default.svc
namespace: runai-backend
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=trueCommon Issues
Thanos Receive CrashLoopBackOff
- Cause: Memory limit too low for WAL replay
- Fix: Increase memory in GitOps values; see dedicated troubleshooting recipe
Keycloak login fails
- Cause: PostgreSQL connection lost or secret rotation
- Fix: Check
grafana-db-secretexists; verify PostgreSQL Pod is running
Grafana shows no GPU metrics
- Cause: Thanos Receive down β no metrics ingested
- Fix: Fix Thanos Receive first; historical data may have gaps
NATS message backlog
- Cause: Consumer (workload-controller) overwhelmed or crashed
- Fix: Check workload-controller logs; restart if stuck
Best Practices
- Pin all backend Pods to infra nodes via tolerations β keep GPU nodes clean
- Use YAML anchors (
&tolerations/*tolerations) to avoid repetition - Never store passwords in values.yaml β use
existingSecretreferences - Size Thanos Receive for your metrics volume β 4Gi minimum for production
- Enable ArgoCD selfHeal β auto-reverts manual drift
- Monitor the monitoring β alert on Thanos Receive restarts
Key Takeaways
- Run:ai backend has ~15 components totaling ~3 cores / 8Gi RAM minimum
- Thanos Receive is the most resource-hungry and crash-prone component
- All components use shared tolerations to pin to infra nodes
- ArgoCD manages lifecycle β all changes must go through Git
- Grafana connects to Thanos Query (not directly to Prometheus)
- Component failure impact: Thanos down = no dashboards; Keycloak down = no login; API down = no job submission

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
