πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai intermediate ⏱ 15 minutes K8s 1.28+

Run:ai Platform Backend Components

Overview of Run:ai backend StatefulSets and components on OpenShift: Thanos receive/query, Keycloak, NATS, Redis, PostgreSQL, workload controllers, and their

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Run:ai backend deploys ~15 StatefulSets/Deployments in the runai-backend namespace covering GPU scheduling, metrics (Thanos), auth (Keycloak), messaging (NATS), storage (PostgreSQL/Redis), and workload management. Managed via ArgoCD with Helm values in a GitOps repo.

The Problem

Understanding Run:ai backend components helps you:

  • Right-size infrastructure nodes for the control plane
  • Troubleshoot component failures (which piece is broken?)
  • Plan for high availability
  • Manage GitOps values for each component independently

The Solution

Backend Component Inventory

Namespace: runai-backend

StatefulSets:
β”œβ”€β”€ runai-backend-thanos-receive     (metrics ingestion from GPU nodes)
β”œβ”€β”€ runai-backend-thanos-query       (PromQL queries for dashboards)
β”œβ”€β”€ runai-backend-postgresql         (workload metadata, projects, quotas)
β”œβ”€β”€ runai-backend-redis              (session cache, job queuing)
β”œβ”€β”€ runai-backend-nats               (event bus between components)
└── runai-backend-keycloak           (SSO / authentication)

Deployments:
β”œβ”€β”€ runai-backend-workload-controller  (job scheduling logic)
β”œβ”€β”€ runai-backend-api-server           (REST API for CLI/UI)
β”œβ”€β”€ runai-backend-frontend             (React dashboard UI)
β”œβ”€β”€ runai-backend-grafana              (GPU metrics dashboards)
β”œβ”€β”€ runai-backend-traefik              (ingress/routing)
└── runai-backend-redoc                (API documentation)

Helm Values Structure (GitOps)

# values.yaml for runai-backend chart
keycloak:
  tolerations: *tolerations

grafana:
  db:
    existingSecret: grafana-db-secret
    userKey: username
    passwordKey: password
  tolerations: *tolerations
  adminUser: admin
  adminPassword: admin        # Override in production!
  dbScheme: backend

traefik:
  tolerations: *tolerations

thanos:
  tolerations: *tolerations
  query:
    tolerations: *tolerations
  receive:
    tolerations: *tolerations
    resources:
      limits:
        cpu: 800m
        memory: 4Gi
      requests:
        cpu: 500m
        memory: 2Gi

nats:
  tolerations: *tolerations

redoc:
  tolerations: *tolerations

workloads:
  tolerations: *tolerations

Component Dependencies

User β†’ Traefik (ingress) β†’ Frontend (UI)
                         β†’ API Server β†’ PostgreSQL (metadata)
                                     β†’ Redis (cache)
                                     β†’ NATS (events)
                         β†’ Keycloak (auth)
                         β†’ Grafana β†’ Thanos Query β†’ Thanos Receive
                                                        ↑
GPU Nodes β†’ DCGM Exporter β†’ Prometheus β†’ Remote Write β†’ Thanos Receive

Tolerations Pattern (Anchor/Alias)

# Define once at top of values:
tolerations: &tolerations
  - key: "node-role.kubernetes.io/infra"
    operator: "Exists"
    effect: "NoSchedule"

# Reference everywhere via *tolerations alias
# This pins all Run:ai backend Pods to infra nodes

Resource Requirements Summary

Component                  CPU Req   Mem Req   CPU Lim   Mem Lim   Replicas
─────────────────────────────────────────────────────────────────────────────
thanos-receive             500m      2Gi       800m      4Gi       1
thanos-query               200m      512Mi     500m      1Gi       1
postgresql                 250m      512Mi     1000m     2Gi       1
redis                      100m      128Mi     500m      512Mi     1
nats                       100m      128Mi     500m      512Mi     1 (or 3)
keycloak                   500m      1Gi       1000m     2Gi       1
api-server                 200m      512Mi     500m      1Gi       2
workload-controller        200m      512Mi     500m      1Gi       2
frontend                   50m       128Mi     200m      256Mi     2
grafana                    100m      256Mi     500m      1Gi       1
traefik                    100m      128Mi     500m      512Mi     2
─────────────────────────────────────────────────────────────────────────────
TOTAL (approx)             ~3 cores  ~8Gi      ~7 cores  ~16Gi

Health Check All Components

# Quick status check
oc get pods -n runai-backend -o wide

# Check StatefulSets
oc get sts -n runai-backend

# Check which components are unhealthy
oc get pods -n runai-backend --field-selector=status.phase!=Running

# Thanos Receive specifically
oc logs -n runai-backend runai-backend-thanos-receive-0 --tail=20

# Keycloak (auth issues)
oc logs -n runai-backend -l app=keycloak --tail=20

# API server
oc logs -n runai-backend -l app=runai-api-server --tail=20

ArgoCD Application Structure

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: runai-backend
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://gitlab.example.com/gitops/runai.git
    path: config/runai/backend
    targetRevision: main
    helm:
      valueFiles:
        - values.yaml
  destination:
    server: https://kubernetes.default.svc
    namespace: runai-backend
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true

Common Issues

Thanos Receive CrashLoopBackOff

  • Cause: Memory limit too low for WAL replay
  • Fix: Increase memory in GitOps values; see dedicated troubleshooting recipe

Keycloak login fails

  • Cause: PostgreSQL connection lost or secret rotation
  • Fix: Check grafana-db-secret exists; verify PostgreSQL Pod is running

Grafana shows no GPU metrics

  • Cause: Thanos Receive down β†’ no metrics ingested
  • Fix: Fix Thanos Receive first; historical data may have gaps

NATS message backlog

  • Cause: Consumer (workload-controller) overwhelmed or crashed
  • Fix: Check workload-controller logs; restart if stuck

Best Practices

  1. Pin all backend Pods to infra nodes via tolerations β€” keep GPU nodes clean
  2. Use YAML anchors (&tolerations / *tolerations) to avoid repetition
  3. Never store passwords in values.yaml β€” use existingSecret references
  4. Size Thanos Receive for your metrics volume β€” 4Gi minimum for production
  5. Enable ArgoCD selfHeal β€” auto-reverts manual drift
  6. Monitor the monitoring β€” alert on Thanos Receive restarts

Key Takeaways

  • Run:ai backend has ~15 components totaling ~3 cores / 8Gi RAM minimum
  • Thanos Receive is the most resource-hungry and crash-prone component
  • All components use shared tolerations to pin to infra nodes
  • ArgoCD manages lifecycle β€” all changes must go through Git
  • Grafana connects to Thanos Query (not directly to Prometheus)
  • Component failure impact: Thanos down = no dashboards; Keycloak down = no login; API down = no job submission
#runai #architecture #openshift #statefulset #observability
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens