OpenShift GPU Node Resource Planning
Plan CPU, memory, and overhead budgets for GPU nodes running NVIDIA GPU Operator, Network Operator, Run:ai, and OpenShift infrastructure Pods. Understand what
π‘ Quick Answer: A typical OpenShift GPU node runs 40+ infrastructure Pods before any AI workload starts. These consume ~4-8 GB RAM and ~2-4 CPU cores of overhead. Plan node sizing accordingly β a 192-core / 1.5TB RAM node may only have ~180 cores and ~1.4TB available for training.
The Problem
GPU nodes arenβt just GPUs. Each node runs a stack of infrastructure:
- NVIDIA GPU Operator (5 Pods)
- NVIDIA Network Operator (4-5 Pods)
- OpenShift platform (15+ Pods)
- Run:ai scheduler and exporters (5+ Pods)
- Monitoring and networking (10+ Pods)
Understanding this overhead is critical for capacity planning.
The Solution
Typical GPU Node Pod Inventory
NAMESPACE POD CPU REQ MEM REQ MEM LIMIT
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# NVIDIA GPU Operator (5 Pods)
nvidia-gpu-operator nvidia-device-plugin-daemonset 0 0 0
nvidia-gpu-operator nvidia-driver-daemonset 0 0 0
nvidia-gpu-operator nvidia-mig-manager 0 0 0
nvidia-gpu-operator nvidia-node-status-exporter 0 0 0
nvidia-gpu-operator nvidia-operator-validator 0 0 0
# NVIDIA Network Operator (4 Pods)
nvidia-network-operator mofed-rhel9.6-ds 0 0 0
nvidia-network-operator network-operator-sriov-device-plugin 0 0 0
nvidia-network-operator nic-feature-discovery-ds 0 0 0
nvidia-network-operator nv-ipam-node 100m 300m 150Mi
# OpenShift Cluster Node Tuning
openshift-cluster-node-tuning tuned 10m 0 0
# OpenShift DNS (2 Pods)
openshift-dns dns-default 0 0 0
openshift-dns node-resolver 60m 0 110Mi
# OpenShift Image Registry
openshift-image-registry node-ca 5m 0 0
# OpenShift Ingress
openshift-ingress-canary ingress-canary 10m 0 0
# OpenShift Insights
openshift-insights insights-runtime-extractor 10m 0 0
# OpenShift KNI Infra (HA/Networking)
openshift-kni-infra coredns-node 0 0 0
openshift-kni-infra keepalived-node 30m (0%) 0 0
# OpenShift Kube Storage Version Migrator
openshift-kube-storage-version migrator 20m (0%) 0 0
# OpenShift Machine Config
openshift-machine-config-operator kube-rbac-proxy-crio 1m 0 0
openshift-machine-config-operator machine-config-daemon 20m 0 50Mi
# OpenShift Monitoring
openshift-monitoring node-exporter 4m (0%) 0 0
# OpenShift Multus (3 Pods)
openshift-multus multus-additional-cni-plugins 10m 0 0
openshift-multus multus-kube 10m 0 0
openshift-multus network-metrics-daemon 20m (0%) 0 0
# OpenShift Network Diagnostics
openshift-network-diagnostics network-check-target 10m (0%) 0 120Mi
# OpenShift Network Operator
openshift-network-operator iptables-alerter 10m (0%) 10m (0%) 0
# OpenShift NFD
openshift-nfd nfd-worker 10m 0 65Mi
# OpenShift NMState
openshift-nmstate nmstate-handler 0 (0%) 0 0
# OpenShift OVN Kubernetes
openshift-ovn-kubernetes ovnkube-node 100m 500m 100Mi
# OpenShift SR-IOV (2 Pods)
openshift-sriov-network-operator sriov-device-plugin 80m (0%) 0 1634Mi
openshift-sriov-network-operator sriov-network-config-daemon 10m 0 54Mi
# RHACS (Security)
rhacs-operator collector 10m 0 0
# Run:ai Backend (6+ Pods)
runai-backend runai-backend-catalog-service 7m (0%) 275m (1%) 340Mi
runai-backend runai-backend-cluster-service 70m 200m 500m
runai-backend runai-backend-frontend 15m 0 500m
runai-backend runai-backend-metrics-service 25m 0 500m
runai-backend runai-backend-org-unit-service 25m 0 500m
runai-backend runai-container-toolkit 250m 500m (0%) 256Mi
# Run:ai (Node-Level)
runai runai-node-exporter 0 (0%) 1500m (0%) 2G1 (0%)
runai runai-runtime-installer 10m 0 0Resource Overhead Summary
Category CPU Requests Memory Requests Memory Limits
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
NVIDIA GPU Operator ~0 ~0 ~0
NVIDIA Network Operator ~200m ~500m ~300Mi
OpenShift Platform ~400m ~1Gi ~2Gi
Run:ai ~400m ~3Gi ~5Gi
Monitoring/Networking ~200m ~500m ~2Gi
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
TOTAL OVERHEAD ~1.2 cores ~5 Gi ~9 GiNode Sizing Formula
Available for AI workloads = Node Total - System Reserved - Infra Overhead
Example: 192-core / 1.5 TiB node
System reserved (kubelet): 2 cores, 16 Gi
Infra Pod overhead: 1.2 cores, 5 Gi
Available for training: ~188 cores, ~1.48 TiB
For GPU memory: Full GPU memory is available (infra Pods don't request GPU)
8Γ H100 80GB = 640 GB GPU memory, all for AI workloadsMonitor Actual Usage
# Per-node resource consumption (all Pods)
oc adm top pods --all-namespaces --sort-by=memory | head -30
# Node allocatable vs capacity
oc get node <gpu-node> -o json | jq '{
capacity: .status.capacity,
allocatable: .status.allocatable
}'
# What's actually being used (live)
oc adm top node <gpu-node>
# Breakdown by namespace
oc adm top pods -A --no-headers | \
awk '{ns=$1; cpu+=$3; mem+=$4} END {print ns, cpu"m", mem"Mi"}'Overcommitment Warning
β οΈ "Total limits may be over 100 percent, i.e., Overcommitted."
This is normal for GPU nodes. Infrastructure Pods set low requests
but may burst. Key is:
- Requests = guaranteed minimum (used for scheduling)
- Limits = maximum allowed (OOM-killed if exceeded)
If sum(requests) < node allocatable β scheduling works fine
If sum(actual usage) > allocatable β OOM kills startRight-Sizing Infrastructure Pods
# Run:ai node exporter is the heaviest infra Pod
# Requests: 1500m CPU, 2Gi memory
# If GPU metrics are critical, keep these limits
# SR-IOV device plugin also significant
# Memory limit: 1634Mi (manages VF allocation state)
# For nodes with limited memory (e.g., 512Gi total):
# Consider reducing monitoring Pod limits
# or moving non-essential services to infra nodesCommon Issues
AI workload pending β βInsufficient memoryβ
- Cause: Infra Pod requests + AI workload requests > allocatable
- Fix: Account for ~5Gi infra overhead; request slightly less than full node memory
Node eviction due to memory pressure
- Cause: Infra Pods exceeding limits during spikes
- Fix: Set
system-reservedin kubelet config; useeviction-hardthresholds
SR-IOV device plugin using 1.6Gi
- Cause: Normal for managing many VFs (64+ per NIC)
- Fix: Expected behavior; factor into capacity planning
Best Practices
- Account for ~5Gi RAM overhead on every GPU node for infra
- Set
system-reservedin kubelet to protect against workload starvation - Monitor infra Pod growth β new operators add overhead silently
- GPU memory is unaffected β infra Pods use CPU/RAM only
- Run:ai exporter is heavy (2Gi) β it collects per-GPU per-Pod metrics
- Use dedicated infra nodes for Run:ai backend (frontend, catalog, cluster-service)
Key Takeaways
- 40+ infra Pods run on each GPU node consuming ~1.2 cores and ~5Gi RAM
- GPU memory (H100/A100) is fully available β infra Pods donβt request GPUs
- Overcommitment warnings are normal β requests matter for scheduling
- Plan node RAM as:
Total - 16Gi system - 5Gi infra = available for training - Run:ai node-exporter and SR-IOV plugin are the heaviest per-node infra Pods
- Monitor with
oc adm topto catch infra Pod memory creep over time

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
