πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
Configuration intermediate ⏱ 15 minutes K8s 1.28+

OpenShift Cluster Update Process Explained

Complete guide to OpenShift Container Platform cluster updates. CVO workflow, Runlevels, Machine Config Operator node updates, update channels

By Luca Berton β€’ β€’ πŸ“– 10 min read

πŸ’‘ Quick Answer: OpenShift updates are orchestrated by the Cluster Version Operator (CVO), which applies release manifests in ordered Runlevels. The CVO updates all control plane Operators first (60-120 min), then the Machine Config Operator (MCO) rolls out OS and config changes to nodes one-by-one (5+ min per node). Use oc adm upgrade to check available versions and oc adm upgrade --to=<version> to initiate.

The Problem

  • Cluster updates are complex β€” multiple Operators must update in sequence
  • Wrong update channel selection can delay access to critical patches
  • Node updates drain workloads β€” poor planning causes application downtime
  • Conditional updates with known risks need informed decision-making
  • Estimating update duration is difficult without understanding the phases

The Solution

OpenShift Update Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ OpenShift Update Service (OSUS)                                   β”‚
β”‚ β€’ Hosts update graph of all release versions                      β”‚
β”‚ β€’ Evaluates conditional risks per cluster                         β”‚
β”‚ β€’ Recommends safe update paths based on channel + version         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚ Query: "What can I update to?"
                             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Cluster Version Operator (CVO)                                    β”‚
β”‚ β€’ Manages ClusterVersion resource                                 β”‚
β”‚ β€’ Downloads + validates release image                             β”‚
β”‚ β€’ Applies manifests in Runlevel order                             β”‚
β”‚ β€’ Monitors Operator health between Runlevels                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚ After control plane complete
                             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Machine Config Operator (MCO)                                     β”‚
β”‚ β€’ Updates OS + system config on each node                         β”‚
β”‚ β€’ Cordon β†’ Drain β†’ Update β†’ Reboot β†’ Uncordon                    β”‚
β”‚ β€’ Respects maxUnavailable (default: 1)                            β”‚
β”‚ β€’ Control plane + compute pools updated in parallel               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Check Available Updates

# View recommended updates
oc adm upgrade

# Include updates with known issues (conditional updates)
oc adm upgrade --include-not-recommended

# Check current cluster version
oc get clusterversion
# NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
# version   4.19.12   True        False         3d      Cluster version is 4.19.12

# View available updates as JSON
oc get clusterversion version -o json | jq '.status.availableUpdates[] | .version'

# Check conditional updates (with known risks)
oc get clusterversion version -o json | jq '.status.conditionalUpdates[] |
  {version: .release.version, recommended: .conditions[0].status, reason: .conditions[0].reason}'

Initiate an Update

# Update to specific version
oc adm upgrade --to=4.20.3

# Update to latest in channel
oc adm upgrade --to-latest=true

# Force update (bypass conditional risk warnings)
oc adm upgrade --to=4.20.3 --force
# ⚠️ Use only when you've evaluated the risk and accept it

# Switch channel first if needed
oc adm upgrade channel stable-4.20

Understanding Update Channels

Channel        Description                              When to Use
───────────────────────────────────────────────────────────────────────────
candidate-4.20 Unsupported early access (pre-GA)        Testing only
fast-4.20      GA releases immediately on publish       Need fixes ASAP
stable-4.20    GA releases after promotion delay        Most production clusters
eus-4.y        Extended Update Support (even versions)  EUS-to-EUS jumps
───────────────────────────────────────────────────────────────────────────

Promotion flow:
  candidate β†’ fast (GA + errata) β†’ stable (after delay)

Delay between fast and stable:
  β€’ z-stream updates: ~1-2 weeks
  β€’ Minor version initial: ~45-90 days

Key facts:
  β€’ fast and stable are BOTH fully supported
  β€’ The only difference is time-to-availability
  β€’ If a regression is found on fast, it's handled same as stable
  β€’ Newly installed clusters default to stable
# Switch channels
oc adm upgrade channel fast-4.20    # Get patches sooner
oc adm upgrade channel stable-4.20  # Wait for broader validation
oc adm upgrade channel eus-4.20     # For EUS-to-EUS updates

# Empty channel (disconnect from OSUS β€” air-gapped)
oc adm upgrade channel ""

Update Process Workflow (Detailed)

Step 1: Admin sets target version
        └─► spec.desiredUpdate.version in ClusterVersion CR

Step 2: CVO resolves version β†’ release image pull spec
        └─► Uses OSUS graph data

Step 3: CVO validates release image integrity
        └─► Cryptographic signature verification (built-in public keys)

Step 4: CVO creates extraction Job
        └─► openshift-cluster-version/version-$version-$hash
        └─► Downloads release image, extracts manifests

Step 5: CVO validates extracted manifests + metadata

Step 6: CVO checks preconditions
        └─► Operators report Upgradeable=True/False
        └─► Blocks if critical precondition fails

Step 7: CVO records in status.desired + status.history

Step 8: CVO applies manifests in Runlevel order
        β”œβ”€β–Ί Runlevel 03: CRDs
        β”œβ”€β–Ί Runlevel 10: Core Operators
        β”œβ”€β–Ί Runlevel 15: CVO itself updates (pod restarts)
        β”œβ”€β–Ί Runlevel 20: kube-apiserver, kube-controller-manager
        β”œβ”€β–Ί Runlevel 25: Other Operators
        β”œβ”€β–Ί ...
        └─► Runlevel 90: MCO manifests (last)

        Between each Runlevel, CVO waits for ALL Operators to report:
        β€’ Available=True
        β€’ Degraded=False
        β€’ Achieved desired version

Step 9: MCO updates nodes
        └─► Cordon β†’ Drain β†’ OS update β†’ Reboot β†’ Uncordon
        └─► maxUnavailable=1 (default, recommended)

Step 10: Cluster reports Updated
         └─► Control plane done; nodes may still be rolling

Monitor Update Progress

# Overall progress
oc adm upgrade
# or
oc get clusterversion version

# Watch Operator status during update
oc get clusteroperators
# NAME                  VERSION   AVAILABLE   PROGRESSING   DEGRADED
# kube-apiserver        4.20.3    True        True          False     ← updating
# network               4.19.12   True        True          False     ← updating
# machine-config        4.19.12   True        False         False     ← waiting

# Detailed CVO status
oc get clusterversion version -o json | jq '.status.conditions[] |
  {type: .type, status: .status, message: .message}'

# Watch node updates (MCO phase)
oc get mcp
# NAME     CONFIG                          UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT
# master   rendered-master-abc123          True      False      False      3              3
# worker   rendered-worker-def456          False     True       False      6              4

# Watch specific node progress
oc get nodes -o custom-columns=NAME:.metadata.name,STATUS:.status.conditions[-1].type,READY:.status.conditions[-1].status

Runlevel Manifest Ordering

# Release image manifests are named:
# 0000_<runlevel>_<component>_<manifest-name>.yaml

# Extract and inspect release contents
oc adm release extract quay.io/openshift-release-dev/ocp-release:4.20.3-x86_64

# View ordering
ls | head -20
# 0000_03_authorization-openshift_01_rolebindingrestriction.crd.yaml
# 0000_03_config-operator_01_proxy.crd.yaml
# 0000_10_cluster-openshift-controller-manager_00_namespace.yaml
# 0000_20_kube-apiserver-operator_00_namespace.yaml
# 0000_25_kube-scheduler-operator_00_namespace.yaml
# 0000_50_cluster-ingress-operator_00_namespace.yaml
# 0000_90_machine-config_01_namespace.yaml

# Rules:
# 1. Lower Runlevel applied before higher
# 2. Within Runlevel: different components in parallel
# 3. Within component: lexicographic order
# 4. CVO waits for stability before next Runlevel

Estimate Update Duration

Formula:
  Update time = CVO phase + (node iterations Γ— time per node)

CVO phase: 60-120 minutes (control plane Operators)

Node update time per node:
  β€’ Cloud instances: 5-10 minutes (fast reboot)
  β€’ Bare metal: 15-30 minutes (slow reboot + BIOS POST)

Node iterations = ceil(total_nodes / maxUnavailable)

Examples:
─────────────────────────────────────────────────────────────────
Cluster: 3 control + 6 compute, cloud, maxUnavailable=1
  = 60 min + (6 iterations Γ— 5 min) = 90 minutes

Cluster: 3 control + 6 compute, cloud, maxUnavailable=2
  = 60 min + (3 iterations Γ— 5 min) = 75 minutes

Cluster: 3 control + 20 compute, bare metal, maxUnavailable=1
  = 90 min + (20 iterations Γ— 20 min) = 490 minutes (~8 hours)

Cluster: 3 control + 20 compute, bare metal, maxUnavailable=5
  = 90 min + (4 iterations Γ— 20 min) = 170 minutes (~3 hours)
─────────────────────────────────────────────────────────────────

MCO Node Update Sequence

For each MachineConfigPool (master, worker):

  While nodes remain to update:
    1. Select up to maxUnavailable nodes
    2. Cordon selected nodes (no new workloads scheduled)
    3. Drain pods (respecting PodDisruptionBudgets)
    4. Apply new MachineConfig (OS + systemd + kubelet + CRI-O)
    5. Reboot node
    6. Node comes back Ready
    7. Uncordon node (workloads can schedule again)
    8. Repeat with next batch

  Node selection order:
    β€’ Alphabetical by topology.kubernetes.io/zone
    β€’ Within zone: oldest nodes first
    β€’ No zones: oldest first
# Check MCP status during update
oc get mcp worker -o yaml | yq '.status'
# machineCount: 6
# readyMachineCount: 4
# updatedMachineCount: 4
# unavailableMachineCount: 1
# degradedMachineCount: 0

# See which node is currently updating
oc get nodes -l node-role.kubernetes.io/worker \
  -o custom-columns=NAME:.metadata.name,READY:.status.conditions[-1].status,SCHEDULABLE:.spec.unschedulable

Conditional Updates (Known Risks)

# View conditional updates with risk details
oc get clusterversion version -o json | jq '.status.conditionalUpdates[] | {
  version: .release.version,
  recommended: .conditions[0].status,
  reason: .conditions[0].reason,
  message: .conditions[0].message
}'

# Example output:
# {
#   "version": "4.20.2",
#   "recommended": "False",
#   "reason": "MultipleReasons",
#   "message": "In Azure clusters with user-provisioned registry storage..."
# }

# Risk evaluation: CVO continuously checks if YOUR cluster matches risk criteria
# If no match β†’ appears in availableUpdates (recommended)
# If matches β†’ appears in conditionalUpdates (known issues)
# You can still update β€” it's informational, not blocking (unless Upgradeable=False)

ClusterOperator Condition Types

# Check all operator conditions
oc get co -o json | jq '.items[] | {
  name: .metadata.name,
  available: (.status.conditions[] | select(.type=="Available") | .status),
  progressing: (.status.conditions[] | select(.type=="Progressing") | .status),
  degraded: (.status.conditions[] | select(.type=="Degraded") | .status),
  upgradeable: (.status.conditions[] | select(.type=="Upgradeable") | .status)
}'
Condition Types:
───────────────────────────────────────────────────────────────────
Available=True    Operator is functional (False = admin intervention needed)
Progressing=True  Operator is rolling out changes (normal during update)
Degraded=True     Persistent issue requiring attention (not transient)
Upgradeable=False Operator says cluster shouldn't update (blocks minor updates)
───────────────────────────────────────────────────────────────────

ClusterVersion Condition Types:
───────────────────────────────────────────────────────────────────
Failing           Cannot reach desired state (unhealthy)
Invalid           Error prevents CVO from taking action
RetrievedUpdates  Successfully fetched update graph from OSUS
ReleaseAccepted   Release payload loaded and verified successfully
───────────────────────────────────────────────────────────────────

EUS-to-EUS Updates (Control Plane Only)

# EUS versions: 4.14, 4.16, 4.18, 4.20 (even minor versions)
# Skip intermediate minor for worker nodes

# 1. Pause worker MCP
oc patch mcp/worker --type merge --patch '{"spec":{"paused":true}}'

# 2. Update control plane through intermediate version
oc adm upgrade channel eus-4.20
oc adm upgrade --to=4.19.latest   # intermediate
# Wait for control plane...
oc adm upgrade --to=4.20.latest   # target EUS

# 3. Resume worker MCP (nodes update directly to 4.20)
oc patch mcp/worker --type merge --patch '{"spec":{"paused":false}}'

# Benefit: Workers reboot only ONCE (not twice)

Common Issues

Update stuck at β€œProgressing” for >2 hours

  • Cause: An Operator can’t reach stable state (often kube-apiserver graceful termination)
  • Fix: Check oc get co for Progressing=True operators; inspect their logs

Node stuck in β€œSchedulingDisabled” after update

  • Cause: MCO drain stuck on pod with restrictive PDB
  • Fix: Check oc get pods --field-selector=status.phase=Pending; review PDBs

”Upgradeable=False” blocking update

  • Cause: An Operator detected a condition preventing safe update
  • Fix: Run oc get co <operator> -o json | jq '.status.conditions[] | select(.type=="Upgradeable")' to see message

Update not available in channel

  • Cause: Release not yet promoted to stable; or conditional risk blocks recommendation
  • Fix: Switch to fast channel; or use --include-not-recommended to see all options

MCO Degraded after node reboot

  • Cause: Node failed to apply new machine config (disk full, kernel panic, etc.)
  • Fix: SSH to node; check journalctl -u machine-config-daemon; may need to oc debug node/

Best Practices

  1. Use stable channel for production β€” fast only when you need specific fixes immediately
  2. Never change maxUnavailable for control plane β€” keep at 1 (sequential)
  3. Check Upgradeable conditions before starting β€” oc adm upgrade shows blockers
  4. Monitor PDBs before update β€” restrictive PDBs cause drain timeouts
  5. Ensure all nodes are Ready β€” unavailable nodes delay the entire update
  6. EUS-to-EUS for large clusters β€” saves one full reboot cycle for all workers
  7. Test in non-production first β€” use fast channel in staging, stable in production
  8. Plan maintenance windows β€” estimate with formula: CVO time + (iterations Γ— node time)

Key Takeaways

  • OpenShift updates = CVO phase (Operators in Runlevels) + MCO phase (node OS/config)
  • CVO applies manifests in dependency order (Runlevel 03 β†’ 90); waits for stability between levels
  • MCO updates nodes one-by-one: cordon β†’ drain β†’ update β†’ reboot β†’ uncordon
  • Four channels: candidate (testing), fast (GA immediate), stable (GA delayed), eus (skip minors)
  • Conditional updates: OSUS evaluates cluster-specific risks and flags known issues
  • Duration estimate: 60-120 min CVO + (nodes/maxUnavailable Γ— reboot time)
  • Default maxUnavailable=1 for both pools β€” increase compute only, never control plane
  • EUS-to-EUS: pause workers, update control plane through intermediate, resume = one reboot
  • ClusterOperator conditions (Available/Progressing/Degraded/Upgradeable) drive update flow
#openshift #cluster-update #cvo #machine-config-operator #lifecycle
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens