ITMS Race Condition with Ingress Controllers
Resolve the ITMS race condition where ImageTagMirrorSet rollouts deadlock with hostNetwork ingress controllers during MCO drain.
π‘ Quick Answer: When an ITMS updates
registries.conf, the MCO reboots nodes one-by-one. Nodes already updated resolve images via the new mirror. If the mirror is missing the ingress router image, replacement router pods canβt pull on updated nodes β PDB blocks eviction on the next node β MCP deadlocks. Fix: pre-sync all images (especially ingress router images) to the mirror before applying the ITMS.
The Problem
You apply an ImageTagMirrorSet (ITMS) to redirect image pulls from an external registry to an internal mirror. The MCO begins rolling the new registries.conf across worker nodes. Halfway through the rollout, the MCP stalls at UPDATING=True. Drains hang, nodes stay cordoned, and the cluster is in a split-brain state β some nodes have the old registries.conf, others have the new one.
The root cause is a circular dependency between the ITMS rollout and the ingress controllerβs ability to reschedule.
The Race Condition β Step by Step
sequenceDiagram
participant Admin
participant MCO
participant Node1 as Worker-1 (updated)
participant Node2 as Worker-2 (draining)
participant Node3 as Worker-3 (pending)
participant Mirror as Internal Mirror
participant PDB as Router PDB
Admin->>MCO: Apply ITMS
MCO->>Node1: Drain β reboot β new registries.conf β
Note over Node1: Now pulls from mirror.internal
MCO->>Node2: Begin drain
Node2->>PDB: Evict router pod?
PDB->>Node2: Need replacement first (minAvailable)
Node2->>Node1: Schedule replacement router on Worker-1
Node1->>Mirror: Pull router image from mirror
Mirror-->>Node1: β 404 β image not mirrored!
Note over Node1: ImagePullBackOff
Note over PDB: Replacement not Running β eviction blocked
Note over MCO: Drain hangs β MCP stuckWhat Happened
ITMS applied β MCO renders new
registries.confrouting (e.g.)quay.io/openshift-release-dev/*βmirror.internal.example.com/openshift-release-dev/*Worker-1 updated first β drained successfully (router pod evicted, replacement ran on Worker-3 with old registries.conf pulling from the original registry). Worker-1 reboots with new
registries.conf.Worker-2 drain begins β MCO tries to evict the router pod on Worker-2. The PDB requires
minAvailable: N, so a replacement must schedule and pass readiness first.Replacement schedules on Worker-1 (already updated) β CRI-O on Worker-1 reads the new
registries.confand tries to pull the ingress router image frommirror.internal.example.com.Mirror doesnβt have the router image β the admin mirrored application images but forgot (or didnβt know) the ingress router image needs to be in the mirror too. Image pull fails with
ImagePullBackOff.PDB canβt be satisfied β the replacement pod never reaches Running. The original pod on Worker-2 canβt be evicted. The drain hangs. The MCP is stuck.
All further nodes are blocked β Worker-3, Worker-4, etc. canβt proceed because MCO processes sequentially.
The Split-Brain State
At this point the cluster has:
- Worker-1: new
registries.confβ pulls from mirror (some images may fail) - Worker-2: cordoned, drain hanging β old
registries.confstill active - Worker-3, 4, 5, 6: old
registries.confβ still pulling from original registry
Any pod rescheduled to Worker-1 that needs an image not in the mirror will also fail. The blast radius grows over time as pods are naturally rescheduled.
The Solution
Immediate Fix: Unblock the Stuck Rollout
# Step 1: Identify the stuck node
oc get mcp worker
oc get nodes -l node-role.kubernetes.io/worker= -o custom-columns=\
'NAME:.metadata.name,STATE:.metadata.annotations.machineconfiguration\.openshift\.io/state,CONFIG:.metadata.annotations.machineconfiguration\.openshift\.io/currentConfig'
# Step 2: Find the failing replacement pod
oc get pods -n openshift-ingress -o wide | grep -E "Pending|ImagePull|ErrImage"
# router-custom-7f8b9c-abc12 0/1 ImagePullBackOff worker-1 β On the updated node
# Step 3: Check what image is failing
oc describe pod router-custom-7f8b9c-abc12 -n openshift-ingress | grep -A3 "Events:"
# Failed to pull image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:..."
# β mirror.internal.example.com resolved it but doesn't have it
# Step 4: Mirror the missing image NOW
skopeo copy \
docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:abc123... \
docker://mirror.internal.example.com/openshift-release-dev/ocp-v4.0-art-dev@sha256:abc123... \
--src-creds=user:token --dest-tls-verify=false
# Step 5: Delete the failing pod to trigger re-pull
oc delete pod router-custom-7f8b9c-abc12 -n openshift-ingress
# Step 6: Watch the replacement come up
oc get pods -n openshift-ingress -o wide -w
# Once it's Running, the PDB is satisfied, drain unblocks, MCP continuesAlternative: Scale Down to Unblock
If you canβt mirror the image quickly:
# Temporarily scale down the router to unblock the drain
REPLICAS=$(oc get deploy router-custom -n openshift-ingress -o jsonpath='{.spec.replicas}')
oc scale deploy router-custom -n openshift-ingress --replicas=0
# Wait for drain to complete and node to reboot
watch oc get mcp worker
# After all nodes are updated, restore
oc scale deploy router-custom -n openshift-ingress --replicas=$REPLICASNuclear Option: Revert the ITMS
If too many images are missing from the mirror:
# Delete the ITMS to revert registries.conf
oc delete itms my-tag-mirror-set
# MCO will render a new MachineConfig WITHOUT the mirror entries
# and roll it out β but this means ANOTHER round of drains/reboots
oc get mcp worker -wPrevention: The Pre-Flight Checklist
1. Inventory ALL Images Before Applying ITMS
# List every image running in the cluster
oc get pods -A -o jsonpath='{range .items[*]}{range .spec.containers[*]}{.image}{"\n"}{end}{range .spec.initContainers[*]}{.image}{"\n"}{end}{end}' | sort -u > /tmp/all-cluster-images.txt
# Don't forget operator images, init containers, and sidecar injectors
oc get csv -A -o json | jq -r '.items[].spec.install.spec.deployments[].spec.template.spec.containers[].image' | sort -u >> /tmp/all-cluster-images.txt
# Check which images match your ITMS source patterns
grep "quay.io" /tmp/all-cluster-images.txt2. Mirror ALL Matched Images
# For each image that matches the ITMS source:
while read img; do
echo "Mirroring: $img"
skopeo copy "docker://$img" \
"docker://mirror.internal.example.com/${img#*/}" \
--all --src-creds=user:token
done < /tmp/matched-images.txt3. Verify Mirror Completeness
# Test that every matched image is accessible from the mirror
while read img; do
MIRROR_IMG="mirror.internal.example.com/${img#*/}"
if skopeo inspect "docker://$MIRROR_IMG" --tls-verify=false > /dev/null 2>&1; then
echo "β
$MIRROR_IMG"
else
echo "β MISSING: $MIRROR_IMG"
fi
done < /tmp/matched-images.txt4. Apply ITMS Only After Full Mirror Sync
# All images verified? Now safe to apply
oc apply -f itms.yaml5. Use a Dedicated MCP for Critical Infra
Separate ingress/router nodes into their own MCP so they update independently:
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
name: infra
spec:
machineConfigSelector:
matchExpressions:
- key: machineconfiguration.openshift.io/role
operator: In
values: [worker, infra]
nodeSelector:
matchLabels:
node-role.kubernetes.io/infra: ""
maxUnavailable: 1
paused: true # Update infra nodes manually after workers succeedWhy This Race Condition Is Hard to Catch
First node always succeeds β Worker-1βs router pod gets evicted and rescheduled to a node still on the OLD
registries.conf, so the pull works fine from the original registry.Failure only appears on second+ node β by then, the replacement targets Worker-1 (now on the NEW
registries.conf), and the mirror lookup fails.Non-obvious image dependency β admins think of application images but forget that infrastructure components (ingress routers, monitoring agents, logging collectors) also pull images affected by the ITMS.
ITMS is tag-based β unlike IDMS (digest-based), ITMS matches on tag patterns. A broad source like
quay.iocatches far more images than expected, including OpenShift infrastructure images.
Common Issues
Partial Mirror: Some Tags Work, Others Donβt
# ITMS with mirrorSourcePolicy: NeverContactSource means NO fallback
# If the mirror is missing ANY matched image, it fails hard
spec:
imageTagMirrors:
- mirrors:
- mirror.internal.example.com
source: quay.io
mirrorSourcePolicy: NeverContactSource # β Dangerous without full syncUse AllowContactingSource during migration to allow fallback:
mirrorSourcePolicy: AllowContactingSource # Falls back to original if mirror missesMultiple Ingress Controllers Compound the Problem
If you have 5 custom IngressControllers, each with hostNetwork and strict PDBs, the drain has 5Γ the chance of hitting a mirror miss.
ITMS + IDMS Ordering Confusion
# Check which mirror sets are active
oc get itms,idms
# ITMS and IDMS entries are merged into registries.conf
# More specific sources take priorityBest Practices
- Always pre-sync mirrors before applying ITMS β treat it like a database migration
- Use
AllowContactingSourceinitially β switch toNeverContactSourceonly after verifying all images are mirrored - Include infrastructure images in your mirror plan β ingress routers, monitoring, logging, operators
- Pause MCP before applying ITMS β sync mirrors while paused, then unpause
- Separate critical infra into dedicated MCP β control update order explicitly
- Test with one node first β
maxUnavailable: 1and watch the full cycle complete before continuing
Key Takeaways
- ITMS changes
registries.confvia MCO β rolling reboot across all nodes in the MCP - Race condition: updated nodes resolve images via mirror; if mirror is incomplete, pods fail to pull
- hostNetwork ingress routers with PDBs amplify the problem β drain deadlocks when replacements canβt pull images
- Pre-sync ALL images (including infrastructure) to the mirror before applying ITMS
- Use
AllowContactingSourceas a safety net during migration - First node success is deceptive β the failure only manifests on subsequent nodes

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses β