Kubernetes Pod Priority and Preemption
Configure pod priority and preemption in Kubernetes for critical workloads. PriorityClass definitions, preemption behavior, protecting system
π‘ Quick Answer: PriorityClass assigns a numeric priority (0-1000000000) to pods. Higher-priority pods get scheduled first and can preempt (evict) lower-priority pods when the cluster is full. Create PriorityClasses for your workload tiers (critical/high/normal/low), then reference them in pod specs with
priorityClassName. System-critical pods use priority > 1000000000.
The Problem
- Critical production pods canβt schedule because batch jobs consumed all resources
- No distinction between must-run services and best-effort workloads
- Cluster full β need to automatically make room for important pods
- System components (DNS, monitoring) must never be evicted
- Want to run low-priority workloads that yield resources when needed
The Solution
Define PriorityClasses
# System critical (highest priority β reserved for cluster components)
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: system-critical
value: 1000000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: "System-critical pods (DNS, ingress, monitoring)"
---
# Production workloads
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: production-high
value: 100000
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: "Production services that must always run"
---
# Default priority for most workloads
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: production-normal
value: 50000
globalDefault: true # Applied to pods without explicit priority
preemptionPolicy: PreemptLowerPriority
description: "Standard production workloads"
---
# Low priority (batch, dev, preemptible)
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: batch-low
value: 10000
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: "Batch jobs and non-critical workloads"
---
# Best effort (can be preempted, never preempts others)
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: best-effort
value: 1
globalDefault: false
preemptionPolicy: Never # Won't evict others to schedule
description: "Best-effort workloads, will be preempted first"Use PriorityClass in Pods
apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-service
namespace: production
spec:
replicas: 3
template:
spec:
priorityClassName: production-high # High priority β preempts lower
containers:
- name: app
image: registry.example.com/payment:v2
resources:
requests:
cpu: "500m"
memory: "512Mi"
---
apiVersion: batch/v1
kind: Job
metadata:
name: data-processing
spec:
template:
spec:
priorityClassName: batch-low # Low priority β preempted by production
containers:
- name: worker
image: registry.example.com/batch-worker:v1
resources:
requests:
cpu: "2"
memory: "4Gi"How Preemption Works
Scenario: Cluster is full, high-priority pod can't schedule
1. Scheduler identifies pending pod with priority 100000
2. Scheduler finds nodes where evicting lower-priority pods would make room
3. Scheduler picks node with minimum disruption (fewest evictions)
4. Lower-priority pods get graceful termination (terminationGracePeriodSeconds)
5. After eviction, high-priority pod schedules on that node
Protection:
- PDB (PodDisruptionBudget) is respected during preemption
- Pods with higher or equal priority are never evicted
- System pods (priority > 1B) are never evicted by user workloadsNon-Preempting Priority (Queue Ordering Only)
# High scheduling priority but won't evict others
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority-no-preempt
value: 90000
preemptionPolicy: Never # Schedules first, but waits for resources
description: "High scheduling priority without preemption"Built-in System PriorityClasses
kubectl get priorityclasses
# NAME VALUE GLOBAL-DEFAULT
# system-cluster-critical 2000000000 false
# system-node-critical 2000001000 false
# (your custom classes)
# system-node-critical: kubelet, kube-proxy (per-node essentials)
# system-cluster-critical: CoreDNS, metrics-server (cluster-wide essentials)Practical Priority Tier Design
Priority Tier β Value β Preempts β Use Case
βββββββββββββββββββββΌββββββββββββββΌβββββββββββΌββββββββββββββββββββββββββ
system-node-criticalβ 2000001000 β All β kubelet, kube-proxy
system-cluster-crit β 2000000000 β All user β CoreDNS, ingress, CNI
platform-critical β 1000000 β User β Monitoring, logging, mesh
production-high β 100000 β Normal+ β Payment, auth services
production-normal β 50000 (def) β Low+ β Standard services
batch-normal β 20000 β Low+ β Scheduled batch jobs
batch-low β 10000 β Best-eff β Backfill jobs
best-effort β 1 (Never) β None β Dev experiments, spot-like
βββββββββββββββββββββ΄ββββββββββββββ΄βββββββββββ΄ββββββββββββββββββββββββββCommon Issues
Critical pod not preempting lower-priority pods
- Cause: PDB protecting the lower-priority pods; or no single node eviction would free enough resources
- Fix: Review PDB settings; ensure lower-priority pods have lower
value; check node resource distribution
All batch jobs getting killed constantly
- Cause: Cluster too full β production workloads keep preempting batch
- Fix: Add dedicated batch nodes with taints; or use cluster autoscaler to add capacity
Pods without PriorityClass get priority 0
- Cause: No
globalDefault: truePriorityClass defined - Fix: Create a PriorityClass with
globalDefault: truefor a sensible default
Preemption cascade (chain reaction)
- Cause: Evicted pod triggers rescheduling which preempts another pod
- Fix: Use clear priority tiers with gaps between values; set PDBs on critical workloads
Best Practices
- Define 4-6 priority tiers β donβt over-complicate; clear hierarchy
- Set a
globalDefaultβ prevents pods from getting priority 0 - Use
preemptionPolicy: Neverfor batch β queue fairly without disruption - Protect with PDB β critical services should have PodDisruptionBudgets
- Gap between values β leaves room for new tiers without reshuffling
- Donβt use priorities > 1B β reserved for system components
- Combine with resource quotas β prevent low-priority namespaces from hoarding
Key Takeaways
- PriorityClass: numeric value (higher = more important) + preemption policy
- Higher-priority pods schedule first AND can evict lower-priority pods
preemptionPolicy: Neverβ high queue priority without eviction power- Built-in:
system-node-critical(2000001000) andsystem-cluster-critical(2000000000) globalDefault: trueβ applied to pods without explicitpriorityClassName- PDBs are respected during preemption β protected pods wonβt be evicted
- Design 4-6 clear tiers: system β platform β production β batch β best-effort

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
