etcd Maintenance Operations Kubernetes
Perform etcd maintenance for Kubernetes clusters. Defragmentation, compaction, snapshot backup, member health checks, and performance monitoring with etcdctl.
π‘ Quick Answer: Run
etcdctl defragafter compaction to reclaim disk space, take snapshots withetcdctl snapshot save, check member health withetcdctl endpoint health --cluster, and monitordb_sizevsdb_size_in_useto detect fragmentation.
The Problem
etcd stores all Kubernetes cluster state. Without maintenance, etcdβs database grows unbounded (fragmentation), performance degrades, and snapshots become unreliable. A full etcd disk (default quota: 2GB) causes the cluster to become read-only β no new pods, no updates, no scaling.
The Solution
Health Check
# Set etcd connection variables
export ETCDCTL_API=3
export ETCDCTL_ENDPOINTS=https://127.0.0.1:2379
export ETCDCTL_CACERT=/etc/kubernetes/pki/etcd/ca.crt
export ETCDCTL_CERT=/etc/kubernetes/pki/etcd/server.crt
export ETCDCTL_KEY=/etc/kubernetes/pki/etcd/server.key
# Cluster health
etcdctl endpoint health --cluster
# https://10.0.0.1:2379 is healthy: committed proposal: took = 2ms
# https://10.0.0.2:2379 is healthy: committed proposal: took = 3ms
# https://10.0.0.3:2379 is healthy: committed proposal: took = 2ms
# Cluster status (db size, leader, raft index)
etcdctl endpoint status --cluster -w tableSnapshot Backup
# Take snapshot
etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db
# Verify snapshot
etcdctl snapshot status /backup/etcd-20260424-020000.db -w table
# +----------+----------+------------+------------+
# | HASH | REVISION | TOTAL KEYS | TOTAL SIZE |
# +----------+----------+------------+------------+
# | 3c27e36a | 1285023 | 4521 | 45 MB |
# +----------+----------+------------+------------+Compaction and Defragmentation
# Get current revision
REV=$(etcdctl endpoint status -w json | jq '.[0].Status.header.revision')
# Compact old revisions (keeps only current state)
etcdctl compact $REV
# Defragment to reclaim disk space (run on each member)
etcdctl defrag --endpoints=https://10.0.0.1:2379
etcdctl defrag --endpoints=https://10.0.0.2:2379
etcdctl defrag --endpoints=https://10.0.0.3:2379β οΈ Defragmentation blocks the member briefly. Run on one member at a time, not the leader first.
Monitoring Queries
# Database size vs in-use (fragmentation indicator)
etcdctl endpoint status -w json | jq '.[] | {
endpoint: .Endpoint,
dbSize: (.Status.dbSize / 1048576 | floor | tostring + " MB"),
dbSizeInUse: (.Status.dbSizeInUse / 1048576 | floor | tostring + " MB"),
leader: .Status.leader,
raftIndex: .Status.raftIndex
}'
# Alarm check (NOSPACE alarm means quota exceeded)
etcdctl alarm list
# Clear alarm after fixing space
etcdctl alarm disarmgraph TD
WRITE[K8s API Writes] -->|Append| ETCD[etcd Database<br/>WAL + B+ tree]
ETCD -->|Revisions grow| FRAG[Fragmentation<br/>dbSize >> dbSizeInUse]
FRAG -->|Fix| COMPACT[etcdctl compact<br/>Remove old revisions]
COMPACT --> DEFRAG[etcdctl defrag<br/>Reclaim disk space]
DEFRAG --> HEALTHY[Healthy etcd<br/>dbSize β dbSizeInUse]
ETCD -->|Periodic| SNAP[etcdctl snapshot save<br/>Daily backup]Common Issues
etcd NOSPACE alarm β cluster read-only
The database exceeded its quota (default 2GB). Emergency fix:
etcdctl alarm list # Confirm NOSPACE
REV=$(etcdctl endpoint status -w json | jq '.[0].Status.header.revision')
etcdctl compact $REV # Compact
etcdctl defrag # Reclaim space
etcdctl alarm disarm # Clear alarmDefragmentation makes member unresponsive
Expected β defrag blocks I/O briefly. Run on followers first, leader last. Monitor with endpoint health between each.
etcd slow β high latency on API operations
Check disk I/O: iostat -x 1. etcd needs fast fsync β SSDs are required. Also check dbSize for fragmentation.
Best Practices
- Daily snapshot backups β automate with CronJob or systemd timer
- Compact + defrag weekly β prevents unbounded database growth
- Monitor
db_sizePrometheus metric β alert when approaching quota - SSD storage is mandatory β etcd performance depends entirely on disk fsync latency
- 3 or 5 member clusters β odd numbers for quorum; 5 tolerates 2 failures
- Defrag one member at a time β followers first, leader last
Key Takeaways
- etcd fragmentation causes
dbSize>>dbSizeInUseβ defrag reclaims the difference - NOSPACE alarm makes the cluster read-only β compact + defrag + disarm to recover
- Daily snapshots are your DR lifeline β verify them with
snapshot status - SSD storage is non-negotiable for etcd β spinning disks cause cluster instability
- Run defrag on followers first, leader last, one at a time

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
