How to Backup and Restore etcd
Protect your Kubernetes cluster with etcd backup strategies. Learn to create snapshots, automate backups, and restore etcd data for disaster recovery.
The Problem
Your Kubernetes clusterβs entire state is stored in etcd. Without proper backups, a corrupted or lost etcd database means losing all cluster configuration, secrets, and resource definitions.
The Solution
Implement a robust etcd backup strategy with regular snapshots, secure storage, and tested restore procedures.
etcd Architecture Overview
flowchart TB
subgraph cluster["βΈοΈ KUBERNETES CLUSTER"]
subgraph controlplane["ποΈ CONTROL PLANE"]
APIServer["π API Server"]
Controllers["βοΈ Controllers"]
Scheduler["π
Scheduler"]
subgraph etcd_db["πΎ etcd"]
data["π¦ Data:<br/>- Pods<br/>- Services<br/>- ConfigMaps<br/>- Secrets"]
end
APIServer --> etcd_db
APIServer --> Controllers
Controllers --> Scheduler
etcd_db --> Snapshot["πΏ SNAPSHOT<br/>(Backup)"]
end
endStep 1: Install etcdctl
On Control Plane Node
# Download etcdctl matching your etcd version
ETCD_VERSION=v3.5.11
wget https://github.com/etcd-io/etcd/releases/download/${ETCD_VERSION}/etcd-${ETCD_VERSION}-linux-amd64.tar.gz
tar xzf etcd-${ETCD_VERSION}-linux-amd64.tar.gz
sudo mv etcd-${ETCD_VERSION}-linux-amd64/etcdctl /usr/local/bin/
# Verify installation
etcdctl versionFind etcd Connection Details
# Get etcd pod info
kubectl get pods -n kube-system -l component=etcd
# View etcd configuration
kubectl describe pod etcd-controlplane -n kube-system | grep -A 20 Command
# Common paths (kubeadm clusters)
# Certificates: /etc/kubernetes/pki/etcd/
# Data directory: /var/lib/etcdStep 2: Create etcd Snapshot
Manual Snapshot
# Set environment variables
export ETCDCTL_API=3
export ETCDCTL_ENDPOINTS=https://127.0.0.1:2379
export ETCDCTL_CACERT=/etc/kubernetes/pki/etcd/ca.crt
export ETCDCTL_CERT=/etc/kubernetes/pki/etcd/server.crt
export ETCDCTL_KEY=/etc/kubernetes/pki/etcd/server.key
# Create snapshot
etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db
# Verify snapshot
etcdctl snapshot status /backup/etcd-snapshot-20260128-120000.db --write-out=tableOne-liner Backup Command
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.keyStep 3: Automate Backups with CronJob
Backup Script
#!/bin/bash
# /usr/local/bin/etcd-backup.sh
set -e
BACKUP_DIR="/backup/etcd"
RETENTION_DAYS=7
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
SNAPSHOT_NAME="etcd-snapshot-${TIMESTAMP}.db"
# Create backup directory
mkdir -p ${BACKUP_DIR}
# Create snapshot
ETCDCTL_API=3 etcdctl snapshot save ${BACKUP_DIR}/${SNAPSHOT_NAME} \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Verify snapshot
ETCDCTL_API=3 etcdctl snapshot status ${BACKUP_DIR}/${SNAPSHOT_NAME}
# Compress snapshot
gzip ${BACKUP_DIR}/${SNAPSHOT_NAME}
# Clean up old backups
find ${BACKUP_DIR} -name "etcd-snapshot-*.db.gz" -mtime +${RETENTION_DAYS} -delete
echo "Backup completed: ${BACKUP_DIR}/${SNAPSHOT_NAME}.gz"Cron Schedule
# Add to crontab
sudo crontab -e
# Backup every 6 hours
0 */6 * * * /usr/local/bin/etcd-backup.sh >> /var/log/etcd-backup.log 2>&1Kubernetes CronJob (Alternative)
apiVersion: batch/v1
kind: CronJob
metadata:
name: etcd-backup
namespace: kube-system
spec:
schedule: "0 */6 * * *" # Every 6 hours
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 3
jobTemplate:
spec:
template:
spec:
hostNetwork: true
nodeSelector:
node-role.kubernetes.io/control-plane: ""
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
containers:
- name: backup
image: bitnami/etcd:3.5
command:
- /bin/sh
- -c
- |
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
etcdctl snapshot save /backup/etcd-snapshot-${TIMESTAMP}.db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Upload to cloud storage
aws s3 cp /backup/etcd-snapshot-${TIMESTAMP}.db s3://my-backup-bucket/etcd/
volumeMounts:
- name: etcd-certs
mountPath: /etc/kubernetes/pki/etcd
readOnly: true
- name: backup
mountPath: /backup
env:
- name: ETCDCTL_API
value: "3"
restartPolicy: OnFailure
volumes:
- name: etcd-certs
hostPath:
path: /etc/kubernetes/pki/etcd
- name: backup
hostPath:
path: /backup/etcdStep 4: Store Backups Securely
Upload to S3
#!/bin/bash
# Add to backup script
# Upload to S3
aws s3 cp ${BACKUP_DIR}/${SNAPSHOT_NAME}.gz s3://my-cluster-backups/etcd/
# Encrypt with KMS
aws s3 cp ${BACKUP_DIR}/${SNAPSHOT_NAME}.gz \
s3://my-cluster-backups/etcd/ \
--sse aws:kms \
--sse-kms-key-id alias/etcd-backup-keyUpload to Azure Blob
# Upload to Azure Blob Storage
az storage blob upload \
--account-name mystorageaccount \
--container-name etcd-backups \
--name ${SNAPSHOT_NAME}.gz \
--file ${BACKUP_DIR}/${SNAPSHOT_NAME}.gzUpload to GCS
# Upload to Google Cloud Storage
gsutil cp ${BACKUP_DIR}/${SNAPSHOT_NAME}.gz gs://my-cluster-backups/etcd/Step 5: Restore etcd from Snapshot
Pre-Restore Checklist
# 1. Stop kube-apiserver (on all control plane nodes)
sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /etc/kubernetes/
# 2. Stop etcd (on all control plane nodes)
sudo mv /etc/kubernetes/manifests/etcd.yaml /etc/kubernetes/
# 3. Wait for pods to terminate
kubectl get pods -n kube-system -l component=etcd
# Should return "No resources found"
# 4. Backup current data directory
sudo mv /var/lib/etcd /var/lib/etcd.backupRestore Snapshot
# Restore to new data directory
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot-20260128-120000.db \
--data-dir=/var/lib/etcd \
--name=controlplane \
--initial-cluster=controlplane=https://192.168.1.10:2380 \
--initial-cluster-token=etcd-cluster-1 \
--initial-advertise-peer-urls=https://192.168.1.10:2380
# Set correct ownership
sudo chown -R etcd:etcd /var/lib/etcdPost-Restore Steps
# 1. Restore etcd manifest
sudo mv /etc/kubernetes/etcd.yaml /etc/kubernetes/manifests/
# 2. Wait for etcd to start
sudo crictl ps | grep etcd
# 3. Verify etcd health
ETCDCTL_API=3 etcdctl endpoint health \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# 4. Restore kube-apiserver manifest
sudo mv /etc/kubernetes/kube-apiserver.yaml /etc/kubernetes/manifests/
# 5. Verify cluster health
kubectl get nodes
kubectl get pods -AMulti-Node etcd Cluster Restore
For Each Control Plane Node
# Node 1 (192.168.1.10)
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
--data-dir=/var/lib/etcd \
--name=node1 \
--initial-cluster=node1=https://192.168.1.10:2380,node2=https://192.168.1.11:2380,node3=https://192.168.1.12:2380 \
--initial-cluster-token=etcd-cluster-1 \
--initial-advertise-peer-urls=https://192.168.1.10:2380
# Node 2 (192.168.1.11)
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
--data-dir=/var/lib/etcd \
--name=node2 \
--initial-cluster=node1=https://192.168.1.10:2380,node2=https://192.168.1.11:2380,node3=https://192.168.1.12:2380 \
--initial-cluster-token=etcd-cluster-1 \
--initial-advertise-peer-urls=https://192.168.1.11:2380
# Node 3 (192.168.1.12)
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
--data-dir=/var/lib/etcd \
--name=node3 \
--initial-cluster=node1=https://192.168.1.10:2380,node2=https://192.168.1.11:2380,node3=https://192.168.1.12:2380 \
--initial-cluster-token=etcd-cluster-1 \
--initial-advertise-peer-urls=https://192.168.1.12:2380etcd Health Monitoring
Check Cluster Health
# Endpoint health
ETCDCTL_API=3 etcdctl endpoint health \
--endpoints=https://192.168.1.10:2379,https://192.168.1.11:2379,https://192.168.1.12:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Endpoint status
ETCDCTL_API=3 etcdctl endpoint status \
--endpoints=https://192.168.1.10:2379,https://192.168.1.11:2379,https://192.168.1.12:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--write-out=table
# Member list
ETCDCTL_API=3 etcdctl member list \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--write-out=tablePrometheus Alerts for etcd
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: etcd-alerts
namespace: monitoring
spec:
groups:
- name: etcd
rules:
- alert: EtcdMembersDown
expr: |
max without (endpoint) (
sum without (instance) (up{job="etcd"} == bool 0)
or
count without (To) (
sum without (instance) (rate(etcd_network_peer_sent_failures_total[120s])) > 0.01
)
) > 0
for: 10m
labels:
severity: critical
annotations:
summary: "etcd cluster members are down"
- alert: EtcdNoLeader
expr: etcd_server_has_leader == 0
for: 1m
labels:
severity: critical
annotations:
summary: "etcd cluster has no leader"
- alert: EtcdHighNumberOfFailedGRPCRequests
expr: |
sum(rate(grpc_server_handled_total{job="etcd", grpc_code=~"Unknown|FailedPrecondition|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded"}[5m]))
/ sum(rate(grpc_server_handled_total{job="etcd"}[5m])) > 0.05
for: 10m
labels:
severity: warning
annotations:
summary: "High rate of failed gRPC requests"
- alert: EtcdDatabaseQuotaLow
expr: |
(etcd_mvcc_db_total_size_in_bytes / etcd_server_quota_backend_bytes) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "etcd database quota usage above 80%"Backup Verification Script
#!/bin/bash
# /usr/local/bin/verify-etcd-backup.sh
SNAPSHOT=$1
TEMP_DIR=$(mktemp -d)
echo "Verifying snapshot: ${SNAPSHOT}"
# Check snapshot integrity
ETCDCTL_API=3 etcdctl snapshot status ${SNAPSHOT} --write-out=table
# Test restore to temp directory
ETCDCTL_API=3 etcdctl snapshot restore ${SNAPSHOT} \
--data-dir=${TEMP_DIR}/etcd \
--name=test-restore \
--initial-cluster=test-restore=http://localhost:2380 \
--initial-cluster-token=test-token \
--initial-advertise-peer-urls=http://localhost:2380
if [ $? -eq 0 ]; then
echo "β Snapshot is valid and restorable"
rm -rf ${TEMP_DIR}
exit 0
else
echo "β Snapshot verification failed"
rm -rf ${TEMP_DIR}
exit 1
fiDisaster Recovery Runbook
## etcd Disaster Recovery Procedure
### 1. Assess Situation
- [ ] Check which nodes are affected
- [ ] Verify backup availability
- [ ] Document current cluster state
### 2. Prepare for Restore
- [ ] SSH to all control plane nodes
- [ ] Stop kube-apiserver on all nodes
- [ ] Stop etcd on all nodes
- [ ] Backup existing /var/lib/etcd directories
### 3. Restore etcd
- [ ] Download latest verified backup
- [ ] Run restore command on each node
- [ ] Set correct file permissions
- [ ] Start etcd on all nodes
### 4. Verify Restore
- [ ] Check etcd member list
- [ ] Verify endpoint health
- [ ] Start kube-apiserver
- [ ] Run kubectl get nodes
- [ ] Verify all workloads
### 5. Post-Recovery
- [ ] Document incident
- [ ] Review backup schedule
- [ ] Update runbook if neededSummary
Regular etcd backups are essential for Kubernetes disaster recovery. Automate backups with cron jobs, store them securely off-cluster, and regularly test your restore procedures to ensure they work when needed.
π Go Further with Kubernetes Recipes
Love this recipe? Thereβs so much more! This is just one of 100+ hands-on recipes in our comprehensive Kubernetes Recipes book.
Inside the book, youβll master:
- β Production-ready deployment strategies
- β Advanced networking and security patterns
- β Observability, monitoring, and troubleshooting
- β Real-world best practices from industry experts
βThe practical, recipe-based approach made complex Kubernetes concepts finally click for me.β
π Get Your Copy Now β Start building production-grade Kubernetes skills today!
π Get All 100+ Recipes in One Book
Stop searching β get every production-ready pattern with detailed explanations, best practices, and copy-paste YAML.
Want More Kubernetes Recipes?
This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.