How to Backup and Restore etcd
Protect your Kubernetes cluster with etcd backup strategies. Learn to create snapshots, automate backups, and restore etcd data for disaster recovery.
π‘ Quick Answer: Backup etcd:
ETCDCTL_API=3 etcdctl snapshot save backup.db --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key. Restore: stop kube-apiserver,etcdctl snapshot restore backup.db --data-dir=/var/lib/etcd-new, update etcd config, restart. For managed K8s (EKS/GKE/AKS), use providerβs backup features instead.
The Problem
Your Kubernetes clusterβs entire state is stored in etcd. Without proper backups, a corrupted or lost etcd database means losing all cluster configuration, secrets, and resource definitions.
The Solution
Implement a robust etcd backup strategy with regular snapshots, secure storage, and tested restore procedures.
etcd Architecture Overview
flowchart TB
subgraph cluster["βΈοΈ KUBERNETES CLUSTER"]
subgraph controlplane["ποΈ CONTROL PLANE"]
APIServer["π API Server"]
Controllers["βοΈ Controllers"]
Scheduler["π
Scheduler"]
subgraph etcd_db["πΎ etcd"]
data["π¦ Data:<br/>- Pods<br/>- Services<br/>- ConfigMaps<br/>- Secrets"]
end
APIServer --> etcd_db
APIServer --> Controllers
Controllers --> Scheduler
etcd_db --> Snapshot["πΏ SNAPSHOT<br/>(Backup)"]
end
endStep 1: Install etcdctl
On Control Plane Node
# Download etcdctl matching your etcd version
ETCD_VERSION=v3.5.11
wget https://github.com/etcd-io/etcd/releases/download/${ETCD_VERSION}/etcd-${ETCD_VERSION}-linux-amd64.tar.gz
tar xzf etcd-${ETCD_VERSION}-linux-amd64.tar.gz
sudo mv etcd-${ETCD_VERSION}-linux-amd64/etcdctl /usr/local/bin/
# Verify installation
etcdctl versionFind etcd Connection Details
# Get etcd pod info
kubectl get pods -n kube-system -l component=etcd
# View etcd configuration
kubectl describe pod etcd-controlplane -n kube-system | grep -A 20 Command
# Common paths (kubeadm clusters)
# Certificates: /etc/kubernetes/pki/etcd/
# Data directory: /var/lib/etcdStep 2: Create etcd Snapshot
Manual Snapshot
# Set environment variables
export ETCDCTL_API=3
export ETCDCTL_ENDPOINTS=https://127.0.0.1:2379
export ETCDCTL_CACERT=/etc/kubernetes/pki/etcd/ca.crt
export ETCDCTL_CERT=/etc/kubernetes/pki/etcd/server.crt
export ETCDCTL_KEY=/etc/kubernetes/pki/etcd/server.key
# Create snapshot
etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db
# Verify snapshot
etcdctl snapshot status /backup/etcd-snapshot-20260128-120000.db --write-out=tableOne-liner Backup Command
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.keyStep 3: Automate Backups with CronJob
Backup Script
#!/bin/bash
# /usr/local/bin/etcd-backup.sh
set -e
BACKUP_DIR="/backup/etcd"
RETENTION_DAYS=7
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
SNAPSHOT_NAME="etcd-snapshot-${TIMESTAMP}.db"
# Create backup directory
mkdir -p ${BACKUP_DIR}
# Create snapshot
ETCDCTL_API=3 etcdctl snapshot save ${BACKUP_DIR}/${SNAPSHOT_NAME} \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Verify snapshot
ETCDCTL_API=3 etcdctl snapshot status ${BACKUP_DIR}/${SNAPSHOT_NAME}
# Compress snapshot
gzip ${BACKUP_DIR}/${SNAPSHOT_NAME}
# Clean up old backups
find ${BACKUP_DIR} -name "etcd-snapshot-*.db.gz" -mtime +${RETENTION_DAYS} -delete
echo "Backup completed: ${BACKUP_DIR}/${SNAPSHOT_NAME}.gz"Cron Schedule
# Add to crontab
sudo crontab -e
# Backup every 6 hours
0 */6 * * * /usr/local/bin/etcd-backup.sh >> /var/log/etcd-backup.log 2>&1Kubernetes CronJob (Alternative)
apiVersion: batch/v1
kind: CronJob
metadata:
name: etcd-backup
namespace: kube-system
spec:
schedule: "0 */6 * * *" # Every 6 hours
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 3
jobTemplate:
spec:
template:
spec:
hostNetwork: true
nodeSelector:
node-role.kubernetes.io/control-plane: ""
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
containers:
- name: backup
image: bitnami/etcd:3.5
command:
- /bin/sh
- -c
- |
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
etcdctl snapshot save /backup/etcd-snapshot-${TIMESTAMP}.db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Upload to cloud storage
aws s3 cp /backup/etcd-snapshot-${TIMESTAMP}.db s3://my-backup-bucket/etcd/
volumeMounts:
- name: etcd-certs
mountPath: /etc/kubernetes/pki/etcd
readOnly: true
- name: backup
mountPath: /backup
env:
- name: ETCDCTL_API
value: "3"
restartPolicy: OnFailure
volumes:
- name: etcd-certs
hostPath:
path: /etc/kubernetes/pki/etcd
- name: backup
hostPath:
path: /backup/etcdStep 4: Store Backups Securely
Upload to S3
#!/bin/bash
# Add to backup script
# Upload to S3
aws s3 cp ${BACKUP_DIR}/${SNAPSHOT_NAME}.gz s3://my-cluster-backups/etcd/
# Encrypt with KMS
aws s3 cp ${BACKUP_DIR}/${SNAPSHOT_NAME}.gz \
s3://my-cluster-backups/etcd/ \
--sse aws:kms \
--sse-kms-key-id alias/etcd-backup-keyUpload to Azure Blob
# Upload to Azure Blob Storage
az storage blob upload \
--account-name mystorageaccount \
--container-name etcd-backups \
--name ${SNAPSHOT_NAME}.gz \
--file ${BACKUP_DIR}/${SNAPSHOT_NAME}.gzUpload to GCS
# Upload to Google Cloud Storage
gsutil cp ${BACKUP_DIR}/${SNAPSHOT_NAME}.gz gs://my-cluster-backups/etcd/Step 5: Restore etcd from Snapshot
Pre-Restore Checklist
# 1. Stop kube-apiserver (on all control plane nodes)
sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /etc/kubernetes/
# 2. Stop etcd (on all control plane nodes)
sudo mv /etc/kubernetes/manifests/etcd.yaml /etc/kubernetes/
# 3. Wait for pods to terminate
kubectl get pods -n kube-system -l component=etcd
# Should return "No resources found"
# 4. Backup current data directory
sudo mv /var/lib/etcd /var/lib/etcd.backupRestore Snapshot
# Restore to new data directory
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot-20260128-120000.db \
--data-dir=/var/lib/etcd \
--name=controlplane \
--initial-cluster=controlplane=https://192.168.1.10:2380 \
--initial-cluster-token=etcd-cluster-1 \
--initial-advertise-peer-urls=https://192.168.1.10:2380
# Set correct ownership
sudo chown -R etcd:etcd /var/lib/etcdPost-Restore Steps
# 1. Restore etcd manifest
sudo mv /etc/kubernetes/etcd.yaml /etc/kubernetes/manifests/
# 2. Wait for etcd to start
sudo crictl ps | grep etcd
# 3. Verify etcd health
ETCDCTL_API=3 etcdctl endpoint health \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# 4. Restore kube-apiserver manifest
sudo mv /etc/kubernetes/kube-apiserver.yaml /etc/kubernetes/manifests/
# 5. Verify cluster health
kubectl get nodes
kubectl get pods -AMulti-Node etcd Cluster Restore
For Each Control Plane Node
# Node 1 (192.168.1.10)
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
--data-dir=/var/lib/etcd \
--name=node1 \
--initial-cluster=node1=https://192.168.1.10:2380,node2=https://192.168.1.11:2380,node3=https://192.168.1.12:2380 \
--initial-cluster-token=etcd-cluster-1 \
--initial-advertise-peer-urls=https://192.168.1.10:2380
# Node 2 (192.168.1.11)
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
--data-dir=/var/lib/etcd \
--name=node2 \
--initial-cluster=node1=https://192.168.1.10:2380,node2=https://192.168.1.11:2380,node3=https://192.168.1.12:2380 \
--initial-cluster-token=etcd-cluster-1 \
--initial-advertise-peer-urls=https://192.168.1.11:2380
# Node 3 (192.168.1.12)
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
--data-dir=/var/lib/etcd \
--name=node3 \
--initial-cluster=node1=https://192.168.1.10:2380,node2=https://192.168.1.11:2380,node3=https://192.168.1.12:2380 \
--initial-cluster-token=etcd-cluster-1 \
--initial-advertise-peer-urls=https://192.168.1.12:2380etcd Health Monitoring
Check Cluster Health
# Endpoint health
ETCDCTL_API=3 etcdctl endpoint health \
--endpoints=https://192.168.1.10:2379,https://192.168.1.11:2379,https://192.168.1.12:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Endpoint status
ETCDCTL_API=3 etcdctl endpoint status \
--endpoints=https://192.168.1.10:2379,https://192.168.1.11:2379,https://192.168.1.12:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--write-out=table
# Member list
ETCDCTL_API=3 etcdctl member list \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--write-out=tablePrometheus Alerts for etcd
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: etcd-alerts
namespace: monitoring
spec:
groups:
- name: etcd
rules:
- alert: EtcdMembersDown
expr: |
max without (endpoint) (
sum without (instance) (up{job="etcd"} == bool 0)
or
count without (To) (
sum without (instance) (rate(etcd_network_peer_sent_failures_total[120s])) > 0.01
)
) > 0
for: 10m
labels:
severity: critical
annotations:
summary: "etcd cluster members are down"
- alert: EtcdNoLeader
expr: etcd_server_has_leader == 0
for: 1m
labels:
severity: critical
annotations:
summary: "etcd cluster has no leader"
- alert: EtcdHighNumberOfFailedGRPCRequests
expr: |
sum(rate(grpc_server_handled_total{job="etcd", grpc_code=~"Unknown|FailedPrecondition|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded"}[5m]))
/ sum(rate(grpc_server_handled_total{job="etcd"}[5m])) > 0.05
for: 10m
labels:
severity: warning
annotations:
summary: "High rate of failed gRPC requests"
- alert: EtcdDatabaseQuotaLow
expr: |
(etcd_mvcc_db_total_size_in_bytes / etcd_server_quota_backend_bytes) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "etcd database quota usage above 80%"Backup Verification Script
#!/bin/bash
# /usr/local/bin/verify-etcd-backup.sh
SNAPSHOT=$1
TEMP_DIR=$(mktemp -d)
echo "Verifying snapshot: ${SNAPSHOT}"
# Check snapshot integrity
ETCDCTL_API=3 etcdctl snapshot status ${SNAPSHOT} --write-out=table
# Test restore to temp directory
ETCDCTL_API=3 etcdctl snapshot restore ${SNAPSHOT} \
--data-dir=${TEMP_DIR}/etcd \
--name=test-restore \
--initial-cluster=test-restore=http://localhost:2380 \
--initial-cluster-token=test-token \
--initial-advertise-peer-urls=http://localhost:2380
if [ $? -eq 0 ]; then
echo "β Snapshot is valid and restorable"
rm -rf ${TEMP_DIR}
exit 0
else
echo "β Snapshot verification failed"
rm -rf ${TEMP_DIR}
exit 1
fiDisaster Recovery Runbook
## etcd Disaster Recovery Procedure
### 1. Assess Situation
- [ ] Check which nodes are affected
- [ ] Verify backup availability
- [ ] Document current cluster state
### 2. Prepare for Restore
- [ ] SSH to all control plane nodes
- [ ] Stop kube-apiserver on all nodes
- [ ] Stop etcd on all nodes
- [ ] Backup existing /var/lib/etcd directories
### 3. Restore etcd
- [ ] Download latest verified backup
- [ ] Run restore command on each node
- [ ] Set correct file permissions
- [ ] Start etcd on all nodes
### 4. Verify Restore
- [ ] Check etcd member list
- [ ] Verify endpoint health
- [ ] Start kube-apiserver
- [ ] Run kubectl get nodes
- [ ] Verify all workloads
### 5. Post-Recovery
- [ ] Document incident
- [ ] Review backup schedule
- [ ] Update runbook if neededSummary
Regular etcd backups are essential for Kubernetes disaster recovery. Automate backups with cron jobs, store them securely off-cluster, and regularly test your restore procedures to ensure they work when needed.
π Go Further with Kubernetes Recipes
Love this recipe? Thereβs so much more! This is just one of 100+ hands-on recipes in our comprehensive Kubernetes Recipes book.
Inside the book, youβll master:
- β Production-ready deployment strategies
- β Advanced networking and security patterns
- β Observability, monitoring, and troubleshooting
- β Real-world best practices from industry experts
βThe practical, recipe-based approach made complex Kubernetes concepts finally click for me.β
π Get Your Copy Now β Start building production-grade Kubernetes skills today!

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
