πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
Troubleshooting advanced ⏱ 15 minutes K8s 1.28+

Fix etcd High Latency and Slow API Server

Debug etcd performance issues causing slow kubectl responses and API server timeouts. Covers disk I/O, compaction, defragmentation, and leader elections.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Slow kubectl responses and API server timeouts are often caused by etcd disk latency. etcd requires <10ms fsync latency for stable operation. Check with etcdctl endpoint status and monitor etcd_disk_wal_fsync_duration_seconds. Fix: use fast SSDs, run compaction, defragment, or reduce object count.

Key insight: etcd stores ALL cluster state. If etcd is slow, everything is slow β€” scheduling, pod creation, service discovery, everything.

The Problem

# kubectl responses are slow (>5 seconds)
$ time kubectl get pods
NAME                    READY   STATUS    RESTARTS   AGE
myapp-7b9f5c6d4-x2k8j  1/1     Running   0          1h
real    0m8.234s

# API server logs show etcd errors
# "etcdserver: request timed out"
# "slow fdatasync"

The Solution

Step 1: Check etcd Health

# In OpenShift
oc exec -n openshift-etcd etcd-master-0 -c etcd -- \
  etcdctl endpoint health --cluster -w table

# Standard Kubernetes
ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint status -w table

Step 2: Check Disk Latency

# On the etcd node β€” test fsync latency
fio --rw=write --ioengine=sync --fdatasync=1 --directory=/var/lib/etcd \
    --size=22m --bs=2300 --name=etcd-fsync-test

# etcd needs: 99th percentile fsync < 10ms
# If > 10ms, your storage is too slow for etcd

Step 3: Run Compaction and Defragmentation

# Get current revision
REV=$(etcdctl endpoint status -w json | jq '.[0].Status.header.revision')

# Compact old revisions
etcdctl compact $REV

# Defragment each member (one at a time!)
etcdctl defrag --endpoints=https://etcd-0:2379
etcdctl defrag --endpoints=https://etcd-1:2379
etcdctl defrag --endpoints=https://etcd-2:2379

Step 4: Check Database Size

# etcd DB should be < 4GB (default limit is 8GB)
etcdctl endpoint status -w table
# Check DB SIZE column

# If approaching limit, find large objects
etcdctl get / --prefix --keys-only | sed 's|/[^/]*$||' | sort | uniq -c | sort -rn | head -20
graph TD
    A[Slow API Server] --> B{etcd healthy?}
    B -->|Unhealthy| C[Check disk latency]
    B -->|Healthy but slow| D[Check DB size]
    C -->|fsync > 10ms| E[Use faster SSD storage]
    C -->|fsync OK| F[Check network between members]
    D -->|DB > 4GB| G[Compact and defragment]
    D -->|DB normal| H[Check leader elections]
    H -->|Frequent elections| I[Network or clock skew issue]

Common Issues

Frequent leader elections

Caused by network latency between etcd members, clock skew, or slow disk. etcd members must have <50ms RTT between them.

etcd alarm NOSPACE

DB hit the quota limit. Emergency fix:

etcdctl alarm list
# alarm:NOSPACE
etcdctl compact $(etcdctl endpoint status -w json | jq '.[0].Status.header.revision')
etcdctl defrag
etcdctl alarm disarm

Slow after cluster upgrade

Upgrades can leave fragmented DB. Run defrag after each upgrade.

Best Practices

  • Use NVMe/SSD for etcd storage β€” etcd is latency-sensitive, not throughput-sensitive
  • Separate etcd disks from other I/O β€” don’t share disks with container storage
  • Monitor etcd_disk_wal_fsync_duration_seconds β€” alert if 99th percentile > 10ms
  • Run compaction daily via cron or use --auto-compaction-retention=1h
  • Keep DB under 4GB β€” large DBs cause slow snapshots and leader elections
  • 3 or 5 members only β€” more members = slower consensus

Key Takeaways

  • etcd disk latency is the #1 cause of slow Kubernetes API responses
  • Requires <10ms fsync latency β€” use dedicated SSDs
  • Compact and defragment regularly to keep DB size healthy
  • Monitor with Prometheus: etcd_disk_wal_fsync_duration_seconds and etcd_server_has_leader
  • NOSPACE alarm = emergency β€” compact and defrag immediately
#etcd #performance #api-server #latency #troubleshooting #kubernetes
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens