DNS Autoscaling and CoreDNS Scaling
Scale CoreDNS horizontally with dns-autoscaler and proportional autoscaling. Tune cache size, configure node-local DNS cache.
π‘ Quick Answer: Deploy
cluster-proportional-autoscalerto scale CoreDNS replicas with cluster size. For large clusters (500+ nodes), add NodeLocal DNS Cache β each node runs a local DNS cache that reduces CoreDNS load by 90%.
The Problem
At scale (500+ nodes, 10K+ pods), CoreDNS becomes a bottleneck. Two default replicas canβt handle the DNS query volume, causing 5-second resolution timeouts, failed service discovery, and cascading application failures.
The Solution
Proportional DNS Autoscaler
apiVersion: apps/v1
kind: Deployment
metadata:
name: dns-autoscaler
namespace: kube-system
spec:
replicas: 1
template:
spec:
containers:
- name: autoscaler
image: registry.k8s.io/cpa/cluster-proportional-autoscaler:v1.8.9
command:
- /cluster-proportional-autoscaler
- --namespace=kube-system
- --configmap=dns-autoscaler
- --target=deployment/coredns
- --logtostderr=true
- --v=2
---
apiVersion: v1
kind: ConfigMap
metadata:
name: dns-autoscaler
namespace: kube-system
data:
linear: |
{
"coresPerReplica": 256,
"nodesPerReplica": 16,
"min": 2,
"max": 20,
"preventSinglePointFailure": true
}This scales CoreDNS: 1 replica per 16 nodes or 256 cores (whichever produces more replicas).
NodeLocal DNS Cache
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-local-dns
namespace: kube-system
spec:
template:
spec:
hostNetwork: true
containers:
- name: node-cache
image: registry.k8s.io/dns/k8s-dns-node-cache:1.23.1
args:
- -localip
- "169.254.20.10"
- -conf
- /etc/Corefile
- -upstreamsvc
- kube-dnsPods resolve DNS via the local cache (169.254.20.10) instead of the cluster CoreDNS service. Cache hits never leave the node.
graph TD
POD[Pod DNS Query] -->|169.254.20.10| LOCAL[NodeLocal DNS Cache<br/>on same node]
LOCAL -->|Cache hit| FAST[β
<1ms response]
LOCAL -->|Cache miss| COREDNS[CoreDNS Deployment<br/>2-20 replicas]
COREDNS -->|cluster.local| K8S[Kubernetes API]
COREDNS -->|external| UPSTREAM[Upstream DNS]
SCALER[Proportional Autoscaler] -->|Scale based on<br/>node/core count| COREDNSCommon Issues
DNS 5-second timeouts at scale
Usually conntrack table race condition (SNAT collision). NodeLocal DNS Cache eliminates this by using the nodeβs loopback.
CoreDNS OOMKilled
Increase memory limits. Each CoreDNS pod caches data β larger clusters need more memory. Set cache plugin size appropriately.
Best Practices
- Proportional autoscaler as baseline β 1 replica per 16 nodes
- NodeLocal DNS Cache for 500+ node clusters β eliminates conntrack race conditions
- CoreDNS memory: 170Mi base + 70Mi per 1000 pods β adjust limits accordingly
- Monitor
coredns_dns_request_duration_secondsβ alert on p99 > 100ms - Set
ndots: 2in pod dnsConfig β reduces unnecessary search domain lookups
Key Takeaways
- Default 2 CoreDNS replicas is insufficient for clusters above ~100 nodes
- Proportional autoscaler scales CoreDNS based on cluster size automatically
- NodeLocal DNS Cache eliminates 90% of CoreDNS traffic and conntrack race conditions
- DNS is the #1 silent failure point at scale β monitor latency and error rates
ndots: 2in pod DNS config reduces query amplification from 5x to 1-2x

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
