Chaos Mesh Fault Injection on Kubernetes
Deploy Chaos Mesh for chaos engineering on Kubernetes. Covers PodChaos, NetworkChaos, IOChaos, StressChaos experiments, scheduling, RBAC
π‘ Quick Answer: Chaos Mesh is a CNCF incubating project that injects faults into Kubernetes workloads β Pod kills, network delays, disk I/O errors, CPU/memory stress β via CRDs. Install with Helm, define experiments as YAML, scope with namespace selectors, and integrate into CI/CD for automated resilience testing.
The Problem
You canβt know if your application is resilient until something breaks:
- What happens when a Pod is killed mid-request?
- How does your app behave with 200ms network latency?
- Does your database failover work when the primary Pod dies?
- Will your HPA react fast enough under CPU stress?
- Testing these manually is slow, inconsistent, and scary in production
The Solution
Install Chaos Mesh
# Add Helm repo
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update
# Install (with dashboard)
helm install chaos-mesh chaos-mesh/chaos-mesh \
--namespace chaos-mesh \
--create-namespace \
--set chaosDaemon.runtime=containerd \
--set chaosDaemon.socketPath=/run/containerd/containerd.sock \
--set dashboard.create=true \
--set dashboard.securityMode=true
# Verify
kubectl get pods -n chaos-mesh
# NAME READY STATUS
# chaos-controller-manager-xxx 1/1 Running
# chaos-daemon-xxxxx 1/1 Running (DaemonSet)
# chaos-dashboard-xxx 1/1 RunningPodChaos: Kill Pods
# Kill random Pods to test self-healing
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-kill-test
namespace: chaos-mesh
spec:
action: pod-kill
mode: one # Kill one random matching Pod
selector:
namespaces:
- production
labelSelectors:
app: my-api
# Schedule: every 10 minutes during business hours
scheduler:
cron: "*/10 9-17 * * 1-5"
---
# Pod failure (container exit 137) instead of delete
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-failure-test
namespace: chaos-mesh
spec:
action: pod-failure
mode: fixed-percent
value: "30" # Kill 30% of matching Pods
duration: "60s" # Pods stay failed for 60s
selector:
namespaces:
- production
labelSelectors:
app: my-apiNetworkChaos: Latency, Loss, Partition
# Add 200ms latency to all traffic
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-delay-test
namespace: chaos-mesh
spec:
action: delay
mode: all
selector:
namespaces:
- production
labelSelectors:
app: my-api
delay:
latency: "200ms"
correlation: "50" # 50% correlation between packets
jitter: "50ms" # Β±50ms variation
direction: to # Only outgoing traffic
target:
selector:
namespaces:
- production
labelSelectors:
app: my-database # Delay only traffic TO database
mode: all
duration: "5m"
---
# Network partition between frontend and backend
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-partition-test
namespace: chaos-mesh
spec:
action: partition
mode: all
selector:
namespaces:
- production
labelSelectors:
tier: frontend
direction: both
target:
selector:
namespaces:
- production
labelSelectors:
tier: backend
mode: all
duration: "2m"
---
# 10% packet loss
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: packet-loss-test
namespace: chaos-mesh
spec:
action: loss
mode: all
selector:
namespaces:
- production
labelSelectors:
app: my-api
loss:
loss: "10"
correlation: "25"
duration: "5m"StressChaos: CPU and Memory Pressure
# CPU stress β test HPA reaction time
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: cpu-stress-test
namespace: chaos-mesh
spec:
mode: one
selector:
namespaces:
- production
labelSelectors:
app: my-api
stressors:
cpu:
workers: 4 # 4 CPU-burning threads
load: 80 # Target 80% CPU usage
duration: "10m"
---
# Memory stress β test OOMKill recovery
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: memory-stress-test
namespace: chaos-mesh
spec:
mode: one
selector:
namespaces:
- production
labelSelectors:
app: my-api
stressors:
memory:
workers: 2
size: "512MB" # Allocate 512MB
duration: "5m"IOChaos: Disk Faults
# Inject I/O latency on filesystem operations
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
name: io-latency-test
namespace: chaos-mesh
spec:
action: latency
mode: one
selector:
namespaces:
- production
labelSelectors:
app: my-database
volumePath: /var/lib/postgresql/data
path: "**/*"
delay: "100ms"
percent: 50 # 50% of I/O operations affected
duration: "5m"
---
# I/O errors (simulate disk failure)
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
name: io-error-test
namespace: chaos-mesh
spec:
action: fault
mode: one
selector:
namespaces:
- production
labelSelectors:
app: my-database
volumePath: /var/lib/postgresql/data
path: "**/*.dat"
errno: 5 # EIO β Input/output error
percent: 10 # 10% of operations fail
duration: "2m"Workflow: Multi-Step Chaos Scenarios
# Sequential chaos: network delay β pod kill β verify recovery
apiVersion: chaos-mesh.org/v1alpha1
kind: Workflow
metadata:
name: resilience-workflow
namespace: chaos-mesh
spec:
entry: resilience-test
templates:
- name: resilience-test
templateType: Serial
children:
- network-degradation
- pod-kill-primary
- verify-recovery
- name: network-degradation
templateType: NetworkChaos
deadline: "5m"
networkChaos:
action: delay
mode: all
selector:
namespaces: [production]
labelSelectors: {app: my-api}
delay:
latency: "100ms"
- name: pod-kill-primary
templateType: PodChaos
deadline: "30s"
podChaos:
action: pod-kill
mode: one
selector:
namespaces: [production]
labelSelectors: {app: my-database, role: primary}
- name: verify-recovery
templateType: Suspend
deadline: "2m" # Wait for recovery, check manually or via webhookRBAC: Scope Chaos to Namespaces
# Prevent chaos experiments from targeting system namespaces
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: chaos-runner
namespace: staging
rules:
- apiGroups: ["chaos-mesh.org"]
resources: ["*"]
verbs: ["create", "delete", "get", "list", "watch"]
---
# Chaos Mesh namespace annotation β protect critical namespaces
# Add to namespaces that should NEVER be targeted:
kubectl annotate namespace kube-system \
chaos-mesh.org/inject=disabled
kubectl annotate namespace chaos-mesh \
chaos-mesh.org/inject=disabledCommon Issues
Chaos experiment stuck in βRunningβ after duration
- Cause: chaos-daemon on target node crashed or restarted
- Fix: Delete the experiment; check chaos-daemon Pod logs
IOChaos not working
- Cause: Container filesystem not using the expected mount path
- Fix: Check
volumePathmatches actual mount inside container
NetworkChaos affects all Pods, not just target
- Cause: Selector too broad; missing
targetfield - Fix: Use
target.selectorto scope the other end of the partition
Best Practices
- Start in staging β never run untested experiments in production
- Set duration always β experiments without duration run forever
- Protect system namespaces β annotate with
chaos-mesh.org/inject=disabled - Use Workflows for multi-step scenarios β serial or parallel
- Monitor during chaos β watch Prometheus/Grafana for SLO violations
- Integrate in CI/CD β run chaos tests before promoting to production
- Start small β one Pod, short duration, then expand
Key Takeaways
- Chaos Mesh injects faults via CRDs: PodChaos, NetworkChaos, IOChaos, StressChaos
- Every experiment needs
selector(what to target) andduration(how long) mode: one, all, fixed, fixed-percent, random-max-percent- NetworkChaos supports delay, loss, partition, duplicate, corrupt
- Workflows chain multiple experiments into resilience test suites
- Protect namespaces with
chaos-mesh.org/inject=disabledannotation - Dashboard provides visual experiment management and status

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses β