Objective
Chaos engineering is the discipline of probing system resilience by introducing controlled failure. This exercise walks the complete cycle: define a steady state, inject failure, measure the impact, stop the experiment, and translate gaps into concrete hardening work.
Prerequisites
- Kubernetes cluster with at least 3 nodes
- Helm installed
- kube-prometheus-stack deployed (for Grafana metrics during experiments)
- A sample workload to target (we'll deploy one in step 2)
- kubectl access with cluster-admin for Chaos Mesh CRD installation
Chaos Experiment Types
| Type | CRD Kind | What It Tests |
|---|---|---|
| PodChaos | PodChaos | Pod kill tolerance, restart behavior, PDB effectiveness |
| NetworkChaos | NetworkChaos | Latency tolerance, timeout handling, circuit breaker behavior |
| StressChaos | StressChaos | Memory/CPU spike handling, HPA response time |
| DNSChaos | DNSChaos | DNS failure handling, service discovery resilience |
| IOChaos | IOChaos | Disk latency tolerance, write-ahead log resilience |
| TimeChaos | TimeChaos | Clock skew handling, certificate validation, cron scheduling |
Steps
Install Chaos Mesh
# Add Chaos Mesh Helm repo helm repo add chaos-mesh https://charts.chaos-mesh.org helm repo update # Install Chaos Mesh with dashboard enabled helm install chaos-mesh chaos-mesh/chaos-mesh \ --namespace chaos-mesh \ --create-namespace \ --set dashboard.create=true \ --set chaosDaemon.runtime=containerd \ --set chaosDaemon.socketPath=/run/containerd/containerd.sock \ --version 2.7.0 \ --wait # Verify all components are running kubectl get pods -n chaos-mesh ## NAME READY STATUS ## chaos-controller-manager-5f6b9d7f4c-xkzpq 1/1 Running ## chaos-daemon-2vgkn 1/1 Running (DaemonSet) ## chaos-daemon-9rjpw 1/1 Running ## chaos-daemon-vplmt 1/1 Running ## chaos-dashboard-6f8c9b7d5-nqmrk 1/1 Running # Access the Chaos Mesh dashboard kubectl port-forward svc/chaos-dashboard \ -n chaos-mesh 2333:2333 & # Open http://localhost:2333
Deploy the target application
Deploy a simple HTTP server with multiple replicas. We'll use this as the target for both chaos experiments.
# Create the chaos target namespace kubectl create namespace chaos-target # Deploy a multi-replica HTTP server cat << 'EOF' | kubectl apply -f - apiVersion: apps/v1 kind: Deployment metadata: name: httpbin namespace: chaos-target spec: replicas: 3 selector: matchLabels: app: httpbin template: metadata: labels: app: httpbin spec: topologySpreadConstraints: - maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: httpbin containers: - name: httpbin image: kennethreitz/httpbin:latest ports: - containerPort: 80 resources: requests: cpu: "50m" memory: "64Mi" limits: cpu: "200m" memory: "128Mi" readinessProbe: httpGet: path: /status/200 port: 80 initialDelaySeconds: 5 periodSeconds: 5 --- apiVersion: v1 kind: Service metadata: name: httpbin namespace: chaos-target spec: selector: app: httpbin ports: - port: 80 targetPort: 80 --- apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: httpbin-pdb namespace: chaos-target spec: minAvailable: 2 selector: matchLabels: app: httpbin EOF kubectl get pods -n chaos-target ## NAME READY STATUS RESTARTS AGE ## httpbin-6d7f8b9c5-4xkzp 1/1 Running 0 20s ## httpbin-6d7f8b9c5-9mnrq 1/1 Running 0 20s ## httpbin-6d7f8b9c5-vxpqr 1/1 Running 0 20s
Establish steady-state baseline
Before injecting chaos, define and measure steady state. This is the condition you compare against during and after experiments.
# Start a continuous load generator in the background kubectl run load-gen \ --image=busybox:1.36 \ -n chaos-target \ --restart=Never \ -- sh -c \ 'while true; do wget -qO- http://httpbin/status/200 > /dev/null; sleep 0.5; done' # Verify requests are succeeding (steady state = 100% success rate) kubectl logs load-gen -n chaos-target --follow & # Define steady state metrics to monitor in Grafana: # 1. Error rate: rate(http_requests_total{status=~"5.."}[1m]) == 0 # 2. Ready replica count: kube_deployment_status_replicas_ready == 3 # 3. P99 latency: histogram_quantile(0.99, ...) < 200ms # Confirm all 3 pods healthy before starting experiment kubectl get pods -n chaos-target -l app=httpbin \ --field-selector=status.phase=Running | wc -l ## 3
Experiment 1 — PodChaos: random pod kill
Kill one pod at random every 30 seconds for 5 minutes. Observe whether traffic is disrupted during the kill and how quickly Kubernetes recovers.
# Apply the PodChaos experiment cat << 'EOF' | kubectl apply -f - apiVersion: chaos-mesh.org/v1alpha1 kind: PodChaos metadata: name: pod-kill-experiment namespace: chaos-target spec: action: pod-kill mode: one # kill one pod at a time selector: namespaces: - chaos-target labelSelectors: app: httpbin scheduler: cron: "@every 30s" # kill a pod every 30 seconds duration: "5m" # experiment runs for 5 minutes then auto-stops EOF # Watch pod restarts during the experiment kubectl get pods -n chaos-target -l app=httpbin -w ## NAME READY STATUS RESTARTS ## httpbin-6d7f8b9c5-4xkzp 1/1 Running 0 ## httpbin-6d7f8b9c5-4xkzp 0/1 Terminating 0 ← chaos kills it ## httpbin-6d7f8b9c5-pq7mn 0/1 Pending 0 ← replacement starts ## httpbin-6d7f8b9c5-pq7mn 1/1 Running 0 ← back to 3 ready # Measure recovery time for each kill # In Grafana: kube_deployment_status_replicas_ready{deployment="httpbin"} # The dip should be brief (under ~15s) and never reach 0 # Check experiment status kubectl get podchaos -n chaos-target ## NAME AGE STATUS EXPERIMENT ## pod-kill-experiment 90s Running PodChaos # After 5 minutes the experiment stops automatically # Or stop it manually: kubectl annotate podchaos pod-kill-experiment \ -n chaos-target \ experiment.chaos-mesh.org/pause=true
Experiment 2 — NetworkChaos: latency injection
Inject 200ms of latency with 50ms jitter on all traffic to the httpbin pods. This simulates a degraded network link or a slow dependency and reveals whether the application has sensible timeouts.
# Clean up experiment 1 first kubectl delete podchaos pod-kill-experiment -n chaos-target # Apply the NetworkChaos latency experiment cat << 'EOF' | kubectl apply -f - apiVersion: chaos-mesh.org/v1alpha1 kind: NetworkChaos metadata: name: network-latency-experiment namespace: chaos-target spec: action: delay mode: all # affect all matching pods selector: namespaces: - chaos-target labelSelectors: app: httpbin delay: latency: "200ms" correlation: "25" # 25% correlation between consecutive packets jitter: "50ms" direction: to # inject on inbound traffic duration: "5m" EOF # Measure the latency impact from the load generator kubectl exec -n chaos-target load-gen -- \ time wget -qO- http://httpbin/delay/0 ## real 0m 0.247s ← normally ~10ms, now 200+ms # Check P99 latency in Grafana using: # histogram_quantile(0.99, # sum(rate(http_request_duration_seconds_bucket[5m])) by (le) # ) # Test whether a 1-second timeout correctly fails fast: kubectl exec -n chaos-target load-gen -- \ wget -qO- --timeout=1 http://httpbin/delay/2 ## wget: download timed out ← good: timeout triggered correctly # Stop the experiment kubectl delete networkchaos network-latency-experiment -n chaos-target
Compile the hardening backlog
Translate the experiment observations into a prioritized list of system improvements. This is the most important output of any chaos experiment.
## ── Hardening Backlog Template ── # # Experiment 1: PodChaos — pod-kill-experiment # Hypothesis: service maintains <1% error rate during pod kill # Observation: ~3s traffic disruption on each kill (readiness probe delay) # Steady state breached: YES — brief spike to 2% error rate # # Finding 1 (P1): readinessProbe initialDelaySeconds=5 too high # → New pod takes 5s before receiving traffic # → Action: reduce to initialDelaySeconds=2 and add startupProbe # # Finding 2 (P2): missing preStop sleep hook # → Killed pod still receives in-flight requests during Terminating # → Action: add lifecycle.preStop.exec: sleep 5 # # Experiment 2: NetworkChaos — network-latency-experiment # Hypothesis: application fails fast when dependency latency > 500ms # Observation: default HTTP client has no timeout — requests queue indefinitely # Steady state breached: YES — goroutine/thread pool exhaustion after 60s # # Finding 3 (P1): HTTP client timeout not configured # → Action: set client.Timeout = 2*time.Second in application code # # Finding 4 (P2): no circuit breaker between services # → Action: evaluate Resilience4j / Istio circuit breaker # Apply the quick fixes to the running deployment kubectl patch deployment httpbin -n chaos-target \ --type=strategic --patch=' spec: template: spec: containers: - name: httpbin lifecycle: preStop: exec: command: ["sleep", "5"] readinessProbe: httpGet: path: /status/200 port: 80 initialDelaySeconds: 2 periodSeconds: 3 failureThreshold: 2 startupProbe: httpGet: path: /status/200 port: 80 failureThreshold: 10 periodSeconds: 2' # Re-run experiment 1 with the fix applied and compare error rates kubectl rollout status deployment/httpbin -n chaos-target
Cleanup
# Delete target workloads kubectl delete namespace chaos-target # Uninstall Chaos Mesh (if desired) helm uninstall chaos-mesh -n chaos-mesh kubectl delete namespace chaos-mesh # Delete Chaos Mesh CRDs kubectl get crd | grep chaos-mesh.org \ | awk '{print $1}' \ | xargs kubectl delete crd # Verify clean state kubectl get crd | grep chaos ## (no output)
Success Criteria
Further Reading
- Chaos Mesh documentation — chaos-mesh.org/docs
- Principles of Chaos Engineering — principlesofchaos.org
- Chaos Mesh experiment types — chaos-mesh.org/docs/simulate-pod-chaos-on-kubernetes
- Netflix Chaos Monkey — netflix.github.io/chaosmonkey
- AWS Fault Injection Simulator — aws.amazon.com/fis