Chaos Engineering with Chaos Mesh — K8s Platform Engineering

Objective

Chaos engineering is the discipline of probing system resilience by introducing controlled failure. This exercise walks the complete cycle: define a steady state, inject failure, measure the impact, stop the experiment, and translate gaps into concrete hardening work.

Prerequisites

Kubernetes cluster with at least 3 nodes
Helm installed
kube-prometheus-stack deployed (for Grafana metrics during experiments)
A sample workload to target (we'll deploy one in step 2)
kubectl access with cluster-admin for Chaos Mesh CRD installation

Chaos Experiment Types

Type	CRD Kind	What It Tests
PodChaos	PodChaos	Pod kill tolerance, restart behavior, PDB effectiveness
NetworkChaos	NetworkChaos	Latency tolerance, timeout handling, circuit breaker behavior
StressChaos	StressChaos	Memory/CPU spike handling, HPA response time
DNSChaos	DNSChaos	DNS failure handling, service discovery resilience
IOChaos	IOChaos	Disk latency tolerance, write-ahead log resilience
TimeChaos	TimeChaos	Clock skew handling, certificate validation, cron scheduling

Steps

Install Chaos Mesh

# Add Chaos Mesh Helm repo
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update

# Install Chaos Mesh with dashboard enabled
helm install chaos-mesh chaos-mesh/chaos-mesh \
  --namespace chaos-mesh \
  --create-namespace \
  --set dashboard.create=true \
  --set chaosDaemon.runtime=containerd \
  --set chaosDaemon.socketPath=/run/containerd/containerd.sock \
  --version 2.7.0 \
  --wait

# Verify all components are running
kubectl get pods -n chaos-mesh
## NAME                                        READY   STATUS
## chaos-controller-manager-5f6b9d7f4c-xkzpq   1/1     Running
## chaos-daemon-2vgkn                           1/1     Running  (DaemonSet)
## chaos-daemon-9rjpw                           1/1     Running
## chaos-daemon-vplmt                           1/1     Running
## chaos-dashboard-6f8c9b7d5-nqmrk             1/1     Running

# Access the Chaos Mesh dashboard
kubectl port-forward svc/chaos-dashboard \
  -n chaos-mesh 2333:2333 &
# Open http://localhost:2333

Deploy the target application

Deploy a simple HTTP server with multiple replicas. We'll use this as the target for both chaos experiments.

# Create the chaos target namespace
kubectl create namespace chaos-target

# Deploy a multi-replica HTTP server
cat << 'EOF' | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: httpbin
  namespace: chaos-target
spec:
  replicas: 3
  selector:
    matchLabels:
      app: httpbin
  template:
    metadata:
      labels:
        app: httpbin
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: httpbin
      containers:
      - name: httpbin
        image: kennethreitz/httpbin:latest
        ports:
        - containerPort: 80
        resources:
          requests:
            cpu: "50m"
            memory: "64Mi"
          limits:
            cpu: "200m"
            memory: "128Mi"
        readinessProbe:
          httpGet:
            path: /status/200
            port: 80
          initialDelaySeconds: 5
          periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: httpbin
  namespace: chaos-target
spec:
  selector:
    app: httpbin
  ports:
  - port: 80
    targetPort: 80
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: httpbin-pdb
  namespace: chaos-target
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: httpbin
EOF

kubectl get pods -n chaos-target
## NAME                       READY   STATUS    RESTARTS   AGE
## httpbin-6d7f8b9c5-4xkzp   1/1     Running   0          20s
## httpbin-6d7f8b9c5-9mnrq   1/1     Running   0          20s
## httpbin-6d7f8b9c5-vxpqr   1/1     Running   0          20s

Establish steady-state baseline

Before injecting chaos, define and measure steady state. This is the condition you compare against during and after experiments.

# Start a continuous load generator in the background
kubectl run load-gen \
  --image=busybox:1.36 \
  -n chaos-target \
  --restart=Never \
  -- sh -c \
  'while true; do wget -qO- http://httpbin/status/200 > /dev/null; sleep 0.5; done'

# Verify requests are succeeding (steady state = 100% success rate)
kubectl logs load-gen -n chaos-target --follow &

# Define steady state metrics to monitor in Grafana:
# 1. Error rate: rate(http_requests_total{status=~"5.."}[1m]) == 0
# 2. Ready replica count: kube_deployment_status_replicas_ready == 3
# 3. P99 latency: histogram_quantile(0.99, ...) < 200ms

# Confirm all 3 pods healthy before starting experiment
kubectl get pods -n chaos-target -l app=httpbin \
  --field-selector=status.phase=Running | wc -l
## 3

Chaos experiments without a steady-state definition produce noise, not signal. Write down your hypothesis: "When one pod is killed, the service should continue serving requests with less than 1% error rate within 10 seconds."

Experiment 1 — PodChaos: random pod kill

Kill one pod at random every 30 seconds for 5 minutes. Observe whether traffic is disrupted during the kill and how quickly Kubernetes recovers.

# Apply the PodChaos experiment
cat << 'EOF' | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-kill-experiment
  namespace: chaos-target
spec:
  action: pod-kill
  mode: one                  # kill one pod at a time
  selector:
    namespaces:
    - chaos-target
    labelSelectors:
      app: httpbin
  scheduler:
    cron: "@every 30s"    # kill a pod every 30 seconds
  duration: "5m"            # experiment runs for 5 minutes then auto-stops
EOF

# Watch pod restarts during the experiment
kubectl get pods -n chaos-target -l app=httpbin -w

## NAME                       READY   STATUS      RESTARTS
## httpbin-6d7f8b9c5-4xkzp   1/1     Running     0
## httpbin-6d7f8b9c5-4xkzp   0/1     Terminating 0       ← chaos kills it
## httpbin-6d7f8b9c5-pq7mn   0/1     Pending     0       ← replacement starts
## httpbin-6d7f8b9c5-pq7mn   1/1     Running     0       ← back to 3 ready

# Measure recovery time for each kill
# In Grafana: kube_deployment_status_replicas_ready{deployment="httpbin"}
# The dip should be brief (under ~15s) and never reach 0

# Check experiment status
kubectl get podchaos -n chaos-target
## NAME                   AGE   STATUS    EXPERIMENT
## pod-kill-experiment    90s   Running   PodChaos

# After 5 minutes the experiment stops automatically
# Or stop it manually:
kubectl annotate podchaos pod-kill-experiment \
  -n chaos-target \
  experiment.chaos-mesh.org/pause=true

Experiment 2 — NetworkChaos: latency injection

Inject 200ms of latency with 50ms jitter on all traffic to the httpbin pods. This simulates a degraded network link or a slow dependency and reveals whether the application has sensible timeouts.

# Clean up experiment 1 first
kubectl delete podchaos pod-kill-experiment -n chaos-target

# Apply the NetworkChaos latency experiment
cat << 'EOF' | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-latency-experiment
  namespace: chaos-target
spec:
  action: delay
  mode: all                  # affect all matching pods
  selector:
    namespaces:
    - chaos-target
    labelSelectors:
      app: httpbin
  delay:
    latency: "200ms"
    correlation: "25"       # 25% correlation between consecutive packets
    jitter: "50ms"
  direction: to              # inject on inbound traffic
  duration: "5m"
EOF

# Measure the latency impact from the load generator
kubectl exec -n chaos-target load-gen -- \
  time wget -qO- http://httpbin/delay/0
## real    0m 0.247s    ← normally ~10ms, now 200+ms

# Check P99 latency in Grafana using:
# histogram_quantile(0.99,
#   sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
# )

# Test whether a 1-second timeout correctly fails fast:
kubectl exec -n chaos-target load-gen -- \
  wget -qO- --timeout=1 http://httpbin/delay/2
## wget: download timed out   ← good: timeout triggered correctly

# Stop the experiment
kubectl delete networkchaos network-latency-experiment -n chaos-target

Compile the hardening backlog

Translate the experiment observations into a prioritized list of system improvements. This is the most important output of any chaos experiment.

## ── Hardening Backlog Template ──
#
# Experiment 1: PodChaos — pod-kill-experiment
# Hypothesis: service maintains <1% error rate during pod kill
# Observation: ~3s traffic disruption on each kill (readiness probe delay)
# Steady state breached: YES — brief spike to 2% error rate
#
# Finding 1 (P1): readinessProbe initialDelaySeconds=5 too high
#   → New pod takes 5s before receiving traffic
#   → Action: reduce to initialDelaySeconds=2 and add startupProbe
#
# Finding 2 (P2): missing preStop sleep hook
#   → Killed pod still receives in-flight requests during Terminating
#   → Action: add lifecycle.preStop.exec: sleep 5
#
# Experiment 2: NetworkChaos — network-latency-experiment
# Hypothesis: application fails fast when dependency latency > 500ms
# Observation: default HTTP client has no timeout — requests queue indefinitely
# Steady state breached: YES — goroutine/thread pool exhaustion after 60s
#
# Finding 3 (P1): HTTP client timeout not configured
#   → Action: set client.Timeout = 2*time.Second in application code
#
# Finding 4 (P2): no circuit breaker between services
#   → Action: evaluate Resilience4j / Istio circuit breaker

# Apply the quick fixes to the running deployment
kubectl patch deployment httpbin -n chaos-target \
  --type=strategic --patch='
spec:
  template:
    spec:
      containers:
      - name: httpbin
        lifecycle:
          preStop:
            exec:
              command: ["sleep", "5"]
        readinessProbe:
          httpGet:
            path: /status/200
            port: 80
          initialDelaySeconds: 2
          periodSeconds: 3
          failureThreshold: 2
        startupProbe:
          httpGet:
            path: /status/200
            port: 80
          failureThreshold: 10
          periodSeconds: 2'

# Re-run experiment 1 with the fix applied and compare error rates
kubectl rollout status deployment/httpbin -n chaos-target

Run chaos experiments in non-production environments first. Before running in production: notify on-call engineers, define an abort condition (e.g. error rate > 5%), and test your rollback procedure. Never run without explicit approval from service owners.

Cleanup

# Delete target workloads
kubectl delete namespace chaos-target

# Uninstall Chaos Mesh (if desired)
helm uninstall chaos-mesh -n chaos-mesh
kubectl delete namespace chaos-mesh

# Delete Chaos Mesh CRDs
kubectl get crd | grep chaos-mesh.org \
  | awk '{print $1}' \
  | xargs kubectl delete crd

# Verify clean state
kubectl get crd | grep chaos
## (no output)

Success Criteria

Chaos Mesh installed with all pods in Running state (controller, daemon DaemonSet, dashboard) PodChaos experiment ran for 5 minutes, killing pods every 30 seconds Ready replica count never dropped to zero during PodChaos experiment NetworkChaos experiment injected measurable 200ms+ latency on httpbin requests Both experiments stopped cleanly (auto-duration or manual annotation) Hardening backlog produced with at least 3 findings and assigned priorities preStop sleep hook and reduced readiness initialDelaySeconds applied and verified with a re-run