Objective
Design workloads with realistic HA configurations (topology spread constraints, PDBs) and then simulate an AZ failure by removing all nodes in zone 1. Measure how long it takes for each workload class to recover to full capacity. Identify which workloads fail ungracefully and produce a hardening list.
Prerequisites
- Multi-zone EKS or AKS cluster with at least 2 nodes per zone (6 nodes minimum)
- kubectl with cluster-admin access
- Prometheus + Alertmanager deployed for RTO measurement
- Understanding of Kubernetes scheduling concepts (taints, node selectors, affinity)
Steps
Deploy the test workloads
Create four workloads with different HA configurations to compare behaviour during AZ failure. Each represents a real-world pattern.
# workload-a-zone-spread.yaml — Stateless with topology spread apiVersion: apps/v1 kind: Deployment metadata: name: workload-a spec: replicas: 6 selector: matchLabels: { app: workload-a } template: metadata: labels: { app: workload-a } spec: topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: { app: workload-a } containers: - name: app image: nginx:alpine resources: requests: { cpu: 50m, memory: 64Mi } --- # workload-b-no-spread.yaml — No topology spread (bad practice) apiVersion: apps/v1 kind: Deployment metadata: name: workload-b spec: replicas: 6 selector: matchLabels: { app: workload-b } template: metadata: labels: { app: workload-b } spec: containers: - name: app image: nginx:alpine resources: requests: { cpu: 50m, memory: 64Mi } kubectl apply -f workload-a-zone-spread.yaml kubectl apply -f workload-b-no-spread.yaml
Apply PodDisruptionBudgets
PDBs ensure that during node drains, the minimum number of pods required for service availability is maintained. Without PDBs, a drain can terminate all pods simultaneously.
# pdb-workload-a.yaml apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: pdb-workload-a spec: minAvailable: 4 # At least 4/6 pods must be up selector: matchLabels: { app: workload-a } --- # pdb-workload-b.yaml — Stricter PDB to demonstrate blocking apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: pdb-workload-b spec: maxUnavailable: 1 # Only 1 pod can be disrupted at a time selector: matchLabels: { app: workload-b } kubectl apply -f pdb-workload-a.yaml kubectl apply -f pdb-workload-b.yaml # Verify PDBs are healthy (DISRUPTIONS ALLOWED > 0) kubectl get pdb
Record baseline state before simulation
Document the pod distribution across zones before starting the failure simulation. This is your baseline for measuring RTO.
# Record start time BASELINE_TIME=$(date -u +%Y-%m-%dT%H:%M:%SZ) echo "Baseline recorded at: $BASELINE_TIME" # Show pod distribution across zones kubectl get pods -o wide \ --label-columns=topology.kubernetes.io/zone | sort # Identify nodes in zone us-east-1a (adjust for your zone) TARGET_ZONE="us-east-1a" ZONE_NODES=$(kubectl get nodes \ --selector="topology.kubernetes.io/zone=${TARGET_ZONE}" \ -o jsonpath='{.items[*].metadata.name}') echo "Nodes in ${TARGET_ZONE}: ${ZONE_NODES}" # Verify workloads are healthy kubectl get deployments -o wide
Simulate AZ failure — cordon and drain all nodes in zone 1
Cordon marks nodes as unschedulable (no new pods). Drain evicts existing pods gracefully, respecting PDBs and terminationGracePeriodSeconds. Record timestamps at each step.
# Step 1: Cordon all nodes in the target zone CORDON_START=$(date -u +%s) for NODE in $ZONE_NODES; do echo "Cordoning $NODE..." kubectl cordon $NODE done # Verify nodes are SchedulingDisabled kubectl get nodes --selector="topology.kubernetes.io/zone=${TARGET_ZONE}" # Step 2: Drain nodes (respects PDBs) DRAIN_START=$(date -u +%s) for NODE in $ZONE_NODES; do echo "Draining $NODE at $(date -u)..." kubectl drain $NODE \ --ignore-daemonsets \ --delete-emptydir-data \ --timeout=300s \ --force=false # Respect PDBs — will wait done DRAIN_END=$(date -u +%s) echo "Drain duration: $((DRAIN_END - DRAIN_START))s"
Monitor rescheduling in real time
Open a separate terminal and watch pod events. Record when each workload reaches its desired replica count.
# Watch pod transitions (run in separate terminal) kubectl get pods -w --output-watch-events # In another terminal: watch deployment status watch -n 2 'kubectl get deployments \ -o custom-columns="NAME:.metadata.name,READY:.status.readyReplicas,DESIRED:.spec.replicas"' # Record time when each deployment reaches desired replicas # workload-a (with topology spread): ____ seconds # workload-b (no topology spread): ____ seconds # Check scheduler events for topology spread decisions kubectl get events --field-selector reason=FailedScheduling \ --sort-by='.lastTimestamp'
Analyse topology spread behaviour
After the drain completes, examine how the two workloads behaved differently. Workload-A with DoNotSchedule spread may not fully reschedule if the remaining zones lack capacity.
# Check final pod distribution kubectl get pods -o wide | grep workload # Check if workload-a is Pending (DoNotSchedule constraint) kubectl describe pods -l app=workload-a | grep -A5 "Events:" # Look for: "didn't match pod's node affinity/selector" # or: "0/4 nodes are available: topology spread constraint not met" # workload-b (ScheduleAnyway alternative) # Check how pods redistributed without topology constraints kubectl get pods -l app=workload-b -o wide | awk '{print $7}' | sort | uniq -c
Restore the zone and measure full RTO
Uncordon the nodes to simulate zone recovery. Measure how long until workloads rebalance back to the restored zone.
# Uncordon zone nodes to simulate zone recovery RESTORE_START=$(date -u +%s) for NODE in $ZONE_NODES; do kubectl uncordon $NODE done # Watch rebalancing — topology spread will NOT auto-rebalance # unless Descheduler is running kubectl get pods -o wide -w | grep workload-a # Force rebalance by rolling the deployment kubectl rollout restart deployment/workload-a kubectl rollout status deployment/workload-a RESTORE_END=$(date -u +%s) echo "Full RTO: $((RESTORE_END - DRAIN_START))s"
Record RTO measurements
Fill in this table with your observed values. These measurements become the basis for SLO definitions.
| Workload | Config | Time to 50% capacity | Time to 100% capacity | Observations |
|---|---|---|---|---|
| workload-a | TopologySpread + PDB | ___ s | ___ s | May stay Pending if no capacity in remaining zones |
| workload-b | No spread + strict PDB | ___ s | ___ s | Drain slowed by PDB; fully reschedules to other zones |
Success Criteria
Key Concepts
- DoNotSchedule vs ScheduleAnyway — DoNotSchedule is strict: pods stay Pending rather than violating spread. ScheduleAnyway allows scheduling with skew, degrading zone balance
- PDB during drain — kubectl drain respects PDBs by waiting; --force overrides PDBs and should only be used in genuine emergencies
- Descheduler — Kubernetes does not auto-rebalance pods after zone restoration; the Descheduler project (sigs.k8s.io/descheduler) handles this
- maxSkew semantics — a maxSkew of 1 means the difference between the zone with most pods and the zone with fewest cannot exceed 1
Further Reading
- Pod Topology Spread Constraints — kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints
- Disruptions and PDBs — kubernetes.io/docs/concepts/workloads/pods/disruptions
- Descheduler project — github.com/kubernetes-sigs/descheduler
- Chaos Engineering with node failure — principlesofchaos.org