← Back to Guide
Cluster Architecture L3 · ADVANCED ~90 min

Simulate AZ Failure with Topology Spread & PDBs

Simulate a full availability zone failure in a multi-zone EKS cluster by cordoning and draining all nodes in one AZ. Measure RTO for each workload class and validate that topology spread constraints and PodDisruptionBudgets govern safe rescheduling.

Objective

Design workloads with realistic HA configurations (topology spread constraints, PDBs) and then simulate an AZ failure by removing all nodes in zone 1. Measure how long it takes for each workload class to recover to full capacity. Identify which workloads fail ungracefully and produce a hardening list.

Run this exercise in a non-production cluster. Draining nodes with PDBs can cause temporary service degradation if configurations are incorrect.

Prerequisites

Steps

01

Deploy the test workloads

Create four workloads with different HA configurations to compare behaviour during AZ failure. Each represents a real-world pattern.

# workload-a-zone-spread.yaml — Stateless with topology spread
apiVersion: apps/v1
kind: Deployment
metadata:
  name: workload-a
spec:
  replicas: 6
  selector:
    matchLabels: { app: workload-a }
  template:
    metadata:
      labels: { app: workload-a }
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels: { app: workload-a }
      containers:
      - name: app
        image: nginx:alpine
        resources:
          requests: { cpu: 50m, memory: 64Mi }
---
# workload-b-no-spread.yaml — No topology spread (bad practice)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: workload-b
spec:
  replicas: 6
  selector:
    matchLabels: { app: workload-b }
  template:
    metadata:
      labels: { app: workload-b }
    spec:
      containers:
      - name: app
        image: nginx:alpine
        resources:
          requests: { cpu: 50m, memory: 64Mi }

kubectl apply -f workload-a-zone-spread.yaml
kubectl apply -f workload-b-no-spread.yaml
02

Apply PodDisruptionBudgets

PDBs ensure that during node drains, the minimum number of pods required for service availability is maintained. Without PDBs, a drain can terminate all pods simultaneously.

# pdb-workload-a.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: pdb-workload-a
spec:
  minAvailable: 4    # At least 4/6 pods must be up
  selector:
    matchLabels: { app: workload-a }
---
# pdb-workload-b.yaml — Stricter PDB to demonstrate blocking
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: pdb-workload-b
spec:
  maxUnavailable: 1   # Only 1 pod can be disrupted at a time
  selector:
    matchLabels: { app: workload-b }

kubectl apply -f pdb-workload-a.yaml
kubectl apply -f pdb-workload-b.yaml

# Verify PDBs are healthy (DISRUPTIONS ALLOWED > 0)
kubectl get pdb
03

Record baseline state before simulation

Document the pod distribution across zones before starting the failure simulation. This is your baseline for measuring RTO.

# Record start time
BASELINE_TIME=$(date -u +%Y-%m-%dT%H:%M:%SZ)
echo "Baseline recorded at: $BASELINE_TIME"

# Show pod distribution across zones
kubectl get pods -o wide \
  --label-columns=topology.kubernetes.io/zone | sort

# Identify nodes in zone us-east-1a (adjust for your zone)
TARGET_ZONE="us-east-1a"
ZONE_NODES=$(kubectl get nodes \
  --selector="topology.kubernetes.io/zone=${TARGET_ZONE}" \
  -o jsonpath='{.items[*].metadata.name}')
echo "Nodes in ${TARGET_ZONE}: ${ZONE_NODES}"

# Verify workloads are healthy
kubectl get deployments -o wide
04

Simulate AZ failure — cordon and drain all nodes in zone 1

Cordon marks nodes as unschedulable (no new pods). Drain evicts existing pods gracefully, respecting PDBs and terminationGracePeriodSeconds. Record timestamps at each step.

# Step 1: Cordon all nodes in the target zone
CORDON_START=$(date -u +%s)
for NODE in $ZONE_NODES; do
  echo "Cordoning $NODE..."
  kubectl cordon $NODE
done

# Verify nodes are SchedulingDisabled
kubectl get nodes --selector="topology.kubernetes.io/zone=${TARGET_ZONE}"

# Step 2: Drain nodes (respects PDBs)
DRAIN_START=$(date -u +%s)
for NODE in $ZONE_NODES; do
  echo "Draining $NODE at $(date -u)..."
  kubectl drain $NODE \
    --ignore-daemonsets \
    --delete-emptydir-data \
    --timeout=300s \
    --force=false    # Respect PDBs — will wait
done
DRAIN_END=$(date -u +%s)

echo "Drain duration: $((DRAIN_END - DRAIN_START))s"
If drain blocks due to a PDB, it will retry until the timeout. Watch the output carefully — it shows which pods are blocking and why. This is expected behaviour when PDBs are configured correctly.
05

Monitor rescheduling in real time

Open a separate terminal and watch pod events. Record when each workload reaches its desired replica count.

# Watch pod transitions (run in separate terminal)
kubectl get pods -w --output-watch-events

# In another terminal: watch deployment status
watch -n 2 'kubectl get deployments \
  -o custom-columns="NAME:.metadata.name,READY:.status.readyReplicas,DESIRED:.spec.replicas"'

# Record time when each deployment reaches desired replicas
# workload-a (with topology spread): ____ seconds
# workload-b (no topology spread):   ____ seconds

# Check scheduler events for topology spread decisions
kubectl get events --field-selector reason=FailedScheduling \
  --sort-by='.lastTimestamp'
06

Analyse topology spread behaviour

After the drain completes, examine how the two workloads behaved differently. Workload-A with DoNotSchedule spread may not fully reschedule if the remaining zones lack capacity.

# Check final pod distribution
kubectl get pods -o wide | grep workload

# Check if workload-a is Pending (DoNotSchedule constraint)
kubectl describe pods -l app=workload-a | grep -A5 "Events:"
# Look for: "didn't match pod's node affinity/selector"
# or: "0/4 nodes are available: topology spread constraint not met"

# workload-b (ScheduleAnyway alternative)
# Check how pods redistributed without topology constraints
kubectl get pods -l app=workload-b -o wide | awk '{print $7}' | sort | uniq -c
07

Restore the zone and measure full RTO

Uncordon the nodes to simulate zone recovery. Measure how long until workloads rebalance back to the restored zone.

# Uncordon zone nodes to simulate zone recovery
RESTORE_START=$(date -u +%s)
for NODE in $ZONE_NODES; do
  kubectl uncordon $NODE
done

# Watch rebalancing — topology spread will NOT auto-rebalance
# unless Descheduler is running
kubectl get pods -o wide -w | grep workload-a

# Force rebalance by rolling the deployment
kubectl rollout restart deployment/workload-a
kubectl rollout status deployment/workload-a
RESTORE_END=$(date -u +%s)

echo "Full RTO: $((RESTORE_END - DRAIN_START))s"
08

Record RTO measurements

Fill in this table with your observed values. These measurements become the basis for SLO definitions.

WorkloadConfigTime to 50% capacityTime to 100% capacityObservations
workload-aTopologySpread + PDB___ s___ sMay stay Pending if no capacity in remaining zones
workload-bNo spread + strict PDB___ s___ sDrain slowed by PDB; fully reschedules to other zones

Success Criteria

Key Concepts

Further Reading