Simulate AZ Failure with Topology Spread & PDBs

Objective

Design workloads with realistic HA configurations (topology spread constraints, PDBs) and then simulate an AZ failure by removing all nodes in zone 1. Measure how long it takes for each workload class to recover to full capacity. Identify which workloads fail ungracefully and produce a hardening list.

Run this exercise in a non-production cluster. Draining nodes with PDBs can cause temporary service degradation if configurations are incorrect.

Prerequisites

Multi-zone EKS or AKS cluster with at least 2 nodes per zone (6 nodes minimum)
kubectl with cluster-admin access
Prometheus + Alertmanager deployed for RTO measurement
Understanding of Kubernetes scheduling concepts (taints, node selectors, affinity)

Steps

Deploy the test workloads

Create four workloads with different HA configurations to compare behaviour during AZ failure. Each represents a real-world pattern.

# workload-a-zone-spread.yaml — Stateless with topology spread
apiVersion: apps/v1
kind: Deployment
metadata:
  name: workload-a
spec:
  replicas: 6
  selector:
    matchLabels: { app: workload-a }
  template:
    metadata:
      labels: { app: workload-a }
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels: { app: workload-a }
      containers:
      - name: app
        image: nginx:alpine
        resources:
          requests: { cpu: 50m, memory: 64Mi }
---
# workload-b-no-spread.yaml — No topology spread (bad practice)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: workload-b
spec:
  replicas: 6
  selector:
    matchLabels: { app: workload-b }
  template:
    metadata:
      labels: { app: workload-b }
    spec:
      containers:
      - name: app
        image: nginx:alpine
        resources:
          requests: { cpu: 50m, memory: 64Mi }

kubectl apply -f workload-a-zone-spread.yaml
kubectl apply -f workload-b-no-spread.yaml

Apply PodDisruptionBudgets

PDBs ensure that during node drains, the minimum number of pods required for service availability is maintained. Without PDBs, a drain can terminate all pods simultaneously.

# pdb-workload-a.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: pdb-workload-a
spec:
  minAvailable: 4    # At least 4/6 pods must be up
  selector:
    matchLabels: { app: workload-a }
---
# pdb-workload-b.yaml — Stricter PDB to demonstrate blocking
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: pdb-workload-b
spec:
  maxUnavailable: 1   # Only 1 pod can be disrupted at a time
  selector:
    matchLabels: { app: workload-b }

kubectl apply -f pdb-workload-a.yaml
kubectl apply -f pdb-workload-b.yaml

# Verify PDBs are healthy (DISRUPTIONS ALLOWED > 0)
kubectl get pdb

Record baseline state before simulation

Document the pod distribution across zones before starting the failure simulation. This is your baseline for measuring RTO.

# Record start time
BASELINE_TIME=$(date -u +%Y-%m-%dT%H:%M:%SZ)
echo "Baseline recorded at: $BASELINE_TIME"

# Show pod distribution across zones
kubectl get pods -o wide \
  --label-columns=topology.kubernetes.io/zone | sort

# Identify nodes in zone us-east-1a (adjust for your zone)
TARGET_ZONE="us-east-1a"
ZONE_NODES=$(kubectl get nodes \
  --selector="topology.kubernetes.io/zone=${TARGET_ZONE}" \
  -o jsonpath='{.items[*].metadata.name}')
echo "Nodes in ${TARGET_ZONE}: ${ZONE_NODES}"

# Verify workloads are healthy
kubectl get deployments -o wide

Simulate AZ failure — cordon and drain all nodes in zone 1

Cordon marks nodes as unschedulable (no new pods). Drain evicts existing pods gracefully, respecting PDBs and terminationGracePeriodSeconds. Record timestamps at each step.

# Step 1: Cordon all nodes in the target zone
CORDON_START=$(date -u +%s)
for NODE in $ZONE_NODES; do
  echo "Cordoning $NODE..."
  kubectl cordon $NODE
done

# Verify nodes are SchedulingDisabled
kubectl get nodes --selector="topology.kubernetes.io/zone=${TARGET_ZONE}"

# Step 2: Drain nodes (respects PDBs)
DRAIN_START=$(date -u +%s)
for NODE in $ZONE_NODES; do
  echo "Draining $NODE at $(date -u)..."
  kubectl drain $NODE \
    --ignore-daemonsets \
    --delete-emptydir-data \
    --timeout=300s \
    --force=false    # Respect PDBs — will wait
done
DRAIN_END=$(date -u +%s)

echo "Drain duration: $((DRAIN_END - DRAIN_START))s"

If drain blocks due to a PDB, it will retry until the timeout. Watch the output carefully — it shows which pods are blocking and why. This is expected behaviour when PDBs are configured correctly.

Monitor rescheduling in real time

Open a separate terminal and watch pod events. Record when each workload reaches its desired replica count.

# Watch pod transitions (run in separate terminal)
kubectl get pods -w --output-watch-events

# In another terminal: watch deployment status
watch -n 2 'kubectl get deployments \
  -o custom-columns="NAME:.metadata.name,READY:.status.readyReplicas,DESIRED:.spec.replicas"'

# Record time when each deployment reaches desired replicas
# workload-a (with topology spread): ____ seconds
# workload-b (no topology spread):   ____ seconds

# Check scheduler events for topology spread decisions
kubectl get events --field-selector reason=FailedScheduling \
  --sort-by='.lastTimestamp'

Analyse topology spread behaviour

After the drain completes, examine how the two workloads behaved differently. Workload-A with DoNotSchedule spread may not fully reschedule if the remaining zones lack capacity.

# Check final pod distribution
kubectl get pods -o wide | grep workload

# Check if workload-a is Pending (DoNotSchedule constraint)
kubectl describe pods -l app=workload-a | grep -A5 "Events:"
# Look for: "didn't match pod's node affinity/selector"
# or: "0/4 nodes are available: topology spread constraint not met"

# workload-b (ScheduleAnyway alternative)
# Check how pods redistributed without topology constraints
kubectl get pods -l app=workload-b -o wide | awk '{print $7}' | sort | uniq -c

Restore the zone and measure full RTO

Uncordon the nodes to simulate zone recovery. Measure how long until workloads rebalance back to the restored zone.

# Uncordon zone nodes to simulate zone recovery
RESTORE_START=$(date -u +%s)
for NODE in $ZONE_NODES; do
  kubectl uncordon $NODE
done

# Watch rebalancing — topology spread will NOT auto-rebalance
# unless Descheduler is running
kubectl get pods -o wide -w | grep workload-a

# Force rebalance by rolling the deployment
kubectl rollout restart deployment/workload-a
kubectl rollout status deployment/workload-a
RESTORE_END=$(date -u +%s)

echo "Full RTO: $((RESTORE_END - DRAIN_START))s"

Record RTO measurements

Fill in this table with your observed values. These measurements become the basis for SLO definitions.

Workload	Config	Time to 50% capacity	Time to 100% capacity	Observations
workload-a	TopologySpread + PDB	___ s	___ s	May stay Pending if no capacity in remaining zones
workload-b	No spread + strict PDB	___ s	___ s	Drain slowed by PDB; fully reschedules to other zones

Success Criteria

All nodes in target zone are cordoned and drained without errors PDB blocks drain from removing more pods than allowed simultaneously Workload-A topology spread constraint causes Pending pods when zone capacity is lost RTO measured in seconds for both workload classes with timestamps kubectl events show FailedScheduling reason for topology spread violations Zone nodes successfully uncordoned and workloads rebalanced

Key Concepts

DoNotSchedule vs ScheduleAnyway — DoNotSchedule is strict: pods stay Pending rather than violating spread. ScheduleAnyway allows scheduling with skew, degrading zone balance
PDB during drain — kubectl drain respects PDBs by waiting; --force overrides PDBs and should only be used in genuine emergencies
Descheduler — Kubernetes does not auto-rebalance pods after zone restoration; the Descheduler project (sigs.k8s.io/descheduler) handles this
maxSkew semantics — a maxSkew of 1 means the difference between the zone with most pods and the zone with fewest cannot exceed 1