Objective
Stateful workloads require special handling during node drains — a database pod cannot be recreated instantly because it must wait for the PVC to detach from the old node and reattach to the new one. This exercise designs a complete upgrade runbook, executes it against a cluster with a real StatefulSet database, measures RTO, and documents gaps between expected and actual behaviour.
Prerequisites
- Kubernetes cluster with at least 3 worker nodes across 2+ zones
- kubectl with cluster-admin access and SSH to nodes (for timing validation)
- A storage class that supports topology-aware binding
- Velero or cloud snapshots configured for PVC backup
Steps
01
Deploy stateful and stateless test workloads
# Deploy PostgreSQL StatefulSet (stateful workload) cat << 'EOF' | kubectl apply -f - apiVersion: apps/v1 kind: StatefulSet metadata: name: postgres namespace: default spec: serviceName: postgres replicas: 1 selector: matchLabels: {app: postgres} template: metadata: labels: {app: postgres} spec: containers: - name: postgres image: postgres:15 env: - name: POSTGRES_PASSWORD value: testpassword - name: PGDATA value: /var/lib/postgresql/data/pgdata volumeMounts: - name: data mountPath: /var/lib/postgresql/data resources: requests: {cpu: 100m, memory: 256Mi} volumeClaimTemplates: - metadata: name: data spec: accessModes: [ReadWriteOnce] resources: requests: storage: 5Gi EOF # Deploy stateless app with strict PDB cat << 'EOF' | kubectl apply -f - apiVersion: apps/v1 kind: Deployment metadata: name: stateless-app namespace: default spec: replicas: 4 selector: matchLabels: {app: stateless-app} template: metadata: labels: {app: stateless-app} spec: containers: - name: app image: nginx:alpine resources: requests: {cpu: 50m, memory: 64Mi} --- apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: stateless-pdb spec: minAvailable: 3 selector: matchLabels: {app: stateless-app} EOF # Write test data to PostgreSQL kubectl wait pod/postgres-0 --for=condition=Ready --timeout=120s kubectl exec postgres-0 -- psql -U postgres -c \ "CREATE TABLE upgrade_test (id serial, note text); INSERT INTO upgrade_test (note) VALUES ('pre-upgrade-data');" # Verify test data kubectl exec postgres-0 -- psql -U postgres -c \ "SELECT * FROM upgrade_test;"
02
Pre-drain checks (runbook Phase 1)
# Record baseline state echo "=== UPGRADE RUNBOOK EXECUTION LOG ===" | tee runbook.log echo "Start: $(date -u)" | tee -a runbook.log # P1.1: All workloads healthy kubectl get pods --all-namespaces | grep -v Running | grep -v Completed | \ tee -a runbook.log # P1.2: PDB status (all must show DISRUPTIONS ALLOWED > 0) kubectl get pdb --all-namespaces | tee -a runbook.log # P1.3: PVC status (all must be Bound) kubectl get pvc --all-namespaces | grep -v Bound | tee -a runbook.log # P1.4: Node that hosts postgres-0 POSTGRES_NODE=$(kubectl get pod postgres-0 -o jsonpath='{.spec.nodeName}') POSTGRES_ZONE=$(kubectl get node $POSTGRES_NODE \ --output=jsonpath='{.metadata.labels.topology\.kubernetes\.io/zone}') echo "postgres-0 is on $POSTGRES_NODE in zone $POSTGRES_ZONE" | tee -a runbook.log # P1.5: Create backup velero backup create pre-drain-$(date +%Y%m%d%H%M) \ --include-namespaces default --wait | tee -a runbook.log
03
Drain the StatefulSet node (runbook Phase 2)
# Phase 2: Drain stateful node echo "=== PHASE 2: DRAIN START $(date -u) ===" | tee -a runbook.log kubectl cordon $POSTGRES_NODE CORDON_TIME=$(date -u +%s) # Monitor PDB blocking (open in separate terminal) # watch -n 2 'kubectl get pdb --all-namespaces' DRAIN_START=$(date -u +%s) kubectl drain $POSTGRES_NODE \ --ignore-daemonsets \ --delete-emptydir-data \ --timeout=300s 2>&1 | tee -a runbook.log DRAIN_END=$(date -u +%s) echo "Drain took: $((DRAIN_END - DRAIN_START))s" | tee -a runbook.log
04
Monitor PVC detachment and pod rescheduling
# Watch postgres-0 reschedule sequence kubectl get pod postgres-0 -w 2>&1 | head -20 # Expected sequence and timings: # postgres-0 Running → Terminating (drain initiated) # postgres-0 Terminating (PVC detach: 5-30s) # postgres-0 Pending (scheduling on new node) # postgres-0 ContainerCreating (PVC attach: 10-60s on cloud) # postgres-0 Running (ready) # Record timing until pod is Running again POD_READY_START=$(date -u +%s) kubectl wait pod/postgres-0 --for=condition=Ready --timeout=300s POD_READY_END=$(date -u +%s) STATEFUL_RTO=$((POD_READY_END - DRAIN_START)) echo "StatefulSet RTO: ${STATEFUL_RTO}s" | tee -a runbook.log # Verify PVC reattached to new node NEW_POSTGRES_NODE=$(kubectl get pod postgres-0 -o jsonpath='{.spec.nodeName}') echo "postgres-0 rescheduled to: $NEW_POSTGRES_NODE" | tee -a runbook.log # Verify data integrity kubectl exec postgres-0 -- psql -U postgres -c \ "SELECT * FROM upgrade_test;" | tee -a runbook.log # Must show: pre-upgrade-data still present
05
Verify stateless PDB enforcement
# Check that PDB prevented stateless workload from dropping below minAvailable # During drain, stateless-app should never have gone below 3 pods # Check deployment health kubectl get deployment stateless-app | tee -a runbook.log # Review events for PDB blocking during drain kubectl get events --field-selector reason=DisruptionAllowed \ --sort-by='.lastTimestamp' | tee -a runbook.log # Uncordon the node (simulate upgrade completion) kubectl uncordon $POSTGRES_NODE echo "Node uncordoned: $(date -u)" | tee -a runbook.log
06
Document findings and runbook gaps
# Generate the execution report
cat << 'EOF' >> runbook.log
=== EXECUTION SUMMARY ===
Drain duration: ___s
StatefulSet RTO: ___s (target: <120s)
Stateless workload impact: min available maintained? YES/NO
Data integrity: Intact? YES/NO
=== GAPS IDENTIFIED ===
[ ] PVC detach time exceeded expected: expected 30s, actual ___s
[ ] PDB blocked drain longer than expected (record which PDB)
[ ] No alerting triggered for pod unavailability during drain
[ ] postgres-0 not in a separate zone from stateless workload
(recommendation: use pod anti-affinity or dedicated node pool)
=== RUNBOOK IMPROVEMENTS ===
- Add PVC snapshot step before drain (currently manual)
- Add zone check: never drain all nodes in the same AZ simultaneously
- Set terminationGracePeriodSeconds on postgres pod to allow clean shutdown
- Consider separate node pool for stateful workloads to isolate drain blast radius
EOF
cat runbook.logSuccess Criteria
Key Concepts
- ReadWriteOnce PVC constraint — EBS and Azure disk volumes can only be mounted on one node at a time; detach/attach adds 30-60s to StatefulSet RTO
- StatefulSet ordering — Kubernetes honours pod ordering guarantees (pod N-1 must be ready before N starts) even during eviction
- terminationGracePeriodSeconds — set long enough for your database to flush writes; default 30s may not be enough for large PostgreSQL checkpoints
- Dedicated stateful node pool — isolating stateful workloads to a separate node pool lets you control drain order precisely
Further Reading
- StatefulSet documentation — kubernetes.io/docs/concepts/workloads/controllers/statefulset
- PVC volume binding — kubernetes.io/docs/concepts/storage/storage-classes/#volume-binding-mode
- Kubernetes graceful shutdown — kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination
- Draining nodes — kubernetes.io/docs/tasks/administer-cluster/safely-drain-node