← Back to Guide
Lifecycle Management L2 · PRACTICAL ~90 min

Zero-Downtime Kubernetes Minor Version Upgrade

Execute a minor version upgrade from Kubernetes 1.28 to 1.29 on an AKS cluster using node pool surge upgrades. Monitor rolling node replacements with kubectl and validate all workloads remain healthy throughout the process.

Objective

Kubernetes minor version upgrades replace every node in the cluster. Done incorrectly, they cause downtime. This exercise performs a controlled upgrade using surge nodes — extra nodes added before draining existing ones — ensuring capacity is always available. You will monitor each phase and validate workload health at each step.

Test this on a non-production cluster. A failed upgrade mid-process can leave the cluster in a mixed-version state that requires careful recovery.

Prerequisites

Steps

01

Pre-upgrade checklist

# 1. Record current state
kubectl get nodes -o wide
kubectl version --output=yaml | grep gitVersion

# 2. Verify all nodes are healthy
kubectl get nodes | grep -v Ready
# Should return nothing

# 3. Check for PodDisruptionBudgets
kubectl get pdb --all-namespaces
# All PDBs should show DISRUPTIONS ALLOWED > 0

# 4. Scan for deprecated APIs (must be zero removed APIs)
pluto detect-all-in-cluster --target-versions k8s=v1.29.0
# Fix any REMOVED items before proceeding

# 5. Backup etcd / cluster state
velero backup create pre-upgrade-$(date +%Y%m%d) --wait
# Or use AKS cluster snapshot if available

# 6. Check available upgrade versions
az aks get-upgrades \
  --resource-group my-rg \
  --name my-cluster \
  --output table
# Should show 1.29.x as available
02

Configure surge upgrade settings

Surge upgrades add extra nodes before draining old ones. Setting max-surge to 33% (1 extra node per 3) ensures capacity is never reduced during the upgrade.

# Set surge upgrade on the user node pool
az aks nodepool update \
  --resource-group my-rg \
  --cluster-name my-cluster \
  --name user \
  --max-surge 33%

# Verify the setting
az aks nodepool show \
  --resource-group my-rg \
  --cluster-name my-cluster \
  --name user \
  --query "upgradeSettings" \
  --output table
max-surge as a percentage (33%) is preferred over a fixed number because it scales with node pool size. For 3 nodes, 33% = 1 surge node. For 9 nodes, 33% = 3 surge nodes, allowing parallel upgrades per zone.
03

Deploy a canary workload to test upgrade impact

# Deploy a load-generating workload to monitor during upgrade
cat << 'EOF' | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: upgrade-canary
  namespace: default
spec:
  replicas: 6
  selector:
    matchLabels: {app: upgrade-canary}
  template:
    metadata:
      labels: {app: upgrade-canary}
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels: {app: upgrade-canary}
      containers:
      - name: app
        image: nginx:alpine
        resources:
          requests: {cpu: 50m, memory: 64Mi}
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: upgrade-canary-pdb
spec:
  minAvailable: 4
  selector:
    matchLabels: {app: upgrade-canary}
EOF

kubectl wait deployment/upgrade-canary --for=condition=Available --timeout=120s
04

Initiate the upgrade and monitor

Start the upgrade in a separate terminal window while monitoring node and pod state in another.

# Terminal 1: Start monitoring
watch -n 5 'kubectl get nodes -o wide && echo "---" && kubectl get pods -o wide | head -20'

# Terminal 2: Start the upgrade
UPGRADE_START=$(date -u +%s)
az aks upgrade \
  --resource-group my-rg \
  --name my-cluster \
  --kubernetes-version 1.29.0 \
  --node-image-only false \
  --yes

# In Terminal 1, watch for the sequence:
# 1. New surge node appears (STATUS: Ready)
# 2. Old node becomes SchedulingDisabled (cordoned)
# 3. Pods evict from old node, reschedule on surge node
# 4. Old node terminates
# 5. Process repeats for next node
05

Monitor workload health during upgrade

# Watch PDB status — should never show 0 disruptions available
watch -n 3 'kubectl get pdb --all-namespaces'

# Check deployment available replicas never drop below minimum
watch -n 5 'kubectl get deployments --all-namespaces \
  -o custom-columns="NAME:.metadata.name,DESIRED:.spec.replicas,AVAILABLE:.status.availableReplicas,READY:.status.readyReplicas"'

# Watch node versions change
kubectl get nodes -w \
  --label-columns=kubernetes.azure.com/node-image-version

# Monitor events for eviction activity
kubectl get events --all-namespaces \
  --field-selector reason=Evicted \
  --sort-by='.lastTimestamp' -w
06

Post-upgrade validation

# Verify all nodes show the new version
kubectl get nodes -o custom-columns="NAME:.metadata.name,VERSION:.status.nodeInfo.kubeletVersion,STATUS:.status.conditions[-1].type"

# Verify API server version
kubectl version --output=yaml | grep gitVersion
# Should show: v1.29.x

# Verify all workloads are healthy
kubectl get deployments --all-namespaces | grep -v "1/1\|2/2\|3/3\|4/4\|5/5\|6/6"
# Any output here indicates a workload that didn't recover

# Check for any pods not in Running/Completed state
kubectl get pods --all-namespaces \
  --field-selector='status.phase!=Running,status.phase!=Succeeded'

# Record upgrade duration
UPGRADE_END=$(date -u +%s)
echo "Upgrade duration: $(( (UPGRADE_END - UPGRADE_START) / 60 )) minutes"

Success Criteria

Further Reading