Objective
Kubernetes minor version upgrades replace every node in the cluster. Done incorrectly, they cause downtime. This exercise performs a controlled upgrade using surge nodes — extra nodes added before draining existing ones — ensuring capacity is always available. You will monitor each phase and validate workload health at each step.
Prerequisites
- AKS cluster running 1.28 with at least one user node pool
- Azure CLI authenticated with Contributor access
- kubectl configured to the cluster
- Test workloads deployed with PDBs (deploy some from earlier exercises)
- Pluto scan completed — no removed API manifests (from d4-e1)
Steps
Pre-upgrade checklist
# 1. Record current state kubectl get nodes -o wide kubectl version --output=yaml | grep gitVersion # 2. Verify all nodes are healthy kubectl get nodes | grep -v Ready # Should return nothing # 3. Check for PodDisruptionBudgets kubectl get pdb --all-namespaces # All PDBs should show DISRUPTIONS ALLOWED > 0 # 4. Scan for deprecated APIs (must be zero removed APIs) pluto detect-all-in-cluster --target-versions k8s=v1.29.0 # Fix any REMOVED items before proceeding # 5. Backup etcd / cluster state velero backup create pre-upgrade-$(date +%Y%m%d) --wait # Or use AKS cluster snapshot if available # 6. Check available upgrade versions az aks get-upgrades \ --resource-group my-rg \ --name my-cluster \ --output table # Should show 1.29.x as available
Configure surge upgrade settings
Surge upgrades add extra nodes before draining old ones. Setting max-surge to 33% (1 extra node per 3) ensures capacity is never reduced during the upgrade.
# Set surge upgrade on the user node pool az aks nodepool update \ --resource-group my-rg \ --cluster-name my-cluster \ --name user \ --max-surge 33% # Verify the setting az aks nodepool show \ --resource-group my-rg \ --cluster-name my-cluster \ --name user \ --query "upgradeSettings" \ --output table
Deploy a canary workload to test upgrade impact
# Deploy a load-generating workload to monitor during upgrade
cat << 'EOF' | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: upgrade-canary
namespace: default
spec:
replicas: 6
selector:
matchLabels: {app: upgrade-canary}
template:
metadata:
labels: {app: upgrade-canary}
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels: {app: upgrade-canary}
containers:
- name: app
image: nginx:alpine
resources:
requests: {cpu: 50m, memory: 64Mi}
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: upgrade-canary-pdb
spec:
minAvailable: 4
selector:
matchLabels: {app: upgrade-canary}
EOF
kubectl wait deployment/upgrade-canary --for=condition=Available --timeout=120sInitiate the upgrade and monitor
Start the upgrade in a separate terminal window while monitoring node and pod state in another.
# Terminal 1: Start monitoring watch -n 5 'kubectl get nodes -o wide && echo "---" && kubectl get pods -o wide | head -20' # Terminal 2: Start the upgrade UPGRADE_START=$(date -u +%s) az aks upgrade \ --resource-group my-rg \ --name my-cluster \ --kubernetes-version 1.29.0 \ --node-image-only false \ --yes # In Terminal 1, watch for the sequence: # 1. New surge node appears (STATUS: Ready) # 2. Old node becomes SchedulingDisabled (cordoned) # 3. Pods evict from old node, reschedule on surge node # 4. Old node terminates # 5. Process repeats for next node
Monitor workload health during upgrade
# Watch PDB status — should never show 0 disruptions available watch -n 3 'kubectl get pdb --all-namespaces' # Check deployment available replicas never drop below minimum watch -n 5 'kubectl get deployments --all-namespaces \ -o custom-columns="NAME:.metadata.name,DESIRED:.spec.replicas,AVAILABLE:.status.availableReplicas,READY:.status.readyReplicas"' # Watch node versions change kubectl get nodes -w \ --label-columns=kubernetes.azure.com/node-image-version # Monitor events for eviction activity kubectl get events --all-namespaces \ --field-selector reason=Evicted \ --sort-by='.lastTimestamp' -w
Post-upgrade validation
# Verify all nodes show the new version kubectl get nodes -o custom-columns="NAME:.metadata.name,VERSION:.status.nodeInfo.kubeletVersion,STATUS:.status.conditions[-1].type" # Verify API server version kubectl version --output=yaml | grep gitVersion # Should show: v1.29.x # Verify all workloads are healthy kubectl get deployments --all-namespaces | grep -v "1/1\|2/2\|3/3\|4/4\|5/5\|6/6" # Any output here indicates a workload that didn't recover # Check for any pods not in Running/Completed state kubectl get pods --all-namespaces \ --field-selector='status.phase!=Running,status.phase!=Succeeded' # Record upgrade duration UPGRADE_END=$(date -u +%s) echo "Upgrade duration: $(( (UPGRADE_END - UPGRADE_START) / 60 )) minutes"
Success Criteria
Further Reading
- AKS upgrade documentation — learn.microsoft.com/azure/aks/upgrade-cluster
- Kubernetes version skew policy — kubernetes.io/releases/version-skew-policy
- AKS node surge upgrades — learn.microsoft.com/azure/aks/upgrade-aks-cluster
- Upgrade planning guide — kubernetes.io/docs/tasks/administer-cluster/cluster-upgrade