01
Cluster Architecture
Multi-Cloud Cluster Design & Operations
AKS, EKS, GKE administration · HA architecture · multi-zone · DR
Multi-Cloud Cluster Provisioning
Provision and manage clusters across AKS, EKS, and GKE with parity in configuration standards. Understand cloud-specific control plane differences, SLAs, and managed add-on ecosystems.
High Availability Design
Design clusters with multi-zone node pools, pod topology spread constraints, pod disruption budgets, and anti-affinity rules. Validate failure scenarios across AZ loss events.
Namespace Tenancy & RBAC
Design multi-tenant cluster topologies with namespace isolation, resource quotas, limit ranges, and hierarchical RBAC. Implement least-privilege service accounts and audit RBAC drift.
Disaster Recovery
Design and test cluster DR runbooks: cross-region failover, backup/restore of cluster state, etcd snapshot management, and RTO/RPO validation using Velero and provider-native tools.
Hands-On Exercises
L1 · INTRO
Provision a 3-node AKS cluster using Terraform. Enable availability zones for node pools. Verify zone distribution via
kubectl get nodes -o wide and review the generated Azure resource model.
L2 · PRACTICAL
Deploy the same workload on EKS and GKE using a shared Terraform module with provider-specific overrides. Create a comparison matrix of control plane configuration differences, default CNI, and managed add-on gaps.
L2 · PRACTICAL
Configure a multi-tenant namespace model: two teams, separate ResourceQuotas, LimitRanges, and RBAC bindings. Simulate a privileged namespace breach attempt and validate isolation holds.
L3 · ADVANCED
Simulate a full AZ failure in a multi-zone EKS cluster using node cordon + drain on all nodes in one AZ. Document how pod topology spread constraints and PDBs govern rescheduling. Measure RTO for each workload class.
L3 · ADVANCED
Implement a full cluster backup with Velero (including PVs). Destroy the cluster. Restore to a new cluster in a different region. Validate all workloads, secrets, and service accounts are intact and functional.
Key Resources
Kubernetes Official Docs: Cluster Administration & Multi-tenancy
CKA/CKAD exam curriculum — scheduling, HA, RBAC deep dives
Velero documentation: backup, restore, and schedule policies
Azure/AWS/GCP Well-Architected Framework: AKS/EKS/GKE sections
Practice: Use
kind or minikube locally; use free-tier cloud accounts for multi-cloud labsYou're ready to advance when you can provision a production-grade multi-zone cluster from scratch using IaC, explain every control-plane component's role, design a multi-tenant RBAC model, and execute a cluster restore under a defined RTO target.
02
Security & Hardening
Platform Security Controls
Pod Security · Network Policies · Secrets · Supply Chain · Runtime Protection
Pod Security & Admission Control
Enforce Pod Security Standards (Baseline/Restricted) via namespace labels, OPA Gatekeeper policies, or Kyverno. Block privileged containers, host path mounts, and root execution cluster-wide.
Network Segmentation
Author Kubernetes NetworkPolicies to implement default-deny ingress/egress. Layer CNI-level policies (Cilium/Calico). Validate with policy simulators and egress traffic audits.
Secrets Management
Integrate Vault, Azure Key Vault, or AWS Secrets Manager via CSI driver or External Secrets Operator. Enforce encryption at rest for etcd. Rotate secrets without workload restart where possible.
Supply Chain & Image Security
Implement image signing with Cosign/Notary, enforce admission policies to only allow signed images from trusted registries. Run Trivy or Grype in CI pipelines. Maintain a curated base image catalog.
Runtime Protection
Deploy Falco for runtime anomaly detection. Write custom Falco rules for crypto miner patterns, shell exec in containers, and unexpected network connections. Integrate alerts with SIEM/PagerDuty.
Certificate Lifecycle
Deploy cert-manager to automate TLS certificate issuance and renewal via Let's Encrypt or internal CAs. Monitor certificate expiry. Manage mTLS between services using Istio/Linkerd service mesh.
Hands-On Exercises
L1 · INTRO
Apply Pod Security Standards to a namespace using labels (
pod-security.kubernetes.io/enforce: restricted). Try deploying a privileged pod and document the admission rejection. Explain the three PSS profiles and when to use each.
L1 · INTRO
Write a default-deny NetworkPolicy for a namespace, then create allow policies for specific pod selectors and ports. Use kubectl exec to verify allowed and blocked connections match intent.
L2 · PRACTICAL
Deploy External Secrets Operator. Connect it to AWS Secrets Manager or Azure Key Vault. Create an ExternalSecret that syncs a database credential into a Kubernetes Secret automatically on rotation.
L2 · PRACTICAL
Sign a container image with Cosign. Write a Kyverno policy that blocks unsigned images or images not from your approved registry. Validate that unsigned images from Docker Hub are rejected at admission.
L3 · ADVANCED
Deploy Falco with a custom ruleset. Simulate attack scenarios: shell exec in a running container, outbound connection to a known C2 IP, privilege escalation attempt. Validate Falco alerts fire within SLA. Write a triage runbook for each scenario.
L3 · ADVANCED
Run CIS Kubernetes Benchmark and kube-bench against a cluster. Produce a prioritized remediation plan. Implement 10 failing checks, re-run, and document delta. Score against a target of ≥95% pass rate.
Key Resources
CKS (Certified Kubernetes Security Specialist) exam curriculum
NIST SP 800-190: Application Container Security Guide
CIS Kubernetes Benchmark & kube-bench tool
Sigstore / Cosign documentation and policy controller
Falco documentation: rules reference and gVisor/eBPF drivers
OWASP Kubernetes Security Cheat Sheet
You're ready to advance when you can harden a cluster from scratch against the CIS Benchmark, build an image supply chain with provenance, enforce zero-trust network policies, and respond to a Falco runtime alert end-to-end.
03
GitOps & CI/CD
GitOps Patterns & Deployment Pipelines
Flux · ArgoCD · Release safety · Auditability · Progressive delivery
GitOps with Flux & Argo CD
Design and operate GitOps control loops using Flux v2 (Kustomize, HelmRelease) or Argo CD application sets. Manage multi-cluster deployments from a single Git source of truth with drift detection.
Progressive Delivery
Implement canary and blue/green deployments with Flagger, Argo Rollouts, or service mesh traffic shifting. Define automated analysis templates with Prometheus metrics and webhook checks.
Release Safety & Auditability
Enforce PR-based change flows, signed commits, policy gates in CI (OPA Conftest, Kubeconform), and mandatory review paths. Maintain a full audit trail from commit to running pod via Git history.
CI Pipeline Security
Integrate vulnerability scanning (Trivy), SBOM generation, and image signing into CI. Enforce SLSA provenance levels. Use OIDC-based authentication to cloud registries, eliminating long-lived CI credentials.
Hands-On Exercises
L1 · INTRO
Bootstrap Flux v2 onto a cluster pointing to a personal GitHub repo. Deploy a sample app via a Kustomization. Make a change in Git and observe the reconciliation loop sync the cluster state automatically.
L2 · PRACTICAL
Implement a multi-environment GitOps structure:
base/, overlays/staging/, overlays/prod/. Use Flux or Argo CD to target different clusters per environment. Enforce image tag pinning in production but allow semver ranges in staging.
L2 · PRACTICAL
Build a GitHub Actions pipeline that: lints Kubernetes manifests with Kubeconform, runs OPA Conftest policies, scans the image with Trivy (fail on HIGH/CRITICAL), and signs the image with Cosign before pushing.
L3 · ADVANCED
Implement a canary release with Flagger and Prometheus. Define a Canary resource with a metric template checking the 5xx error rate. Deliberately inject errors into the canary deployment. Validate that Flagger automatically rolls back and fires an alert.
L3 · ADVANCED
Design a GitOps emergency change process ("break-glass" procedure) for critical hotfixes that bypass normal PR review without losing auditability. Implement it and demonstrate a full trace from the hotfix commit to running pod.
Key Resources
Flux v2 documentation: GitRepository, Kustomization, HelmRelease controllers
Argo CD documentation: ApplicationSets and progressive delivery patterns
Flagger documentation: canary analysis templates and metric providers
CNCF GitOps Working Group: OpenGitOps principles
SLSA Framework documentation: provenance levels 1–4
04
Lifecycle Management
Kubernetes Lifecycle & Capacity Operations
Version upgrades · Node pool strategy · Capacity planning · Add-on management
Version Upgrade Strategy
Plan and execute Kubernetes version upgrades (minor/patch) with zero downtime: node pool surge upgrades, compatibility matrix checks, API deprecation scanning with Pluto, and staged rollout across clusters.
Node Pool Strategy
Design node pool topologies with dedicated system pools, spot/preemptible pools with workload toleration, and GPU/high-memory pools. Implement Cluster Autoscaler and KEDA for demand-driven scaling.
Capacity Planning
Right-size workloads using VPA recommendations and resource analysis. Build capacity forecasting from VPA/Prometheus metrics. Set request/limit ratios to prevent OOM and CPU throttling at scale.
Add-on & Extension Management
Manage cluster add-ons (CoreDNS, kube-proxy, metrics-server, CNI) through a versioned, tested upgrade pipeline. Maintain compatibility between add-on versions and Kubernetes minor versions.
Hands-On Exercises
L1 · INTRO
Run Pluto against a cluster or set of manifests to identify deprecated/removed API versions. Produce a remediation list for upgrading from 1.27 to 1.29. Update the manifests to use current API groups.
L2 · PRACTICAL
Execute a minor version upgrade (e.g., 1.28 → 1.29) on an AKS cluster using node pool surge upgrades. Observe the rollout with
kubectl get nodes -w. Validate all workloads remain healthy throughout (zero restarts on PDB-protected deployments).
L2 · PRACTICAL
Deploy Goldilocks VPA alongside a set of workloads. After 24 hours of traffic, review the recommendations dashboard. Apply recommendations to 3 workloads and measure the effect on cluster resource headroom.
L3 · ADVANCED
Design a node pool upgrade strategy for a cluster with stateful workloads (databases on PVs) and stateless workloads with strict PDBs. Write the full runbook. Execute it and measure actual vs expected disruption. Identify any gaps.
Key Resources
Kubernetes release notes: API deprecation and removal timelines per version
Pluto: Detect deprecated Kubernetes apiVersions in code
Goldilocks: VPA-based resource request recommendations dashboard
AKS upgrade documentation, EKS managed node group upgrade docs
KEDA documentation: ScaledObject and external event-driven autoscaling
05
Observability
Platform Observability & SLOs
Metrics · Logs · Traces · Alerting · SLO/SLI definition · Runbooks
Metrics & Alerting Stack
Deploy and operate Prometheus + Alertmanager + Grafana. Write PromQL queries for cluster KPIs (node pressure, pod restart rates, API server latency). Define alert severity levels and routing rules.
Log Aggregation
Build a structured log pipeline with Fluent Bit → Loki or OpenSearch. Enforce structured JSON logging standards for workloads. Implement log-based alerts for error patterns and security events.
Distributed Tracing
Deploy OpenTelemetry Collector for trace collection. Configure instrumented workloads to emit spans to Tempo or Jaeger. Correlate traces with logs and metrics for cross-service incident investigation.
SLOs, SLIs & Error Budgets
Define SLIs (availability, latency p99, error rate) and encode SLOs in Prometheus recording rules or Sloth. Build error budget burn rate alerts and Grafana dashboards for SLO tracking by service.
Hands-On Exercises
L1 · INTRO
Deploy kube-prometheus-stack via Helm. Write 5 PromQL queries for cluster health: node memory pressure, pending pods, API server request rate, PVC binding failures, and container restart rate per namespace.
L2 · PRACTICAL
Define an SLO for a sample web service: 99.9% availability over 30 days. Implement Sloth or manual Prometheus recording rules to track the error budget. Create a Grafana dashboard showing current burn rate and projected budget exhaustion date.
L2 · PRACTICAL
Deploy OpenTelemetry Collector in gateway mode. Configure a sample instrumented app (OpenTelemetry demo app) to emit traces. Correlate a failing request trace with its associated logs in Loki using the trace ID.
L3 · ADVANCED
Design and write a complete runbook for a "PodCrashLoopBackOff flood" incident: detection query, initial triage steps, likely causes decision tree, remediation procedures, escalation path, and post-incident review template. Validate it against a simulated incident.
Key Resources
Prometheus documentation: recording rules, alerting rules, PromQL reference
Google SRE Book (free online): SLO chapter and error budget policy
Sloth: SLO generator for Prometheus
OpenTelemetry documentation: Collector configuration and SDK guides
Grafana Mimir / Thanos: long-term metrics storage at scale
06
Platform Operations
Incident Response & Platform Support
RCA · 2nd/3rd level support · On-call operations · Application team partnership
Kubernetes Internals Deep Dive
Master control plane components: kube-scheduler (scoring, filtering, preemption), kube-controller-manager, etcd consistency, API server admission chain, and kubelet reconciliation loops. Understand failure modes for each.
Incident Triage & RCA
Build structured incident response: detect → triage → contain → remediate → RCA. Use structured 5-why or Fishbone analysis. Write blameless post-mortems with timeline, contributing factors, and preventive actions.
Application Team Partnership
Advise teams on container best practices: correct resource requests/limits, liveness vs readiness vs startup probes, graceful shutdown (preStop hooks, SIGTERM handling), health endpoint standards, and ingress/DNS setup.
Storage & Stateful Workloads
Operate CSI drivers (Azure Disk, EFS, GCS Fuse). Manage PVC lifecycle, storage classes, reclaim policies, and volume snapshots. Understand StatefulSet rollout guarantees and pod identity for stateful systems.
Hands-On Exercises
L1 · INTRO
Walk through the complete lifecycle of a pod: creation, scheduling decisions, image pull, container runtime setup, readiness probe evaluation, and termination with preStop hook. Trace each step using
kubectl describe and control plane logs.
L2 · PRACTICAL
Simulate 5 common Kubernetes failure scenarios: ImagePullBackOff, OOMKilled, Evicted (disk pressure), Pending (resource constraints), Terminating (stuck finalizer). For each, practice the full triage-to-resolution flow and document the runbook entry.
L2 · PRACTICAL
Review a real-world or sample application deployment missing best practices (no resource limits, no probes, no graceful shutdown). Write an advisory report and a corrected Deployment manifest with full annotations explaining each change.
L3 · ADVANCED
Run a chaos engineering exercise using Chaos Mesh or LitmusChaos: pod failure, network latency injection, and node restart. Document which workloads degraded gracefully vs failed ungracefully. Produce a hardening backlog from findings.
Key Resources
Kubernetes Internals: "Kubernetes Up and Running" (O'Reilly) — chapters on scheduler, controllers
GitHub: Kubernetes source code — pkg/scheduler, pkg/controller
Chaos Mesh / LitmusChaos: chaos engineering for Kubernetes
Learnk8s.io: detailed architectural diagrams of Kubernetes components
Google SRE Book: Incident Management chapter and post-mortem culture
07
Automation & IaC
Platform Automation & Infrastructure as Code
Terraform · Ansible · Bash/Python/Go · Policy-as-code · Self-service workflows
Terraform for Kubernetes Platforms
Build modular Terraform for multi-cloud cluster provisioning (AKS, EKS, GKE), node pools, managed identities/IAM, add-ons, and monitoring integrations. Use remote state, workspaces, and Atlantis for collaborative IaC.
Policy-as-Code
Write OPA Rego policies or Kyverno ClusterPolicies to enforce standards at admission. Validate manifests in CI with Conftest. Build auto-remediation via Kyverno mutating policies for missing labels, resource limits, and security contexts.
Scripting & Tooling (Bash / Python / Go)
Write robust Bash scripts for operational tasks (node draining, cert rotation, log collection). Build Python tools using the Kubernetes client library for custom controllers/automation. Write Go-based kubectl plugins for self-service workflows.
Operator Pattern & Controllers
Understand the controller-runtime framework. Build a simple custom controller with controller-gen. Understand reconciliation loops, finalizers, owner references, and status subresources. Evaluate and operate existing operators (Prometheus Operator, Strimzi).
Hands-On Exercises
L1 · INTRO
Write a Terraform module that provisions an AKS cluster with a system node pool and a user node pool, enabling RBAC, managed identity, and Azure Monitor integration. Use variables for environment-specific sizing.
L2 · PRACTICAL
Write a Kyverno policy that: (1) requires all Deployments to have resource requests and limits, (2) mutates missing
app.kubernetes.io/name labels from the Deployment name, and (3) blocks containers running as UID 0. Test each rule with conformant and non-conformant manifests.
L2 · PRACTICAL
Write a Python script using the official Kubernetes client that lists all deployments across all namespaces where replica count is 1 (single-point-of-failure). Output a report grouped by namespace with the owning team label.
L3 · ADVANCED
Build a self-service namespace provisioning workflow: a Python or Go webhook that receives a request with team name + tier, creates the namespace, applies ResourceQuota, LimitRange, NetworkPolicy defaults, and RBAC bindings. Ensure idempotency and GitOps sync compatibility.
Key Resources
HashiCorp Terraform documentation: module design and provider reference for AKS/EKS/GKE
Kyverno documentation: policy types, generate rules, and mutation patterns
OPA documentation: Rego language reference and Conftest guide
Kubernetes Python client: github.com/kubernetes-client/python
Kubebuilder book: building controllers and operators with controller-runtime
08
Compliance & Auditing
Compliance, CVE Remediation & Audit Evidence
CIS Benchmark · CVE management · Supply-chain risk · Audit evidence · Reporting
Benchmark Compliance
Run kube-bench against CIS Kubernetes Benchmark. Track compliance scores over time. Integrate benchmark checks into CI and cluster provisioning pipelines to prevent regression. Target SOC 2, ISO 27001, or PCI-DSS alignment where required.
CVE & Vulnerability Management
Operate a continuous vulnerability management lifecycle: scan (Trivy/Grype), prioritise (CVSS + exploitability), patch (image rebuild or node OS update), verify, and report. Maintain SLAs per severity tier and track exceptions with risk acceptance records.
Audit Evidence Collection
Enable and collect Kubernetes audit logs. Build automated evidence collection scripts for configuration drift, RBAC snapshots, image scanning results, and policy compliance reports. Store evidence with tamper-evident retention.
Misconfiguration Detection
Run Kubescape, Checkov, or Polaris for continuous misconfiguration scanning. Integrate with admission control to block known-bad patterns. Generate compliance posture reports for security and leadership stakeholders.
Hands-On Exercises
L1 · INTRO
Run kube-bench on a cluster. Export the results as JSON. Write a script that categorises failures by CIS section and outputs an HTML report with a compliance score. Identify the top 5 highest-priority remediations.
L2 · PRACTICAL
Build an automated vulnerability management pipeline: daily Trivy scan of all images running in a cluster (using
kubectl get pods to enumerate), de-duplicate by image digest, produce a prioritised JIRA/ticket export by severity, and track SLA breach dates.
L2 · PRACTICAL
Enable Kubernetes API server audit logging with a policy file capturing reads and writes to secrets, RBAC resources, and privileged pod creation. Parse the audit log to produce a weekly access report suitable for a security team review.
L3 · ADVANCED
Simulate an internal audit request: compile evidence for 10 security controls (network segmentation, encryption at rest, RBAC least privilege, image scanning, etc.). Automate evidence collection into a structured ZIP with a manifest. Measure time-to-evidence and identify collection gaps to automate.
Key Resources
CIS Kubernetes Benchmark: download free PDF at cisecurity.org
kube-bench: github.com/aquasecurity/kube-bench
Kubescape documentation: MITRE ATT&CK and NSA/CISA framework checks
NIST NVD / CVE database for vulnerability research
Polaris: fairwinds.com/polaris — open source Kubernetes policy engine
Recommended Learning Path
| Phase | Focus | Duration | Key Milestone |
|---|---|---|---|
| Phase 1 | CKA + CKS certification prep. Kubernetes internals, RBAC, networking, storage. Deploy your first multi-cloud cluster via Terraform. | 8–12 weeks | CKA + CKS certified; cluster from IaC in 2 clouds |
| Phase 2 | GitOps pipeline (Flux or Argo CD). Observability stack (Prometheus, Loki, Grafana). First SLO defined and tracked. Kyverno policy baseline. | 6–8 weeks | GitOps deployed; SLO dashboard live; policy enforced |
| Phase 3 | Security hardening: Falco, supply chain (Cosign), network policies, Secrets Manager integration. First CIS Benchmark run + remediation cycle. | 6–8 weeks | ≥90% CIS pass rate; supply chain signed; runtime alerts firing |
| Phase 4 | Advanced automation: custom controllers, self-service namespace provisioning, chaos engineering validation, DR test, vulnerability management pipeline. | 8–10 weeks | Full DR test passed; chaos runbook validated; audit evidence automated |
You have achieved senior platform engineering readiness when you can: provision a secure, multi-cloud cluster from IaC, operate a full GitOps and observability stack, run a DR exercise to RTO target, respond to a production incident with structured RCA, and produce automated compliance evidence — all without referencing notes.
Kubernetes Internals Quick Reference
# Key subsystems to understand deeply
Scheduler → Filtering (predicates) → Scoring (priorities) → Binding
Preemption, taints/tolerations, topology constraints
Controller Mgr → Deployment, ReplicaSet, StatefulSet, Job controllers
Reconciliation loop: observe → diff → act
API Server → Authentication → Authorization (RBAC) → Admission (mutating → validating)
Resource versioning, watches, etcd writes
etcd → Raft consensus, leader election, snapshot/restore
Encryption at rest (--encryption-provider-config)
kubelet → Pod spec sync, cgroup management, CSI/CNI/CRI calls
Eviction thresholds: memory.available, nodefs.available
CNI → Calico, Cilium, Azure CNI, AWS VPC CNI, GKE Dataplane v2
NetworkPolicy enforcement point (varies by CNI)
CSI → Provision → Attach → Mount lifecycle
StorageClass reclaim policies, volume snapshots, topology
Ingress → NGINX, Traefik, Gateway API (v1 GA since K8s 1.28)
TLS termination, cert-manager integration, path routing
Multi-Cloud Parity Cheat Sheet
# Key differences to know across cloud providers
AKS (Azure) EKS (AWS) GKE (GCP)
Identity Workload Identity IRSA / Pod Identity Workload Identity
CNI default Azure CNI / Overlay Amazon VPC CNI Dataplane v2 (Cilium)
Node images AzureLinux / Ubuntu Amazon Linux 2 Container-Optimized OS
Autoscaler Cluster Autoscaler Karpenter / CAS Node Auto Provisioning
LB Azure LB / App GW AWS ALB / NLB Cloud Load Balancing
Upgrades Node pool surge Managed Node Group Node pool upgrade
Secrets Key Vault CSI Driver Secrets Manager CSI Secret Manager + ESO
Networking VNet peering VPC peering VPC native
Enclave Azure Confidential Nitro Enclaves Confidential GKE
Essential Scripting Patterns
# Bash — drain all nodes in a given AZ
for node in $(kubectl get nodes -l topology.kubernetes.io/zone=eu-west-1a -o name); do
kubectl drain $node --ignore-daemonsets --delete-emptydir-data --grace-period=60
done
# Python — list images with HIGH+ CVEs across all running pods
from kubernetes import client, config
config.load_kube_config()
v1 = client.CoreV1Api()
pods = v1.list_pod_for_all_namespaces()
images = {c.image for pod in pods.items for c in pod.spec.containers}
# then: for each image, run trivy image --format json --exit-code 1
# PromQL — 5m error budget burn rate (SLO: 99.9% availability)
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) / (1 - 0.999) # > 1 means burning budget faster than allowed
Incident Response Quick-Reference
# Triage toolkit — commands for common scenarios
# Node pressure / OOM events
kubectl describe node <node> | grep -A5 Conditions
kubectl get events --sort-by=.lastTimestamp -A | grep -i "OOM\|Evicted\|Failed"
# Pod stuck Terminating (stuck finalizer)
kubectl patch pod <pod> -p '{"metadata":{"finalizers":null}}' --type=merge
# API server latency spike
kubectl get --raw /metrics | grep apiserver_request_duration_seconds_bucket
# etcd health check
ETCDCTL_API=3 etcdctl endpoint health --cluster
# Cordon + drain a suspect node safely
kubectl cordon <node>
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
# Capture cluster state snapshot for RCA
kubectl cluster-info dump --output-directory=/tmp/cluster-dump --all-namespaces