Platform Engineering · Skills Practice Guide

Kubernetes Platform
Engineering Mastery

A structured, hands-on curriculum to build and validate expertise across multi-cloud Kubernetes operations, security, observability, and platform automation.

8 Core Domains
AKS · EKS · GKE
L1 → L3 Exercises
Hands-On Labs
01
Cluster Architecture

Multi-Cloud Cluster Design & Operations

AKS, EKS, GKE administration · HA architecture · multi-zone · DR
Multi-Cloud Cluster Provisioning
Provision and manage clusters across AKS, EKS, and GKE with parity in configuration standards. Understand cloud-specific control plane differences, SLAs, and managed add-on ecosystems.
AKSEKSGKEterraformcontrol-plane
High Availability Design
Design clusters with multi-zone node pools, pod topology spread constraints, pod disruption budgets, and anti-affinity rules. Validate failure scenarios across AZ loss events.
multi-zonePDBtopology-spreadanti-affinity
Namespace Tenancy & RBAC
Design multi-tenant cluster topologies with namespace isolation, resource quotas, limit ranges, and hierarchical RBAC. Implement least-privilege service accounts and audit RBAC drift.
RBACnamespacesResourceQuotaLimitRange
Disaster Recovery
Design and test cluster DR runbooks: cross-region failover, backup/restore of cluster state, etcd snapshot management, and RTO/RPO validation using Velero and provider-native tools.
Veleroetcdcross-regionRTO/RPO
Key Resources
Kubernetes Official Docs: Cluster Administration & Multi-tenancy
CKA/CKAD exam curriculum — scheduling, HA, RBAC deep dives
Velero documentation: backup, restore, and schedule policies
Azure/AWS/GCP Well-Architected Framework: AKS/EKS/GKE sections
Practice: Use kind or minikube locally; use free-tier cloud accounts for multi-cloud labs
✦ CHECKPOINT
You're ready to advance when you can provision a production-grade multi-zone cluster from scratch using IaC, explain every control-plane component's role, design a multi-tenant RBAC model, and execute a cluster restore under a defined RTO target.
02
Security & Hardening

Platform Security Controls

Pod Security · Network Policies · Secrets · Supply Chain · Runtime Protection
Pod Security & Admission Control
Enforce Pod Security Standards (Baseline/Restricted) via namespace labels, OPA Gatekeeper policies, or Kyverno. Block privileged containers, host path mounts, and root execution cluster-wide.
PSSKyvernoOPA Gatekeeperadmission webhooks
Network Segmentation
Author Kubernetes NetworkPolicies to implement default-deny ingress/egress. Layer CNI-level policies (Cilium/Calico). Validate with policy simulators and egress traffic audits.
NetworkPolicyCiliumCalicoCNImicrosegmentation
Secrets Management
Integrate Vault, Azure Key Vault, or AWS Secrets Manager via CSI driver or External Secrets Operator. Enforce encryption at rest for etcd. Rotate secrets without workload restart where possible.
VaultESOCSI Secrets Storeetcd encryption
Supply Chain & Image Security
Implement image signing with Cosign/Notary, enforce admission policies to only allow signed images from trusted registries. Run Trivy or Grype in CI pipelines. Maintain a curated base image catalog.
CosignTrivySBOMimage provenanceNotary
Runtime Protection
Deploy Falco for runtime anomaly detection. Write custom Falco rules for crypto miner patterns, shell exec in containers, and unexpected network connections. Integrate alerts with SIEM/PagerDuty.
FalcoeBPFseccompAppArmorruntime rules
Certificate Lifecycle
Deploy cert-manager to automate TLS certificate issuance and renewal via Let's Encrypt or internal CAs. Monitor certificate expiry. Manage mTLS between services using Istio/Linkerd service mesh.
cert-managermTLSIstioPKISAN
Key Resources
CKS (Certified Kubernetes Security Specialist) exam curriculum
NIST SP 800-190: Application Container Security Guide
CIS Kubernetes Benchmark & kube-bench tool
Sigstore / Cosign documentation and policy controller
Falco documentation: rules reference and gVisor/eBPF drivers
OWASP Kubernetes Security Cheat Sheet
✦ CHECKPOINT
You're ready to advance when you can harden a cluster from scratch against the CIS Benchmark, build an image supply chain with provenance, enforce zero-trust network policies, and respond to a Falco runtime alert end-to-end.
03
GitOps & CI/CD

GitOps Patterns & Deployment Pipelines

Flux · ArgoCD · Release safety · Auditability · Progressive delivery
GitOps with Flux & Argo CD
Design and operate GitOps control loops using Flux v2 (Kustomize, HelmRelease) or Argo CD application sets. Manage multi-cluster deployments from a single Git source of truth with drift detection.
Flux v2Argo CDHelmReleaseKustomize
Progressive Delivery
Implement canary and blue/green deployments with Flagger, Argo Rollouts, or service mesh traffic shifting. Define automated analysis templates with Prometheus metrics and webhook checks.
FlaggerArgo Rolloutscanaryblue-green
Release Safety & Auditability
Enforce PR-based change flows, signed commits, policy gates in CI (OPA Conftest, Kubeconform), and mandatory review paths. Maintain a full audit trail from commit to running pod via Git history.
ConftestKubeconformGit auditpolicy gates
CI Pipeline Security
Integrate vulnerability scanning (Trivy), SBOM generation, and image signing into CI. Enforce SLSA provenance levels. Use OIDC-based authentication to cloud registries, eliminating long-lived CI credentials.
SLSAOIDCSBOMGitHub ActionsTekton
Key Resources
Flux v2 documentation: GitRepository, Kustomization, HelmRelease controllers
Argo CD documentation: ApplicationSets and progressive delivery patterns
Flagger documentation: canary analysis templates and metric providers
CNCF GitOps Working Group: OpenGitOps principles
SLSA Framework documentation: provenance levels 1–4
04
Lifecycle Management

Kubernetes Lifecycle & Capacity Operations

Version upgrades · Node pool strategy · Capacity planning · Add-on management
Version Upgrade Strategy
Plan and execute Kubernetes version upgrades (minor/patch) with zero downtime: node pool surge upgrades, compatibility matrix checks, API deprecation scanning with Pluto, and staged rollout across clusters.
Plutonode surgeAPI deprecationupgrade runbook
Node Pool Strategy
Design node pool topologies with dedicated system pools, spot/preemptible pools with workload toleration, and GPU/high-memory pools. Implement Cluster Autoscaler and KEDA for demand-driven scaling.
Cluster AutoscalerKEDAspot instancestaints/tolerations
Capacity Planning
Right-size workloads using VPA recommendations and resource analysis. Build capacity forecasting from VPA/Prometheus metrics. Set request/limit ratios to prevent OOM and CPU throttling at scale.
VPAGoldilocksresource analysisHPA
Add-on & Extension Management
Manage cluster add-ons (CoreDNS, kube-proxy, metrics-server, CNI) through a versioned, tested upgrade pipeline. Maintain compatibility between add-on versions and Kubernetes minor versions.
CoreDNSCNIadd-on lifecycleHelm charts
Key Resources
Kubernetes release notes: API deprecation and removal timelines per version
Pluto: Detect deprecated Kubernetes apiVersions in code
Goldilocks: VPA-based resource request recommendations dashboard
AKS upgrade documentation, EKS managed node group upgrade docs
KEDA documentation: ScaledObject and external event-driven autoscaling
05
Observability

Platform Observability & SLOs

Metrics · Logs · Traces · Alerting · SLO/SLI definition · Runbooks
Metrics & Alerting Stack
Deploy and operate Prometheus + Alertmanager + Grafana. Write PromQL queries for cluster KPIs (node pressure, pod restart rates, API server latency). Define alert severity levels and routing rules.
PrometheusPromQLAlertmanagerGrafanakube-state-metrics
Log Aggregation
Build a structured log pipeline with Fluent Bit → Loki or OpenSearch. Enforce structured JSON logging standards for workloads. Implement log-based alerts for error patterns and security events.
Fluent BitLokiOpenSearchstructured logging
Distributed Tracing
Deploy OpenTelemetry Collector for trace collection. Configure instrumented workloads to emit spans to Tempo or Jaeger. Correlate traces with logs and metrics for cross-service incident investigation.
OpenTelemetryTempoJaegertrace correlation
SLOs, SLIs & Error Budgets
Define SLIs (availability, latency p99, error rate) and encode SLOs in Prometheus recording rules or Sloth. Build error budget burn rate alerts and Grafana dashboards for SLO tracking by service.
SLOSLISlotherror budgetburn rate
Key Resources
Prometheus documentation: recording rules, alerting rules, PromQL reference
Google SRE Book (free online): SLO chapter and error budget policy
Sloth: SLO generator for Prometheus
OpenTelemetry documentation: Collector configuration and SDK guides
Grafana Mimir / Thanos: long-term metrics storage at scale
06
Platform Operations

Incident Response & Platform Support

RCA · 2nd/3rd level support · On-call operations · Application team partnership
Kubernetes Internals Deep Dive
Master control plane components: kube-scheduler (scoring, filtering, preemption), kube-controller-manager, etcd consistency, API server admission chain, and kubelet reconciliation loops. Understand failure modes for each.
scheduleretcdAPI serverkubeletcontroller-manager
Incident Triage & RCA
Build structured incident response: detect → triage → contain → remediate → RCA. Use structured 5-why or Fishbone analysis. Write blameless post-mortems with timeline, contributing factors, and preventive actions.
RCApost-mortem5-whyon-call
Application Team Partnership
Advise teams on container best practices: correct resource requests/limits, liveness vs readiness vs startup probes, graceful shutdown (preStop hooks, SIGTERM handling), health endpoint standards, and ingress/DNS setup.
probesresourcesgraceful shutdowningressDNS
Storage & Stateful Workloads
Operate CSI drivers (Azure Disk, EFS, GCS Fuse). Manage PVC lifecycle, storage classes, reclaim policies, and volume snapshots. Understand StatefulSet rollout guarantees and pod identity for stateful systems.
CSIPVCStorageClassStatefulSetsnapshots
Key Resources
Kubernetes Internals: "Kubernetes Up and Running" (O'Reilly) — chapters on scheduler, controllers
GitHub: Kubernetes source code — pkg/scheduler, pkg/controller
Chaos Mesh / LitmusChaos: chaos engineering for Kubernetes
Learnk8s.io: detailed architectural diagrams of Kubernetes components
Google SRE Book: Incident Management chapter and post-mortem culture
07
Automation & IaC

Platform Automation & Infrastructure as Code

Terraform · Ansible · Bash/Python/Go · Policy-as-code · Self-service workflows
Terraform for Kubernetes Platforms
Build modular Terraform for multi-cloud cluster provisioning (AKS, EKS, GKE), node pools, managed identities/IAM, add-ons, and monitoring integrations. Use remote state, workspaces, and Atlantis for collaborative IaC.
Terraformmodulesremote stateAtlantisOpenTofu
Policy-as-Code
Write OPA Rego policies or Kyverno ClusterPolicies to enforce standards at admission. Validate manifests in CI with Conftest. Build auto-remediation via Kyverno mutating policies for missing labels, resource limits, and security contexts.
OPARegoKyvernoConftestmutating
Scripting & Tooling (Bash / Python / Go)
Write robust Bash scripts for operational tasks (node draining, cert rotation, log collection). Build Python tools using the Kubernetes client library for custom controllers/automation. Write Go-based kubectl plugins for self-service workflows.
BashPython k8s clientGokubectl plugins
Operator Pattern & Controllers
Understand the controller-runtime framework. Build a simple custom controller with controller-gen. Understand reconciliation loops, finalizers, owner references, and status subresources. Evaluate and operate existing operators (Prometheus Operator, Strimzi).
controller-runtimeCRDoperator-sdkreconciliation
Key Resources
HashiCorp Terraform documentation: module design and provider reference for AKS/EKS/GKE
Kyverno documentation: policy types, generate rules, and mutation patterns
OPA documentation: Rego language reference and Conftest guide
Kubernetes Python client: github.com/kubernetes-client/python
Kubebuilder book: building controllers and operators with controller-runtime
08
Compliance & Auditing

Compliance, CVE Remediation & Audit Evidence

CIS Benchmark · CVE management · Supply-chain risk · Audit evidence · Reporting
Benchmark Compliance
Run kube-bench against CIS Kubernetes Benchmark. Track compliance scores over time. Integrate benchmark checks into CI and cluster provisioning pipelines to prevent regression. Target SOC 2, ISO 27001, or PCI-DSS alignment where required.
kube-benchCISSOC 2ISO 27001
CVE & Vulnerability Management
Operate a continuous vulnerability management lifecycle: scan (Trivy/Grype), prioritise (CVSS + exploitability), patch (image rebuild or node OS update), verify, and report. Maintain SLAs per severity tier and track exceptions with risk acceptance records.
TrivyGrypeCVSSpatch SLAexception tracking
Audit Evidence Collection
Enable and collect Kubernetes audit logs. Build automated evidence collection scripts for configuration drift, RBAC snapshots, image scanning results, and policy compliance reports. Store evidence with tamper-evident retention.
audit logsevidence collectionpolicy reportsRBAC snapshots
Misconfiguration Detection
Run Kubescape, Checkov, or Polaris for continuous misconfiguration scanning. Integrate with admission control to block known-bad patterns. Generate compliance posture reports for security and leadership stakeholders.
KubescapeCheckovPolarisposture reporting
Key Resources
CIS Kubernetes Benchmark: download free PDF at cisecurity.org
kube-bench: github.com/aquasecurity/kube-bench
Kubescape documentation: MITRE ATT&CK and NSA/CISA framework checks
NIST NVD / CVE database for vulnerability research
Polaris: fairwinds.com/polaris — open source Kubernetes policy engine
Recommended Learning Path
Phase Focus Duration Key Milestone
Phase 1
CKA + CKS certification prep. Kubernetes internals, RBAC, networking, storage. Deploy your first multi-cloud cluster via Terraform. 8–12 weeks CKA + CKS certified; cluster from IaC in 2 clouds
Phase 2
GitOps pipeline (Flux or Argo CD). Observability stack (Prometheus, Loki, Grafana). First SLO defined and tracked. Kyverno policy baseline. 6–8 weeks GitOps deployed; SLO dashboard live; policy enforced
Phase 3
Security hardening: Falco, supply chain (Cosign), network policies, Secrets Manager integration. First CIS Benchmark run + remediation cycle. 6–8 weeks ≥90% CIS pass rate; supply chain signed; runtime alerts firing
Phase 4
Advanced automation: custom controllers, self-service namespace provisioning, chaos engineering validation, DR test, vulnerability management pipeline. 8–10 weeks Full DR test passed; chaos runbook validated; audit evidence automated
✦ MASTERY
You have achieved senior platform engineering readiness when you can: provision a secure, multi-cloud cluster from IaC, operate a full GitOps and observability stack, run a DR exercise to RTO target, respond to a production incident with structured RCA, and produce automated compliance evidence — all without referencing notes.
Kubernetes Internals Quick Reference
# Key subsystems to understand deeply Scheduler → Filtering (predicates) → Scoring (priorities) → Binding Preemption, taints/tolerations, topology constraints Controller Mgr → Deployment, ReplicaSet, StatefulSet, Job controllers Reconciliation loop: observe → diff → act API Server → Authentication → Authorization (RBAC) → Admission (mutating → validating) Resource versioning, watches, etcd writes etcd → Raft consensus, leader election, snapshot/restore Encryption at rest (--encryption-provider-config) kubelet → Pod spec sync, cgroup management, CSI/CNI/CRI calls Eviction thresholds: memory.available, nodefs.available CNI → Calico, Cilium, Azure CNI, AWS VPC CNI, GKE Dataplane v2 NetworkPolicy enforcement point (varies by CNI) CSI → Provision → Attach → Mount lifecycle StorageClass reclaim policies, volume snapshots, topology Ingress → NGINX, Traefik, Gateway API (v1 GA since K8s 1.28) TLS termination, cert-manager integration, path routing
Multi-Cloud Parity Cheat Sheet
# Key differences to know across cloud providers AKS (Azure) EKS (AWS) GKE (GCP) Identity Workload Identity IRSA / Pod Identity Workload Identity CNI default Azure CNI / Overlay Amazon VPC CNI Dataplane v2 (Cilium) Node images AzureLinux / Ubuntu Amazon Linux 2 Container-Optimized OS Autoscaler Cluster Autoscaler Karpenter / CAS Node Auto Provisioning LB Azure LB / App GW AWS ALB / NLB Cloud Load Balancing Upgrades Node pool surge Managed Node Group Node pool upgrade Secrets Key Vault CSI Driver Secrets Manager CSI Secret Manager + ESO Networking VNet peering VPC peering VPC native Enclave Azure Confidential Nitro Enclaves Confidential GKE
Essential Scripting Patterns
# Bash — drain all nodes in a given AZ for node in $(kubectl get nodes -l topology.kubernetes.io/zone=eu-west-1a -o name); do kubectl drain $node --ignore-daemonsets --delete-emptydir-data --grace-period=60 done # Python — list images with HIGH+ CVEs across all running pods from kubernetes import client, config config.load_kube_config() v1 = client.CoreV1Api() pods = v1.list_pod_for_all_namespaces() images = {c.image for pod in pods.items for c in pod.spec.containers} # then: for each image, run trivy image --format json --exit-code 1 # PromQL — 5m error budget burn rate (SLO: 99.9% availability) ( sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) ) / (1 - 0.999) # > 1 means burning budget faster than allowed
Incident Response Quick-Reference
# Triage toolkit — commands for common scenarios # Node pressure / OOM events kubectl describe node <node> | grep -A5 Conditions kubectl get events --sort-by=.lastTimestamp -A | grep -i "OOM\|Evicted\|Failed" # Pod stuck Terminating (stuck finalizer) kubectl patch pod <pod> -p '{"metadata":{"finalizers":null}}' --type=merge # API server latency spike kubectl get --raw /metrics | grep apiserver_request_duration_seconds_bucket # etcd health check ETCDCTL_API=3 etcdctl endpoint health --cluster # Cordon + drain a suspect node safely kubectl cordon <node> kubectl drain <node> --ignore-daemonsets --delete-emptydir-data # Capture cluster state snapshot for RCA kubectl cluster-info dump --output-directory=/tmp/cluster-dump --all-namespaces