Kubernetes Platform Engineering — Skills Practice Guide

Cluster Architecture

Multi-Cloud Cluster Design & Operations

AKS, EKS, GKE administration · HA architecture · multi-zone · DR

Multi-Cloud Cluster Provisioning

Provision and manage clusters across AKS, EKS, and GKE with parity in configuration standards. Understand cloud-specific control plane differences, SLAs, and managed add-on ecosystems.

AKSEKSGKEterraformcontrol-plane

High Availability Design

Design clusters with multi-zone node pools, pod topology spread constraints, pod disruption budgets, and anti-affinity rules. Validate failure scenarios across AZ loss events.

multi-zonePDBtopology-spreadanti-affinity

Namespace Tenancy & RBAC

Design multi-tenant cluster topologies with namespace isolation, resource quotas, limit ranges, and hierarchical RBAC. Implement least-privilege service accounts and audit RBAC drift.

RBACnamespacesResourceQuotaLimitRange

Disaster Recovery

Design and test cluster DR runbooks: cross-region failover, backup/restore of cluster state, etcd snapshot management, and RTO/RPO validation using Velero and provider-native tools.

Veleroetcdcross-regionRTO/RPO

Hands-On Exercises

L1 · INTRO Provision a 3-node AKS cluster using Terraform. Enable availability zones for node pools. Verify zone distribution via kubectl get nodes -o wide and review the generated Azure resource model. L2 · PRACTICAL Deploy the same workload on EKS and GKE using a shared Terraform module with provider-specific overrides. Create a comparison matrix of control plane configuration differences, default CNI, and managed add-on gaps. L2 · PRACTICAL Configure a multi-tenant namespace model: two teams, separate ResourceQuotas, LimitRanges, and RBAC bindings. Simulate a privileged namespace breach attempt and validate isolation holds. L3 · ADVANCED Simulate a full AZ failure in a multi-zone EKS cluster using node cordon + drain on all nodes in one AZ. Document how pod topology spread constraints and PDBs govern rescheduling. Measure RTO for each workload class. L3 · ADVANCED Implement a full cluster backup with Velero (including PVs). Destroy the cluster. Restore to a new cluster in a different region. Validate all workloads, secrets, and service accounts are intact and functional.

Key Resources

Kubernetes Official Docs: Cluster Administration & Multi-tenancy

CKA/CKAD exam curriculum — scheduling, HA, RBAC deep dives

Velero documentation: backup, restore, and schedule policies

Azure/AWS/GCP Well-Architected Framework: AKS/EKS/GKE sections

Practice: Use kind or minikube locally; use free-tier cloud accounts for multi-cloud labs

✦ CHECKPOINT

You're ready to advance when you can provision a production-grade multi-zone cluster from scratch using IaC, explain every control-plane component's role, design a multi-tenant RBAC model, and execute a cluster restore under a defined RTO target.

Security & Hardening

Platform Security Controls

Pod Security · Network Policies · Secrets · Supply Chain · Runtime Protection

Pod Security & Admission Control

Enforce Pod Security Standards (Baseline/Restricted) via namespace labels, OPA Gatekeeper policies, or Kyverno. Block privileged containers, host path mounts, and root execution cluster-wide.

PSSKyvernoOPA Gatekeeperadmission webhooks

Network Segmentation

Author Kubernetes NetworkPolicies to implement default-deny ingress/egress. Layer CNI-level policies (Cilium/Calico). Validate with policy simulators and egress traffic audits.

NetworkPolicyCiliumCalicoCNImicrosegmentation

Secrets Management

Integrate Vault, Azure Key Vault, or AWS Secrets Manager via CSI driver or External Secrets Operator. Enforce encryption at rest for etcd. Rotate secrets without workload restart where possible.

VaultESOCSI Secrets Storeetcd encryption

Supply Chain & Image Security

Implement image signing with Cosign/Notary, enforce admission policies to only allow signed images from trusted registries. Run Trivy or Grype in CI pipelines. Maintain a curated base image catalog.

CosignTrivySBOMimage provenanceNotary

Runtime Protection

Deploy Falco for runtime anomaly detection. Write custom Falco rules for crypto miner patterns, shell exec in containers, and unexpected network connections. Integrate alerts with SIEM/PagerDuty.

FalcoeBPFseccompAppArmorruntime rules

Certificate Lifecycle

Deploy cert-manager to automate TLS certificate issuance and renewal via Let's Encrypt or internal CAs. Monitor certificate expiry. Manage mTLS between services using Istio/Linkerd service mesh.

cert-managermTLSIstioPKISAN

Hands-On Exercises

L1 · INTRO Apply Pod Security Standards to a namespace using labels (pod-security.kubernetes.io/enforce: restricted). Try deploying a privileged pod and document the admission rejection. Explain the three PSS profiles and when to use each. L1 · INTRO Write a default-deny NetworkPolicy for a namespace, then create allow policies for specific pod selectors and ports. Use kubectl exec to verify allowed and blocked connections match intent. L2 · PRACTICAL Deploy External Secrets Operator. Connect it to AWS Secrets Manager or Azure Key Vault. Create an ExternalSecret that syncs a database credential into a Kubernetes Secret automatically on rotation. L2 · PRACTICAL Sign a container image with Cosign. Write a Kyverno policy that blocks unsigned images or images not from your approved registry. Validate that unsigned images from Docker Hub are rejected at admission. L3 · ADVANCED Deploy Falco with a custom ruleset. Simulate attack scenarios: shell exec in a running container, outbound connection to a known C2 IP, privilege escalation attempt. Validate Falco alerts fire within SLA. Write a triage runbook for each scenario. L3 · ADVANCED Run CIS Kubernetes Benchmark and kube-bench against a cluster. Produce a prioritized remediation plan. Implement 10 failing checks, re-run, and document delta. Score against a target of ≥95% pass rate.

Key Resources

CKS (Certified Kubernetes Security Specialist) exam curriculum

NIST SP 800-190: Application Container Security Guide

CIS Kubernetes Benchmark & kube-bench tool

Sigstore / Cosign documentation and policy controller

Falco documentation: rules reference and gVisor/eBPF drivers

OWASP Kubernetes Security Cheat Sheet

✦ CHECKPOINT

You're ready to advance when you can harden a cluster from scratch against the CIS Benchmark, build an image supply chain with provenance, enforce zero-trust network policies, and respond to a Falco runtime alert end-to-end.

GitOps & CI/CD

GitOps Patterns & Deployment Pipelines

Flux · ArgoCD · Release safety · Auditability · Progressive delivery

GitOps with Flux & Argo CD

Design and operate GitOps control loops using Flux v2 (Kustomize, HelmRelease) or Argo CD application sets. Manage multi-cluster deployments from a single Git source of truth with drift detection.

Flux v2Argo CDHelmReleaseKustomize

Progressive Delivery

Implement canary and blue/green deployments with Flagger, Argo Rollouts, or service mesh traffic shifting. Define automated analysis templates with Prometheus metrics and webhook checks.

FlaggerArgo Rolloutscanaryblue-green

Release Safety & Auditability

Enforce PR-based change flows, signed commits, policy gates in CI (OPA Conftest, Kubeconform), and mandatory review paths. Maintain a full audit trail from commit to running pod via Git history.

ConftestKubeconformGit auditpolicy gates

CI Pipeline Security

Integrate vulnerability scanning (Trivy), SBOM generation, and image signing into CI. Enforce SLSA provenance levels. Use OIDC-based authentication to cloud registries, eliminating long-lived CI credentials.

SLSAOIDCSBOMGitHub ActionsTekton

Hands-On Exercises

L1 · INTRO Bootstrap Flux v2 onto a cluster pointing to a personal GitHub repo. Deploy a sample app via a Kustomization. Make a change in Git and observe the reconciliation loop sync the cluster state automatically. L2 · PRACTICAL Implement a multi-environment GitOps structure: base/, overlays/staging/, overlays/prod/. Use Flux or Argo CD to target different clusters per environment. Enforce image tag pinning in production but allow semver ranges in staging. L2 · PRACTICAL Build a GitHub Actions pipeline that: lints Kubernetes manifests with Kubeconform, runs OPA Conftest policies, scans the image with Trivy (fail on HIGH/CRITICAL), and signs the image with Cosign before pushing. L3 · ADVANCED Implement a canary release with Flagger and Prometheus. Define a Canary resource with a metric template checking the 5xx error rate. Deliberately inject errors into the canary deployment. Validate that Flagger automatically rolls back and fires an alert. L3 · ADVANCED Design a GitOps emergency change process ("break-glass" procedure) for critical hotfixes that bypass normal PR review without losing auditability. Implement it and demonstrate a full trace from the hotfix commit to running pod.

Key Resources

Flux v2 documentation: GitRepository, Kustomization, HelmRelease controllers

Argo CD documentation: ApplicationSets and progressive delivery patterns

Flagger documentation: canary analysis templates and metric providers

CNCF GitOps Working Group: OpenGitOps principles

SLSA Framework documentation: provenance levels 1–4

Lifecycle Management

Kubernetes Lifecycle & Capacity Operations

Version upgrades · Node pool strategy · Capacity planning · Add-on management

Version Upgrade Strategy

Plan and execute Kubernetes version upgrades (minor/patch) with zero downtime: node pool surge upgrades, compatibility matrix checks, API deprecation scanning with Pluto, and staged rollout across clusters.

Plutonode surgeAPI deprecationupgrade runbook

Node Pool Strategy

Design node pool topologies with dedicated system pools, spot/preemptible pools with workload toleration, and GPU/high-memory pools. Implement Cluster Autoscaler and KEDA for demand-driven scaling.

Cluster AutoscalerKEDAspot instancestaints/tolerations

Capacity Planning

Right-size workloads using VPA recommendations and resource analysis. Build capacity forecasting from VPA/Prometheus metrics. Set request/limit ratios to prevent OOM and CPU throttling at scale.

VPAGoldilocksresource analysisHPA

Add-on & Extension Management

Manage cluster add-ons (CoreDNS, kube-proxy, metrics-server, CNI) through a versioned, tested upgrade pipeline. Maintain compatibility between add-on versions and Kubernetes minor versions.

CoreDNSCNIadd-on lifecycleHelm charts

Hands-On Exercises

L1 · INTRO Run Pluto against a cluster or set of manifests to identify deprecated/removed API versions. Produce a remediation list for upgrading from 1.27 to 1.29. Update the manifests to use current API groups. L2 · PRACTICAL Execute a minor version upgrade (e.g., 1.28 → 1.29) on an AKS cluster using node pool surge upgrades. Observe the rollout with kubectl get nodes -w. Validate all workloads remain healthy throughout (zero restarts on PDB-protected deployments). L2 · PRACTICAL Deploy Goldilocks VPA alongside a set of workloads. After 24 hours of traffic, review the recommendations dashboard. Apply recommendations to 3 workloads and measure the effect on cluster resource headroom. L3 · ADVANCED Design a node pool upgrade strategy for a cluster with stateful workloads (databases on PVs) and stateless workloads with strict PDBs. Write the full runbook. Execute it and measure actual vs expected disruption. Identify any gaps.

Key Resources

Kubernetes release notes: API deprecation and removal timelines per version

Pluto: Detect deprecated Kubernetes apiVersions in code

Goldilocks: VPA-based resource request recommendations dashboard

AKS upgrade documentation, EKS managed node group upgrade docs

KEDA documentation: ScaledObject and external event-driven autoscaling

Observability

Platform Observability & SLOs

Metrics · Logs · Traces · Alerting · SLO/SLI definition · Runbooks

Metrics & Alerting Stack

Deploy and operate Prometheus + Alertmanager + Grafana. Write PromQL queries for cluster KPIs (node pressure, pod restart rates, API server latency). Define alert severity levels and routing rules.

PrometheusPromQLAlertmanagerGrafanakube-state-metrics

Log Aggregation

Build a structured log pipeline with Fluent Bit → Loki or OpenSearch. Enforce structured JSON logging standards for workloads. Implement log-based alerts for error patterns and security events.

Fluent BitLokiOpenSearchstructured logging

Distributed Tracing

Deploy OpenTelemetry Collector for trace collection. Configure instrumented workloads to emit spans to Tempo or Jaeger. Correlate traces with logs and metrics for cross-service incident investigation.

OpenTelemetryTempoJaegertrace correlation

SLOs, SLIs & Error Budgets

Define SLIs (availability, latency p99, error rate) and encode SLOs in Prometheus recording rules or Sloth. Build error budget burn rate alerts and Grafana dashboards for SLO tracking by service.

SLOSLISlotherror budgetburn rate

Hands-On Exercises

L1 · INTRO Deploy kube-prometheus-stack via Helm. Write 5 PromQL queries for cluster health: node memory pressure, pending pods, API server request rate, PVC binding failures, and container restart rate per namespace. L2 · PRACTICAL Define an SLO for a sample web service: 99.9% availability over 30 days. Implement Sloth or manual Prometheus recording rules to track the error budget. Create a Grafana dashboard showing current burn rate and projected budget exhaustion date. L2 · PRACTICAL Deploy OpenTelemetry Collector in gateway mode. Configure a sample instrumented app (OpenTelemetry demo app) to emit traces. Correlate a failing request trace with its associated logs in Loki using the trace ID. L3 · ADVANCED Design and write a complete runbook for a "PodCrashLoopBackOff flood" incident: detection query, initial triage steps, likely causes decision tree, remediation procedures, escalation path, and post-incident review template. Validate it against a simulated incident.

Key Resources

Prometheus documentation: recording rules, alerting rules, PromQL reference

Google SRE Book (free online): SLO chapter and error budget policy

Sloth: SLO generator for Prometheus

OpenTelemetry documentation: Collector configuration and SDK guides

Grafana Mimir / Thanos: long-term metrics storage at scale

Platform Operations

Incident Response & Platform Support

RCA · 2nd/3rd level support · On-call operations · Application team partnership

Kubernetes Internals Deep Dive

Master control plane components: kube-scheduler (scoring, filtering, preemption), kube-controller-manager, etcd consistency, API server admission chain, and kubelet reconciliation loops. Understand failure modes for each.

scheduleretcdAPI serverkubeletcontroller-manager

Incident Triage & RCA

Build structured incident response: detect → triage → contain → remediate → RCA. Use structured 5-why or Fishbone analysis. Write blameless post-mortems with timeline, contributing factors, and preventive actions.

RCApost-mortem5-whyon-call

Application Team Partnership

Advise teams on container best practices: correct resource requests/limits, liveness vs readiness vs startup probes, graceful shutdown (preStop hooks, SIGTERM handling), health endpoint standards, and ingress/DNS setup.

probesresourcesgraceful shutdowningressDNS

Storage & Stateful Workloads

Operate CSI drivers (Azure Disk, EFS, GCS Fuse). Manage PVC lifecycle, storage classes, reclaim policies, and volume snapshots. Understand StatefulSet rollout guarantees and pod identity for stateful systems.

CSIPVCStorageClassStatefulSetsnapshots

Hands-On Exercises

L1 · INTRO Walk through the complete lifecycle of a pod: creation, scheduling decisions, image pull, container runtime setup, readiness probe evaluation, and termination with preStop hook. Trace each step using kubectl describe and control plane logs. L2 · PRACTICAL Simulate 5 common Kubernetes failure scenarios: ImagePullBackOff, OOMKilled, Evicted (disk pressure), Pending (resource constraints), Terminating (stuck finalizer). For each, practice the full triage-to-resolution flow and document the runbook entry. L2 · PRACTICAL Review a real-world or sample application deployment missing best practices (no resource limits, no probes, no graceful shutdown). Write an advisory report and a corrected Deployment manifest with full annotations explaining each change. L3 · ADVANCED Run a chaos engineering exercise using Chaos Mesh or LitmusChaos: pod failure, network latency injection, and node restart. Document which workloads degraded gracefully vs failed ungracefully. Produce a hardening backlog from findings.

Key Resources

Kubernetes Internals: "Kubernetes Up and Running" (O'Reilly) — chapters on scheduler, controllers

GitHub: Kubernetes source code — pkg/scheduler, pkg/controller

Chaos Mesh / LitmusChaos: chaos engineering for Kubernetes

Learnk8s.io: detailed architectural diagrams of Kubernetes components

Google SRE Book: Incident Management chapter and post-mortem culture

Automation & IaC

Platform Automation & Infrastructure as Code

Terraform · Ansible · Bash/Python/Go · Policy-as-code · Self-service workflows

Terraform for Kubernetes Platforms

Build modular Terraform for multi-cloud cluster provisioning (AKS, EKS, GKE), node pools, managed identities/IAM, add-ons, and monitoring integrations. Use remote state, workspaces, and Atlantis for collaborative IaC.

Terraformmodulesremote stateAtlantisOpenTofu

Policy-as-Code

Write OPA Rego policies or Kyverno ClusterPolicies to enforce standards at admission. Validate manifests in CI with Conftest. Build auto-remediation via Kyverno mutating policies for missing labels, resource limits, and security contexts.

OPARegoKyvernoConftestmutating

Scripting & Tooling (Bash / Python / Go)

Write robust Bash scripts for operational tasks (node draining, cert rotation, log collection). Build Python tools using the Kubernetes client library for custom controllers/automation. Write Go-based kubectl plugins for self-service workflows.

BashPython k8s clientGokubectl plugins

Operator Pattern & Controllers

Understand the controller-runtime framework. Build a simple custom controller with controller-gen. Understand reconciliation loops, finalizers, owner references, and status subresources. Evaluate and operate existing operators (Prometheus Operator, Strimzi).

controller-runtimeCRDoperator-sdkreconciliation

Hands-On Exercises

L1 · INTRO Write a Terraform module that provisions an AKS cluster with a system node pool and a user node pool, enabling RBAC, managed identity, and Azure Monitor integration. Use variables for environment-specific sizing. L2 · PRACTICAL Write a Kyverno policy that: (1) requires all Deployments to have resource requests and limits, (2) mutates missing app.kubernetes.io/name labels from the Deployment name, and (3) blocks containers running as UID 0. Test each rule with conformant and non-conformant manifests. L2 · PRACTICAL Write a Python script using the official Kubernetes client that lists all deployments across all namespaces where replica count is 1 (single-point-of-failure). Output a report grouped by namespace with the owning team label. L3 · ADVANCED Build a self-service namespace provisioning workflow: a Python or Go webhook that receives a request with team name + tier, creates the namespace, applies ResourceQuota, LimitRange, NetworkPolicy defaults, and RBAC bindings. Ensure idempotency and GitOps sync compatibility.

Key Resources

HashiCorp Terraform documentation: module design and provider reference for AKS/EKS/GKE

Kyverno documentation: policy types, generate rules, and mutation patterns

OPA documentation: Rego language reference and Conftest guide

Kubernetes Python client: github.com/kubernetes-client/python

Kubebuilder book: building controllers and operators with controller-runtime

Compliance & Auditing

Compliance, CVE Remediation & Audit Evidence

CIS Benchmark · CVE management · Supply-chain risk · Audit evidence · Reporting

Benchmark Compliance

Run kube-bench against CIS Kubernetes Benchmark. Track compliance scores over time. Integrate benchmark checks into CI and cluster provisioning pipelines to prevent regression. Target SOC 2, ISO 27001, or PCI-DSS alignment where required.

kube-benchCISSOC 2ISO 27001

CVE & Vulnerability Management

Operate a continuous vulnerability management lifecycle: scan (Trivy/Grype), prioritise (CVSS + exploitability), patch (image rebuild or node OS update), verify, and report. Maintain SLAs per severity tier and track exceptions with risk acceptance records.

TrivyGrypeCVSSpatch SLAexception tracking

Audit Evidence Collection

Enable and collect Kubernetes audit logs. Build automated evidence collection scripts for configuration drift, RBAC snapshots, image scanning results, and policy compliance reports. Store evidence with tamper-evident retention.

audit logsevidence collectionpolicy reportsRBAC snapshots

Misconfiguration Detection

Run Kubescape, Checkov, or Polaris for continuous misconfiguration scanning. Integrate with admission control to block known-bad patterns. Generate compliance posture reports for security and leadership stakeholders.

KubescapeCheckovPolarisposture reporting

Hands-On Exercises

L1 · INTRO Run kube-bench on a cluster. Export the results as JSON. Write a script that categorises failures by CIS section and outputs an HTML report with a compliance score. Identify the top 5 highest-priority remediations. L2 · PRACTICAL Build an automated vulnerability management pipeline: daily Trivy scan of all images running in a cluster (using kubectl get pods to enumerate), de-duplicate by image digest, produce a prioritised JIRA/ticket export by severity, and track SLA breach dates. L2 · PRACTICAL Enable Kubernetes API server audit logging with a policy file capturing reads and writes to secrets, RBAC resources, and privileged pod creation. Parse the audit log to produce a weekly access report suitable for a security team review. L3 · ADVANCED Simulate an internal audit request: compile evidence for 10 security controls (network segmentation, encryption at rest, RBAC least privilege, image scanning, etc.). Automate evidence collection into a structured ZIP with a manifest. Measure time-to-evidence and identify collection gaps to automate.

Key Resources

CIS Kubernetes Benchmark: download free PDF at cisecurity.org

kube-bench: github.com/aquasecurity/kube-bench

Kubescape documentation: MITRE ATT&CK and NSA/CISA framework checks

NIST NVD / CVE database for vulnerability research

Polaris: fairwinds.com/polaris — open source Kubernetes policy engine

Recommended Learning Path

Phase	Focus	Duration	Key Milestone
Phase 1	CKA + CKS certification prep. Kubernetes internals, RBAC, networking, storage. Deploy your first multi-cloud cluster via Terraform.	8–12 weeks	CKA + CKS certified; cluster from IaC in 2 clouds
Phase 2	GitOps pipeline (Flux or Argo CD). Observability stack (Prometheus, Loki, Grafana). First SLO defined and tracked. Kyverno policy baseline.	6–8 weeks	GitOps deployed; SLO dashboard live; policy enforced
Phase 3	Security hardening: Falco, supply chain (Cosign), network policies, Secrets Manager integration. First CIS Benchmark run + remediation cycle.	6–8 weeks	≥90% CIS pass rate; supply chain signed; runtime alerts firing
Phase 4	Advanced automation: custom controllers, self-service namespace provisioning, chaos engineering validation, DR test, vulnerability management pipeline.	8–10 weeks	Full DR test passed; chaos runbook validated; audit evidence automated

✦ MASTERY

You have achieved senior platform engineering readiness when you can: provision a secure, multi-cloud cluster from IaC, operate a full GitOps and observability stack, run a DR exercise to RTO target, respond to a production incident with structured RCA, and produce automated compliance evidence — all without referencing notes.

Kubernetes Internals Quick Reference

# Key subsystems to understand deeply

Scheduler       → Filtering (predicates) → Scoring (priorities) → Binding
                  Preemption, taints/tolerations, topology constraints

Controller Mgr → Deployment, ReplicaSet, StatefulSet, Job controllers
                  Reconciliation loop: observe → diff → act

API Server      → Authentication → Authorization (RBAC) → Admission (mutating → validating)
                  Resource versioning, watches, etcd writes

etcd            → Raft consensus, leader election, snapshot/restore
                  Encryption at rest (--encryption-provider-config)

kubelet         → Pod spec sync, cgroup management, CSI/CNI/CRI calls
                  Eviction thresholds: memory.available, nodefs.available

CNI             → Calico, Cilium, Azure CNI, AWS VPC CNI, GKE Dataplane v2
                  NetworkPolicy enforcement point (varies by CNI)

CSI             → Provision → Attach → Mount lifecycle
                  StorageClass reclaim policies, volume snapshots, topology

Ingress         → NGINX, Traefik, Gateway API (v1 GA since K8s 1.28)
                  TLS termination, cert-manager integration, path routing

Multi-Cloud Parity Cheat Sheet

# Key differences to know across cloud providers

              AKS (Azure)           EKS (AWS)              GKE (GCP)
Identity      Workload Identity      IRSA / Pod Identity    Workload Identity
CNI default   Azure CNI / Overlay    Amazon VPC CNI         Dataplane v2 (Cilium)
Node images   AzureLinux / Ubuntu    Amazon Linux 2         Container-Optimized OS
Autoscaler    Cluster Autoscaler     Karpenter / CAS        Node Auto Provisioning
LB            Azure LB / App GW      AWS ALB / NLB          Cloud Load Balancing
Upgrades      Node pool surge        Managed Node Group     Node pool upgrade
Secrets       Key Vault CSI Driver   Secrets Manager CSI    Secret Manager + ESO
Networking    VNet peering           VPC peering            VPC native
Enclave       Azure Confidential     Nitro Enclaves         Confidential GKE

Essential Scripting Patterns

# Bash — drain all nodes in a given AZ
for node in $(kubectl get nodes -l topology.kubernetes.io/zone=eu-west-1a -o name); do
  kubectl drain $node --ignore-daemonsets --delete-emptydir-data --grace-period=60
done

# Python — list images with HIGH+ CVEs across all running pods
from kubernetes import client, config
config.load_kube_config()
v1 = client.CoreV1Api()
pods = v1.list_pod_for_all_namespaces()
images = {c.image for pod in pods.items for c in pod.spec.containers}
# then: for each image, run trivy image --format json --exit-code 1

# PromQL — 5m error budget burn rate (SLO: 99.9% availability)
(
  sum(rate(http_requests_total{status=~"5.."}[5m]))
  /
  sum(rate(http_requests_total[5m]))
) / (1 - 0.999) # > 1 means burning budget faster than allowed

Incident Response Quick-Reference

# Triage toolkit — commands for common scenarios

# Node pressure / OOM events
kubectl describe node <node> | grep -A5 Conditions
kubectl get events --sort-by=.lastTimestamp -A | grep -i "OOM\|Evicted\|Failed"

# Pod stuck Terminating (stuck finalizer)
kubectl patch pod <pod> -p '{"metadata":{"finalizers":null}}' --type=merge

# API server latency spike
kubectl get --raw /metrics | grep apiserver_request_duration_seconds_bucket

# etcd health check
ETCDCTL_API=3 etcdctl endpoint health --cluster

# Cordon + drain a suspect node safely
kubectl cordon <node>
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data

# Capture cluster state snapshot for RCA
kubectl cluster-info dump --output-directory=/tmp/cluster-dump --all-namespaces