Objective
Each failure scenario follows the same triage loop: observe the symptom, read the signal (events, logs, describe output), identify the root cause, and apply a fix. Working through all five builds pattern recognition so you can diagnose production incidents faster under pressure.
Prerequisites
- A running Kubernetes cluster with kubectl access
- A namespace you can freely create broken workloads in
- Sufficient node capacity for at least 4 pods
- kubectl, jq installed locally
Quick Reference: Triage Signals
| Failure State | Primary Signal | Root Cause Category | First Fix Attempt |
|---|---|---|---|
| ImagePullBackOff | kubectl get pod → ErrImagePull → ImagePullBackOff | Registry auth or image tag wrong | Fix image name or add imagePullSecret |
| OOMKilled | Exit code 137, Reason: OOMKilled | Memory limit too low or leak | Raise limit or profile memory usage |
| Evicted | Status: Failed, Reason: Evicted | Node disk/memory pressure | Free node resources or add resource requests |
| Pending | Status: Pending, no node assigned | Insufficient CPU/memory or taint | Reduce requests or add toleration |
| Terminating | Status: Terminating, age keeps growing | Finalizer not cleared | Patch finalizers: [] to unblock deletion |
Steps
Simulate and fix an image pull failure
Deploy a pod referencing a non-existent image tag. Observe the exponential backoff, read the event, and fix it.
# Create a namespace for all five scenarios kubectl create namespace failure-lab # Deploy a pod with a deliberately wrong image tag kubectl run bad-image \ --image=nginx:99.99.99-does-not-exist \ -n failure-lab # Watch the status progress from ErrImagePull → ImagePullBackOff kubectl get pod bad-image -n failure-lab -w ## Expected output after ~30 seconds: ## NAME READY STATUS RESTARTS AGE ## bad-image 0/1 ErrImagePull 0 8s ## bad-image 0/1 ImagePullBackOff 0 21s # Read the event — it tells you exactly what went wrong kubectl describe pod bad-image -n failure-lab | grep -A 8 Events: ## Failed to pull image "nginx:99.99.99-does-not-exist": rpc error: ## manifest for nginx:99.99.99-does-not-exist not found: manifest unknown # Fix: patch the pod's image to a valid tag # Pods are immutable — delete and recreate with correct image kubectl delete pod bad-image -n failure-lab kubectl run bad-image \ --image=nginx:1.25-alpine \ -n failure-lab kubectl get pod bad-image -n failure-lab ## NAME READY STATUS RESTARTS AGE ## bad-image 1/1 Running 0 5s
Simulate and fix an out-of-memory container kill
Deploy a container that allocates more memory than its limit allows. The kernel OOM killer terminates it with exit code 137.
# Deploy a pod with a very low memory limit cat << 'EOF' | kubectl apply -f - apiVersion: v1 kind: Pod metadata: name: oom-demo namespace: failure-lab spec: containers: - name: stress image: polinux/stress args: - stress - --vm - "1" - --vm-bytes - "150M" # tries to allocate 150 MiB - --vm-hang - "1" resources: limits: memory: "64Mi" # but limit is only 64 MiB — will OOMKill requests: memory: "32Mi" EOF # Watch it get killed within a few seconds kubectl get pod oom-demo -n failure-lab -w ## NAME READY STATUS RESTARTS AGE ## oom-demo 0/1 OOMKilled 0 4s ## oom-demo 0/1 CrashLoopBackOff 1 10s # Confirm the exit code is 137 (128 + SIGKILL) kubectl get pod oom-demo -n failure-lab -o json \ | jq '.status.containerStatuses[0].lastState.terminated' ## { ## "exitCode": 137, ## "reason": "OOMKilled", ## "startedAt": "...", ## "finishedAt": "..." ## } # Fix: raise the memory limit kubectl delete pod oom-demo -n failure-lab cat << 'EOF' | kubectl apply -f - apiVersion: v1 kind: Pod metadata: name: oom-demo namespace: failure-lab spec: containers: - name: stress image: polinux/stress args: [stress, --vm, "1", --vm-bytes, "150M", --vm-hang, "60"] resources: limits: memory: "256Mi" # enough headroom requests: memory: "160Mi" EOF kubectl get pod oom-demo -n failure-lab ## NAME READY STATUS RESTARTS AGE ## oom-demo 1/1 Running 0 6s
Understand eviction and prevent it with proper resource requests
Pods without resource requests are Burstable or BestEffort and are the first evicted when a node hits disk or memory pressure. This scenario shows how to read eviction events and prevent them.
# Deploy a pod without any resource requests (BestEffort QoS) kubectl run best-effort-pod \ --image=nginx:1.25-alpine \ -n failure-lab # Check its QoS class kubectl get pod best-effort-pod -n failure-lab -o json \ | jq '.status.qosClass' ## "BestEffort" # In a real eviction event, you'd see this in pod status: # kubectl get pod <name> -n <ns> -o json | jq '.status' ## { ## "phase": "Failed", ## "reason": "Evicted", ## "message": "The node was low on resource: ephemeral-storage." ## } # Find all evicted pods across all namespaces kubectl get pods --all-namespaces \ --field-selector=status.phase=Failed \ -o json | jq -r \ '.items[] | select(.status.reason=="Evicted") | [.metadata.namespace, .metadata.name, .status.message] | @tsv' # Simulate disk pressure eviction threshold awareness: # kubelet evicts in this order: # 1. BestEffort pods (no requests) # 2. Burstable pods exceeding their requests # 3. Guaranteed pods (limits == requests) — last evicted # Fix: add resource requests so pod becomes Burstable (safer) kubectl delete pod best-effort-pod -n failure-lab cat << 'EOF' | kubectl apply -f - apiVersion: v1 kind: Pod metadata: name: burstable-pod namespace: failure-lab spec: containers: - name: nginx image: nginx:1.25-alpine resources: requests: cpu: "50m" memory: "32Mi" limits: cpu: "200m" memory: "64Mi" EOF kubectl get pod burstable-pod -n failure-lab -o json \ | jq '.status.qosClass' ## "Burstable" # Clean up evicted pod records (they stay as Failed) kubectl get pods -n failure-lab \ --field-selector=status.phase=Failed -o name \ | xargs kubectl delete -n failure-lab 2>/dev/null || true
Diagnose and fix an unschedulable pod
A pod that cannot be scheduled stays Pending indefinitely. The scheduler records the reason in events. This scenario covers both resource exhaustion and taint-based blocking.
# Deploy a pod requesting more CPU than any node has cat << 'EOF' | kubectl apply -f - apiVersion: v1 kind: Pod metadata: name: unschedulable-pod namespace: failure-lab spec: containers: - name: nginx image: nginx:1.25-alpine resources: requests: cpu: "99" # 99 cores — no node has this memory: "256Mi" EOF kubectl get pod unschedulable-pod -n failure-lab ## NAME READY STATUS RESTARTS AGE ## unschedulable-pod 0/1 Pending 0 30s # Read the scheduler's reason from events kubectl describe pod unschedulable-pod -n failure-lab \ | grep -A 5 "Events:" ## Warning FailedScheduling default-scheduler ## 0/3 nodes are available: 3 Insufficient cpu. ## preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod. # Check actual allocatable capacity on each node kubectl get nodes -o custom-columns=\ 'NAME:.metadata.name,CPU:.status.allocatable.cpu,MEM:.status.allocatable.memory' ## NAME CPU MEM ## node-1 3920m 14336Mi ## node-2 3920m 14336Mi # Fix: reduce the CPU request to something schedulable kubectl delete pod unschedulable-pod -n failure-lab cat << 'EOF' | kubectl apply -f - apiVersion: v1 kind: Pod metadata: name: schedulable-pod namespace: failure-lab spec: containers: - name: nginx image: nginx:1.25-alpine resources: requests: cpu: "100m" memory: "64Mi" EOF kubectl get pod schedulable-pod -n failure-lab ## NAME READY STATUS RESTARTS AGE ## schedulable-pod 1/1 Running 0 4s ## --- Bonus: taint-based Pending --- # Taint a node to block scheduling NODE=$(kubectl get nodes -o name | head -1 | cut -d/ -f2) kubectl taint node $NODE env=prod:NoSchedule # Pod without toleration will be Pending on the tainted node kubectl run taint-test --image=nginx:1.25-alpine -n failure-lab kubectl describe pod taint-test -n failure-lab | grep "FailedScheduling" ## Warning FailedScheduling: 1 node(s) had untolerated taint {env: prod} # Remove the taint to restore scheduling kubectl taint node $NODE env=prod:NoSchedule- kubectl delete pod taint-test -n failure-lab
Remove a stuck finalizer to unblock deletion
Finalizers are strings in metadata.finalizers that prevent an object from being deleted until a controller clears them. If the controller is gone or broken, the object is stuck in Terminating forever.
# Create a pod with a custom finalizer cat << 'EOF' | kubectl apply -f - apiVersion: v1 kind: Pod metadata: name: finalizer-demo namespace: failure-lab finalizers: - example.com/custom-cleanup # controller is supposed to remove this spec: containers: - name: nginx image: nginx:1.25-alpine EOF kubectl get pod finalizer-demo -n failure-lab ## NAME READY STATUS RESTARTS AGE ## finalizer-demo 1/1 Running 0 5s # Delete the pod — it will hang in Terminating kubectl delete pod finalizer-demo -n failure-lab & # In another terminal, observe it stuck kubectl get pod finalizer-demo -n failure-lab ## NAME READY STATUS RESTARTS AGE ## finalizer-demo 0/1 Terminating 0 45s # Confirm the finalizer is blocking deletion kubectl get pod finalizer-demo -n failure-lab -o json \ | jq '.metadata.finalizers' ## ["example.com/custom-cleanup"] # Check deletionTimestamp is set (deletion was requested) kubectl get pod finalizer-demo -n failure-lab -o json \ | jq '.metadata.deletionTimestamp' ## "2024-11-15T10:23:47Z" # Fix: patch the finalizers list to empty — this unblocks deletion kubectl patch pod finalizer-demo -n failure-lab \ --type=json \ -p='[{"op":"remove","path":"/metadata/finalizers"}]' # The pod should now disappear immediately kubectl get pod finalizer-demo -n failure-lab ## Error from server (NotFound): pods "finalizer-demo" not found # Same pattern works for any Kubernetes resource: # kubectl patch namespace <ns> --type=json \ # -p='[{"op":"remove","path":"/metadata/finalizers"}]' # This is how you unblock stuck namespace deletion too
Cleanup and validate resolution of all scenarios
# Verify no pods are in a failed state in failure-lab kubectl get pods -n failure-lab # Delete the namespace to clean up all resources kubectl delete namespace failure-lab # Confirm deletion kubectl get namespace failure-lab ## Error from server (NotFound): namespaces "failure-lab" not found ## --- Summary of diagnostic commands --- # Universal triage starting point for any pod failure: kubectl describe pod <name> -n <ns> # Events + state + probe results kubectl logs <name> -n <ns> --previous # Logs from the previous container run kubectl get pod <name> -n <ns> -o json \ | jq '.status.containerStatuses' # Exit codes + restart counts + OOM reason kubectl get events -n <ns> \ --sort-by='.lastTimestamp' # Chronological event stream
Success Criteria
Further Reading
- Pod lifecycle — kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle
- Resource QoS — kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod
- Node-pressure eviction — kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction
- Finalizers — kubernetes.io/docs/concepts/overview/working-with-objects/finalizers
- Scheduler — kubernetes.io/docs/concepts/scheduling-eviction/kube-scheduler