Triage 5 Common Kubernetes Failure Scenarios

Objective

Each failure scenario follows the same triage loop: observe the symptom, read the signal (events, logs, describe output), identify the root cause, and apply a fix. Working through all five builds pattern recognition so you can diagnose production incidents faster under pressure.

Prerequisites

A running Kubernetes cluster with kubectl access
A namespace you can freely create broken workloads in
Sufficient node capacity for at least 4 pods
kubectl, jq installed locally

Quick Reference: Triage Signals

Failure State	Primary Signal	Root Cause Category	First Fix Attempt
ImagePullBackOff	kubectl get pod → ErrImagePull → ImagePullBackOff	Registry auth or image tag wrong	Fix image name or add imagePullSecret
OOMKilled	Exit code 137, Reason: OOMKilled	Memory limit too low or leak	Raise limit or profile memory usage
Evicted	Status: Failed, Reason: Evicted	Node disk/memory pressure	Free node resources or add resource requests
Pending	Status: Pending, no node assigned	Insufficient CPU/memory or taint	Reduce requests or add toleration
Terminating	Status: Terminating, age keeps growing	Finalizer not cleared	Patch finalizers: [] to unblock deletion

Steps

01

Scenario 1 — ImagePullBackOff

Simulate and fix an image pull failure

Deploy a pod referencing a non-existent image tag. Observe the exponential backoff, read the event, and fix it.

# Create a namespace for all five scenarios
kubectl create namespace failure-lab

# Deploy a pod with a deliberately wrong image tag
kubectl run bad-image \
  --image=nginx:99.99.99-does-not-exist \
  -n failure-lab

# Watch the status progress from ErrImagePull → ImagePullBackOff
kubectl get pod bad-image -n failure-lab -w

## Expected output after ~30 seconds:
## NAME        READY   STATUS             RESTARTS   AGE
## bad-image   0/1     ErrImagePull       0          8s
## bad-image   0/1     ImagePullBackOff   0          21s

# Read the event — it tells you exactly what went wrong
kubectl describe pod bad-image -n failure-lab | grep -A 8 Events:

## Failed to pull image "nginx:99.99.99-does-not-exist": rpc error:
## manifest for nginx:99.99.99-does-not-exist not found: manifest unknown

# Fix: patch the pod's image to a valid tag
# Pods are immutable — delete and recreate with correct image
kubectl delete pod bad-image -n failure-lab
kubectl run bad-image \
  --image=nginx:1.25-alpine \
  -n failure-lab

kubectl get pod bad-image -n failure-lab
## NAME        READY   STATUS    RESTARTS   AGE
## bad-image   1/1     Running   0          5s

For private registries, the event will say "401 Unauthorized" or "403 Forbidden". Fix: create a Secret of type kubernetes.io/dockerconfigjson and reference it with imagePullSecrets in the pod spec.

02

Scenario 2 — OOMKilled

Simulate and fix an out-of-memory container kill

Deploy a container that allocates more memory than its limit allows. The kernel OOM killer terminates it with exit code 137.

# Deploy a pod with a very low memory limit
cat << 'EOF' | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: oom-demo
  namespace: failure-lab
spec:
  containers:
  - name: stress
    image: polinux/stress
    args:
    - stress
    - --vm
    - "1"
    - --vm-bytes
    - "150M"   # tries to allocate 150 MiB
    - --vm-hang
    - "1"
    resources:
      limits:
        memory: "64Mi"   # but limit is only 64 MiB — will OOMKill
      requests:
        memory: "32Mi"
EOF

# Watch it get killed within a few seconds
kubectl get pod oom-demo -n failure-lab -w
## NAME       READY   STATUS      RESTARTS   AGE
## oom-demo   0/1     OOMKilled   0          4s
## oom-demo   0/1     CrashLoopBackOff  1    10s

# Confirm the exit code is 137 (128 + SIGKILL)
kubectl get pod oom-demo -n failure-lab -o json \
  | jq '.status.containerStatuses[0].lastState.terminated'

## {
##   "exitCode": 137,
##   "reason": "OOMKilled",
##   "startedAt": "...",
##   "finishedAt": "..."
## }

# Fix: raise the memory limit
kubectl delete pod oom-demo -n failure-lab

cat << 'EOF' | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: oom-demo
  namespace: failure-lab
spec:
  containers:
  - name: stress
    image: polinux/stress
    args: [stress, --vm, "1", --vm-bytes, "150M", --vm-hang, "60"]
    resources:
      limits:
        memory: "256Mi"   # enough headroom
      requests:
        memory: "160Mi"
EOF

kubectl get pod oom-demo -n failure-lab
## NAME       READY   STATUS    RESTARTS   AGE
## oom-demo   1/1     Running   0          6s

In production, use Goldilocks (Domain 4, Exercise 3) to get VPA recommendations before adjusting limits. OOMKilled with a process that should not consume that much memory often signals a memory leak — profile before simply raising the limit.

03

Scenario 3 — Evicted

Understand eviction and prevent it with proper resource requests

Pods without resource requests are Burstable or BestEffort and are the first evicted when a node hits disk or memory pressure. This scenario shows how to read eviction events and prevent them.

# Deploy a pod without any resource requests (BestEffort QoS)
kubectl run best-effort-pod \
  --image=nginx:1.25-alpine \
  -n failure-lab

# Check its QoS class
kubectl get pod best-effort-pod -n failure-lab -o json \
  | jq '.status.qosClass'
## "BestEffort"

# In a real eviction event, you'd see this in pod status:
# kubectl get pod <name> -n <ns> -o json | jq '.status'
## {
##   "phase": "Failed",
##   "reason": "Evicted",
##   "message": "The node was low on resource: ephemeral-storage."
## }

# Find all evicted pods across all namespaces
kubectl get pods --all-namespaces \
  --field-selector=status.phase=Failed \
  -o json | jq -r \
  '.items[] | select(.status.reason=="Evicted") |
   [.metadata.namespace, .metadata.name, .status.message] |
   @tsv'

# Simulate disk pressure eviction threshold awareness:
# kubelet evicts in this order:
#   1. BestEffort pods (no requests)
#   2. Burstable pods exceeding their requests
#   3. Guaranteed pods (limits == requests) — last evicted

# Fix: add resource requests so pod becomes Burstable (safer)
kubectl delete pod best-effort-pod -n failure-lab

cat << 'EOF' | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: burstable-pod
  namespace: failure-lab
spec:
  containers:
  - name: nginx
    image: nginx:1.25-alpine
    resources:
      requests:
        cpu: "50m"
        memory: "32Mi"
      limits:
        cpu: "200m"
        memory: "64Mi"
EOF

kubectl get pod burstable-pod -n failure-lab -o json \
  | jq '.status.qosClass'
## "Burstable"

# Clean up evicted pod records (they stay as Failed)
kubectl get pods -n failure-lab \
  --field-selector=status.phase=Failed -o name \
  | xargs kubectl delete -n failure-lab 2>/dev/null || true

Evicted pods are not automatically garbage collected. They stay in Failed state and can accumulate over time, consuming etcd space. Use a LimitRange to enforce default requests on all pods in a namespace, preventing BestEffort workloads entirely.

04

Scenario 4 — Pending (Resource Constraints)

Diagnose and fix an unschedulable pod

A pod that cannot be scheduled stays Pending indefinitely. The scheduler records the reason in events. This scenario covers both resource exhaustion and taint-based blocking.

# Deploy a pod requesting more CPU than any node has
cat << 'EOF' | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: unschedulable-pod
  namespace: failure-lab
spec:
  containers:
  - name: nginx
    image: nginx:1.25-alpine
    resources:
      requests:
        cpu: "99"       # 99 cores — no node has this
        memory: "256Mi"
EOF

kubectl get pod unschedulable-pod -n failure-lab
## NAME                 READY   STATUS    RESTARTS   AGE
## unschedulable-pod   0/1     Pending   0          30s

# Read the scheduler's reason from events
kubectl describe pod unschedulable-pod -n failure-lab \
  | grep -A 5 "Events:"
## Warning  FailedScheduling  default-scheduler
## 0/3 nodes are available: 3 Insufficient cpu.
## preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod.

# Check actual allocatable capacity on each node
kubectl get nodes -o custom-columns=\
'NAME:.metadata.name,CPU:.status.allocatable.cpu,MEM:.status.allocatable.memory'

## NAME       CPU     MEM
## node-1     3920m   14336Mi
## node-2     3920m   14336Mi

# Fix: reduce the CPU request to something schedulable
kubectl delete pod unschedulable-pod -n failure-lab

cat << 'EOF' | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: schedulable-pod
  namespace: failure-lab
spec:
  containers:
  - name: nginx
    image: nginx:1.25-alpine
    resources:
      requests:
        cpu: "100m"
        memory: "64Mi"
EOF

kubectl get pod schedulable-pod -n failure-lab
## NAME              READY   STATUS    RESTARTS   AGE
## schedulable-pod   1/1     Running   0          4s

## --- Bonus: taint-based Pending ---
# Taint a node to block scheduling
NODE=$(kubectl get nodes -o name | head -1 | cut -d/ -f2)
kubectl taint node $NODE env=prod:NoSchedule

# Pod without toleration will be Pending on the tainted node
kubectl run taint-test --image=nginx:1.25-alpine -n failure-lab
kubectl describe pod taint-test -n failure-lab | grep "FailedScheduling"
## Warning FailedScheduling: 1 node(s) had untolerated taint {env: prod}

# Remove the taint to restore scheduling
kubectl taint node $NODE env=prod:NoSchedule-
kubectl delete pod taint-test -n failure-lab

05

Scenario 5 — Terminating (Stuck Finalizer)

Remove a stuck finalizer to unblock deletion

Finalizers are strings in metadata.finalizers that prevent an object from being deleted until a controller clears them. If the controller is gone or broken, the object is stuck in Terminating forever.

# Create a pod with a custom finalizer
cat << 'EOF' | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: finalizer-demo
  namespace: failure-lab
  finalizers:
  - example.com/custom-cleanup   # controller is supposed to remove this
spec:
  containers:
  - name: nginx
    image: nginx:1.25-alpine
EOF

kubectl get pod finalizer-demo -n failure-lab
## NAME             READY   STATUS    RESTARTS   AGE
## finalizer-demo   1/1     Running   0          5s

# Delete the pod — it will hang in Terminating
kubectl delete pod finalizer-demo -n failure-lab &

# In another terminal, observe it stuck
kubectl get pod finalizer-demo -n failure-lab
## NAME             READY   STATUS        RESTARTS   AGE
## finalizer-demo   0/1     Terminating   0          45s

# Confirm the finalizer is blocking deletion
kubectl get pod finalizer-demo -n failure-lab -o json \
  | jq '.metadata.finalizers'
## ["example.com/custom-cleanup"]

# Check deletionTimestamp is set (deletion was requested)
kubectl get pod finalizer-demo -n failure-lab -o json \
  | jq '.metadata.deletionTimestamp'
## "2024-11-15T10:23:47Z"

# Fix: patch the finalizers list to empty — this unblocks deletion
kubectl patch pod finalizer-demo -n failure-lab \
  --type=json \
  -p='[{"op":"remove","path":"/metadata/finalizers"}]'

# The pod should now disappear immediately
kubectl get pod finalizer-demo -n failure-lab
## Error from server (NotFound): pods "finalizer-demo" not found

# Same pattern works for any Kubernetes resource:
# kubectl patch namespace <ns> --type=json \
#   -p='[{"op":"remove","path":"/metadata/finalizers"}]'
# This is how you unblock stuck namespace deletion too

Only patch away finalizers when you are certain the associated cleanup operation either completed or is no longer needed. Removing finalizers bypasses the controller's cleanup logic, which may leave orphaned resources (cloud load balancers, volumes, DNS records) behind.

06

Cleanup and validate resolution of all scenarios

# Verify no pods are in a failed state in failure-lab
kubectl get pods -n failure-lab

# Delete the namespace to clean up all resources
kubectl delete namespace failure-lab

# Confirm deletion
kubectl get namespace failure-lab
## Error from server (NotFound): namespaces "failure-lab" not found

## --- Summary of diagnostic commands ---
# Universal triage starting point for any pod failure:
kubectl describe pod <name> -n <ns>   # Events + state + probe results
kubectl logs <name> -n <ns> --previous  # Logs from the previous container run
kubectl get pod <name> -n <ns> -o json \
  | jq '.status.containerStatuses'      # Exit codes + restart counts + OOM reason
kubectl get events -n <ns> \
  --sort-by='.lastTimestamp'            # Chronological event stream

Success Criteria

ImagePullBackOff: identified wrong image tag from event message and fixed it OOMKilled: confirmed exit code 137 via JSON output and raised memory limit Evicted: explained QoS class ordering (BestEffort → Burstable → Guaranteed) and added resource requests Pending: read scheduler FailedScheduling event and reduced CPU request to schedulable value Terminating: used kubectl patch to remove the blocking finalizer and confirmed pod was deleted Can recite the triage starting commands (describe, logs --previous, get pod -o json, get events) from memory