Write a PodCrashLoopBackOff Incident Runbook

Objective

A CrashLoopBackOff flood (many pods crashing simultaneously) is one of the most common and urgent Kubernetes incidents. Without a runbook, engineers waste time on inconsistent triage. This exercise builds a production-quality runbook and then validates it by deliberately triggering each crash cause in a test environment.

Prerequisites

kube-prometheus-stack deployed (for alerts)
kubectl access to create test pods
A namespace where you can deploy test workloads

Steps

01

Create the detection alert rule

# crashloop-alert.yaml
cat << 'EOF' | kubectl apply -f -
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: crashloop-alerts
  namespace: monitoring
  labels:
    release: kube-prometheus-stack
spec:
  groups:
  - name: crashloop
    rules:
    - alert: PodCrashLoopFlood
      expr: |
        count by (namespace) (
          kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}
        ) > 3
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "CrashLoopBackOff flood in {{ $labels.namespace }}"
        description: "{{ $value }} pods in CrashLoopBackOff in namespace {{ $labels.namespace }}"
        runbook_url: "https://wiki.company.com/runbooks/crashloop-flood"

    - alert: SinglePodCrashLoop
      expr: |
        kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} == 1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Pod {{ $labels.pod }} in CrashLoopBackOff"
        description: "Container {{ $labels.container }} in pod {{ $labels.pod }} is CrashLoopBackOff"
EOF

02

Triage Phase 1 — Initial assessment (target: under 5 minutes)

## RUNBOOK: PodCrashLoopBackOff Flood
## Severity: P1 when >3 pods affected; P2 for single pod
## Last Updated: 2024-01

## PHASE 1: Initial Triage (target: 5 min)

# 1.1 Identify affected pods and namespaces
kubectl get pods --all-namespaces \
  --field-selector=status.phase!=Running | grep CrashLoop

# More detailed view
kubectl get pods --all-namespaces -o wide | grep -E "CrashLoop|Error"

# 1.2 Check restart count and last exit code
kubectl get pod $POD_NAME -n $NAMESPACE \
  -o jsonpath='{.status.containerStatuses[*]}'  | python3 -c "
import sys, json
data = json.load(sys.stdin)
for c in data:
    print(f'Container: {c[\"name\"]}')
    print(f'  Restart count: {c[\"restartCount\"]}')
    lstate = c.get(\"lastState\", {}).get(\"terminated\", {})
    print(f'  Last exit code: {lstate.get(\"exitCode\", \"N/A\")}')
    print(f'  Last reason: {lstate.get(\"reason\", \"N/A\")}')
"

# 1.3 Quick event scan
kubectl describe pod $POD_NAME -n $NAMESPACE | \
  grep -A5 "Events:"

03

Triage Phase 2 — Decision tree by exit code

## DECISION TREE based on exit code from 'lastState.terminated.exitCode'

## EXIT CODE 137 → OOMKilled (out of memory)
#   Detection: kubectl describe pod | grep -i "oom\|killed\|137"
#   Confirm: kubectl top pod $POD --containers (was using near limit?)
#   Fix: Increase memory limit OR reduce memory consumption
kubectl get pod $POD_NAME -n $NAMESPACE \
  -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'
# "OOMKilled" → increase memory limits

## EXIT CODE 1 → Application error / config error
#   Detection: kubectl logs $POD --previous (look for startup errors)
kubectl logs $POD_NAME -n $NAMESPACE --previous 2>&1 | tail -50

## EXIT CODE 0 → Clean exit (readiness/liveness probe killing healthy pod)
#   Detection: probe config mismatch
kubectl describe pod $POD_NAME -n $NAMESPACE | \
  grep -A10 "Liveness\|Readiness"

## EXIT CODE 128+N → Signal-based termination
#   Code 143 (128+15) = SIGTERM (graceful shutdown signal)
#   Code 137 (128+9)  = SIGKILL (OOM or force kill)

## EXIT CODE 126/127 → Command not found / permission error
#   Check entrypoint command in pod spec vs what's in the image
kubectl get pod $POD_NAME -n $NAMESPACE \
  -o jsonpath='{.spec.containers[0].command}'

04

Simulate each crash scenario and validate triage

kubectl create namespace crashloop-test

## Scenario 1: OOMKilled
cat << 'EOF' | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: oom-test
  namespace: crashloop-test
spec:
  containers:
  - name: oom
    image: polinux/stress
    command: ["stress", "--vm", "1", "--vm-bytes", "200M"]
    resources:
      limits:
        memory: 50Mi  # Will OOMKill at 50MB when allocating 200MB
EOF

# Wait for OOMKill then check exit code
sleep 30
kubectl describe pod oom-test -n crashloop-test | grep -E "OOM|137|Killed"
# Expected: OOMKilled / exit code 137

## Scenario 2: Config error (bad env var)
cat << 'EOF' | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: config-error-test
  namespace: crashloop-test
spec:
  containers:
  - name: app
    image: nginx:alpine
    command: ["/bin/sh", "-c", "echo $REQUIRED_VAR && exit 1"]
    env: []  # REQUIRED_VAR is missing
EOF
kubectl logs config-error-test -n crashloop-test --previous 2>&1 | head -5

## Scenario 3: Bad readiness probe
cat << 'EOF' | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: probe-fail-test
  namespace: crashloop-test
spec:
  containers:
  - name: app
    image: nginx:alpine
    readinessProbe:
      httpGet:
        path: /nonexistent-path
        port: 80
      failureThreshold: 1
      initialDelaySeconds: 5
    livenessProbe:
      httpGet:
        path: /nonexistent-path
        port: 80
      failureThreshold: 2
      initialDelaySeconds: 10
EOF
# After 20s: pod transitions to CrashLoopBackOff due to liveness probe failure

# Check events for probe failures
kubectl describe pod probe-fail-test -n crashloop-test | grep -A5 "Events"

05

Write the remediation playbook

## REMEDIATION ACTIONS BY CAUSE

## OOMKilled remediation
# Short-term: increase memory limit
kubectl patch deployment $DEPLOYMENT_NAME -n $NAMESPACE \
  --type=json \
  -p='[{"op":"replace","path":"/spec/template/spec/containers/0/resources/limits/memory","value":"512Mi"}]'

# Medium-term: check VPA recommendations for right-sizing
kubectl get vpa -n $NAMESPACE
# If not installed: kubectl top pods --containers -n $NAMESPACE

## Config error remediation
# Check if Secret/ConfigMap referenced by env exists
kubectl get secrets -n $NAMESPACE | grep required-secret
kubectl get configmap -n $NAMESPACE | grep app-config

# If missing secret: create it
kubectl create secret generic required-secret \
  --from-literal=REQUIRED_VAR=value \
  -n $NAMESPACE

# Then restart the pod
kubectl rollout restart deployment/$DEPLOYMENT_NAME -n $NAMESPACE

## Probe failure remediation
# Increase failure threshold while investigating
kubectl patch deployment $DEPLOYMENT_NAME -n $NAMESPACE \
  --type=json \
  -p='[{"op":"replace","path":"/spec/template/spec/containers/0/livenessProbe/failureThreshold","value":5}]'

## Escalation matrix
# Single pod, non-critical: Create ticket, monitor
# Multiple pods, non-critical: Page on-call engineer
# Any pods, production user-facing: P1 alert, wake-up on-call
# Data-layer pods (DB, cache): P0, escalate to on-call + manager

Success Criteria

PrometheusRule for CrashLoopFlood alert is active in Prometheus OOM scenario creates a pod with exit code 137 and reason OOMKilled Config error scenario shows exit code 1 and error in previous logs Probe failure scenario transitions to CrashLoopBackOff due to liveness probe You can diagnose each scenario correctly using only the runbook triage commands Remediation steps resolve each scenario and pods return to Running Escalation matrix is defined with specific pod categories and response levels