← Back to Guide
Observability L3 · ADVANCED ~90 min

Write a PodCrashLoopBackOff Incident Runbook

Design and write a complete runbook for a PodCrashLoopBackOff flood incident. Include detection PromQL query, triage steps, a cause decision tree (OOM vs config error vs crash vs probe failure), remediation for each cause, and escalation path. Validate against a simulated incident.

Objective

A CrashLoopBackOff flood (many pods crashing simultaneously) is one of the most common and urgent Kubernetes incidents. Without a runbook, engineers waste time on inconsistent triage. This exercise builds a production-quality runbook and then validates it by deliberately triggering each crash cause in a test environment.

Prerequisites

Steps

01

Create the detection alert rule

# crashloop-alert.yaml
cat << 'EOF' | kubectl apply -f -
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: crashloop-alerts
  namespace: monitoring
  labels:
    release: kube-prometheus-stack
spec:
  groups:
  - name: crashloop
    rules:
    - alert: PodCrashLoopFlood
      expr: |
        count by (namespace) (
          kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}
        ) > 3
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "CrashLoopBackOff flood in {{ $labels.namespace }}"
        description: "{{ $value }} pods in CrashLoopBackOff in namespace {{ $labels.namespace }}"
        runbook_url: "https://wiki.company.com/runbooks/crashloop-flood"

    - alert: SinglePodCrashLoop
      expr: |
        kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} == 1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Pod {{ $labels.pod }} in CrashLoopBackOff"
        description: "Container {{ $labels.container }} in pod {{ $labels.pod }} is CrashLoopBackOff"
EOF
02

Triage Phase 1 — Initial assessment (target: under 5 minutes)

## RUNBOOK: PodCrashLoopBackOff Flood
## Severity: P1 when >3 pods affected; P2 for single pod
## Last Updated: 2024-01

## PHASE 1: Initial Triage (target: 5 min)

# 1.1 Identify affected pods and namespaces
kubectl get pods --all-namespaces \
  --field-selector=status.phase!=Running | grep CrashLoop

# More detailed view
kubectl get pods --all-namespaces -o wide | grep -E "CrashLoop|Error"

# 1.2 Check restart count and last exit code
kubectl get pod $POD_NAME -n $NAMESPACE \
  -o jsonpath='{.status.containerStatuses[*]}'  | python3 -c "
import sys, json
data = json.load(sys.stdin)
for c in data:
    print(f'Container: {c[\"name\"]}')
    print(f'  Restart count: {c[\"restartCount\"]}')
    lstate = c.get(\"lastState\", {}).get(\"terminated\", {})
    print(f'  Last exit code: {lstate.get(\"exitCode\", \"N/A\")}')
    print(f'  Last reason: {lstate.get(\"reason\", \"N/A\")}')
"

# 1.3 Quick event scan
kubectl describe pod $POD_NAME -n $NAMESPACE | \
  grep -A5 "Events:"
03

Triage Phase 2 — Decision tree by exit code

## DECISION TREE based on exit code from 'lastState.terminated.exitCode'

## EXIT CODE 137 → OOMKilled (out of memory)
#   Detection: kubectl describe pod | grep -i "oom\|killed\|137"
#   Confirm: kubectl top pod $POD --containers (was using near limit?)
#   Fix: Increase memory limit OR reduce memory consumption
kubectl get pod $POD_NAME -n $NAMESPACE \
  -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'
# "OOMKilled" → increase memory limits

## EXIT CODE 1 → Application error / config error
#   Detection: kubectl logs $POD --previous (look for startup errors)
kubectl logs $POD_NAME -n $NAMESPACE --previous 2>&1 | tail -50

## EXIT CODE 0 → Clean exit (readiness/liveness probe killing healthy pod)
#   Detection: probe config mismatch
kubectl describe pod $POD_NAME -n $NAMESPACE | \
  grep -A10 "Liveness\|Readiness"

## EXIT CODE 128+N → Signal-based termination
#   Code 143 (128+15) = SIGTERM (graceful shutdown signal)
#   Code 137 (128+9)  = SIGKILL (OOM or force kill)

## EXIT CODE 126/127 → Command not found / permission error
#   Check entrypoint command in pod spec vs what's in the image
kubectl get pod $POD_NAME -n $NAMESPACE \
  -o jsonpath='{.spec.containers[0].command}'
04

Simulate each crash scenario and validate triage

kubectl create namespace crashloop-test

## Scenario 1: OOMKilled
cat << 'EOF' | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: oom-test
  namespace: crashloop-test
spec:
  containers:
  - name: oom
    image: polinux/stress
    command: ["stress", "--vm", "1", "--vm-bytes", "200M"]
    resources:
      limits:
        memory: 50Mi  # Will OOMKill at 50MB when allocating 200MB
EOF

# Wait for OOMKill then check exit code
sleep 30
kubectl describe pod oom-test -n crashloop-test | grep -E "OOM|137|Killed"
# Expected: OOMKilled / exit code 137

## Scenario 2: Config error (bad env var)
cat << 'EOF' | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: config-error-test
  namespace: crashloop-test
spec:
  containers:
  - name: app
    image: nginx:alpine
    command: ["/bin/sh", "-c", "echo $REQUIRED_VAR && exit 1"]
    env: []  # REQUIRED_VAR is missing
EOF
kubectl logs config-error-test -n crashloop-test --previous 2>&1 | head -5

## Scenario 3: Bad readiness probe
cat << 'EOF' | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: probe-fail-test
  namespace: crashloop-test
spec:
  containers:
  - name: app
    image: nginx:alpine
    readinessProbe:
      httpGet:
        path: /nonexistent-path
        port: 80
      failureThreshold: 1
      initialDelaySeconds: 5
    livenessProbe:
      httpGet:
        path: /nonexistent-path
        port: 80
      failureThreshold: 2
      initialDelaySeconds: 10
EOF
# After 20s: pod transitions to CrashLoopBackOff due to liveness probe failure

# Check events for probe failures
kubectl describe pod probe-fail-test -n crashloop-test | grep -A5 "Events"
05

Write the remediation playbook

## REMEDIATION ACTIONS BY CAUSE

## OOMKilled remediation
# Short-term: increase memory limit
kubectl patch deployment $DEPLOYMENT_NAME -n $NAMESPACE \
  --type=json \
  -p='[{"op":"replace","path":"/spec/template/spec/containers/0/resources/limits/memory","value":"512Mi"}]'

# Medium-term: check VPA recommendations for right-sizing
kubectl get vpa -n $NAMESPACE
# If not installed: kubectl top pods --containers -n $NAMESPACE

## Config error remediation
# Check if Secret/ConfigMap referenced by env exists
kubectl get secrets -n $NAMESPACE | grep required-secret
kubectl get configmap -n $NAMESPACE | grep app-config

# If missing secret: create it
kubectl create secret generic required-secret \
  --from-literal=REQUIRED_VAR=value \
  -n $NAMESPACE

# Then restart the pod
kubectl rollout restart deployment/$DEPLOYMENT_NAME -n $NAMESPACE

## Probe failure remediation
# Increase failure threshold while investigating
kubectl patch deployment $DEPLOYMENT_NAME -n $NAMESPACE \
  --type=json \
  -p='[{"op":"replace","path":"/spec/template/spec/containers/0/livenessProbe/failureThreshold","value":5}]'

## Escalation matrix
# Single pod, non-critical: Create ticket, monitor
# Multiple pods, non-critical: Page on-call engineer
# Any pods, production user-facing: P1 alert, wake-up on-call
# Data-layer pods (DB, cache): P0, escalate to on-call + manager

Success Criteria

Further Reading