Objective
A CrashLoopBackOff flood (many pods crashing simultaneously) is one of the most common and urgent Kubernetes incidents. Without a runbook, engineers waste time on inconsistent triage. This exercise builds a production-quality runbook and then validates it by deliberately triggering each crash cause in a test environment.
Prerequisites
- kube-prometheus-stack deployed (for alerts)
- kubectl access to create test pods
- A namespace where you can deploy test workloads
Steps
01
Create the detection alert rule
# crashloop-alert.yaml
cat << 'EOF' | kubectl apply -f -
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: crashloop-alerts
namespace: monitoring
labels:
release: kube-prometheus-stack
spec:
groups:
- name: crashloop
rules:
- alert: PodCrashLoopFlood
expr: |
count by (namespace) (
kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}
) > 3
for: 2m
labels:
severity: critical
annotations:
summary: "CrashLoopBackOff flood in {{ $labels.namespace }}"
description: "{{ $value }} pods in CrashLoopBackOff in namespace {{ $labels.namespace }}"
runbook_url: "https://wiki.company.com/runbooks/crashloop-flood"
- alert: SinglePodCrashLoop
expr: |
kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} == 1
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} in CrashLoopBackOff"
description: "Container {{ $labels.container }} in pod {{ $labels.pod }} is CrashLoopBackOff"
EOF02
Triage Phase 1 — Initial assessment (target: under 5 minutes)
## RUNBOOK: PodCrashLoopBackOff Flood ## Severity: P1 when >3 pods affected; P2 for single pod ## Last Updated: 2024-01 ## PHASE 1: Initial Triage (target: 5 min) # 1.1 Identify affected pods and namespaces kubectl get pods --all-namespaces \ --field-selector=status.phase!=Running | grep CrashLoop # More detailed view kubectl get pods --all-namespaces -o wide | grep -E "CrashLoop|Error" # 1.2 Check restart count and last exit code kubectl get pod $POD_NAME -n $NAMESPACE \ -o jsonpath='{.status.containerStatuses[*]}' | python3 -c " import sys, json data = json.load(sys.stdin) for c in data: print(f'Container: {c[\"name\"]}') print(f' Restart count: {c[\"restartCount\"]}') lstate = c.get(\"lastState\", {}).get(\"terminated\", {}) print(f' Last exit code: {lstate.get(\"exitCode\", \"N/A\")}') print(f' Last reason: {lstate.get(\"reason\", \"N/A\")}') " # 1.3 Quick event scan kubectl describe pod $POD_NAME -n $NAMESPACE | \ grep -A5 "Events:"
03
Triage Phase 2 — Decision tree by exit code
## DECISION TREE based on exit code from 'lastState.terminated.exitCode' ## EXIT CODE 137 → OOMKilled (out of memory) # Detection: kubectl describe pod | grep -i "oom\|killed\|137" # Confirm: kubectl top pod $POD --containers (was using near limit?) # Fix: Increase memory limit OR reduce memory consumption kubectl get pod $POD_NAME -n $NAMESPACE \ -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}' # "OOMKilled" → increase memory limits ## EXIT CODE 1 → Application error / config error # Detection: kubectl logs $POD --previous (look for startup errors) kubectl logs $POD_NAME -n $NAMESPACE --previous 2>&1 | tail -50 ## EXIT CODE 0 → Clean exit (readiness/liveness probe killing healthy pod) # Detection: probe config mismatch kubectl describe pod $POD_NAME -n $NAMESPACE | \ grep -A10 "Liveness\|Readiness" ## EXIT CODE 128+N → Signal-based termination # Code 143 (128+15) = SIGTERM (graceful shutdown signal) # Code 137 (128+9) = SIGKILL (OOM or force kill) ## EXIT CODE 126/127 → Command not found / permission error # Check entrypoint command in pod spec vs what's in the image kubectl get pod $POD_NAME -n $NAMESPACE \ -o jsonpath='{.spec.containers[0].command}'
04
Simulate each crash scenario and validate triage
kubectl create namespace crashloop-test ## Scenario 1: OOMKilled cat << 'EOF' | kubectl apply -f - apiVersion: v1 kind: Pod metadata: name: oom-test namespace: crashloop-test spec: containers: - name: oom image: polinux/stress command: ["stress", "--vm", "1", "--vm-bytes", "200M"] resources: limits: memory: 50Mi # Will OOMKill at 50MB when allocating 200MB EOF # Wait for OOMKill then check exit code sleep 30 kubectl describe pod oom-test -n crashloop-test | grep -E "OOM|137|Killed" # Expected: OOMKilled / exit code 137 ## Scenario 2: Config error (bad env var) cat << 'EOF' | kubectl apply -f - apiVersion: v1 kind: Pod metadata: name: config-error-test namespace: crashloop-test spec: containers: - name: app image: nginx:alpine command: ["/bin/sh", "-c", "echo $REQUIRED_VAR && exit 1"] env: [] # REQUIRED_VAR is missing EOF kubectl logs config-error-test -n crashloop-test --previous 2>&1 | head -5 ## Scenario 3: Bad readiness probe cat << 'EOF' | kubectl apply -f - apiVersion: v1 kind: Pod metadata: name: probe-fail-test namespace: crashloop-test spec: containers: - name: app image: nginx:alpine readinessProbe: httpGet: path: /nonexistent-path port: 80 failureThreshold: 1 initialDelaySeconds: 5 livenessProbe: httpGet: path: /nonexistent-path port: 80 failureThreshold: 2 initialDelaySeconds: 10 EOF # After 20s: pod transitions to CrashLoopBackOff due to liveness probe failure # Check events for probe failures kubectl describe pod probe-fail-test -n crashloop-test | grep -A5 "Events"
05
Write the remediation playbook
## REMEDIATION ACTIONS BY CAUSE ## OOMKilled remediation # Short-term: increase memory limit kubectl patch deployment $DEPLOYMENT_NAME -n $NAMESPACE \ --type=json \ -p='[{"op":"replace","path":"/spec/template/spec/containers/0/resources/limits/memory","value":"512Mi"}]' # Medium-term: check VPA recommendations for right-sizing kubectl get vpa -n $NAMESPACE # If not installed: kubectl top pods --containers -n $NAMESPACE ## Config error remediation # Check if Secret/ConfigMap referenced by env exists kubectl get secrets -n $NAMESPACE | grep required-secret kubectl get configmap -n $NAMESPACE | grep app-config # If missing secret: create it kubectl create secret generic required-secret \ --from-literal=REQUIRED_VAR=value \ -n $NAMESPACE # Then restart the pod kubectl rollout restart deployment/$DEPLOYMENT_NAME -n $NAMESPACE ## Probe failure remediation # Increase failure threshold while investigating kubectl patch deployment $DEPLOYMENT_NAME -n $NAMESPACE \ --type=json \ -p='[{"op":"replace","path":"/spec/template/spec/containers/0/livenessProbe/failureThreshold","value":5}]' ## Escalation matrix # Single pod, non-critical: Create ticket, monitor # Multiple pods, non-critical: Page on-call engineer # Any pods, production user-facing: P1 alert, wake-up on-call # Data-layer pods (DB, cache): P0, escalate to on-call + manager
Success Criteria
Further Reading
- Kubernetes exit codes — komodor.com/learn/kubernetes-exit-codes
- Debugging pods — kubernetes.io/docs/tasks/debug/debug-application/debug-pods
- OOMKilled troubleshooting — kubernetes.io/docs/tasks/configure-pod-container/assign-memory-resource
- Probe configuration — kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes