Sign In

Curriculum 29: Troubleshooting Playbook

Pod Crash Loops

20 min · 35 XP

Diagnosing CrashLoopBackOff

CrashLoopBackOff means a pod is starting, crashing, and restarting repeatedly. Kubernetes backs off exponentially between restarts. Diagnosing the root cause requires checking logs, events, and resource configurations.

Initial Diagnosis

# Check pod status and restart count
kubectl get pods -n production

# Get detailed pod information
kubectl describe pod myapp-7d9f8b6c4-x2k1m -n production

# Look at the Events section at the bottom of describe output
# Common clues: OOMKilled, failed liveness probe, image pull errors

Checking Logs

# View current container logs
kubectl logs myapp-7d9f8b6c4-x2k1m -n production

# View logs from the previous crashed container
kubectl logs myapp-7d9f8b6c4-x2k1m --previous -n production

# Stream logs from all containers in the pod
kubectl logs myapp-7d9f8b6c4-x2k1m --all-containers -n production

# Follow logs in real time
kubectl logs -f myapp-7d9f8b6c4-x2k1m -n production

Common Causes and Fixes

OOMKilled - The container exceeded its memory limit:

# Check resource limits
kubectl get pod myapp-7d9f8b6c4-x2k1m -o jsonpath='{.spec.containers[0].resources}'

# Check if the pod was OOMKilled
kubectl get pod myapp-7d9f8b6c4-x2k1m \
  -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'

Application error - The process exits with a non-zero code:

# Check exit code
kubectl get pod myapp-7d9f8b6c4-x2k1m \
  -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'

Failed health checks - Liveness probe kills the container:

# Inspect probe configuration
kubectl get pod myapp-7d9f8b6c4-x2k1m \
  -o jsonpath='{.spec.containers[0].livenessProbe}'

Interactive Debugging

# Start the pod with a shell override to investigate
kubectl debug myapp-7d9f8b6c4-x2k1m -it \
  --copy-to=debug-pod --container=myapp -- /bin/sh

# Run an ephemeral debug container
kubectl debug myapp-7d9f8b6c4-x2k1m -it --image=busybox

Work through the checklist: logs, events, exit codes, resource limits, then probe configurations.