CrashLoop¶
1. Summary¶
A pod starts and repeatedly exits or fails health checks. The immediate symptom is restart churn, but the real cause may be application configuration, dependency startup, or probe design.
flowchart TD
A[Symptom] --> B[Hypotheses]
B --> C[Evidence]
C --> D[Disprove weak paths]
D --> E[Mitigation] 2. Common Misreadings¶
- The first visible symptom is the root cause.
- Restarting the pod proves the issue is fixed.
- If one namespace is affected, the cluster is healthy.
3. Competing Hypotheses¶
- H1: The container process exits due to application error.
- H2: Liveness/readiness/startup probes are misconfigured.
- H3: Required configuration or secret data is missing.
- H4: The process is OOMKilled or resource starved.
4. What to Check First¶
kubectl get pods -A
kubectl logs <pod-name> -n <namespace> --previous
kubectl describe pod <pod-name> -n <namespace>
5. Evidence to Collect¶
- Previous container logs.
- Exit code and termination reason.
- Probe configuration and timing.
- Secret, ConfigMap, and dependency readiness state.
6. Validation and Disproof by Hypothesis¶
- If exit code and stack trace exist, H1 is strongest.
- If logs are clean but probes fail, disprove application crash first and inspect probes.
- If termination reason is
OOMKilled, prioritize requests/limits and memory use.
7. Likely Root Cause Patterns¶
- Invalid app config or missing secret.
- Startup work taking longer than probe thresholds.
- Memory limits too low for workload behavior.
- Dependency endpoints unavailable at startup.
8. Immediate Mitigations¶
- Scale down noisy rollout if needed.
- Fix configuration or probe settings.
- Increase limits only if evidence supports it.
- Use
startupProbefor slow initialization instead of weakening liveness blindly.
9. Prevention¶
- Add startup validation in CI/CD.
- Keep probe design workload-specific.
- Capture restart alerts with namespace and owner labels.