Pod CrashLoopBackOff¶
1. Summary¶
Use this playbook when a pod repeatedly restarts and Kubernetes reports CrashLoopBackOff. In AKS, the visible symptom often hides a smaller set of root causes: image pull and startup drift, container exit failures, OOM kills, broken liveness probes, or missing configuration and secret dependencies.
Typical incident window: 10-20 minutes to establish whether the issue is workload-specific, node-specific, or cluster-wide. Time to resolution: 30 minutes to several hours depending on whether the fix is manifest-level, node-level, or Azure control-plane level.
Symptoms¶
kubectl get pods --all-namespacesshowsCrashLoopBackOffor rapidly increasing restart counts.kubectl describe podshowsBack-off restarting failed container.- Container Insights reports a sharp increase in restarts or termination reasons.
- Only one namespace or one rollout revision may be affected even when the cluster itself is healthy.
Diagnostic flowchart¶
flowchart TD
A[Reported symptom] --> B{Can the object be reproduced now?}
B -->|No| C[Use recent events, Container Insights, and rollout history]
B -->|Yes| D[Capture current state with kubectl and Azure CLI]
D --> E{Is the issue isolated to one workload or node pool?}
E -->|Workload| F[Check image, probes, config, and service wiring]
E -->|Node pool| G[Check node health, autoscaler, subnet, and VMSS state]
F --> H[Validate with KQL and controller logs]
G --> H
H --> I[Apply targeted fix and verify telemetry returns to baseline] 2. Common Misreadings¶
| Observation | Often Misread As | Actually Means |
|---|---|---|
| Symptom appears in one namespace | Entire cluster outage | The issue may still be isolated to one rollout, one pool, or one ingress class. |
| Azure portal shows cluster healthy | Workload path is healthy | Control plane health does not prove pod, node, or ingress behavior. |
| Restart or reschedule seems to help briefly | Root cause is fixed | Many AKS issues recur until the underlying manifest, node, or network condition is corrected. |
| Monitoring has partial data | Monitoring is the problem | Partial Container Insights data is itself useful evidence about scope and timing. |
3. Competing Hypotheses¶
| Hypothesis | Likelihood | Key Discriminator |
|---|---|---|
| Image or startup dependency drift | High | Events show image pull delay, missing secrets, or entrypoint errors before the loop starts. |
| Container exits because of application error | High | Previous logs show stack traces, non-zero exit code, or dependency initialization failure. |
| OOM kill caused by requests and limits mismatch | High | Termination reason is OOMKilled and memory metrics climb before restart. |
| Liveness probe is too aggressive | Medium | Logs show the app becomes healthy eventually, but the kubelet restarts it first. |
4. What to Check First¶
-
Confirm the current object state from Kubernetes
-
Describe the affected object to capture recent events
-
Check AKS cluster and node pool configuration from Azure
-
List node pools and autoscaler settings
-
Run a fast Container Insights control query
5. Evidence to Collect¶
5.1 KQL Queries¶
KubePodInventory
| where TimeGenerated > ago(30m)
| summarize Restarts=max(ContainerRestartCount), LastSeen=max(TimeGenerated) by ClusterName, Namespace, PodName, ContainerName
| order by Restarts desc
| Column | Example value | Interpretation |
|---|---|---|
Restarts | 14 | Confirms the issue is current and identifies which container is unstable. |
LastSeen | 2026-04-07 09:41:00 | Shows how fresh the inventory signal is. |
Namespace | payments | Helps isolate whether blast radius is limited. |
How to Read This
Start by proving scope. If restart or state anomalies are limited to one namespace or one pool, avoid cluster-wide changes first.
ContainerLogV2
| where TimeGenerated > ago(30m)
| summarize LogLines=count(), LastSeen=max(TimeGenerated) by Namespace, PodName
| order by LastSeen desc
| Column | Example value | Interpretation |
|---|---|---|
LogLines | 152 | Confirms whether the pod is emitting logs during failure. |
LastSeen | recent timestamp | Stale logs can indicate the container never reaches full runtime. |
How to Read This
Pair this query with kubectl logs --previous so you do not confuse current healthy logs with the failing previous container instance.
KubeEvents
| where TimeGenerated > ago(30m)
| where Reason in ("Failed", "BackOff", "Unhealthy", "NodeNotReady", "FailedScheduling")
| project TimeGenerated, Namespace, Name, Reason, Message
| order by TimeGenerated desc
| Column | Example value | Interpretation |
|---|---|---|
Reason | BackOff | Indicates repeated restart attempts or scheduling failures depending on the object. |
Message | Back-off restarting failed container | Often provides the shortest path to the likely hypothesis. |
How to Read This
Events often age out faster than logs. Capture them early in the incident before recreating pods or nodes.
5.2 CLI Investigation¶
Interpretation: previous logs are usually more valuable than current logs during restart loops because they contain the container exit path.
Interpretation: look for probe failures, image pull errors, FailedScheduling, NodeNotReady, or backend controller warnings near the incident start time.
az vmss list-instances \
--resource-group "$NODE_RESOURCE_GROUP" \
--name "$VMSS_NAME" \
--query "[].{instanceId:instanceId,provisioningState:provisioningState,latestModelApplied:latestModelApplied}" \
--output table
Interpretation: when the problem is node- or ingress-related, VMSS state and model drift provide important Azure-side evidence.
6. Validation and Disproof by Hypothesis¶
Image or startup dependency drift¶
Proves if: Kubernetes events, previous logs, and Azure-side state all align around this hypothesis.
Disproves if: Another signal explains the timing more directly or the expected discriminator is missing.
Container exits because of application error¶
Proves if: Kubernetes events, previous logs, and Azure-side state all align around this hypothesis.
Disproves if: Another signal explains the timing more directly or the expected discriminator is missing.
OOM kill caused by requests and limits mismatch¶
Proves if: Kubernetes events, previous logs, and Azure-side state all align around this hypothesis.
Disproves if: Another signal explains the timing more directly or the expected discriminator is missing.
Liveness probe is too aggressive¶
Proves if: Kubernetes events, previous logs, and Azure-side state all align around this hypothesis.
Disproves if: Another signal explains the timing more directly or the expected discriminator is missing.
7. Likely Root Cause Patterns¶
| Pattern | Evidence | Resolution |
|---|---|---|
| Manifest drift after a rollout | New revision correlates with events, logs, or controller errors | Revert or patch the manifest and validate against staging first |
| Pool-level capacity mismatch | Pending pods, high utilization, or NotReady nodes align to one pool | Tune requests, autoscaler limits, or node pool shape |
| Network or DNS drift | Ingress, image pull, or dependency lookups fail while pods otherwise look normal | Correct DNS, NSG, route, or ingress controller configuration |
| Operational blind spot | Teams deleted or recreated resources before collecting evidence | Add a first-response checklist and automation for evidence capture |
8. Immediate Mitigations and Step-by-Step Resolution¶
- Capture previous logs and termination reason before deleting the pod.
- Fix image reference, environment variables, or secrets if the container never starts correctly.
- If termination is
OOMKilled, increase memory only after validating real usage and startup footprint. - Replace over-aggressive liveness checks with
startupProbeplus a realistic readiness design. - Roll out the corrected workload gradually and confirm restarts return to baseline in Container Insights.
Example resolution commands:
az aks nodepool update \
--resource-group "$RG" \
--cluster-name "$CLUSTER_NAME" \
--name "$NODEPOOL_NAME" \
--max-count 10
9. Prevention Checklist¶
- [ ] Create saved Container Insights queries for the symptom family and link them in the team runbook.
- [ ] Require long-flag CLI examples and standardized evidence capture in incident response docs.
- [ ] Review ingress, autoscaler, probes, and node pool settings during every production readiness review.
- [ ] Alert on restart spikes,
NotReadynodes, andFailedSchedulingevents before customers report impact. - [ ] Document which changes require platform-team approval, especially around networking, ingress, and security policy.