CrashLoop OOM and Resource Pressure¶
1. Summary¶
Symptom¶
Replicas repeatedly restart with CrashLoopBackOff, OOMKilled, or ContainerTerminated events. Application may briefly serve traffic before crashing, or fail to start at all. Latency spikes and timeouts often precede or accompany crash cycles.
Why this scenario is confusing¶
Resource pressure manifests differently depending on the bottleneck (memory vs CPU vs startup time). An application bug can look like a resource limit issue, and vice versa. OOM kills may occur without obvious memory warnings if the spike is sudden.
Troubleshooting decision flow¶
graph TD
A[Symptom: Frequent restarts] --> B{OOMKilled signal in logs?}
B -->|Yes| H1[H1: Memory limit exceeded]
B -->|No| C{Exit code 137?}
C -->|Yes| H1
C -->|No| D{Probe failures during startup?}
D -->|Yes| H2[H2: CPU throttling delays startup]
D -->|No| E{Crash after serving traffic?}
E -->|Yes| H3[H3: Memory leak or burst allocation]
E -->|No| H4[H4: Application startup crash] 2. Common Misreadings¶
- "Application bug only" — Resource pressure can trigger crashes in correct code.
- "Just raise memory limit" — Unbounded memory growth will still fail; root cause matters.
- "CPU throttling won't crash the app" — Probe timeouts due to CPU starvation cause restarts.
- "OOMKilled always means memory leak" — Can also be legitimate peak usage exceeding limits.
- "Exit code 1 means app bug" — Could be uncaught exception from resource exhaustion.
3. Competing Hypotheses¶
| Hypothesis | Typical Evidence For | Typical Evidence Against |
|---|---|---|
| H1: Memory limit too low | OOMKilled signals, exit code 137, abrupt termination | Stable memory profile well below limit |
| H2: CPU throttling delays startup | Probe timeout with high CPU contention, slow startup logs | Startup latency unchanged under CPU increase |
| H3: Memory leak or burst allocation | Memory grows over time, crashes after serving traffic | Memory stable, crashes during startup |
| H4: Application startup crash | Exit code 1, exception stack in logs, crashes before any requests | Crashes only under load, not at startup |
4. What to Check First¶
Metrics¶
- Memory working set vs memory limit over time
- CPU usage vs CPU limit (throttling indicator)
- Restart count trend
- Request latency percentiles (P95, P99) before crashes
Logs¶
let AppName = "ca-myapp";
ContainerAppSystemLogs_CL
| where ContainerAppName_s == AppName
| where TimeGenerated > ago(2h)
| where Reason_s has_any ("OOMKilled", "CrashLoopBackOff", "ContainerTerminated", "BackOff", "Killed")
or Log_s has_any ("OOM", "killed", "terminated", "exit code", "signal 9", "signal 137")
| project TimeGenerated, RevisionName_s, ReplicaName_s, Reason_s, Log_s
| order by TimeGenerated desc
Platform Signals¶
# Check resource allocation
az containerapp show --name "$APP_NAME" --resource-group "$RG" \
--query "properties.template.containers[0].resources" --output json
# Check probe configuration
az containerapp show --name "$APP_NAME" --resource-group "$RG" \
--query "properties.template.containers[0].probes" --output json
# Check replica status and restart count
az containerapp replica list --name "$APP_NAME" --resource-group "$RG" --output table
5. Evidence to Collect¶
Required Evidence¶
| Evidence | Command/Query | Purpose |
|---|---|---|
| Resource limits | az containerapp show ... --query containers[0].resources | Verify CPU/memory allocation |
| System logs | KQL for OOM/crash events | Identify crash pattern |
| Console logs | az containerapp logs show --type console | Find stack traces, memory errors |
| Restart timeline | KQL with time bins | Correlate crashes with events |
| Probe config | az containerapp show ... --query probes | Check timeout/threshold settings |
Useful Context¶
- Application memory footprint at idle and under load
- Startup time requirements
- Recent code or dependency changes
- Traffic pattern during incidents
6. Validation and Disproof by Hypothesis¶
H1: Memory limit too low¶
Signals that support:
- Explicit
OOMKilledin system logs - Exit code 137 (128 + SIGKILL signal 9)
- Container terminated abruptly without graceful shutdown logs
- Memory usage approaching limit before termination
Signals that weaken:
- Memory usage consistently well below limit
- Graceful shutdown messages in logs
- Exit code 1 with application exception
What to verify:
// Find OOM signals
let AppName = "ca-myapp";
ContainerAppSystemLogs_CL
| where ContainerAppName_s == AppName
| where TimeGenerated > ago(6h)
| where Log_s has_any ("OOM", "137", "killed", "memory")
| project TimeGenerated, RevisionName_s, Log_s
| order by TimeGenerated desc
# Check current memory limits
az containerapp show --name "$APP_NAME" --resource-group "$RG" \
--query "properties.template.containers[0].resources.memory" --output tsv
# Typical fix: increase memory
az containerapp update --name "$APP_NAME" --resource-group "$RG" \
--memory "1.0Gi" --cpu "0.5"
H2: CPU throttling delays startup¶
Signals that support:
- Probe failures during startup phase
- Startup takes longer than probe
initialDelaySeconds - CPU limit is very low (e.g., 0.25 cores)
- Logs show slow initialization steps
Signals that weaken:
- Startup completes within probe window
- CPU usage well below limit
- Crashes occur after stable running period
What to verify:
# Check probe configuration
az containerapp show --name "$APP_NAME" --resource-group "$RG" \
--query "properties.template.containers[0].probes" --output json
# Check CPU allocation
az containerapp show --name "$APP_NAME" --resource-group "$RG" \
--query "properties.template.containers[0].resources.cpu" --output tsv
// Find probe failures
let AppName = "ca-myapp";
ContainerAppSystemLogs_CL
| where ContainerAppName_s == AppName
| where TimeGenerated > ago(2h)
| where Reason_s == "ProbeFailed" or Log_s has "probe"
| project TimeGenerated, RevisionName_s, Log_s
| order by TimeGenerated desc
H3: Memory leak or burst allocation¶
Signals that support:
- Memory grows steadily over time (saw-tooth pattern)
- Crashes occur after serving traffic for some period
- Restart temporarily fixes the issue
- Specific endpoints or operations correlate with crashes
Signals that weaken:
- Memory stable throughout operation
- Crashes during startup before any traffic
What to verify:
// Correlate crashes with traffic
let AppName = "ca-myapp";
let Crashes = ContainerAppSystemLogs_CL
| where ContainerAppName_s == AppName
| where TimeGenerated > ago(6h)
| where Reason_s has_any ("OOMKilled", "ContainerTerminated")
| project CrashTime=TimeGenerated;
// Cross-reference with request patterns in Application Insights if available
# Check for memory-related app settings
az containerapp show --name "$APP_NAME" --resource-group "$RG" \
--query "properties.template.containers[0].env[?contains(name, 'MEMORY') || contains(name, 'HEAP')]" \
--output table
H4: Application startup crash¶
Signals that support:
- Exit code 1 with exception stack trace
- Crashes immediately at startup
- Missing configuration or environment variables
- Dependency connection failures in logs
Signals that weaken:
- Successful startup, crashes only under load
- OOMKilled or exit code 137
What to verify:
# Get console logs for stack traces
az containerapp logs show --name "$APP_NAME" --resource-group "$RG" \
--type console --tail 200
// Find startup errors
let AppName = "ca-myapp";
ContainerAppConsoleLogs_CL
| where ContainerAppName_s == AppName
| where TimeGenerated > ago(2h)
| where Log_s has_any ("error", "exception", "failed", "traceback", "Error:", "Exception:")
| project TimeGenerated, Log_s
| order by TimeGenerated desc
| take 50
7. Likely Root Cause Patterns¶
| Pattern | Frequency | First Signal | Typical Resolution |
|---|---|---|---|
| Memory limit too low | Very common | Exit code 137, OOMKilled | Increase memory allocation |
| Startup too slow for probes | Common | ProbeFailed during startup | Increase initialDelaySeconds or CPU |
| Memory leak in app | Occasional | Crashes after running period | Fix app code, add memory monitoring |
| Missing env vars | Occasional | Exit code 1 at startup | Add required configuration |
| Dependency unavailable | Occasional | Connection errors in console | Fix dependency access |
8. Immediate Mitigations¶
-
If OOM: Increase memory limit
-
If probe timeout: Relax probe settings
-
If startup crash: Roll back to known good revision
-
If memory leak: Restart replicas while investigating
9. Prevention¶
- Baseline resource profiles in staging before production
- Set resource requests based on observed P95 usage + headroom
- Implement memory monitoring and alerts in Application Insights
- Add startup health endpoints that fail fast on missing dependencies
- Use liveness probes with appropriate failure thresholds
- Apply performance regression checks in CI pipeline
See Also¶
- Container Start Failure
- Probe Failure and Slow Start
- HTTP Scaling Not Triggering
- Replica Crash Signals KQL
- Restart Timing Correlation KQL
- Repeated Startup Attempts KQL