Memory Leak OOMKilled¶
Use this playbook when replicas restart with OOMKilled, exit code 137, or repeated memory-pressure symptoms that briefly improve after a restart and then return.
Symptom¶
- Replicas restart repeatedly after serving traffic for a period of time.
- System logs or revision events contain
OOMKilled,Killed, or abrupt termination signals. - Working set memory climbs until it approaches the configured limit.
- Increasing traffic or long-lived sessions accelerates the restart pattern.
Possible Causes¶
- Application memory leak in caches, queues, or retained objects.
- Legitimate peak memory demand that exceeds the replica limit.
- Memory-heavy startup or background jobs sharing the same replica budget.
- Missing backpressure, causing unbounded in-memory buffering.
- Too few replicas, forcing each replica to hold too much state.
Diagnosis Steps¶
flowchart TD
A[Replica restarts or OOMKilled] --> B[Check memory limit and WorkingSetBytes]
B --> C{Memory grows toward limit?}
C -->|No| D[Check startup failures or other crash causes]
C -->|Yes| E[Compare growth pattern with traffic and time]
E --> F{Resets after restart then grows again?}
F -->|Yes| G[Strong leak or unbounded buffer suspicion]
F -->|No| H[Peak demand may exceed limit]
G --> I[Profile code path and reduce retained memory]
H --> J[Raise memory, scale out, or reduce per-request footprint] -
Confirm the configured memory limit and the latest replica restart pattern.
-
Pull memory metrics for the incident window.
-
Collect system and console logs to distinguish hard OOM from application exceptions.
let AppName = "ca-myapp"; ContainerAppSystemLogs_CL | where ContainerAppName_s == AppName | where TimeGenerated > ago(6h) | where Reason_s has_any ("OOMKilled", "ContainerTerminated", "BackOff") or Log_s has_any ("OOM", "137", "Killed", "memory") | project TimeGenerated, RevisionName_s, ReplicaName_s, Reason_s, Log_s | order by TimeGenerated desc -
Verify whether scale configuration forces too much work into each replica.
| Command or Query | Why it is used |
|---|---|
az containerapp show --query resources | Verifies the enforced memory ceiling per replica. |
az containerapp replica list | Shows restart churn and replica health at the platform level. |
az monitor metrics list --metric WorkingSetBytes ... | Confirms whether memory climbs toward the limit over time. |
KQL for OOMKilled and console logs | Separates hard memory pressure from normal app exceptions. |
Resolution¶
- Fix the leak or unbounded in-memory behavior in the application first.
-
Increase memory only as a mitigation or to buy investigation time.
-
Scale out earlier so each replica handles less concurrent state.
- Add or tune readiness and liveness probes so restart behavior is easier to interpret.
- Capture heap dumps, profiles, or allocation traces in a non-production reproduction environment.
Prevention¶
- Track memory growth over long test runs, not only short smoke tests.
- Avoid unbounded caches, queues, and per-request object retention.
- Set alerts on rising
WorkingSetBytesplus restart count. - Load-test with realistic payload sizes and concurrency.
- Keep memory budgets documented per revision and revisit them after dependency changes.