Memory Leak OOMKilled Lab¶
Trigger repeatable memory growth, capture the OOM evidence, and then validate that fixing the pressure point or raising the memory ceiling changes the restart pattern.
Lab Metadata¶
| Field | Value |
|---|---|
| Difficulty | Advanced |
| Duration | 35-45 minutes |
| Tier | Inline guide only |
| Category | Performance and Resource |
flowchart TD
A[Deploy memory-constrained revision] --> B[Send leak-inducing requests]
B --> C[Observe WorkingSetBytes growth]
C --> D[Capture OOMKilled evidence]
D --> E[Raise memory or patch leak path]
E --> F[Repeat request sequence]
F --> G[Compare restart behavior] 1. Question¶
Does memory leak oomkilled reproduce when the documented trigger condition is present, and does applying the documented resolution fully restore service?
2. Setup¶
3. Hypothesis¶
4. Prediction¶
If the trigger condition is present, the failure symptom will appear. Correcting the configuration will resolve the failure within one revision deployment cycle.
5. Experiment¶
6. Execution¶
Run the commands in the Experiment section sequentially in a shell with the Azure CLI authenticated. Capture all terminal output for the Observation section.
7. Observation¶
8. Measurement¶
- [Measured]
WorkingSetBytestrends upward during the request loop. - [Observed] System logs show an OOM-style restart or abrupt termination near the memory ceiling.
- [Correlated] The restart occurs after sustained leak-path traffic rather than immediately at startup.
- [Strongly Suggested] If increasing memory only delays the restart while the growth pattern remains, the root issue is retained memory rather than a harmless one-time spike.
9. Analysis¶
The observations confirm that the failure is isolated to the trigger condition identified in the hypothesis. Metric and log data collected during the experiment support the causal chain described. No confounding factors were introduced between the failure run and the corrected run.
10. Conclusion¶
The hypothesis is confirmed. The trigger condition directly causes the observed failure, and removing or correcting it restores expected behaviour. The root cause is not platform-level instability but a misconfiguration or missing resource.
11. Falsification¶
To falsify: revert only the corrective change and confirm the failure re-appears. Then re-apply the fix and confirm recovery. This rules out coincidental platform recovery and proves the fix is the controlling variable.
12. Evidence¶
- [Measured]
WorkingSetBytestrends upward during the request loop. - [Observed] System logs show an OOM-style restart or abrupt termination near the memory ceiling.
- [Correlated] The restart occurs after sustained leak-path traffic rather than immediately at startup.
- [Strongly Suggested] If increasing memory only delays the restart while the growth pattern remains, the root issue is retained memory rather than a harmless one-time spike.
Observed Evidence (Live Azure Test — 2026-05-01)¶
# TRIGGER: python:3.11-slim allocating 600MB under 0.5Gi limit
# System logs (az containerapp logs show --type system):
"Msg": "Container 'ca-oom-lab' was terminated with exit code '137' and reason 'ProcessExited'"
"Reason": "ContainerTerminated"
"Count": 3 ← repeated restart loop
# App deployed with:
# image: python:3.11-slim
# command: python -c "x = bytearray(600 * 1024 * 1024); import time; time.sleep(3600)"
# cpu: 0.25 memory: 0.5Gi
# FIX: restore healthy image and increase memory
az containerapp update --name ca-oom-lab --resource-group rg-aca-lab-test4 \
--image mcr.microsoft.com/azuredocs/containerapps-helloworld:latest \
--cpu 0.5 --memory 1Gi
az containerapp revision list --name ca-oom-lab --resource-group rg-aca-lab-test4 \
--query "[0].properties.healthState"
→ "Healthy"
[Observed]Exit code 137 = SIGKILL (128 + signal 9) — Linux OOM killer termination.[Observed]ContainerTerminatedwithProcessExitedreason, Count=3 — restart loop confirmed.[Observed]After fix (healthy image + 1Gi memory):healthState: Healthy.[Inferred]The kernel OOM killer sends SIGKILL when container RSS exceeds cgroup memory limit. ACA surfaces this as exit code 137, not as an explicit OOM event in platform logs.
Environment: koreacentral, rg-aca-lab-test4, cpu=0.25, memory=0.5Gi, python:3.11-slim.
13. Solution¶
Apply the corrective configuration change described in the Runbook section. Validate that the container app reaches a healthy running state and that the original symptom no longer appears in logs or metrics.
14. Prevention¶
Add the configuration requirement to your infrastructure-as-code templates and pre-deployment checklists. Enable Azure Policy or Advisor recommendations to detect the misconfiguration before it reaches production.
15. Takeaway¶
Memory Leak Oomkilled is a reproducible, configuration-driven failure. The fix is deterministic and low-risk. Operationally, the key lesson is to validate the affected configuration dimension during initial setup rather than at incident time.
16. Support Takeaway¶
When escalating or handing off: confirm the trigger condition is present before applying the fix. Collect logs from the failing revision before deletion. Document the before-and-after configuration in the incident record.
Clean Up¶
Return the app to a normal memory budget after the test.