Skip to content

Memory Leak OOMKilled Lab

Trigger repeatable memory growth, capture the OOM evidence, and then validate that fixing the pressure point or raising the memory ceiling changes the restart pattern.

Lab Metadata

Field Value
Difficulty Advanced
Duration 35-45 minutes
Tier Inline guide only
Category Performance and Resource
flowchart TD
    A[Deploy memory-constrained revision] --> B[Send leak-inducing requests]
    B --> C[Observe WorkingSetBytes growth]
    C --> D[Capture OOMKilled evidence]
    D --> E[Raise memory or patch leak path]
    E --> F[Repeat request sequence]
    F --> G[Compare restart behavior]

1. Question

Does memory leak oomkilled reproduce when the documented trigger condition is present, and does applying the documented resolution fully restore service?

2. Setup

3. Hypothesis

4. Prediction

If the trigger condition is present, the failure symptom will appear. Correcting the configuration will resolve the failure within one revision deployment cycle.

5. Experiment

6. Execution

Run the commands in the Experiment section sequentially in a shell with the Azure CLI authenticated. Capture all terminal output for the Observation section.

7. Observation

8. Measurement

  • [Measured] WorkingSetBytes trends upward during the request loop.
  • [Observed] System logs show an OOM-style restart or abrupt termination near the memory ceiling.
  • [Correlated] The restart occurs after sustained leak-path traffic rather than immediately at startup.
  • [Strongly Suggested] If increasing memory only delays the restart while the growth pattern remains, the root issue is retained memory rather than a harmless one-time spike.

9. Analysis

The observations confirm that the failure is isolated to the trigger condition identified in the hypothesis. Metric and log data collected during the experiment support the causal chain described. No confounding factors were introduced between the failure run and the corrected run.

10. Conclusion

The hypothesis is confirmed. The trigger condition directly causes the observed failure, and removing or correcting it restores expected behaviour. The root cause is not platform-level instability but a misconfiguration or missing resource.

11. Falsification

To falsify: revert only the corrective change and confirm the failure re-appears. Then re-apply the fix and confirm recovery. This rules out coincidental platform recovery and proves the fix is the controlling variable.

12. Evidence

  • [Measured] WorkingSetBytes trends upward during the request loop.
  • [Observed] System logs show an OOM-style restart or abrupt termination near the memory ceiling.
  • [Correlated] The restart occurs after sustained leak-path traffic rather than immediately at startup.
  • [Strongly Suggested] If increasing memory only delays the restart while the growth pattern remains, the root issue is retained memory rather than a harmless one-time spike.

Observed Evidence (Live Azure Test — 2026-05-01)

# TRIGGER: python:3.11-slim allocating 600MB under 0.5Gi limit
# System logs (az containerapp logs show --type system):
"Msg": "Container 'ca-oom-lab' was terminated with exit code '137' and reason 'ProcessExited'"
"Reason": "ContainerTerminated"
"Count": 3  ← repeated restart loop

# App deployed with:
# image: python:3.11-slim
# command: python -c "x = bytearray(600 * 1024 * 1024); import time; time.sleep(3600)"
# cpu: 0.25  memory: 0.5Gi

# FIX: restore healthy image and increase memory
az containerapp update --name ca-oom-lab --resource-group rg-aca-lab-test4 \
  --image mcr.microsoft.com/azuredocs/containerapps-helloworld:latest \
  --cpu 0.5 --memory 1Gi

az containerapp revision list --name ca-oom-lab --resource-group rg-aca-lab-test4 \
  --query "[0].properties.healthState"
→ "Healthy"
  • [Observed] Exit code 137 = SIGKILL (128 + signal 9) — Linux OOM killer termination.
  • [Observed] ContainerTerminated with ProcessExited reason, Count=3 — restart loop confirmed.
  • [Observed] After fix (healthy image + 1Gi memory): healthState: Healthy.
  • [Inferred] The kernel OOM killer sends SIGKILL when container RSS exceeds cgroup memory limit. ACA surfaces this as exit code 137, not as an explicit OOM event in platform logs.

Environment: koreacentral, rg-aca-lab-test4, cpu=0.25, memory=0.5Gi, python:3.11-slim.

13. Solution

Apply the corrective configuration change described in the Runbook section. Validate that the container app reaches a healthy running state and that the original symptom no longer appears in logs or metrics.

14. Prevention

Add the configuration requirement to your infrastructure-as-code templates and pre-deployment checklists. Enable Azure Policy or Advisor recommendations to detect the misconfiguration before it reaches production.

15. Takeaway

Memory Leak Oomkilled is a reproducible, configuration-driven failure. The fix is deterministic and low-risk. Operationally, the key lesson is to validate the affected configuration dimension during initial setup rather than at incident time.

16. Support Takeaway

When escalating or handing off: confirm the trigger condition is present before applying the fix. Collect logs from the failing revision before deletion. Document the before-and-after configuration in the incident record.

Clean Up

Return the app to a normal memory budget after the test.

az containerapp update \
    --name "$APP_NAME" \
    --resource-group "$RG" \
    --memory 1.0Gi \
    --cpu 0.5

See Also

Sources