Out of Memory / Worker Crash Playbook¶
1. Summary¶
This playbook is for incidents where Azure Functions executions fail, restart, or disappear under load because worker memory is exhausted. It is most common in workloads that materialize large payloads in memory, read blobs fully instead of streaming, hold unbounded collections, or run CPU-heavy media operations that allocate large temporary buffers.
On Azure Functions, memory ceilings are plan-dependent and hard in practice: Consumption is typically limited to about 1.5 GB per instance, Premium EP1 to about 3.5 GB, EP2 to about 7 GB, and EP3 to about 14 GB. When allocations spike beyond available memory, workers can throw System.OutOfMemoryException, trigger process recycling, and leave heartbeat gaps that appear as random cold starts or transient availability loss.
Decision Flow¶
flowchart TD
A["Incident detected: failures/timeouts/restarts"] --> B{Any OOM exceptions in traces or FunctionAppLogs?}
B -->|Yes| C[Correlate by role instance and operation]
B -->|No| D{Heartbeat gaps or worker restarts?}
D -->|Yes| E[Suspect crash without surfaced exception]
D -->|No| F[Check upstream dependencies and network]
C --> G{Large payload or media processing path?}
E --> G
G -->|Yes| H[Validate memory pressure pattern in AppMetrics]
G -->|No| I["Check unbounded collections / leaks / disposal"]
H --> J{Plan memory ceiling reached?}
I --> J
J -->|Yes| K["Immediate mitigation: throttle/scale up/stream"]
J -->|No| L[Validate runtime config and library behavior]
K --> M[Confirm stabilization and no new crash gaps]
L --> M Severity guidance¶
| Condition | Severity | Action priority |
|---|---|---|
| OOM exceptions on single app instance with auto-recovery in under 5 minutes | Sev3 | Start triage within 60 minutes |
| Repeated worker crashes across multiple instances causing queue growth or API errors | Sev2 | Start mitigation within 15 minutes |
| Broad outage with sustained crash loops and customer-visible failure | Sev1 | Immediate response and incident command |
Signal snapshot¶
| Signal | Normal | Incident |
|---|---|---|
traces containing System.OutOfMemoryException | None or very rare | Bursts aligned with failed requests |
AppMetrics working set trend | Stable sawtooth with release | Stair-step growth with no return |
| Host heartbeat continuity | Regular heartbeat cadence | Gaps during crash/recycle windows |
| Request success rate | >99% for steady workload | Drops with 5xx and timeout spikes |
| Queue backlog | Bounded and catches up | Persistent growth during restarts |
flowchart LR
A[Trigger receives event] --> B[Function worker allocates payload]
B --> C[Deserializer + transform buffers]
C --> D[Business logic allocations]
D --> E[Downstream SDK buffers]
E --> F{Memory below ceiling?}
F -->|Yes| G[Execution completes]
F -->|No| H[OOM exception or process crash]
H --> I["Retry/restart causes repeated pressure"] sequenceDiagram
participant Q as Queue/Event Source
participant H as Host
participant W as Worker
participant AI as App Insights
Q->>H: Trigger message batch
H->>W: Dispatch invocation
W->>W: Materialize large object graph
W--xH: Crash / OOM
H->>AI: traces with restart gap
H->>Q: Message visibility timeout expires
Q->>H: Message redelivered 2. Common Misreadings¶
| Misreading | Why incorrect | Correct interpretation |
|---|---|---|
| "No OOM exception means memory is fine" | Hard crashes can terminate before managed exception logging | Use heartbeat gaps, restarts, and abrupt request drops as corroboration |
| "CPU is low, so memory cannot be the issue" | Memory exhaustion can happen with modest CPU | Track working set and allocation bursts independently |
| "Only large files cause OOM" | Many medium objects plus retention can exceed limits | Check cumulative allocations and object lifetime |
| "Premium always prevents OOM" | Plan upgrade raises ceiling but does not remove leaks or spikes | Fix memory behavior and then right-size plan |
| "Retries fix transient crash loops" | Retries can amplify pressure by repeating allocations | Add backoff, lower concurrency, and reduce payload footprint |
3. Competing Hypotheses¶
| ID | Hypothesis | Confirming signal | Disproving signal |
|---|---|---|---|
| H1 | Large payloads are fully buffered in memory | OOMs correlate with large request/body size operations | OOMs happen on tiny payload paths |
| H2 | Blob/content downloads are materialized rather than streamed | OOMs correlate with blob read operations and high bytes read | Blob operations absent near failure windows |
| H3 | Unbounded collections or cache growth in worker process | Working set grows monotonically across invocations | Memory returns to baseline after each batch |
| H4 | EF or DB resources are not disposed causing retention | Long-lived contexts/connections around failed windows | No correlation with DB activity and objects released promptly |
| H5 | Image/PDF processing library spikes temporary allocations | Failures cluster on media conversion endpoints | Media pipeline inactive during incidents |
| H6 | Plan memory ceiling is too low for legitimate workload | OOM near stable peak load with efficient code path | Same load succeeds after code-level optimization without scaling |
4. What to Check First¶
- Confirm whether failures align with memory-intensive functions, payload types, and event sizes.
- Check for crash signatures: OOM traces, heartbeat gaps, and sudden worker restarts.
- Compare memory trend and request success trend in the same 30-minute window.
- Determine if immediate containment needs concurrency reduction or temporary scale-up.
Quick portal checks¶
- In Application Insights, search
tracesforSystem.OutOfMemoryExceptionand restart markers. - In Metrics, overlay memory working set and failed requests for the same time range.
- In Function App logs, inspect crash windows for host heartbeat discontinuity.
Quick CLI checks¶
az functionapp show --name $APP_NAME --resource-group $RG --output table
az monitor log-analytics query --workspace "$WORKSPACE_ID" --analytics-query "traces | where timestamp > ago(30m) | where message contains 'OutOfMemory' | project timestamp, cloud_RoleInstance, operation_Id, message" --output table
az monitor log-analytics query --workspace "$WORKSPACE_ID" --analytics-query "AppMetrics | where TimeGenerated > ago(30m) | where Name in ('WorkingSet','PrivateBytes') | summarize avg(Value), max(Value) by Name, bin(TimeGenerated, 5m)" --output table
Example output¶
Name ResourceGroup State RuntimeVersion LinuxFxVersion
------------------ ------------------- ------- ---------------- --------------------
func-prod-orders rg-functions-prod Running ~4 PYTHON|3.11
timestamp cloud_RoleInstance operation_Id message
-------------------------- ------------------------------ ------------------------------------ --------------------------------------------------------------
2026-04-05T03:10:11.220Z RD281878D4A1C2 xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx System.OutOfMemoryException: Exception of type...
2026-04-05T03:10:11.907Z RD281878D4A1C2 xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx Worker process terminated due to memory pressure
Name TimeGenerated avg_Value max_Value
----------- -------------------------- ------------ ------------
WorkingSet 2026-04-05T03:05:00.000Z 1287348224 1499822080
WorkingSet 2026-04-05T03:10:00.000Z 1519804416 1602242560
5. Evidence to Collect¶
KQL Table Names
Most queries use Application Insights table names (traces, requests, dependencies) with classic columns (timestamp, duration). The AppMetrics table is a Log Analytics-only table and uses TimeGenerated instead of timestamp.
| Source | Query/Command | Purpose |
|---|---|---|
traces | Search for System.OutOfMemoryException, worker process, recycle | Establish explicit crash/error signatures |
traces | Filter by host/worker restart events | Confirm process-level disruption timeline |
AppMetrics | WorkingSet, PrivateBytes, GC metrics by 1-5 minute bins | Measure memory growth and saturation trend |
AppMetrics | Analyze GC pause and allocation counters by role instance | Distinguish leak-like growth from bursty allocations |
requests | Failed/timeout requests per operation | Quantify customer impact and hot endpoints |
dependencies | High latency or failed downstream calls around OOM windows | Rule out external dependency primary failure |
| Deployment history | Runtime/library version changes before incident | Detect regressions introduced by release |
| Function config | host.json concurrency and batching settings | Identify configuration amplifying memory pressure |
| Hosting plan details | SKU/instance memory envelope | Validate if ceiling is structurally insufficient |
6. Validation and Disproof by Hypothesis¶
H1: Large payloads are fully buffered in memory¶
Confirming KQL¶
traces
| where timestamp > ago(6h)
| where message has_any ("OutOfMemoryException", "OutOfMemory", "worker process")
| project oomTime=timestamp, operation_Id, cloud_RoleInstance, message
| join kind=leftouter (
requests
| where timestamp > ago(6h)
| extend requestBytes = tolong(customDimensions["RequestBodyBytes"])
| project operation_Id, reqTime=timestamp, name, resultCode, requestBytes
) on operation_Id
| where isnotempty(requestBytes)
| summarize incidents=count(), p95RequestBytes=percentile(requestBytes,95), maxRequestBytes=max(requestBytes) by name
| order by incidents desc
Expected output¶
name incidents p95RequestBytes maxRequestBytes
--------------------------- --------- --------------- ---------------
ProcessOrderHttpTrigger 41 52428800 104857600
ImportCatalogHttpTrigger 13 73400320 125829120
Disproving check¶
If OOM windows show low or missing request payload sizes and the same operations fail with tiny payloads, this hypothesis weakens. Shift focus to retained state or library allocation spikes.
H2: Blob/content downloads are materialized rather than streamed¶
Confirming KQL¶
dependencies
| where timestamp > ago(6h)
| where type =~ "Azure blob"
| extend bytes = tolong(customDimensions["ContentLengthBytes"])
| summarize blobOps=count(), maxBytes=max(bytes), p95Bytes=percentile(bytes,95) by operation_Id
| join kind=inner (
traces
| where timestamp > ago(6h)
| where message has_any ("OutOfMemory", "worker process terminated", "System.OutOfMemoryException")
| project operation_Id, oomMessage=message, oomTime=timestamp
) on operation_Id
| project oomTime, operation_Id, blobOps, p95Bytes, maxBytes, oomMessage
| order by oomTime desc
Expected output¶
oomTime operation_Id blobOps p95Bytes maxBytes oomMessage
------------------------- ------------------------------------ ------- --------- --------- -------------------------------------------------
2026-04-05T03:10:11.220Z xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx 4 67108864 83886080 System.OutOfMemoryException: Exception of type...
2026-04-05T03:28:02.114Z xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx 3 50331648 67108864 Worker process terminated due to memory pressure
Disproving check¶
If blob dependencies are small, infrequent, or absent around crash periods, bulk materialization of blob content is unlikely to be primary.
H3: Unbounded collections or cache growth in worker process¶
Confirming KQL¶
AppMetrics
| where TimeGenerated > ago(12h)
| where Name in ("WorkingSet", "PrivateBytes")
| summarize avgValue=avg(Value), maxValue=max(Value) by Name, cloud_RoleInstance, bin(TimeGenerated, 10m)
| order by cloud_RoleInstance asc, TimeGenerated asc
Expected output¶
Name cloud_RoleInstance TimeGenerated avgValue maxValue
----------- ------------------- -------------------------- ------------ ------------
WorkingSet RD281878D4A1C2 2026-04-05T02:40:00.000Z 1042284544 1090519040
WorkingSet RD281878D4A1C2 2026-04-05T02:50:00.000Z 1170837504 1222311936
WorkingSet RD281878D4A1C2 2026-04-05T03:00:00.000Z 1329594368 1388314624
WorkingSet RD281878D4A1C2 2026-04-05T03:10:00.000Z 1528823808 1602242560
PrivateBytes RD281878D4A1C2 2026-04-05T02:40:00.000Z 950534144 1003126784
PrivateBytes RD281878D4A1C2 2026-04-05T03:10:00.000Z 1468006400 1539316275
Disproving check¶
If memory exhibits healthy sawtooth behavior with periodic drops after GC and workload drain, persistent retention is less likely. Re-examine event-size bursts or one-off media tasks.
H4: EF or database resources are not disposed causing retention¶
Confirming KQL¶
dependencies
| where timestamp > ago(6h)
| where type in~ ("SQL", "Azure SQL")
| summarize depCount=count(), failed=countif(success == false), p95Duration=percentile(duration,95) by operation_Id
| join kind=leftouter (
traces
| where timestamp > ago(6h)
| where message has_any ("OutOfMemory", "DbContext", "connection pool")
| project operation_Id, traceMsg=message
) on operation_Id
| where isnotempty(traceMsg)
| project operation_Id, depCount, failed, p95Duration, traceMsg
| order by depCount desc
Expected output¶
operation_Id depCount failed p95Duration traceMsg
------------------------------------ -------- ------ ------------- ------------------------------------------------------------
xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx 184 12 00:00:01.214 DbContext retained across invocations; memory pressure rising
xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx 171 8 00:00:01.009 System.OutOfMemoryException after repeated query materialization
Disproving check¶
If DB dependency counts and durations remain normal and no retention indicators appear in traces, EF/context disposal is less likely than payload or media buffering.
H5: Image/PDF processing spikes temporary allocations¶
Confirming KQL¶
requests
| where timestamp > ago(6h)
| where name has_any ("Image", "Pdf", "Render", "Thumbnail")
| summarize mediaCalls=count(), failed=countif(success == false), p95Duration=percentile(duration,95) by operation_Id, name
| join kind=inner (
traces
| where timestamp > ago(6h)
| where message has_any ("OutOfMemory", "bitmap", "PDF", "buffer")
| project operation_Id, oomMessage=message, oomTime=timestamp
) on operation_Id
| project oomTime, name, mediaCalls, failed, p95Duration, oomMessage
| order by oomTime desc
Expected output¶
oomTime name mediaCalls failed p95Duration oomMessage
------------------------- ---------------------- ---------- ------ ----------- ------------------------------------------------------
2026-04-05T04:02:45.110Z RenderPdfActivity 1 1 00:00:22.919 System.OutOfMemoryException during page rasterization
2026-04-05T04:06:31.442Z GenerateThumbnail 1 1 00:00:08.305 Worker process terminated while processing image buffer
Disproving check¶
If media-related requests are absent or successful during failure windows, prioritize hypotheses around generic payload buffering and retained collections.
H6: Plan memory ceiling is too low for legitimate workload¶
Confirming KQL¶
AppMetrics
| where TimeGenerated > ago(24h)
| where Name == "WorkingSet"
| summarize p95=percentile(Value,95), p99=percentile(Value,99), peak=max(Value) by cloud_RoleInstance, bin(TimeGenerated, 1h)
| order by TimeGenerated desc
Expected output¶
cloud_RoleInstance TimeGenerated p95 p99 peak
------------------- -------------------------- ----------- ----------- -----------
RD281878D4A1C2 2026-04-05T03:00:00.000Z 1459617792 1542449152 1602242560
RD281878D4A1C2 2026-04-05T04:00:00.000Z 1480589312 1559238656 1601048576
Disproving check¶
If code changes (streaming, batching, disposal fixes) materially reduce peak memory below the same plan limit, a pure sizing issue is disproved and optimization is primary.
Cross-check result interpretation: - If p95 is consistently within 5-10% of the plan ceiling, short bursts can trigger recurring crash loops. - If p95 remains far below the ceiling but isolated peaks crash workers, investigate single-operation allocation spikes. - If multiple instances show the same pattern at similar utilization, capacity is likely a structural bottleneck.
7. Likely Root Cause Patterns¶
| Pattern | Evidence signature | Frequency |
|---|---|---|
| Full in-memory payload deserialization | OOMs tied to endpoints with high request bytes | High |
| Blob download then process in-memory | Blob dependencies with large content lengths near OOM | High |
| Unbounded in-process cache/list growth | WorkingSet climbs steadily between invocations | Medium |
| Resource lifecycle leak (DB context/object graph) | Long-lived references and rising PrivateBytes | Medium |
| Media conversion buffer amplification | Crashes clustered around PDF/image handlers | Medium |
flowchart TD
A[Deploy code change] --> B[Memory trend begins rising]
B --> C[Latency increases]
C --> D[GC pressure increases]
D --> E[OOM exception or abrupt worker exit]
E --> F[Retries and redelivery]
F --> G[Backlog and error rate increase]
G --> H[Incident declared] 8. Immediate Mitigations¶
- Reduce concurrency and batch size in
host.json, then restart the app to reduce per-instance memory pressure. - Switch blob handling to streaming patterns and avoid loading entire content into byte arrays before processing.
- Temporarily scale to a higher Premium SKU when sustained peaks approach current plan limits.
- Apply guardrails for payload size at the application level. Validate incoming payload size in function code and reject oversized items before buffering.
- Configure queue poison-queue threshold and retry interval via
host.jsonextensions so repeated crash retries do not amplify memory pressure. Messages exceedingmaxDequeueCountare moved to the poison queue;visibilityTimeoutcontrols the delay between retry attempts. - Confirm
host.jsonmemory-sensitive settings are explicit and aligned with runtime behavior.
Example host.json excerpt for controlled throughput:
{
"version": "2.0",
"extensions": {
"queues": {
"batchSize": 8,
"newBatchThreshold": 4,
"maxDequeueCount": 5
},
"serviceBus": {
"prefetchCount": 0,
"messageHandlerOptions": {
"maxConcurrentCalls": 8,
"autoCompleteMessages": false
}
}
},
"concurrency": {
"dynamicConcurrencyEnabled": true,
"snapshotPersistenceEnabled": true
}
}
Plan memory guidance for escalation decisions:
| Plan | Typical memory ceiling per instance | Use when |
|---|---|---|
| Consumption | 1.5 GB | Lightweight, bursty, memory-efficient handlers |
| Premium EP1 | 3.5 GB | Moderate memory workloads and reduced cold starts |
| Premium EP2 | 7 GB | Heavy payload transforms and media processing |
| Premium EP3 | 14 GB | Very large in-memory pipelines after optimization |
9. Prevention¶
- Enforce streaming I/O for blob/file paths and prohibit full-buffer helper usage in code review.
- Add payload-size budgets per trigger and reject or split work items that exceed thresholds.
- Instrument memory telemetry (
WorkingSet,PrivateBytes) with alerts at 75%, 85%, and 95% of plan ceiling. - Dispose EF/DB contexts and large object graphs deterministically; avoid static caches without eviction.
- Load test with production-like payload distributions before release, then validate no monotonic memory growth.
See Also¶
- Troubleshooting architecture
- Troubleshooting methodology
- Troubleshooting KQL guide
- Out of memory crash lab guide
- Durable orchestration stuck playbook