Out of Memory / Worker Crash Playbook¶

1. Summary¶

This playbook is for incidents where Azure Functions executions fail, restart, or disappear under load because worker memory is exhausted. It is most common in workloads that materialize large payloads in memory, read blobs fully instead of streaming, hold unbounded collections, or run CPU-heavy media operations that allocate large temporary buffers.

On Azure Functions, memory ceilings are plan-dependent and hard in practice: Consumption is typically limited to about 1.5 GB per instance, Premium EP1 to about 3.5 GB, EP2 to about 7 GB, and EP3 to about 14 GB. When allocations spike beyond available memory, workers can throw System.OutOfMemoryException, trigger process recycling, and leave heartbeat gaps that appear as random cold starts or transient availability loss.

Decision Flow¶

flowchart TD
    A["Incident detected: failures/timeouts/restarts"] --> B{Any OOM exceptions in traces or FunctionAppLogs?}
    B -->|Yes| C[Correlate by role instance and operation]
    B -->|No| D{Heartbeat gaps or worker restarts?}
    D -->|Yes| E[Suspect crash without surfaced exception]
    D -->|No| F[Check upstream dependencies and network]
    C --> G{Large payload or media processing path?}
    E --> G
    G -->|Yes| H[Validate memory pressure pattern in AppMetrics]
    G -->|No| I["Check unbounded collections / leaks / disposal"]
    H --> J{Plan memory ceiling reached?}
    I --> J
    J -->|Yes| K["Immediate mitigation: throttle/scale up/stream"]
    J -->|No| L[Validate runtime config and library behavior]
    K --> M[Confirm stabilization and no new crash gaps]
    L --> M

Severity guidance¶

Condition	Severity	Action priority
OOM exceptions on single app instance with auto-recovery in under 5 minutes	Sev3	Start triage within 60 minutes
Repeated worker crashes across multiple instances causing queue growth or API errors	Sev2	Start mitigation within 15 minutes
Broad outage with sustained crash loops and customer-visible failure	Sev1	Immediate response and incident command

Signal snapshot¶

Signal	Normal	Incident
`traces` containing `System.OutOfMemoryException`	None or very rare	Bursts aligned with failed requests
`AppMetrics` working set trend	Stable sawtooth with release	Stair-step growth with no return
Host heartbeat continuity	Regular heartbeat cadence	Gaps during crash/recycle windows
Request success rate	>99% for steady workload	Drops with 5xx and timeout spikes
Queue backlog	Bounded and catches up	Persistent growth during restarts

flowchart LR
    A[Trigger receives event] --> B[Function worker allocates payload]
    B --> C[Deserializer + transform buffers]
    C --> D[Business logic allocations]
    D --> E[Downstream SDK buffers]
    E --> F{Memory below ceiling?}
    F -->|Yes| G[Execution completes]
    F -->|No| H[OOM exception or process crash]
    H --> I["Retry/restart causes repeated pressure"]

sequenceDiagram
    participant Q as Queue/Event Source
    participant H as Host
    participant W as Worker
    participant AI as App Insights
    Q->>H: Trigger message batch
    H->>W: Dispatch invocation
    W->>W: Materialize large object graph
    W--xH: Crash / OOM
    H->>AI: traces with restart gap
    H->>Q: Message visibility timeout expires
    Q->>H: Message redelivered

2. Common Misreadings¶

Misreading	Why incorrect	Correct interpretation
"No OOM exception means memory is fine"	Hard crashes can terminate before managed exception logging	Use heartbeat gaps, restarts, and abrupt request drops as corroboration
"CPU is low, so memory cannot be the issue"	Memory exhaustion can happen with modest CPU	Track working set and allocation bursts independently
"Only large files cause OOM"	Many medium objects plus retention can exceed limits	Check cumulative allocations and object lifetime
"Premium always prevents OOM"	Plan upgrade raises ceiling but does not remove leaks or spikes	Fix memory behavior and then right-size plan
"Retries fix transient crash loops"	Retries can amplify pressure by repeating allocations	Add backoff, lower concurrency, and reduce payload footprint

3. Competing Hypotheses¶

ID	Hypothesis	Confirming signal	Disproving signal
H1	Large payloads are fully buffered in memory	OOMs correlate with large request/body size operations	OOMs happen on tiny payload paths
H2	Blob/content downloads are materialized rather than streamed	OOMs correlate with blob read operations and high bytes read	Blob operations absent near failure windows
H3	Unbounded collections or cache growth in worker process	Working set grows monotonically across invocations	Memory returns to baseline after each batch
H4	EF or DB resources are not disposed causing retention	Long-lived contexts/connections around failed windows	No correlation with DB activity and objects released promptly
H5	Image/PDF processing library spikes temporary allocations	Failures cluster on media conversion endpoints	Media pipeline inactive during incidents
H6	Plan memory ceiling is too low for legitimate workload	OOM near stable peak load with efficient code path	Same load succeeds after code-level optimization without scaling

4. What to Check First¶

Confirm whether failures align with memory-intensive functions, payload types, and event sizes.
Check for crash signatures: OOM traces, heartbeat gaps, and sudden worker restarts.
Compare memory trend and request success trend in the same 30-minute window.
Determine if immediate containment needs concurrency reduction or temporary scale-up.

Quick portal checks¶

In Application Insights, search traces for System.OutOfMemoryException and restart markers.
In Metrics, overlay memory working set and failed requests for the same time range.
In Function App logs, inspect crash windows for host heartbeat discontinuity.

Quick CLI checks¶

az functionapp show --name $APP_NAME --resource-group $RG --output table
az monitor log-analytics query --workspace "$WORKSPACE_ID" --analytics-query "traces | where timestamp > ago(30m) | where message contains 'OutOfMemory' | project timestamp, cloud_RoleInstance, operation_Id, message" --output table
az monitor log-analytics query --workspace "$WORKSPACE_ID" --analytics-query "AppMetrics | where TimeGenerated > ago(30m) | where Name in ('WorkingSet','PrivateBytes') | summarize avg(Value), max(Value) by Name, bin(TimeGenerated, 5m)" --output table

Example output¶

Name                ResourceGroup        State    RuntimeVersion    LinuxFxVersion
------------------  -------------------  -------  ----------------  --------------------
func-prod-orders    rg-functions-prod    Running  ~4                PYTHON|3.11

timestamp                   cloud_RoleInstance              operation_Id                           message
--------------------------  ------------------------------  ------------------------------------   --------------------------------------------------------------
2026-04-05T03:10:11.220Z    RD281878D4A1C2                 xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx   System.OutOfMemoryException: Exception of type...
2026-04-05T03:10:11.907Z    RD281878D4A1C2                 xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx   Worker process terminated due to memory pressure

Name         TimeGenerated               avg_Value      max_Value
-----------  --------------------------  ------------   ------------
WorkingSet   2026-04-05T03:05:00.000Z   1287348224     1499822080
WorkingSet   2026-04-05T03:10:00.000Z   1519804416     1602242560

5. Evidence to Collect¶

KQL Table Names

Most queries use Application Insights table names (traces, requests, dependencies) with classic columns (timestamp, duration). The AppMetrics table is a Log Analytics-only table and uses TimeGenerated instead of timestamp.

Source	Query/Command	Purpose
`traces`	Search for `System.OutOfMemoryException`, `worker process`, `recycle`	Establish explicit crash/error signatures
`traces`	Filter by host/worker restart events	Confirm process-level disruption timeline
`AppMetrics`	WorkingSet, PrivateBytes, GC metrics by 1-5 minute bins	Measure memory growth and saturation trend
`AppMetrics`	Analyze GC pause and allocation counters by role instance	Distinguish leak-like growth from bursty allocations
`requests`	Failed/timeout requests per operation	Quantify customer impact and hot endpoints
`dependencies`	High latency or failed downstream calls around OOM windows	Rule out external dependency primary failure
Deployment history	Runtime/library version changes before incident	Detect regressions introduced by release
Function config	`host.json` concurrency and batching settings	Identify configuration amplifying memory pressure
Hosting plan details	SKU/instance memory envelope	Validate if ceiling is structurally insufficient

6. Validation and Disproof by Hypothesis¶

H1: Large payloads are fully buffered in memory¶

Confirming KQL¶

traces
| where timestamp > ago(6h)
| where message has_any ("OutOfMemoryException", "OutOfMemory", "worker process")
| project oomTime=timestamp, operation_Id, cloud_RoleInstance, message
| join kind=leftouter (
    requests
    | where timestamp > ago(6h)
    | extend requestBytes = tolong(customDimensions["RequestBodyBytes"])
    | project operation_Id, reqTime=timestamp, name, resultCode, requestBytes
) on operation_Id
| where isnotempty(requestBytes)
| summarize incidents=count(), p95RequestBytes=percentile(requestBytes,95), maxRequestBytes=max(requestBytes) by name
| order by incidents desc

Expected output¶

name                          incidents   p95RequestBytes   maxRequestBytes
---------------------------   ---------   ---------------   ---------------
ProcessOrderHttpTrigger       41          52428800          104857600
ImportCatalogHttpTrigger      13          73400320          125829120

Disproving check¶

If OOM windows show low or missing request payload sizes and the same operations fail with tiny payloads, this hypothesis weakens. Shift focus to retained state or library allocation spikes.

H2: Blob/content downloads are materialized rather than streamed¶

Confirming KQL¶

dependencies
| where timestamp > ago(6h)
| where type =~ "Azure blob"
| extend bytes = tolong(customDimensions["ContentLengthBytes"])
| summarize blobOps=count(), maxBytes=max(bytes), p95Bytes=percentile(bytes,95) by operation_Id
| join kind=inner (
    traces
    | where timestamp > ago(6h)
    | where message has_any ("OutOfMemory", "worker process terminated", "System.OutOfMemoryException")
    | project operation_Id, oomMessage=message, oomTime=timestamp
) on operation_Id
| project oomTime, operation_Id, blobOps, p95Bytes, maxBytes, oomMessage
| order by oomTime desc

Expected output¶

oomTime                     operation_Id                           blobOps   p95Bytes    maxBytes    oomMessage
-------------------------   ------------------------------------   -------   ---------   ---------   -------------------------------------------------
2026-04-05T03:10:11.220Z    xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx  4         67108864    83886080    System.OutOfMemoryException: Exception of type...
2026-04-05T03:28:02.114Z    xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx  3         50331648    67108864    Worker process terminated due to memory pressure

Disproving check¶

If blob dependencies are small, infrequent, or absent around crash periods, bulk materialization of blob content is unlikely to be primary.

H3: Unbounded collections or cache growth in worker process¶

Confirming KQL¶

AppMetrics
| where TimeGenerated > ago(12h)
| where Name in ("WorkingSet", "PrivateBytes")
| summarize avgValue=avg(Value), maxValue=max(Value) by Name, cloud_RoleInstance, bin(TimeGenerated, 10m)
| order by cloud_RoleInstance asc, TimeGenerated asc

Expected output¶

Name         cloud_RoleInstance   TimeGenerated               avgValue      maxValue
-----------  -------------------  --------------------------  ------------  ------------
WorkingSet   RD281878D4A1C2       2026-04-05T02:40:00.000Z   1042284544    1090519040
WorkingSet   RD281878D4A1C2       2026-04-05T02:50:00.000Z   1170837504    1222311936
WorkingSet   RD281878D4A1C2       2026-04-05T03:00:00.000Z   1329594368    1388314624
WorkingSet   RD281878D4A1C2       2026-04-05T03:10:00.000Z   1528823808    1602242560
PrivateBytes RD281878D4A1C2       2026-04-05T02:40:00.000Z   950534144     1003126784
PrivateBytes RD281878D4A1C2       2026-04-05T03:10:00.000Z   1468006400    1539316275

Disproving check¶

If memory exhibits healthy sawtooth behavior with periodic drops after GC and workload drain, persistent retention is less likely. Re-examine event-size bursts or one-off media tasks.

H4: EF or database resources are not disposed causing retention¶

Confirming KQL¶

dependencies
| where timestamp > ago(6h)
| where type in~ ("SQL", "Azure SQL")
| summarize depCount=count(), failed=countif(success == false), p95Duration=percentile(duration,95) by operation_Id
| join kind=leftouter (
    traces
    | where timestamp > ago(6h)
    | where message has_any ("OutOfMemory", "DbContext", "connection pool")
    | project operation_Id, traceMsg=message
) on operation_Id
| where isnotempty(traceMsg)
| project operation_Id, depCount, failed, p95Duration, traceMsg
| order by depCount desc

Expected output¶

operation_Id                           depCount   failed   p95Duration     traceMsg
------------------------------------   --------   ------   -------------   ------------------------------------------------------------
xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx   184        12       00:00:01.214    DbContext retained across invocations; memory pressure rising
xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx   171        8        00:00:01.009    System.OutOfMemoryException after repeated query materialization

Disproving check¶

If DB dependency counts and durations remain normal and no retention indicators appear in traces, EF/context disposal is less likely than payload or media buffering.

H5: Image/PDF processing spikes temporary allocations¶

Confirming KQL¶

requests
| where timestamp > ago(6h)
| where name has_any ("Image", "Pdf", "Render", "Thumbnail")
| summarize mediaCalls=count(), failed=countif(success == false), p95Duration=percentile(duration,95) by operation_Id, name
| join kind=inner (
    traces
    | where timestamp > ago(6h)
    | where message has_any ("OutOfMemory", "bitmap", "PDF", "buffer")
    | project operation_Id, oomMessage=message, oomTime=timestamp
) on operation_Id
| project oomTime, name, mediaCalls, failed, p95Duration, oomMessage
| order by oomTime desc

Expected output¶

oomTime                     name                     mediaCalls   failed   p95Duration   oomMessage
-------------------------   ----------------------   ----------   ------   -----------   ------------------------------------------------------
2026-04-05T04:02:45.110Z    RenderPdfActivity        1            1        00:00:22.919  System.OutOfMemoryException during page rasterization
2026-04-05T04:06:31.442Z    GenerateThumbnail        1            1        00:00:08.305  Worker process terminated while processing image buffer

Disproving check¶

If media-related requests are absent or successful during failure windows, prioritize hypotheses around generic payload buffering and retained collections.

H6: Plan memory ceiling is too low for legitimate workload¶

Confirming KQL¶

AppMetrics
| where TimeGenerated > ago(24h)
| where Name == "WorkingSet"
| summarize p95=percentile(Value,95), p99=percentile(Value,99), peak=max(Value) by cloud_RoleInstance, bin(TimeGenerated, 1h)
| order by TimeGenerated desc

Expected output¶

cloud_RoleInstance   TimeGenerated               p95          p99          peak
-------------------  --------------------------  -----------  -----------  -----------
RD281878D4A1C2       2026-04-05T03:00:00.000Z   1459617792   1542449152   1602242560
RD281878D4A1C2       2026-04-05T04:00:00.000Z   1480589312   1559238656   1601048576

Disproving check¶

If code changes (streaming, batching, disposal fixes) materially reduce peak memory below the same plan limit, a pure sizing issue is disproved and optimization is primary.

Cross-check result interpretation: - If p95 is consistently within 5-10% of the plan ceiling, short bursts can trigger recurring crash loops. - If p95 remains far below the ceiling but isolated peaks crash workers, investigate single-operation allocation spikes. - If multiple instances show the same pattern at similar utilization, capacity is likely a structural bottleneck.

7. Likely Root Cause Patterns¶

Pattern	Evidence signature	Frequency
Full in-memory payload deserialization	OOMs tied to endpoints with high request bytes	High
Blob download then process in-memory	Blob dependencies with large content lengths near OOM	High
Unbounded in-process cache/list growth	WorkingSet climbs steadily between invocations	Medium
Resource lifecycle leak (DB context/object graph)	Long-lived references and rising PrivateBytes	Medium
Media conversion buffer amplification	Crashes clustered around PDF/image handlers	Medium

flowchart TD
    A[Deploy code change] --> B[Memory trend begins rising]
    B --> C[Latency increases]
    C --> D[GC pressure increases]
    D --> E[OOM exception or abrupt worker exit]
    E --> F[Retries and redelivery]
    F --> G[Backlog and error rate increase]
    G --> H[Incident declared]

8. Immediate Mitigations¶

Reduce concurrency and batch size in host.json, then restart the app to reduce per-instance memory pressure.

az functionapp config appsettings set --name $APP_NAME --resource-group $RG --settings "AzureFunctionsJobHost__extensions__queues__batchSize=8" "AzureFunctionsJobHost__extensions__queues__newBatchThreshold=4" --output table

Switch blob handling to streaming patterns and avoid loading entire content into byte arrays before processing.

az functionapp deployment source config-zip --name $APP_NAME --resource-group $RG --src ./deployments/streaming-hotfix.zip --output table

Temporarily scale to a higher Premium SKU when sustained peaks approach current plan limits.

az functionapp plan update --name $PLAN_NAME --resource-group $RG --sku EP2 --number-of-workers 2 --output table

Apply guardrails for payload size at the application level. Validate incoming payload size in function code and reject oversized items before buffering.

# Redeploy with payload validation guards added to function handlers
az functionapp deployment source config-zip --name $APP_NAME --resource-group $RG --src ./deployments/payload-guard-hotfix.zip --output table

Configure queue poison-queue threshold and retry interval via host.json extensions so repeated crash retries do not amplify memory pressure. Messages exceeding maxDequeueCount are moved to the poison queue; visibilityTimeout controls the delay between retry attempts.
```
{
  "version": "2.0",
  "extensions": {
    "queues": {
      "maxDequeueCount": 5,
      "visibilityTimeout": "00:01:00"
    }
  }
}
```
Confirm host.json memory-sensitive settings are explicit and aligned with runtime behavior.

Example host.json excerpt for controlled throughput:

{
  "version": "2.0",
  "extensions": {
    "queues": {
      "batchSize": 8,
      "newBatchThreshold": 4,
      "maxDequeueCount": 5
    },
    "serviceBus": {
      "prefetchCount": 0,
      "messageHandlerOptions": {
        "maxConcurrentCalls": 8,
        "autoCompleteMessages": false
      }
    }
  },
  "concurrency": {
    "dynamicConcurrencyEnabled": true,
    "snapshotPersistenceEnabled": true
  }
}

Plan memory guidance for escalation decisions:

Plan	Typical memory ceiling per instance	Use when
Consumption	1.5 GB	Lightweight, bursty, memory-efficient handlers
Premium EP1	3.5 GB	Moderate memory workloads and reduced cold starts
Premium EP2	7 GB	Heavy payload transforms and media processing
Premium EP3	14 GB	Very large in-memory pipelines after optimization

9. Prevention¶

Enforce streaming I/O for blob/file paths and prohibit full-buffer helper usage in code review.
Add payload-size budgets per trigger and reject or split work items that exceed thresholds.
Instrument memory telemetry (WorkingSet, PrivateBytes) with alerts at 75%, 85%, and 95% of plan ceiling.
Dispose EF/DB contexts and large object graphs deterministically; avoid static caches without eviction.
Load test with production-like payload distributions before release, then validate no monotonic memory growth.

Out of Memory / Worker Crash Playbook¶

1. Summary¶

Decision Flow¶

Severity guidance¶

Signal snapshot¶

2. Common Misreadings¶

3. Competing Hypotheses¶

4. What to Check First¶

Quick portal checks¶

Quick CLI checks¶

Example output¶

5. Evidence to Collect¶

6. Validation and Disproof by Hypothesis¶

H1: Large payloads are fully buffered in memory¶

Confirming KQL¶

Expected output¶

Disproving check¶

H2: Blob/content downloads are materialized rather than streamed¶

Confirming KQL¶

Expected output¶

Disproving check¶

H3: Unbounded collections or cache growth in worker process¶

Confirming KQL¶

Expected output¶

Disproving check¶

H4: EF or database resources are not disposed causing retention¶

Confirming KQL¶

Expected output¶

Disproving check¶

H5: Image/PDF processing spikes temporary allocations¶

Confirming KQL¶

Expected output¶

Disproving check¶

H6: Plan memory ceiling is too low for legitimate workload¶

Confirming KQL¶

Expected output¶

Disproving check¶

7. Likely Root Cause Patterns¶

8. Immediate Mitigations¶

9. Prevention¶

See Also¶

Sources¶