Windows Memory Pressure and IIS Worker Recycling (Azure App Service Windows)¶

1. Summary¶

Symptom¶

The app is stable after restart, then latency and intermittent 5xx return as uptime increases. Windows workers (w3wp.exe) recycle unexpectedly, sessions may reset, and logs show process restarts without a code deployment event.

Why this scenario is confusing¶

Memory pressure on App Service Windows often appears as periodic worker recycling rather than one obvious crash. IIS overlap recycle, platform restarts, .NET GC behavior, and native allocations can all look similar from high-level uptime signals. Teams also assume on-premises IIS behavior applies directly, but App Service sandbox and SKU memory ceilings materially change outcomes.

Troubleshooting decision flow (mermaid diagram)¶

graph TD
    A[Symptom: Windows app degrades with uptime and w3wp.exe recycles] --> B{First discriminator}
    B --> C[Private Bytes rises steadily across hours]
    B --> D[Memory reaches tier ceiling then recycle]
    B --> E[Recycle cadence is clock-like or config-linked]
    B --> F[High Gen2 or LOH warnings before recycle]
    B --> G[Managed heap looks stable but process memory still grows]
    C --> H1[H1: Application memory leak]
    D --> H2[H2: Tier memory limit reached]
    E --> H3[H3: IIS proactive/time-based recycle]
    F --> H4[H4: .NET GC pressure and LOH fragmentation]
    G --> H5[H5: Native memory leak in unmanaged path]

Limitations¶

This playbook targets Windows App Service only and focuses on w3wp.exe recycling behavior.
It does not cover Linux container workers, Oryx startup, or dotnet process model on Linux plans.
App Service host internals are abstracted; some low-level host counters from on-premises IIS are not directly exposed.
Evidence quality depends on enabled diagnostics and retention period in the connected Log Analytics workspace.

Quick Conclusion¶

If w3wp.exe private bytes climbs toward the SKU memory envelope and recycle events correlate with that rise, treat memory pressure as primary. Use Kudu Process Explorer and App Service Diagnostics Memory detector first, then disprove competing hypotheses with KQL timeline correlation before changing code or scaling policy.

2. Common Misreadings¶

"IIS recycled, so App Service platform is unstable" (often memory threshold or configured recycle).
"CPU is normal, so memory cannot be the bottleneck" (memory-only pressure can degrade latency first).
"Restart fixed it permanently" (restart can mask leak or fragmentation pattern for a few hours).
"On-premises IIS private memory tuning should be identical" (App Service imposes tier/sandbox constraints).
"Managed heap is small, therefore no leak exists" (native allocations and fragmentation can dominate process RSS).
"Only one app instance shows restarts, so this is random" (instance-local traffic skew can trigger per-instance pressure).

3. Competing Hypotheses¶

H1: Application memory leak (steady growth until recycling)
- Long-lived references, unbounded caches, retained per-request objects, or event-handler leaks grow w3wp.exe private bytes over uptime.
H2: Tier memory limit reached (app needs more RAM than plan provides)
- Workload memory demand exceeds practical per-worker envelope for the selected SKU, so recycle is capacity-driven.
H3: IIS proactive recycling (time-based or overlapping recycle)
- Scheduled/time-based recycle or config-triggered overlap recycle creates restart signatures that are not pure leak behavior.
H4: GC pressure in .NET (LOH fragmentation, excessive Gen2)
- High allocation churn and LOH fragmentation produce pause amplification and memory growth without classic object leak signatures.
H5: Native memory leak in unmanaged code
- Interop/native library usage leaks memory outside managed heap, so .NET counters look acceptable while process memory rises.

4. What to Check First¶

Fast triage sequence¶

Open Diagnose and solve problems in App Service and run the Memory detector.
Open Kudu (https://<app-name>.scm.azurewebsites.net) and inspect Process Explorer for w3wp.exe private bytes trend.
Correlate recycle/restart times with memory rise using KQL across app, platform, and metrics logs.
Confirm whether recycle timing is threshold-driven (reactive) or schedule/config-driven (proactive).

Practical memory envelope reference (Windows workers)¶

Plan SKU	Practical `w3wp.exe` private bytes envelope	Operational interpretation
S1	~1 GB	Frequent pressure if app holds large in-memory caches or large object payloads
P1v2	~3.5 GB	Better headroom, still sensitive to fragmentation and native leaks
P1v3	~8 GB	Significant headroom; sustained climb here strongly suggests leak or oversized workload

Interpret SKU values correctly

These values are practical troubleshooting envelopes, not hard contractual limits for every runtime combination. Always validate with observed per-instance metrics and current platform behavior.

App Service Windows vs on-premises IIS mental model¶

On-premises IIS lets you directly tune many app pool knobs per server; App Service abstracts host control and enforces multi-tenant safety boundaries.
App Service applies platform-level safeguards that can recycle workers before traditional on-prem thresholds would be reached.
Diagnosing memory must combine IIS/app evidence with App Service diagnostics signals; pure server-admin assumptions can mislead.

Proactive vs reactive recycling discriminator¶

Reactive memory recycle: private bytes trend approaches envelope, recycle occurs near peak, then memory resets lower.
Proactive/time-based recycle: recycle occurs on regular cadence independent of memory peak.
Config-triggered recycle: recycle aligns with configuration/app setting changes, slot swaps, or web.config touches.
Overlap recycle behavior: temporary dual-worker overlap can increase short-lived memory pressure during transition.

Portal view: Diagnose and solve problems hub as entry point for the Memory detector¶

The Diagnose and solve problems blade with the Common Solutions tab selected is the orientation surface this playbook's Section 4 names as the first triage step. The visible Availability and Performance category card lists Application Logs, App Down Workflow, and Web App Down as the entry-point links inside that category. The visible Diagnostic Tools category card lists Auto-Heal, Network Troubleshooter, and Advanced Application Restart — Auto-Heal being the same name this playbook's H3 hypothesis discriminator references. The Popular troubleshooting tools strip at the bottom lists Application Logs, App Down Workflow, Web App Down, Web App Slow, and Process Full List. The source app for this capture is Linux; the Diagnose and solve problems hub itself is OS-agnostic and is the equivalent first stop for a Windows worker-recycle incident.

5. Evidence to Collect¶

Required evidence¶

Kudu Process Explorer snapshots for w3wp.exe over time.
App Service Diagnostics output from Memory detector.
KQL timeline: memory counters + recycle/restart events + latency/error effects.
App configuration timeline (settings change, slot swap, deployment).
Runtime evidence: GC indicators (Gen2, LOH behavior) and native module usage.

Kudu Process Explorer sample output¶

Site: <app-name> (Windows)
Timestamp (UTC): 2026-04-09T07:42:00Z

Process Name   PID    CPU%   Private Bytes   Working Set   Handles   Threads   Uptime
-----------    ----   ----   -------------   -----------   -------   -------   --------
w3wp.exe       10324  28.4   987 MB          1.14 GB       2184      142       06:11:32
w3wp.exe       11488   2.1   126 MB          211 MB         512       33        00:00:41
dotnet.exe      8420   0.6    74 MB          129 MB         268       18        12:05:19
php-cgi.exe     0000   N/A    N/A            N/A            N/A       N/A       N/A

Observation:
- PID 10324 is the active worker near S1 practical envelope.
- PID 11488 indicates overlap recycle warm-up.

How to Read This

A short-lived second w3wp.exe during recycle is expected for overlap startup. The key signal is whether the long-running worker shows monotonic private-bytes growth leading to recycle.

App Service Diagnostics sample findings (Memory detector)¶

Detector: Memory Analysis (App Service Diagnostics)
Time range: Last 24 hours

Findings:
1) Worker memory usage reached 94% of available memory on instance RD00155D4E2A3B.
2) Process w3wp.exe exceeded normal baseline for 3h 18m before recycle.
3) 4 recycle events detected; 3 correlated with memory high-water marks.
4) Recommendation: Review in-memory cache size and evaluate scale-up to higher memory SKU.

KQL Queries with Example Output¶

Query 1: Private bytes trend for worker process¶

AppMetrics
| where TimeGenerated > ago(24h)
| where Name has "Private Bytes" or Name has "Working Set"
| where Properties has "w3wp"
| extend Process=tostring(Properties.ProcessName)
| summarize avg_value=avg(Val), max_value=max(Val) by bin(TimeGenerated, 5m), Process, Name
| order by TimeGenerated asc

Example Output

TimeGenerated	Process	Name	avg_value	max_value
2026-04-09 04:00:00	w3wp.exe	Process\Private Bytes	512	590
2026-04-09 05:00:00	w3wp.exe	Process\Private Bytes	731	804
2026-04-09 06:00:00	w3wp.exe	Process\Private Bytes	892	970
2026-04-09 06:55:00	w3wp.exe	Process\Private Bytes	981	1008
2026-04-09 07:00:00	w3wp.exe	Process\Private Bytes	212	284

How to Read This

A sharp reset after a peak is recycle-consistent. Repeating stair-step growth strongly supports H1 or H2.

Query 2: Restart/recycle correlation timeline¶

AppServicePlatformLogs
| where TimeGenerated > ago(24h)
| where Message has_any ("recycle", "restart", "worker process", "w3wp", "site stopped", "site started")
| project TimeGenerated, Level, Message, _ResourceId
| order by TimeGenerated desc

Example Output

TimeGenerated	Level	Message
2026-04-09 07:00:06	Informational	Site: started.
2026-04-09 06:59:58	Warning	Worker process recycle initiated due to memory pressure.
2026-04-09 06:59:55	Informational	Site: stopped.
2026-04-09 02:59:57	Warning	Worker process recycle initiated due to memory pressure.

How to Read This

If each recycle aligns with memory peaks, H1/H2 strengthen. If recycle occurs on fixed intervals regardless of memory, H3 strengthens.

Query 3: Latency/error impact around recycle windows¶

AppServiceHTTPLogs
| where TimeGenerated > ago(24h)
| summarize requests=count(), p95=percentile(TimeTaken,95), p99=percentile(TimeTaken,99), errors=countif(ScStatus >= 500) by bin(TimeGenerated, 5m)
| order by TimeGenerated asc

Example Output

TimeGenerated	requests	p95	p99	errors
2026-04-09 06:45:00	11240	980	1840	9
2026-04-09 06:50:00	11308	1150	2220	14
2026-04-09 06:55:00	11092	1390	3110	26
2026-04-09 07:00:00	10231	420	760	3

AppTraces
| where TimeGenerated > ago(24h)
| where Message has_any ("Gen2", "Large Object Heap", "LOH", "GC", "allocation", "OutOfMemoryException")
| project TimeGenerated, SeverityLevel, Message
| order by TimeGenerated desc

Example Output

TimeGenerated	SeverityLevel	Message
2026-04-09 06:57:13	Warning	Gen2 collection duration increased to 280ms percentile band.
2026-04-09 06:55:02	Warning	LOH allocation pressure detected in image transform path.
2026-04-09 06:54:48	Error	System.OutOfMemoryException in report aggregation pipeline.

Query 5: Config/deployment-change correlation¶

AppServiceAuditLogs
| where TimeGenerated > ago(24h)
| where OperationName has_any ("Update site config", "Swap slots", "Restart Web App", "Set app settings")
| project TimeGenerated, OperationName, Caller, ResultDescription
| order by TimeGenerated desc

Example Output

TimeGenerated	OperationName	Caller	ResultDescription
2026-04-09 06:00:02	Update site config	user@example.com	Succeeded
2026-04-09 04:00:04	Set app settings	user@example.com	Succeeded

Data source availability

Table names can vary by workspace onboarding path. If a table is unavailable, use equivalent App Service log tables already enabled in your environment and preserve the same time-correlation logic.

CLI investigation commands (long flags only)¶

az webapp show --resource-group <resource-group> --name <app-name>
az webapp config show --resource-group <resource-group> --name <app-name>
az webapp config appsettings list --resource-group <resource-group> --name <app-name>
az webapp restart --resource-group <resource-group> --name <app-name>
az monitor metrics list --resource <app-service-plan-resource-id> --metric "MemoryPercentage,CpuPercentage,Http5xx" --interval PT1M --aggregation Average
az monitor app-insights query --app <application-insights-name> --analytics-query "AppServiceHTTPLogs | where TimeGenerated > ago(1h) | summarize requests=count() by bin(TimeGenerated, 5m)"

Example Output (sanitized)

$ az monitor metrics list --resource <app-service-plan-resource-id> --metric "MemoryPercentage,CpuPercentage,Http5xx" --interval PT1M --aggregation Average
timestamp                  MemoryPercentage_Average   CpuPercentage_Average   Http5xx_Average
-------------------------  ------------------------   ---------------------   ---------------
2026-04-09T06:55:00Z       92.8                       38.1                    1.0
2026-04-09T06:56:00Z       93.4                       39.6                    2.0
2026-04-09T07:01:00Z       47.2                       24.8                    0.0

Normal vs Abnormal Comparison¶

Signal	Normal (Healthy)	Abnormal (Memory Pressure and Recycling)
`w3wp.exe` private bytes	Stable or cyclical within safe band	Stair-step growth toward SKU envelope then reset
Recycle timing	Rare, change-driven	Repeated near memory high-water mark or fixed cadence
GC behavior	Routine background collections	Frequent Gen2, LOH warnings, longer pause impact
HTTP tail latency	Predictable under same traffic	P95/P99 rises before recycle, then drops after restart
CPU vs memory	Roughly proportional trends	CPU moderate while memory saturates
Managed vs process memory	Similar directional movement	Process memory climbs beyond managed heap trend

6. Validation and Disproof by Hypothesis¶

H1: Application memory leak (steady growth until recycling)¶

Signals that support
- Private bytes trend increases continuously across steady traffic periods.
- Recycle/reset temporarily restores latency and memory baseline.
- Same growth pattern repeats after each recycle.
- Cache cardinality or retained object counts increase over time.
Signals that weaken
- Growth disappears under identical load after code rollback.
- Memory remains flat when instrumentation and cache features are disabled.
- Recycle occurs at fixed wall-clock intervals without memory rise.
What to verify
- Confirm monotonic private-bytes trend from Kudu snapshots and Query 1.
- Compare pre/post-recycle latency from Query 3.
- Review recent changes for unbounded cache, static collections, and retention bugs.

H2: Tier memory limit reached (app needs more RAM than plan provides)¶

Signals that support
- Peaks consistently align near practical envelope (S1 ~1 GB, P1v2 ~3.5 GB, P1v3 ~8 GB).
- Recycle frequency drops immediately after scale-up to higher-memory SKU.
- Workload characteristics include large payload in-memory transformations.
- Multiple instances show similar high-memory pattern under load.
Signals that weaken
- Pressure persists after scale-up with proportional headroom increase.
- Only one code path or tenant triggers growth independent of traffic volume.
- Process memory remains low but recycles still occur by schedule.
What to verify
- Compare memory peaks against SKU envelope and detector findings.
- Run controlled scale-up and observe recycle reduction.
- Validate whether memory demand is expected workload size or unexpected retention.

H3: IIS proactive recycling (time-based or overlapping recycle)¶

Signals that support
- Recycles occur on predictable cadence (for example every N hours) independent of memory peak.
- Recycles align with app setting updates, slot swaps, or config writes.
- Overlap recycle shows short dual-w3wp.exe period with smooth handoff.
- No strong memory stair-step preceding every recycle event.
Signals that weaken
- Recycle timing varies with load and follows memory highs.
- Recycles disappear without config changes but with memory optimization.
- Detector explicitly flags memory-triggered recycle correlation.
What to verify
- Use Query 2 and Query 5 to align recycle events to schedule/config operations.
- Check deployment history and automation jobs touching configuration.
- Distinguish overlap recycle (expected) from recycle storm (problematic).

H4: GC pressure in .NET (LOH fragmentation, excessive Gen2)¶

Signals that support
- App traces show increased Gen2 frequency/duration near incident windows.
- LOH-heavy paths (large JSON, image/PDF processing, report generation) correlate with latency spikes.
- Managed heap size oscillates while process memory remains elevated due to fragmentation.
- Throughput drops despite moderate CPU because GC pause overhead increases.
Signals that weaken
- GC metrics remain stable while private bytes rises linearly.
- Non-.NET components dominate process allocations.
- Load tests with identical payloads do not reproduce GC stress.
What to verify
- Run Query 4 and correlate timestamps with Query 3 degradation periods.
- Profile allocation hotspots in affected endpoints.
- Validate payload size distribution and serialization buffering behavior.

H5: Native memory leak in unmanaged code¶

Signals that support
- Process private bytes grows while managed heap/counters remain comparatively stable.
- Leak appears in features using interop/native dependencies (image codecs, PDF engines, compression libs).
- Memory does not return after full GC, only after process recycle.
- Handle count or native object usage trends upward with uptime.
Signals that weaken
- Growth disappears after managed-code fix with no native dependency changes.
- Managed heap rise fully explains total process memory growth.
- Native-heavy code paths are not active during incidents.
What to verify
- Correlate feature-level traffic with private-bytes rise.
- Capture repeated process snapshots via Kudu Process Explorer.
- Audit native package versions and known leak advisories.

7. Likely Root Cause Patterns¶

Pattern A: Unbounded in-memory cache in ASP.NET application layer
- Cache keyed by request/user dimensions grows without eviction and drives recurring high-water recycle.
Pattern B: SKU mismatch for workload memory profile
- S1 or P1v2 selected for workloads requiring larger sustained object graphs.
Pattern C: Recycle policy noise mistaken for instability
- Time-based/config-triggered overlap recycle interpreted as random platform crash.
Pattern D: LOH-heavy payload processing under burst traffic
- Large allocations fragment heap, increase Gen2 pause cost, and amplify tail latency.
Pattern E: Native library leak in mixed-mode path
- Unmanaged allocations accumulate until worker recycle resets process memory.

Investigation notes¶

Treat recycle as a symptom classification problem first: reactive memory vs proactive schedule/config.
Always inspect per-instance memory timelines; averages hide one-hot instance failure.
Correlate latency, memory, and restart events in a single time axis before selecting mitigations.
Keep evidence sanitized and avoid embedding tenant or subscription identifiers.

Quick conclusion¶

In Windows App Service, repeated w3wp.exe recycle with uptime-linked degradation is usually memory-shape driven, not random platform restart. Distinguish leak, capacity ceiling, proactive recycle, GC pressure, and native leak using Kudu + Diagnostics + KQL correlation, then apply the least risky mitigation that matches the validated hypothesis.

8. Immediate Mitigations¶

Scale up to a higher-memory SKU when evidence supports H2 and user impact is active.
Reduce in-memory cache limits and enforce bounded eviction to lower private-bytes slope.
Move memory-intensive endpoints to async/background architecture to reduce in-flight object retention.
Trigger controlled app restart during incident bridge window while deeper fix is prepared.
Temporarily lower concurrency for heavy endpoints to reduce allocation bursts and Gen2 pressure.
Patch or rollback known leaky native/interop library versions in affected code paths.
Stagger or remove unnecessary config updates that trigger avoidable proactive recycles.

Mitigation ordering

Prefer reversible mitigations first (scale, cap cache, controlled restart) while evidence collection continues. Avoid simultaneous large code and infrastructure changes that erase causal clarity.

9. Prevention¶

Define memory budgets per endpoint and per worker, then validate in load tests before production rollout.
Add memory SLO guardrails: alert on private-bytes slope, recycle frequency, and P95/P99 co-movement.
Use bounded caches with explicit size limits and expiration policy across all app modules.
Include LOH and Gen2 telemetry in release validation for .NET applications handling large objects.
Audit native dependencies quarterly and track security/performance advisories for leak regressions.
Isolate high-memory workloads onto dedicated plans to reduce cross-workload contention.
Use staged rollout with slot testing to detect memory growth patterns before full production traffic.

Windows Memory Pressure and IIS Worker Recycling (Azure App Service Windows)¶

1. Summary¶

Symptom¶

Why this scenario is confusing¶

Troubleshooting decision flow (mermaid diagram)¶

Limitations¶

Quick Conclusion¶

2. Common Misreadings¶

3. Competing Hypotheses¶

4. What to Check First¶

Fast triage sequence¶

Practical memory envelope reference (Windows workers)¶

App Service Windows vs on-premises IIS mental model¶

Proactive vs reactive recycling discriminator¶

Portal view: Diagnose and solve problems hub as entry point for the Memory detector¶

5. Evidence to Collect¶

Required evidence¶

Kudu Process Explorer sample output¶

App Service Diagnostics sample findings (Memory detector)¶

KQL Queries with Example Output¶

Query 1: Private bytes trend for worker process¶

Query 2: Restart/recycle correlation timeline¶

Query 3: Latency/error impact around recycle windows¶

Query 5: Config/deployment-change correlation¶

CLI investigation commands (long flags only)¶

Normal vs Abnormal Comparison¶

6. Validation and Disproof by Hypothesis¶

H1: Application memory leak (steady growth until recycling)¶

H2: Tier memory limit reached (app needs more RAM than plan provides)¶

H3: IIS proactive recycling (time-based or overlapping recycle)¶

H4: GC pressure in .NET (LOH fragmentation, excessive Gen2)¶

H5: Native memory leak in unmanaged code¶

7. Likely Root Cause Patterns¶

Investigation notes¶

Quick conclusion¶

8. Immediate Mitigations¶

9. Prevention¶

See Also¶

Sources¶

Windows Memory Pressure and IIS Worker Recycling (Azure App Service Windows)¶

1. Summary¶

Symptom¶

Why this scenario is confusing¶

Troubleshooting decision flow (mermaid diagram)¶

Limitations¶

Quick Conclusion¶

2. Common Misreadings¶

3. Competing Hypotheses¶

4. What to Check First¶

Fast triage sequence¶

Practical memory envelope reference (Windows workers)¶

App Service Windows vs on-premises IIS mental model¶

Proactive vs reactive recycling discriminator¶

Portal view: Diagnose and solve problems hub as entry point for the Memory detector¶

5. Evidence to Collect¶

Required evidence¶

Kudu Process Explorer sample output¶

App Service Diagnostics sample findings (Memory detector)¶

KQL Queries with Example Output¶

Query 1: Private bytes trend for worker process¶

Query 2: Restart/recycle correlation timeline¶

Query 3: Latency/error impact around recycle windows¶

Query 4: .NET GC pressure signature (Gen2 and LOH-related indicators)¶

Query 5: Config/deployment-change correlation¶

CLI investigation commands (long flags only)¶

Normal vs Abnormal Comparison¶

6. Validation and Disproof by Hypothesis¶

H1: Application memory leak (steady growth until recycling)¶

H2: Tier memory limit reached (app needs more RAM than plan provides)¶

H3: IIS proactive recycling (time-based or overlapping recycle)¶

H4: GC pressure in .NET (LOH fragmentation, excessive Gen2)¶

H5: Native memory leak in unmanaged code¶

7. Likely Root Cause Patterns¶

Investigation notes¶

Quick conclusion¶

8. Immediate Mitigations¶

9. Prevention¶

See Also¶

Sources¶