Troubleshooting Decision Tree¶

Use this page when you need to triage quickly from symptom to likely failure category and then open the right playbook.

The tree is intentionally symptom-first and optimized for the first 10–15 minutes of incident response.

Portal view: Diagnose and solve problems as the triage entry point¶

Before walking the manual decision tree below, open Diagnose and solve problems in the Portal — its Risk alerts panel and Common Solutions may already point directly to the failure category. The seven Troubleshooting categories correspond to the branches in the decision tree: Availability and Performance covers the 5xx and startup branches, Networking covers the outbound dependency branch, Configuration and Management covers the slot swap and config drift branches, and Deployment covers the post-deploy regression branch. Click App Down Workflow for an automated walkthrough that runs many of the queries from the decision tree below in a guided sequence.

Main triage decision tree¶

flowchart TD
    S[Incident starts: user-visible impact] --> Q1{Is it a 5xx issue?}

    Q1 -->|Yes| Q1A{Intermittent or constant?}
    Q1 -->|No| Q2{Is startup failing?}

    Q1A -->|Intermittent under load| P1[Playbook: Intermittent 5xx Under Load]
    Q1A -->|Constant after deploy or swap| Q3{Was there a recent restart or deployment?}
    Q1A -->|Constant, no deploy| P2[Playbook: Slow Response but Low CPU]

    Q3 -->|Yes, startup symptoms| P3[Playbook: Deployment Succeeded but Startup Failed]
    Q3 -->|Yes, proxy forward errors| P4[Playbook: Failed to Forward Request]
    Q3 -->|Yes, slot behavior changed| P5[Playbook: Slot Swap Config Drift]
    Q3 -->|No clear deploy signal| P6[Playbook: Container Didn't Respond to HTTP Pings]

    Q2 -->|Yes| Q2A{Container up but unhealthy?}
    Q2 -->|No| Q4{Is outbound dependency failing?}

    Q2A -->|Startup timeout or ping failure| P6
    Q2A -->|Warm-up confusion| P7[Playbook: Warm-up vs Health Check]
    Q2A -->|Swap warm-up failure| P8[Playbook: Slot Swap Failed During Warm-up]

    Q4 -->|Yes| Q4A{DNS, SNAT, or private endpoint route?}
    Q4 -->|No| Q5{Is it a performance issue?}

    Q4A -->|DNS| P9[Playbook: DNS Resolution VNet-Integrated App Service]
    Q4A -->|SNAT or outbound churn| P10[Playbook: SNAT or Application Issue]
    Q4A -->|Private endpoint/custom DNS| P11[Playbook: Private Endpoint Custom DNS Route Confusion]

    Q5 -->|Yes| Q5A{CPU, memory, or disk?}
    Q5 -->|No| P12[Use Methodology: build hypotheses from evidence]

    Q5A -->|CPU not saturated but slow| P2
    Q5A -->|Memory pressure or kill/recycle| P13[Playbook: Memory Pressure and Worker Degradation]
    Q5A -->|Disk full| P14[Playbook: No Space Left on Device]
    Q5A -->|First request very slow| P15[Playbook: Slow Start Cold Start]

5xx branch deep-dive tree¶

flowchart TD
    A[Observed 5xx] --> B{Status pattern}
    B -->|Mostly 500| C[Check application exceptions and endpoint concentration]
    B -->|Mostly 502| D[Check forwarding path and dependency timeouts]
    B -->|Mostly 503| E[Check warm-up, restart, and health transitions]

    C --> F{Startup related?}
    F -->|Yes| G[Container Didn't Respond to HTTP Pings]
    F -->|No| H[Slow Response but Low CPU]

    D --> I{Outbound signatures present?}
    I -->|Yes| J[SNAT or Application Issue]
    I -->|No| K[Failed to Forward Request]

    E --> L{Recent deploy or swap?}
    L -->|Yes| M[Deployment Succeeded but Startup Failed]
    L -->|Swap-specific| N[Slot Swap Failed During Warm-up]
    L -->|No| O[Intermittent 5xx Under Load]

Playbook leaves (direct links)¶

Quick reference matrix¶

Symptom Pattern	Most Likely Cause Category	Playbook Link
5xx spikes only during traffic bursts	worker saturation, timeout queueing, or outbound pressure	Intermittent 5xx Under Load
503 after deployment or restart	startup/warm-up sequence instability	Deployment Succeeded but Startup Failed
502 with proxy-forward messages	front-end to worker forwarding path issue	Failed to Forward Request
Container appears running but no responses	binding mismatch or app not listening correctly	Container Didn't Respond to HTTP Pings
Swap operation fails during warm-up	warm-up endpoint mismatch or timeout	Slot Swap Failed During Warm-up
App became unstable after swap	slot config drift or restart race	Slot Swap Config Drift
High latency with low CPU	dependency wait, lock contention, sync blocking	Slow Response but Low CPU
Gradual slowdown then recycle	memory growth and worker degradation	Memory Pressure and Worker Degradation
Intermittent outbound timeout/reset	SNAT pressure or outbound connection churn	SNAT or Application Issue?
Name resolution failures in VNet integration	DNS resolver path or custom DNS mismatch	DNS Resolution (VNet-Integrated)
Private endpoint dependency unreachable	private DNS zone/routing configuration mismatch	Private Endpoint / Custom DNS Route Confusion
Errors include `No space left on device`	local filesystem exhaustion	No Space Left on Device
First request after idle/deploy is very slow	cold start behavior or startup regression	Slow Start / Cold Start
Health check reports unhealthy while app path works	warm-up vs health-check semantics confusion	Warm-up vs Health Check

Triage prompts to ask in order¶

Is it a 5xx issue? If yes, is it intermittent or constant?
Was there a recent restart or deployment in the incident window?
Is startup failing (container not ready, ping failure, warm-up timeout)?
Is outbound dependency failing (DNS, SNAT, private endpoint route)?
Is it a performance issue (CPU, memory, disk, or cold start)?

Minimal evidence before choosing a branch¶

15-minute HTTP status trend (AppServiceHTTPLogs)
platform event timeline for restarts/deployments (AppServicePlatformLogs + Activity Log)
console signatures for startup and outbound failures (AppServiceConsoleLogs)

AppServiceHTTPLogs
| where TimeGenerated > ago(2h)
| summarize total=count(), err5xx=countif(ScStatus >= 500 and ScStatus < 600), p95=percentile(TimeTaken,95) by bin(TimeGenerated, 5m)
| order by TimeGenerated asc

AppServicePlatformLogs
| where TimeGenerated > ago(24h)
| where ResultDescription has_any ("restart", "recycle", "health", "swap", "deploy", "container")
| project TimeGenerated, OperationName, ResultDescription
| order by TimeGenerated desc

AppServiceConsoleLogs
| where TimeGenerated > ago(6h)
| where ResultDescription has_any ("timeout", "failed", "could not bind", "No space left", "DNS", "ConnectTimeout")
| project TimeGenerated, ResultDescription
| order by TimeGenerated desc

CLI triage bundle¶

az monitor activity-log list --resource-group <resource-group> --offset 24h
az monitor metrics list --resource <app-resource-id> --metric "Http5xx,Requests,AverageResponseTime,CpuPercentage,MemoryWorkingSet" --interval PT1M
az webapp log show --resource-group <resource-group> --name <app-name>
az webapp config show --resource-group <resource-group> --name <app-name>

Avoid branch bias

Do not choose a branch only because it matches a familiar past issue. If the first branch is disproven by timestamps, return to the top and re-classify. Decision trees accelerate triage, but evidence still decides root cause.

Decision Tree Limits¶

This tree is optimized for App Service Linux OSS workloads.
Multi-cause incidents can map to more than one branch.
If no branch matches cleanly, use Troubleshooting Method and build explicit competing hypotheses.

Branch-specific first checks¶

If you choose the startup branch¶

Confirm expected port and startup command alignment.
Check whether health check path depends on unavailable dependencies.
Validate slot-specific settings when swap is part of the timeline.

If you choose the outbound branch¶

Verify whether only one dependency host fails.
Compare failure windows against outbound-heavy endpoints.
Test DNS resolution and route behavior from the running app context.

If you choose the runtime degradation branch¶

Compare memory trend with restart cadence.
Check for No space left on device and temporary filesystem growth.
Identify whether high latency leads 5xx or follows it.

Practical triage examples¶

Intermittent 502 + connect timeout logs + burst traffic
- Decision tree branch: 5xx → intermittent → outbound candidate. - Start with SNAT or Application Issue?.
Deployment succeeded + immediate 503 + ping failures
- Decision tree branch: restart/deployment → startup failing. - Start with Deployment Succeeded but Startup Failed.
Latency grows over hours + recycle + memory climb
- Decision tree branch: performance → memory. - Start with Memory Pressure and Worker Degradation.