Skip to content

Troubleshooting Decision Tree

Use this page when you need to triage quickly from symptom to likely failure category and then open the right playbook.

The tree is intentionally symptom-first and optimized for the first 10–15 minutes of incident response.

Portal view: Diagnose and solve problems as the triage entry point

Diagnose and solve problems blade for app-test-20251107 showing the Common Solutions tab with a Risk alerts panel (Availability 2 Critical, View more details link) and a Troubleshooting categories grid with seven cards: Availability and Performance (Application Logs, App Down Workflow, Web App Down links), Configuration and Management (Investigate EasyAuth errors, IP Address Configuration, All Scaling Operations), Risk Assessments (Availability risks, Configuration risks), Deployment (Troubleshoot link), Networking (Troubleshoot link), Diagnostic Tools (Auto-Heal, Network Troubleshooter, Advanced Application Restart), and Load Test your App (Create Load Test link). A Popular troubleshooting tools list at the bottom shows Application Logs, App Down Workflow, Web App Down, Web App Slow, and Process Full List.

Before walking the manual decision tree below, open Diagnose and solve problems in the Portal — its Risk alerts panel and Common Solutions may already point directly to the failure category. The seven Troubleshooting categories correspond to the branches in the decision tree: Availability and Performance covers the 5xx and startup branches, Networking covers the outbound dependency branch, Configuration and Management covers the slot swap and config drift branches, and Deployment covers the post-deploy regression branch. Click App Down Workflow for an automated walkthrough that runs many of the queries from the decision tree below in a guided sequence.

Main triage decision tree

flowchart TD
    S[Incident starts: user-visible impact] --> Q1{Is it a 5xx issue?}

    Q1 -->|Yes| Q1A{Intermittent or constant?}
    Q1 -->|No| Q2{Is startup failing?}

    Q1A -->|Intermittent under load| P1[Playbook: Intermittent 5xx Under Load]
    Q1A -->|Constant after deploy or swap| Q3{Was there a recent restart or deployment?}
    Q1A -->|Constant, no deploy| P2[Playbook: Slow Response but Low CPU]

    Q3 -->|Yes, startup symptoms| P3[Playbook: Deployment Succeeded but Startup Failed]
    Q3 -->|Yes, proxy forward errors| P4[Playbook: Failed to Forward Request]
    Q3 -->|Yes, slot behavior changed| P5[Playbook: Slot Swap Config Drift]
    Q3 -->|No clear deploy signal| P6[Playbook: Container Didn't Respond to HTTP Pings]

    Q2 -->|Yes| Q2A{Container up but unhealthy?}
    Q2 -->|No| Q4{Is outbound dependency failing?}

    Q2A -->|Startup timeout or ping failure| P6
    Q2A -->|Warm-up confusion| P7[Playbook: Warm-up vs Health Check]
    Q2A -->|Swap warm-up failure| P8[Playbook: Slot Swap Failed During Warm-up]

    Q4 -->|Yes| Q4A{DNS, SNAT, or private endpoint route?}
    Q4 -->|No| Q5{Is it a performance issue?}

    Q4A -->|DNS| P9[Playbook: DNS Resolution VNet-Integrated App Service]
    Q4A -->|SNAT or outbound churn| P10[Playbook: SNAT or Application Issue]
    Q4A -->|Private endpoint/custom DNS| P11[Playbook: Private Endpoint Custom DNS Route Confusion]

    Q5 -->|Yes| Q5A{CPU, memory, or disk?}
    Q5 -->|No| P12[Use Methodology: build hypotheses from evidence]

    Q5A -->|CPU not saturated but slow| P2
    Q5A -->|Memory pressure or kill/recycle| P13[Playbook: Memory Pressure and Worker Degradation]
    Q5A -->|Disk full| P14[Playbook: No Space Left on Device]
    Q5A -->|First request very slow| P15[Playbook: Slow Start Cold Start]

5xx branch deep-dive tree

flowchart TD
    A[Observed 5xx] --> B{Status pattern}
    B -->|Mostly 500| C[Check application exceptions and endpoint concentration]
    B -->|Mostly 502| D[Check forwarding path and dependency timeouts]
    B -->|Mostly 503| E[Check warm-up, restart, and health transitions]

    C --> F{Startup related?}
    F -->|Yes| G[Container Didn't Respond to HTTP Pings]
    F -->|No| H[Slow Response but Low CPU]

    D --> I{Outbound signatures present?}
    I -->|Yes| J[SNAT or Application Issue]
    I -->|No| K[Failed to Forward Request]

    E --> L{Recent deploy or swap?}
    L -->|Yes| M[Deployment Succeeded but Startup Failed]
    L -->|Swap-specific| N[Slot Swap Failed During Warm-up]
    L -->|No| O[Intermittent 5xx Under Load]

Quick reference matrix

Symptom Pattern Most Likely Cause Category Playbook Link
5xx spikes only during traffic bursts worker saturation, timeout queueing, or outbound pressure Intermittent 5xx Under Load
503 after deployment or restart startup/warm-up sequence instability Deployment Succeeded but Startup Failed
502 with proxy-forward messages front-end to worker forwarding path issue Failed to Forward Request
Container appears running but no responses binding mismatch or app not listening correctly Container Didn't Respond to HTTP Pings
Swap operation fails during warm-up warm-up endpoint mismatch or timeout Slot Swap Failed During Warm-up
App became unstable after swap slot config drift or restart race Slot Swap Config Drift
High latency with low CPU dependency wait, lock contention, sync blocking Slow Response but Low CPU
Gradual slowdown then recycle memory growth and worker degradation Memory Pressure and Worker Degradation
Intermittent outbound timeout/reset SNAT pressure or outbound connection churn SNAT or Application Issue?
Name resolution failures in VNet integration DNS resolver path or custom DNS mismatch DNS Resolution (VNet-Integrated)
Private endpoint dependency unreachable private DNS zone/routing configuration mismatch Private Endpoint / Custom DNS Route Confusion
Errors include No space left on device local filesystem exhaustion No Space Left on Device
First request after idle/deploy is very slow cold start behavior or startup regression Slow Start / Cold Start
Health check reports unhealthy while app path works warm-up vs health-check semantics confusion Warm-up vs Health Check

Triage prompts to ask in order

  1. Is it a 5xx issue? If yes, is it intermittent or constant?
  2. Was there a recent restart or deployment in the incident window?
  3. Is startup failing (container not ready, ping failure, warm-up timeout)?
  4. Is outbound dependency failing (DNS, SNAT, private endpoint route)?
  5. Is it a performance issue (CPU, memory, disk, or cold start)?

Minimal evidence before choosing a branch

  • 15-minute HTTP status trend (AppServiceHTTPLogs)
  • platform event timeline for restarts/deployments (AppServicePlatformLogs + Activity Log)
  • console signatures for startup and outbound failures (AppServiceConsoleLogs)
AppServiceHTTPLogs
| where TimeGenerated > ago(2h)
| summarize total=count(), err5xx=countif(ScStatus >= 500 and ScStatus < 600), p95=percentile(TimeTaken,95) by bin(TimeGenerated, 5m)
| order by TimeGenerated asc
AppServicePlatformLogs
| where TimeGenerated > ago(24h)
| where ResultDescription has_any ("restart", "recycle", "health", "swap", "deploy", "container")
| project TimeGenerated, OperationName, ResultDescription
| order by TimeGenerated desc
AppServiceConsoleLogs
| where TimeGenerated > ago(6h)
| where ResultDescription has_any ("timeout", "failed", "could not bind", "No space left", "DNS", "ConnectTimeout")
| project TimeGenerated, ResultDescription
| order by TimeGenerated desc

CLI triage bundle

az monitor activity-log list --resource-group <resource-group> --offset 24h
az monitor metrics list --resource <app-resource-id> --metric "Http5xx,Requests,AverageResponseTime,CpuPercentage,MemoryWorkingSet" --interval PT1M
az webapp log show --resource-group <resource-group> --name <app-name>
az webapp config show --resource-group <resource-group> --name <app-name>

Avoid branch bias

Do not choose a branch only because it matches a familiar past issue. If the first branch is disproven by timestamps, return to the top and re-classify. Decision trees accelerate triage, but evidence still decides root cause.

Decision Tree Limits

  • This tree is optimized for App Service Linux OSS workloads.
  • Multi-cause incidents can map to more than one branch.
  • If no branch matches cleanly, use Troubleshooting Method and build explicit competing hypotheses.

Branch-specific first checks

If you choose the startup branch

  • Confirm expected port and startup command alignment.
  • Check whether health check path depends on unavailable dependencies.
  • Validate slot-specific settings when swap is part of the timeline.

If you choose the outbound branch

  • Verify whether only one dependency host fails.
  • Compare failure windows against outbound-heavy endpoints.
  • Test DNS resolution and route behavior from the running app context.

If you choose the runtime degradation branch

  • Compare memory trend with restart cadence.
  • Check for No space left on device and temporary filesystem growth.
  • Identify whether high latency leads 5xx or follows it.

Practical triage examples

  1. Intermittent 502 + connect timeout logs + burst traffic
  2. Deployment succeeded + immediate 503 + ping failures
  3. Latency grows over hours + recycle + memory climb

See Also

Sources