Evidence Map for App Service Troubleshooting¶

This page maps common investigation questions to the best evidence source, the CLI command to run, and the KQL table/query to use.

Use it when you know what you need to answer but not where to collect proof.

flowchart TD
    Q[Investigation Question] --> S[Evidence Source]
    S --> CLI[CLI Command]
    S --> KQL[KQL Query]
    CLI --> P[Proof Artifact]
    KQL --> P
    P --> H[Hypothesis Validation]

Why an evidence map¶

During incidents, teams lose time by checking the wrong signal first.

HTTP errors are checked in CPU charts
startup issues are checked only in app code
DNS/SNAT issues are diagnosed without outbound evidence

An evidence map reduces this by pairing each question with a reproducible command and query.

Use CLI and query artifacts for reproducible investigations

Since log result capture from the browser is awkward and difficult to maintain, use CLI queries and example outputs. This makes investigation reproducible, easier to copy, and easier to interpret.

Portal view: Application Insights Logs query editor¶

The Logs blade is the canonical entry point for every KQL snippet in this evidence map. The Time range: Last 24 hours selector is the first control to adjust during an incident — most queries below scope to ago(6h) or ago(24h), so widening or narrowing the blade-level window must match the | where TimeGenerated > ago(...) clause to avoid empty results from a stale time filter. The KQL mode dropdown distinguishes raw Kusto from Simple mode; all queries on this page assume KQL mode. The empty Query history panel here means a fresh workspace — during real incidents this panel becomes the reproducibility artifact you preserve in the ticket alongside CLI output.

Quick Map (Question → Source → Command → Table)¶

Question	Best Source	CLI Command	KQL Table
Was the app restarting?	Platform logs + Activity Log	`az monitor activity-log list --resource-group <resource-group> --offset 24h`	`AppServicePlatformLogs`
Were requests failing?	HTTP logs	`az monitor metrics list --resource <app-resource-id> --metric "Http5xx,Requests" --interval PT1M`	`AppServiceHTTPLogs`
Was startup failing?	Console logs	`az webapp log tail --resource-group <resource-group> --name <app-name>`	`AppServiceConsoleLogs`
Was a dependency slow?	App logs + latency trend	`az monitor metrics list --resource <app-resource-id> --metric "AverageResponseTime" --interval PT1M`	`AppServiceAppLogs`
Was DNS failing?	Console/app logs + runtime test output	`az webapp ssh --resource-group <resource-group> --name <app-name>`	`AppServiceConsoleLogs`
Was scale involved?	Metrics + platform events	`az monitor metrics list --resource <app-resource-id> --metric "CpuPercentage,MemoryWorkingSet" --interval PT1M`	`AppServicePlatformLogs`
Was disk full?	Console logs + filesystem command output	`az webapp log tail --resource-group <resource-group> --name <app-name>`	`AppServiceConsoleLogs`
Was memory exhausted?	Process/platform signals	`az monitor metrics list --resource <app-resource-id> --metric "MemoryWorkingSet" --interval PT1M`	`AppServicePlatformLogs`
Was SNAT exhausted?	Outbound diagnostics + errors	`az monitor metrics list --resource <app-resource-id> --metric "Http5xx,Requests" --interval PT1M`	`AppServiceConsoleLogs`
Was there a deployment?	Activity Log	`az monitor activity-log list --resource-group <resource-group> --offset 24h --status Succeeded`	`AppServicePlatformLogs`
Was health check failing?	Platform health signals	`az webapp show --resource-group <resource-group> --name <app-name>`	`AppServicePlatformLogs`
Was slot swap involved?	Activity Log + swap events	`az webapp deployment slot list --resource-group <resource-group> --name <app-name>`	`AppServicePlatformLogs`
Was there a config change?	Activity Log + app settings snapshot	`az webapp config appsettings list --resource-group <resource-group> --name <app-name>`	`AppServicePlatformLogs`
Was the container killed?	Platform + console kill messages	`az webapp log tail --resource-group <resource-group> --name <app-name>`	`AppServicePlatformLogs`
Were there network errors?	Console logs	`az webapp log tail --resource-group <resource-group> --name <app-name>`	`AppServiceConsoleLogs`
Did warm-up fail during swap?	swap diagnostics + startup logs	`az monitor activity-log list --resource-group <resource-group> --offset 6h`	`AppServicePlatformLogs`

Detailed Evidence Recipes¶

1) Was the app restarting?¶

CLI¶

az monitor activity-log list --resource-group <resource-group> --offset 24h

KQL¶

AppServicePlatformLogs
| where TimeGenerated > ago(24h)
| where ResultDescription has_any ("restart", "recycle", "container", "stopped", "started")
| project TimeGenerated, OperationName, ResultDescription, Host
| order by TimeGenerated desc

2) Were requests failing?¶

CLI¶

az monitor metrics list --resource <app-resource-id> --metric "Http5xx,Requests" --interval PT1M

KQL¶

AppServiceHTTPLogs
| where TimeGenerated > ago(6h)
| summarize total=count(), err5xx=countif(ScStatus >= 500 and ScStatus < 600) by bin(TimeGenerated, 5m)
| extend errPct=todouble(err5xx)*100.0/iif(total==0,1,total)
| order by TimeGenerated asc

3) Was startup failing?¶

Portal view: Log stream blade streaming runtime logs¶

The Log stream blade is the fastest way to confirm whether startup is silent (no logs at all) versus failing (errors visible in real time). The Runtime radio selects application logs from your container's stdout/stderr — this is where Python tracebacks, Node.js Error: listen EADDRINUSE, and Java BindException will surface. Switch to Platform when investigating whether App Service itself is restarting the container or printing health-check failures. The Instances dropdown matters during multi-instance incidents because each worker streams independently — if startup is failing on only one instance, you must rotate through each instance ID to find the failing one. The CLI command below produces the same stream without the browser-side UI lag.

CLI¶

az webapp log tail --resource-group <resource-group> --name <app-name>

KQL¶

AppServiceConsoleLogs
| where TimeGenerated > ago(6h)
| where ResultDescription has_any ("failed to start", "could not bind", "listen", "startup", "didn't respond")
| project TimeGenerated, ResultDescription, Host
| order by TimeGenerated desc

4) Was a dependency slow?¶

Portal view: Application Insights Overview dashboard¶

The Application Insights Overview is the right starting point when the question is "is a dependency slow?" because Server response time is the primary signal — values much greater than the historical baseline indicate downstream latency. Pair this with Failed requests to distinguish slow-but-succeeding calls from outright dependency failures. The Logs button in the command bar opens the same KQL editor used by the query below, but you should also check Application Dashboard for the curated dependency view. The 1 hour window is the right scope for active incidents; widen to 1 day to see whether the latency pattern is recurring (e.g., scheduled batch jobs or noisy neighbors). Note Logs workspace shows this is a workspace-based Application Insights resource — required for AppDependencies and AppRequests tables in the KQL below.

CLI¶

az monitor metrics list --resource <app-resource-id> --metric "AverageResponseTime" --interval PT1M

KQL¶

AppServiceAppLogs
| where TimeGenerated > ago(6h)
| where ResultDescription has_any ("timeout", "dependency", "upstream", "database", "redis", "key vault")
| summarize hits=count() by bin(TimeGenerated, 5m)
| order by TimeGenerated asc

5) Was DNS failing?¶

CLI¶

az webapp ssh --resource-group <resource-group> --name <app-name>

Inside the session, run:

nslookup <dependency-hostname>

KQL¶

AppServiceConsoleLogs
| where TimeGenerated > ago(6h)
| where ResultDescription has_any ("Name or service not known", "Temporary failure in name resolution", "getaddrinfo", "DNS")
| project TimeGenerated, ResultDescription
| order by TimeGenerated desc

6) Was scale involved?¶

Portal view: Metrics blade empty state with metric selector¶

The Metrics blade is the visual companion to the az monitor metrics list CLI command below. The required fields are the same: Scope (the App Service resource), Metric Namespace (App Service standard metrics), Metric (e.g., CpuPercentage, MemoryWorkingSet, Http5xx), and Aggregation (Avg, Max, Sum, Count). For "was scale involved?" the canonical chart pairs CpuPercentage Max with Requests Sum over the same window — a Cpu spike that precedes a Requests drop indicates autoscale lag, while spikes that align indicate normal capacity tracking. Use Apply splitting to split a single metric by Instance to see whether one worker is hot — a critical signal for diagnosing noisy-neighbor or sticky-session imbalance. The Drill into Logs button is the bridge from the chart to the matching KQL query.

CLI¶

az monitor metrics list --resource <app-resource-id> --metric "CpuPercentage,MemoryWorkingSet,Http5xx,AverageResponseTime" --interval PT1M

KQL¶

AppServicePlatformLogs
| where TimeGenerated > ago(24h)
| where ResultDescription has_any ("scale", "instance", "restart", "recycle")
| project TimeGenerated, OperationName, ResultDescription
| order by TimeGenerated asc

7) Was disk full?¶

CLI¶

az webapp log tail --resource-group <resource-group> --name <app-name>

Use SSH to confirm with:

df -h

KQL¶

AppServiceConsoleLogs
| where TimeGenerated > ago(24h)
| where ResultDescription has_any ("No space left on device", "ENOSPC", "disk full")
| project TimeGenerated, ResultDescription
| order by TimeGenerated desc

8) Was memory exhausted?¶

CLI¶

az monitor metrics list --resource <app-resource-id> --metric "MemoryWorkingSet" --interval PT1M

KQL¶

AppServicePlatformLogs
| where TimeGenerated > ago(24h)
| where ResultDescription has_any ("OOM", "killed", "memory", "SIGKILL")
| project TimeGenerated, OperationName, ResultDescription, Host
| order by TimeGenerated desc

9) Was SNAT exhausted?¶

CLI¶

az monitor metrics list --resource <app-resource-id> --metric "Http5xx,Requests,AverageResponseTime" --interval PT1M

KQL¶

AppServiceConsoleLogs
| where TimeGenerated > ago(6h)
| where ResultDescription has_any ("connect timed out", "ReadTimeout", "ConnectTimeout", "socket", "ECONNRESET")
| summarize errors=count() by bin(TimeGenerated, 5m)
| order by TimeGenerated asc

10) Was there a deployment?¶

Portal view: Activity log blade for the Web App¶

The Activity log is the authoritative record of every control-plane operation against this Web App — deployments, configuration changes, slot swaps, and scale operations all appear as discrete Operation name entries with the initiating principal. For "was there a deployment?" the operations to grep are Microsoft.Web/sites/publish/Action, Update Site, Update Configuration, and Update App Settings. The Event initiated by column distinguishes user-driven deployments from automated CI/CD service principals — critical for separating intentional deploys from third-party automation. The Timespan: Last 6 hours chip should match the incident window; the Add Filter button lets you scope further by operation name. Use Export Activity Logs to set up continuous export to Log Analytics so the same data is available as the KQL query below.

CLI¶

az monitor activity-log list --resource-group <resource-group> --offset 24h --status Succeeded

KQL¶

AppServicePlatformLogs
| where TimeGenerated > ago(24h)
| where OperationName has_any ("Deploy", "Publish", "SiteConfig", "Container")
| project TimeGenerated, OperationName, ResultDescription
| order by TimeGenerated desc

11) Was health check failing?¶

CLI¶

az webapp show --resource-group <resource-group> --name <app-name>

KQL¶

AppServicePlatformLogs
| where TimeGenerated > ago(24h)
| where ResultDescription has_any ("health check", "unhealthy", "warmup", "probe")
| project TimeGenerated, OperationName, ResultDescription, Host
| order by TimeGenerated desc

12) Was slot swap involved?¶

CLI¶

az webapp deployment slot list --resource-group <resource-group> --name <app-name>

KQL¶

AppServicePlatformLogs
| where TimeGenerated > ago(24h)
| where ResultDescription has_any ("swap", "slot", "warm-up", "warmed up")
| project TimeGenerated, OperationName, ResultDescription
| order by TimeGenerated desc

13) Was there a config change?¶

CLI¶

az webapp config appsettings list --resource-group <resource-group> --name <app-name>
az monitor activity-log list --resource-group <resource-group> --offset 24h

KQL¶

AppServicePlatformLogs
| where TimeGenerated > ago(24h)
| where OperationName has_any ("Update Site", "Update App Settings", "Update Configuration")
| project TimeGenerated, OperationName, ResultDescription
| order by TimeGenerated desc

14) Was the container killed?¶

CLI¶

az webapp log tail --resource-group <resource-group> --name <app-name>

KQL¶

AppServicePlatformLogs
| where TimeGenerated > ago(24h)
| where ResultDescription has_any ("killed", "SIGKILL", "exit code", "container stopped", "OOM")
| project TimeGenerated, OperationName, ResultDescription, Host
| order by TimeGenerated desc

15) Were there network errors?¶

CLI¶

az webapp log tail --resource-group <resource-group> --name <app-name>

KQL¶

AppServiceConsoleLogs
| where TimeGenerated > ago(6h)
| where ResultDescription has_any ("connection refused", "connection reset", "ENETUNREACH", "EHOSTUNREACH", "timed out")
| project TimeGenerated, ResultDescription
| order by TimeGenerated desc

16) Did warm-up fail during slot swap?¶

CLI¶

az monitor activity-log list --resource-group <resource-group> --offset 6h

KQL¶

AppServicePlatformLogs
| where TimeGenerated > ago(6h)
| where ResultDescription has_any ("swap", "warm-up", "did not respond", "health check")
| project TimeGenerated, OperationName, ResultDescription
| order by TimeGenerated desc

Evidence Quality Checklist¶

Keep all evidence in one incident time window.
Correlate HTTP, console, and platform signals before selecting a root cause.
Preserve query text used during the incident for post-incident review.
Capture command outputs in ticket notes with sensitive identifiers removed.

Evidence Map for App Service Troubleshooting¶

Why an evidence map¶

Portal view: Application Insights Logs query editor¶

Quick Map (Question → Source → Command → Table)¶

Detailed Evidence Recipes¶

1) Was the app restarting?¶

CLI¶

KQL¶

2) Were requests failing?¶

CLI¶

KQL¶

3) Was startup failing?¶

Portal view: Log stream blade streaming runtime logs¶

CLI¶

KQL¶

4) Was a dependency slow?¶

Portal view: Application Insights Overview dashboard¶

CLI¶

KQL¶

5) Was DNS failing?¶

CLI¶

KQL¶

6) Was scale involved?¶

Portal view: Metrics blade empty state with metric selector¶

CLI¶

KQL¶

7) Was disk full?¶

CLI¶

KQL¶

8) Was memory exhausted?¶

CLI¶

KQL¶

9) Was SNAT exhausted?¶

CLI¶

KQL¶

10) Was there a deployment?¶

Portal view: Activity log blade for the Web App¶

CLI¶

KQL¶

11) Was health check failing?¶

CLI¶

KQL¶

12) Was slot swap involved?¶

CLI¶

KQL¶

13) Was there a config change?¶

CLI¶

KQL¶

14) Was the container killed?¶

CLI¶

KQL¶

15) Were there network errors?¶

CLI¶

KQL¶

16) Did warm-up fail during slot swap?¶

CLI¶

KQL¶

Evidence Quality Checklist¶

See Also¶

Sources¶