Skip to content

Quick Diagnosis Cards

One-page reference cards for rapid incident triage. Each card maps: Symptom → First Query → Platform Segment → Playbook.

Use these when you have 60 seconds to identify the failure category.


Card 1: App Returns 503 After Deployment

graph TD
    A[503 after deploy] --> B[First Query]
    B --> C[Platform Segment]
    C --> D[Playbook]
Step Action
Symptom All requests return 503 immediately after deployment or restart
First Query AppServiceConsoleLogs \| where TimeGenerated > ago(15m) \| where ResultDescription has_any ("failed", "error", "exception", "listening") \| take 50
What to Look For Missing startup logs = startup command issue. Listening at 127.0.0.1 = wrong bind. Traceback = app crash.
Platform Segment Startup / Availability
Playbook Deployment Succeeded but Startup Failed

Quick CLI Check:

az webapp log tail --resource-group <resource-group> --name <app-name>

Card 2: Intermittent 5xx Under Load

Step Action
Symptom 5xx errors appear during traffic spikes, recover when load drops
First Query AppServiceHTTPLogs \| where TimeGenerated > ago(1h) \| summarize total=count(), err5xx=countif(ScStatus >= 500) by bin(TimeGenerated, 5m) \| order by TimeGenerated asc
What to Look For 5xx spikes correlating with request volume. High TimeTaken on failed requests.
Platform Segment Performance
Playbook Intermittent 5xx Under Load

Quick CLI Check:

az monitor metrics list --resource <app-resource-id> --metric "Http5xx,Requests,CpuPercentage" --interval PT1M

Card 3: Outbound Connection Timeouts

Step Action
Symptom Requests to external APIs/databases time out intermittently
First Query AppServiceConsoleLogs \| where TimeGenerated > ago(1h) \| where ResultDescription has_any ("connect timed out", "ReadTimeout", "ConnectTimeout", "ECONNRESET") \| summarize count() by bin(TimeGenerated, 5m)
What to Look For Connection errors increasing over time. Correlation with high outbound request volume.
Platform Segment Outbound / Network
Playbook SNAT or Application Issue

Quick CLI Check (Linux):

# Check for connection patterns in logs
az webapp log tail --resource-group <resource-group> --name <app-name> | grep -i "timeout\|reset\|refused"

Card 4: DNS Resolution Failures (VNet)

Step Action
Symptom App cannot resolve private endpoint FQDNs or custom DNS names
First Query AppServiceConsoleLogs \| where TimeGenerated > ago(1h) \| where ResultDescription has_any ("Name or service not known", "getaddrinfo", "DNS", "NXDOMAIN") \| take 50
What to Look For Resolution failures for *.privatelink.* domains. Public IP returned instead of private.
Platform Segment Outbound / Network
Playbook DNS Resolution VNet-Integrated

Quick SSH Check (Linux):

# SSH into container and test resolution
az webapp ssh --resource-group <resource-group> --name <app-name>
# Then run: nslookup <private-endpoint-fqdn>

Card 5: Slow First Request (Cold Start)

Step Action
Symptom First request after deploy or idle period takes 10-60+ seconds
First Query AppServiceHTTPLogs \| where TimeGenerated > ago(1h) \| where TimeTaken > 10000 \| project TimeGenerated, CsUriStem, TimeTaken, ScStatus \| order by TimeTaken desc
What to Look For High TimeTaken on first requests only. Subsequent requests normal.
Platform Segment Performance
Playbook Slow Start / Cold Start

Quick CLI Check (Linux):

# Stream logs to see startup sequence
az webapp log tail --resource-group <resource-group> --name <app-name>

Card 6: Memory Pressure / Worker Restarts

Step Action
Symptom App becomes slow, then restarts. Pattern repeats.
First Query AppServicePlatformLogs \| where TimeGenerated > ago(6h) \| where ResultDescription has_any ("OOM", "killed", "memory", "SIGKILL", "recycle") \| project TimeGenerated, ResultDescription
What to Look For OOM kill messages. Restart timing correlating with memory growth.
Platform Segment Performance
Playbook Memory Pressure and Worker Degradation

Quick CLI Check:

az monitor metrics list --resource <app-resource-id> --metric "MemoryWorkingSet" --interval PT5M

Card 7: Slot Swap Broke the App

Step Action
Symptom App worked in staging slot, fails after swap to production
First Query AppServicePlatformLogs \| where TimeGenerated > ago(6h) \| where ResultDescription has_any ("swap", "slot", "warm-up") \| project TimeGenerated, ResultDescription
What to Look For Config values that should have stayed in production slot. Connection strings pointing to wrong environment.
Platform Segment Startup / Availability
Playbook Slot Swap Config Drift

Quick CLI Check:

# Compare app settings between slots
az webapp config appsettings list --resource-group <resource-group> --name <app-name> --slot staging
az webapp config appsettings list --resource-group <resource-group> --name <app-name>

Card 8: Disk Full / No Space Left

Step Action
Symptom Errors include "No space left on device" or ENOSPC
First Query AppServiceConsoleLogs \| where TimeGenerated > ago(24h) \| where ResultDescription has_any ("No space left", "ENOSPC", "disk full") \| take 50
What to Look For Temp file accumulation. Log rotation not working. Large uploads filling /tmp.
Platform Segment Performance
Playbook No Space Left on Device

Quick SSH Check (Linux):

# SSH into container
az webapp ssh --resource-group <resource-group> --name <app-name>
# Then run:
df -h
du -sh /tmp/* | sort -h | tail -20

Universal First 3 Queries

When you don't know where to start, run these three queries to establish baseline:

Portal view: Log Analytics editor where the three baseline queries run

Application Insights Logs blade for ai-test-20251107 showing the KQL editor with a New Query 1 tab, a blue Run button, Time range Last 24 hours, Show 1000 results, and a KQL mode dropdown. The query editor is empty with placeholder Type your query here or click one of the queries to start. A Query history panel below shows the empty state No queries history with the hint You haven't run any queries yet.

All three baseline queries below execute in this Application Insights Logs editor. Before pasting each query, adjust the blade-level Time range to match the query's ago(...) window — Query 1 uses 2 hours, Query 2 uses 24 hours, and Query 3 uses 6 hours. The Show: 1000 results cap matters when investigating high-volume incidents; for noisy queries you may need to increase it or refine the filter. Use the KQL mode dropdown — all queries below assume raw Kusto syntax. As you run each query, the Query history panel will populate; export the queries that produced actionable evidence into the incident ticket as your reproducibility artifact.

Query 1: HTTP Error Trend

AppServiceHTTPLogs
| where TimeGenerated > ago(2h)
| summarize total=count(), err5xx=countif(ScStatus >= 500), p95=percentile(TimeTaken, 95) by bin(TimeGenerated, 5m)
| order by TimeGenerated asc

Query 2: Platform Events

AppServicePlatformLogs
| where TimeGenerated > ago(24h)
| where ResultDescription has_any ("restart", "recycle", "health", "swap", "deploy", "container", "OOM", "killed")
| project TimeGenerated, OperationName, ResultDescription
| order by TimeGenerated desc

Query 3: Console Error Signatures

AppServiceConsoleLogs
| where TimeGenerated > ago(6h)
| where ResultDescription has_any ("timeout", "failed", "error", "exception", "could not", "DNS", "connect")
| project TimeGenerated, ResultDescription
| order by TimeGenerated desc
| take 100

Decision Matrix

Observation Most Likely Card Confidence
503 + no console output Card 1 (Startup) High
5xx spikes with traffic Card 2 (Load) High
Outbound timeout errors Card 3 (SNAT) Medium-High
DNS resolution errors Card 4 (VNet DNS) High
First request slow, rest fast Card 5 (Cold Start) High
Gradual slowdown → restart Card 6 (Memory) High
Broken after slot swap Card 7 (Config Drift) High
ENOSPC errors Card 8 (Disk) High

See Also

Sources