Troubleshooting Architecture Overview¶

This page answers one practical question first: where can this fail?

Before deep debugging, map the symptom to a platform segment (Front End, Worker, app container, outbound path, or deployment pipeline). That classification tells you which logs to query first and which playbook to open.

Why this page exists¶

Playbooks are symptom-driven and detailed. During active incidents, engineers usually need one faster artifact:

A request-path view with failure points
A runtime model that explains timeout, OOM, and recycle behavior
A deployment path showing where startup and config drift failures happen
A network path showing DNS/SNAT/private routing issues

Use this architecture map to route quickly to the right playbook.

1) Request Path Architecture (where 5xx can originate)¶

flowchart TD
    A[Client Browser or API Caller] --> B[Azure App Service Front End]
    B --> C[Worker VM Instance]
    C --> D[App Container or Runtime Process]
    D --> E[Application Handler]
    E --> F[Response]

    B -. FP-REQ-01 .-> B1[Failure Point: FE gateway or routing issue]
    C -. FP-REQ-02 .-> C1[Failure Point: worker unhealthy or recycled]
    D -. FP-REQ-03 .-> D1[Failure Point: container crash or startup loop]
    E -. FP-REQ-04 .-> E1[Failure Point: unhandled exception or timeout]

Typical interpretation¶

HTTP 5xx is not one thing.
A 5xx can originate at Front End, Worker, Container startup, or app code.
Treat each layer as a competing hypothesis until logs disprove it.

Request-path failure points and playbooks¶

Failure Point	Typical Symptom	First Evidence	Playbook
FP-REQ-01 Front End	502/503 bursts, request forwarding issues	`AppServicePlatformLogs` + HTTP status trend	Failed to Forward Request
FP-REQ-02 Worker	Intermittent 5xx during load, restart overlap	restart timing, platform recycle events	Intermittent 5xx Under Load
FP-REQ-03 Container	app marked up/down, ping failures, cold start failures	console startup logs + health probe messages	Container Didn't Respond to HTTP Pings
FP-REQ-04 Application code	500 with stack traces or long latency before error	app/console logs + endpoint-level HTTP logs	Slow Response but Low CPU

2) Runtime / Worker Model (memory pressure, SIGKILL, timeout)¶

flowchart TD
    A[Worker Instance] --> B[Runtime Process Manager]
    B --> C[App Worker Processes]
    C --> D[Request Queue]
    D --> E[Dependency Calls]

    C --> F{Memory growth trend}
    F -->|steady growth| G[Pressure: GC churn and degraded latency]
    F -->|exceeds limit| H[SIGKILL or OOM kill]

    D --> I{Queue delay exceeds timeout}
    I -->|yes| J[Worker timeout and 5xx]
    I -->|no| K[Request completes]

    H --> L[Platform restart or process replacement]
    J --> M[Intermittent failures under burst]

Runtime failure mapping¶

Failure Point	Why it happens	What to check first	Playbook
FP-RUN-01 Memory pressure	Memory leak, large object churn, excess workers	memory trend + console logs for kill/restart	Memory Pressure and Worker Degradation
FP-RUN-02 SIGKILL/OOM	process exceeded practical memory envelope	platform/console kill messages, restart cadence	Memory Pressure and Worker Degradation
FP-RUN-03 Worker timeout	backlog + slow dependencies + low worker throughput	high `TimeTaken`, timeout signatures	Intermittent 5xx Under Load
FP-RUN-04 Disk pressure impact	temp/log growth blocks runtime operations	`No space left on device` in console	No Space Left on Device

3) Deployment Path (startup failures and config drift)¶

flowchart TD
    A[Code or image change] --> B[Deployment to slot]
    B --> C[Container start command executes]
    C --> D[Port binding and warm-up probes]
    D --> E[Health check evaluation]
    E --> F[Slot swap to production]

    C -. FP-DEP-01 .-> C1[Startup command mismatch]
    D -. FP-DEP-02 .-> D1[Port binding mismatch]
    E -. FP-DEP-03 .-> E1[Health check path failure]
    F -. FP-DEP-04 .-> F1[Slot swap drift or restart race]

Deployment path failure mapping¶

Failure Point	Typical signal	Primary playbook
FP-DEP-01 Startup command mismatch	deployment green but app never healthy	Deployment Succeeded but Startup Failed
FP-DEP-02 Port mismatch	container runs but does not answer expected port	Container Didn't Respond to HTTP Pings
FP-DEP-03 Warm-up/health confusion	swap warm-up passes/fails unexpectedly	Warm-up vs Health Check
FP-DEP-04 Swap/config drift	slot swap introduces config regression	Slot Swap Config Drift

4) Outbound / Network Path (SNAT, DNS, private routing)¶

flowchart TD
    A[App Worker] --> B[DNS Resolver Path]
    B --> C[Resolved target IP]
    C --> D[Outbound NAT/SNAT ports]
    D --> E[Internet or Private Endpoint]
    E --> F[Dependency Service]

    B -. FP-NET-01 .-> B1[DNS resolution failure]
    D -. FP-NET-02 .-> D1[SNAT port pressure]
    E -. FP-NET-03 .-> E1[Route or NSG mismatch]
    F -. FP-NET-04 .-> F1[Dependency timeout or refusal]

Outbound path failure mapping¶

Failure Point	Symptom pattern	Primary playbook
FP-NET-01 DNS failure	intermittent/constant name lookup failures	DNS Resolution (VNet-Integrated)
FP-NET-02 SNAT pressure	connect timeout spikes under parallel outbound load	SNAT or Application Issue?
FP-NET-03 Private route confusion	endpoint unreachable despite private endpoint setup	Private Endpoint / Custom DNS Route Confusion
FP-NET-04 Dependency failures	outbound errors isolated to one backend service	SNAT or Application Issue?

5) Observability Coverage Map¶

Portal view: Diagnose and solve problems as observability gateway¶

The Diagnose and solve problems blade is the Portal-side mirror of this observability coverage map and the fastest first stop during an active incident. Each Troubleshooting categories card maps to one or more components in the diagram below: Availability and Performance covers the Front End and worker lifecycle, Configuration and Management and Deployment cover Activity Log signals, Networking covers the outbound path, and Diagnostic Tools provides mitigation actions like Auto-Heal and Advanced Application Restart. The Risk alerts panel runs continuous health checks and surfaces critical issues before symptoms appear in your own monitoring — always check it before falling through to the manual KQL queries below.

flowchart TD
    A[Component] --> B[Best First Log Source]

    C[Front End and proxy path] --> D[AppServiceHTTPLogs + AppServicePlatformLogs]
    E[Worker lifecycle and recycle] --> F[AppServicePlatformLogs]
    G[Container startup and runtime stderr/stdout] --> H[AppServiceConsoleLogs]
    I[App-level business errors] --> J[AppServiceAppLogs or structured app logs]
    K[Deployment and config changes] --> L[Azure Activity Log]
    M[Dependency failures] --> N[App logs + HTTP log correlation]

Quick evidence commands by component¶

az webapp log show --resource-group <resource-group> --name <app-name>
az monitor activity-log list --resource-group <resource-group> --offset 24h
az monitor metrics list --resource <app-resource-id> --metric "Http5xx,Requests,AverageResponseTime,MemoryWorkingSet" --interval PT1M
az webapp config show --resource-group <resource-group> --name <app-name>
az webapp config appsettings list --resource-group <resource-group> --name <app-name>

Portal view: Log Analytics editor where the KQL queries below execute¶

Application Insights Logs blade for ai-test-20251107 showing the KQL editor with a New Query 1 tab, a Run button, Time range Last 24 hours, Show 1000 results, and a KQL mode dropdown. The query editor is empty with placeholder Type your query here or click one of the queries to start. A Query history panel below shows the empty state No queries history.

The KQL snippets below all execute in the Application Insights Logs editor shown here — accessible from the Logs button in any Application Insights or Log Analytics workspace. Before pasting a query, set the blade-level Time range to match the | where TimeGenerated > ago(...) clause in the query (default queries here use ago(2h), ago(6h), or ago(24h)). The KQL mode dropdown distinguishes raw Kusto from Simple mode; all three queries below require KQL mode. The empty Query history panel becomes the reproducibility artifact during real incidents — save important queries from this panel into the runbook for post-incident review.

AppServiceHTTPLogs
| where TimeGenerated > ago(2h)
| summarize total=count(), errors5xx=countif(ScStatus >= 500 and ScStatus < 600), p95=percentile(TimeTaken,95) by bin(TimeGenerated, 5m)
| order by TimeGenerated asc

AppServicePlatformLogs
| where TimeGenerated > ago(24h)
| where ResultDescription has_any ("restart", "recycle", "health", "container", "start", "stop")
| project TimeGenerated, OperationName, ResultDescription
| order by TimeGenerated desc

AppServiceConsoleLogs
| where TimeGenerated > ago(6h)
| where ResultDescription has_any ("Exception", "timed out", "No space left", "OOM", "killed", "could not bind")
| project TimeGenerated, ResultDescription
| order by TimeGenerated desc

6) Fast routing examples¶

Example A: 5xx appears only during bursts.
- Start with runtime/worker model (FP-RUN-03), then outbound pressure (FP-NET-02). - Open: Intermittent 5xx Under Load and SNAT or Application Issue?.
Example B: deployment says success, app unavailable.
- Start with deployment path (FP-DEP-01/02/03). - Open: Deployment Succeeded but Startup Failed and Container Didn't Respond to HTTP Pings.
Example C: outbound calls fail but only for private endpoint targets.
- Start with outbound path (FP-NET-01/03). - Open: Private Endpoint / Custom DNS Route Confusion.

How to use this architecture page during incidents

Do not treat any single metric as proof. Use this page to identify the most likely failure layer, then validate with time-correlated evidence. Move to the linked playbook only after you identify which layer best matches the symptom timing.