App Service OSS Troubleshooting¶

A hypothesis-driven troubleshooting guide for Azure App Service OSS workloads.

What This Is¶

A practical field guide for troubleshooting real-world issues on Azure App Service Linux.

This is not a general Azure tutorial. It is designed to help engineers move from symptom to validated interpretation faster.

How It Works¶

graph TD
    A[Observe Symptom] --> B[List Hypotheses]
    B --> C[Collect Evidence]
    C --> D[Validate / Disprove]
    D --> E[Identify Root Cause]
    E --> F[Mitigate]

Every playbook follows this flow:

Start from the symptom — what the engineer actually observes
List competing hypotheses — multiple plausible causes
Collect evidence — metrics, logs, detectors, configuration
Validate or disprove each hypothesis with specific signals
Identify the most likely root cause pattern
Apply mitigations — immediate and long-term

Start Here¶

Your Situation	Go To
First incident, no idea where to start	Architecture Overview
Need to identify the failure category	Decision Tree
Want 60-second symptom-to-playbook cards	Quick Diagnosis Cards
Want to understand what evidence to collect	Evidence Map
Need a mental framework for diagnosis	Mental Model
Already know the symptom category	Jump to Playbooks below
Need KQL queries to investigate	KQL Query Library
Want hands-on practice	Labs below

Quick Decision Tree¶

Use this to route to the right playbook in under 60 seconds:

graph TD
    A[Symptom Observed] --> B{App returns HTTP errors?}
    B -->|503 on all requests| C[Startup Failure]
    B -->|Intermittent 5xx| D[Performance / Load]
    B -->|200 but slow| E[Performance / Cold Start]
    B -->|No errors but wrong behavior| F[Config / Routing]

    C --> C1[Startup Probe Failed?]
    C1 -->|Yes, 0 console logs| C2[Wrong startup command → Deployment Succeeded Startup Failed]
    C1 -->|Yes, app listening on 127.0.0.1| C3[Wrong bind address → Failed to Forward Request]
    C1 -->|Yes, timeout on port| C4[Port mismatch → Container HTTP Pings]

    D --> D1{Outbound connections involved?}
    D1 -->|Yes, connection timeouts| D2[SNAT Exhaustion]
    D1 -->|No, sync worker blocking| D3[Intermittent 5xx Under Load]

    E --> E1{First request after deploy?}
    E1 -->|Yes| E2[Cold Start / Slow Start]
    E1 -->|No, always slow| E3[Memory Pressure or CPU]

    F --> F1{After slot swap?}
    F1 -->|Yes| F2[Slot Swap Config Drift]
    F1 -->|No, DNS/network| F3[DNS VNet Resolution]

    style C fill:#c62828,color:#fff
    style D fill:#ef6c00,color:#fff
    style E fill:#f9a825,color:#000
    style F fill:#1565c0,color:#fff

Hosting Mode: Where to Look First¶

Different hosting modes have different observation points. Use this table to prioritize your investigation:

Symptom	Linux Code	Linux Container	Windows Code
Startup fails	`AppServiceConsoleLogs` — Oryx build output, runtime startup	`AppServiceConsoleLogs` — Docker logs, `ENTRYPOINT`/`CMD` output	Application Event Logs, `WEBSITE_RUN_FROM_PACKAGE` extraction
Wrong port/binding	Check `--bind` in startup command (Gunicorn, etc.)	Check `WEBSITES_PORT`, `PORT` env var, `EXPOSE` in Dockerfile	Typically auto-configured; check `web.config` for IIS settings
Missing dependencies	Oryx build logs, `requirements.txt` / `package.json`	Image build logs; dependencies baked into image	NuGet restore logs, MSBuild output
Slow cold start	Module import time, lazy loading patterns	Image pull time (check image size), container init	Assembly loading, JIT compilation
Memory pressure	`MemoryWorkingSet` metric, OOM in platform logs	`MemoryWorkingSet`, container memory limits	`MemoryWorkingSet`, w3wp process memory
Outbound timeouts	SNAT metrics, `AppServiceConsoleLogs` connection errors	Same as Linux Code	SNAT metrics, outbound connection tracking
Config drift after swap	App Settings sticky slot config	Same as Linux Code	`web.config` transforms, connection strings
Filesystem issues	`/home` (persistent) vs `/tmp` (ephemeral), `df -h` via SSH	Container filesystem (ephemeral by default), mounted volumes	`D:\home` (persistent) vs `D:\local` (ephemeral)

Hosting Mode Detection

Use az webapp show --query "kind" to check hosting mode:

app,linux → Linux Code
app,linux,container → Linux Container
app → Windows Code

Windows-Specific Gaps

This guide focuses on Linux workloads. Windows-specific playbooks (IIS configuration, web.config issues, Windows containers) are referenced but not exhaustively covered.

Representative Log Patterns¶

Quick reference for recognizing common failure signatures:

Pattern	Indicates	Playbook
`503` + TimeTaken > 40000ms + 0 console logs	Startup failure — app never ran	Deployment Succeeded Startup Failed
Console: `Listening at: http://127.0.0.1:8000`	Wrong bind address	Failed to Forward Request
`499` + TimeTaken ~5000ms on `/slow` endpoints	Client timeout, sync worker blocking	Intermittent 5xx Under Load
`499` + TimeTaken ~30000ms on `/outbound`	SNAT exhaustion or outbound timeout	SNAT or Application Issue
`/resolve` returns public IP for privatelink FQDN	DNS misconfiguration (Private DNS Zone not linked)	DNS Resolution VNet
`startup_duration` > 30s in platform logs	Cold start / slow start	Slow Start / Cold Start
`/disk-status` shows /tmp > 50%	Disk pressure	No Space Left on Device
`/config` returns wrong environment values after swap	Slot swap config drift (sticky settings missing)	Slot Swap Config Drift

Topics¶

Performance¶

Outbound / Network¶

Startup / Availability¶

Quick Start¶

Need	Start Here
First 10 minutes of a performance issue	Performance Checklist
First 10 minutes of a network issue	Network Checklist
First 10 minutes of a startup failure	Startup Checklist
Reusable KQL queries	Query Library

Hands-on Labs¶

Deploy reproduction environments to your Azure subscription and observe real symptoms:

Architecture & Methodology¶

Portal view: Diagnose and solve problems landing page¶

The Portal's built-in Diagnose and solve problems blade is the operational entry point that complements the architecture and methodology pages below. Treat it as the first stop during an active incident — Risk alerts surfaces pre-detected critical issues, the seven Troubleshooting categories map directly to the failure classifications in the Mental Model, and the Popular troubleshooting tools (App Down Workflow, Web App Slow) run guided diagnostic flows that consolidate many manual KQL queries.

Architecture Overview — How App Service components interact during failures
Decision Tree — Route from symptom to playbook in 60 seconds
Evidence Map — What evidence to collect for each failure type
Mental Model — Framework for hypothesis-driven diagnosis
Troubleshooting Method — Full methodology deep-dive
Detector Map — Platform diagnostic detectors and what they check