Skip to content

App Service Labs

Azure App Service troubleshooting experiments focused on worker behavior, memory management, networking, deployment patterns, and platform/application boundary interpretation.

Architecture Overview

Azure App Service runs web applications inside Docker containers on shared or dedicated virtual machines (workers). Understanding the component architecture is essential for diagnosing where failures originate.

graph TB
    subgraph "Azure App Service Platform"
        Client([Client]) --> FE[Frontend<br/>ARR / Load Balancer]

        FE --> W1[Worker Instance 1]
        FE --> W2[Worker Instance 2]

        subgraph "Worker Instance"
            direction TB
            W1 --> SC[Site Container<br/>Application Code]
            SC --> HOME["/home<br/>CIFS Mount<br/>(Persistent)"]
            SC --> TMP["/tmp<br/>Writable Layer<br/>(Ephemeral)"]
        end

        W1 --> SNAT[SNAT Pool<br/>Outbound Connections]
        SNAT --> EXT([External Services])

        HC[Health Check<br/>Probe] --> W1
        HC --> W2

        subgraph "App Service Plan"
            direction LR
            CPU[Shared CPU]
            MEM[Shared Memory]
            SWAP[Swap Space]
        end
    end

    style FE fill:#4a9eff,color:#fff
    style HC fill:#ff9800,color:#fff
    style SNAT fill:#e91e63,color:#fff
    style HOME fill:#4caf50,color:#fff
    style TMP fill:#ff5722,color:#fff

Key Components for Troubleshooting

Component Role Why It Matters
Frontend (ARR) HTTP load balancer and SSL termination ARR affinity, timeout settings, and routing decisions affect request behavior before the app sees them
Worker VM running the application container Workers are shared across apps in the same plan; one app's resource usage affects others
App Service Plan Resource allocation boundary (CPU, memory) Plan SKU determines available memory, CPU, SNAT ports, and swap behavior
Health Check Platform probe that monitors app health Misconfigured health checks can cause instance eviction even when the app is partially functional
SNAT Source Network Address Translation for outbound connections Limited port pool (128 per instance); exhaustion causes connection failures with no CPU/memory signal
/home mount Persistent CIFS (SMB) storage shared across instances Data survives restarts and scale events; performance is network-dependent
Writable layer Ephemeral container filesystem overlay Data is lost on restart, deployment, or instance migration

Note

These experiments target Linux App Service plans unless otherwise noted. Windows-specific behavior may differ.

Experiment Status

Experiment Status Description
Filesystem Persistence Published /home vs writable layer data survival across restarts
Health Check Eviction Published Cascading outage from partial dependency failure
SNAT Exhaustion Published Connection failures without CPU/memory pressure
Memory Pressure Published Plan-level degradation, swap thrashing, kernel page reclaim
Custom DNS Resolution Published Private name resolution drift after VNet changes
procfs Interpretation Published /proc reliability and limits in Linux containers
Slow Requests Published Frontend timeout vs. worker-side delay vs. dependency latency
Zip Deploy vs Container Published Deployment method behavioral differences
Slot Swap Warmup Draft In-flight request handling during slot swap warmup
Access Restrictions SCM Draft Access restriction behavior on SCM site

Published Experiments

Filesystem PersistencePublished

/home mount vs writable layer persistence behavior across restarts, deployments, and scale operations. Confirms that /home (CIFS) survives all lifecycle events while the writable layer is lost on any container recreation.

Experiment Complete

Completed 2026-04 on B1 Linux (koreacentral). All three hypotheses confirmed with evidence from 4 trigger events across 2 instances.

Health Check EvictionPublished

How Health Check failures from partial dependency outages cause cascading instance eviction. Demonstrates that a single unhealthy dependency endpoint can trigger progressive instance removal.

Experiment Complete

Completed 2026-04 on B1 Linux (koreacentral). Documents the eviction cascade timeline and recovery behavior.

SNAT ExhaustionPublished

Connection failures caused by SNAT port exhaustion without corresponding CPU or memory pressure. Shows how outbound connection patterns can exhaust the limited SNAT port pool (128 per instance) while all other metrics remain normal.

Experiment Complete

Completed 2026-04 on B1 Linux (koreacentral). Captures the characteristic signal pattern: connection failures with clean CPU/memory metrics.

Memory PressurePublished

Plan-level degradation under memory pressure. Investigates swap thrashing, kernel page reclaim effects, and cross-app impact on shared plans. Explores whether memory pressure manifests as CPU increase and how plan-level metrics can mislead per-app diagnosis.

Experiment Complete

Completed 2025-07. Covers B1/B2 Linux plans with multiple apps, measuring swap usage, page fault rates, CPU attribution, and cross-app latency impact.

Planned Experiments

Custom DNS ResolutionPublished

Private name resolution behavior after Private DNS Zone link changes on VNet-integrated App Service. Tests whether DNS cache drift causes intermittent failures. Finding: DNS changes propagate immediately — the cache drift hypothesis is refuted.

Experiment Complete

Completed 2026-04 on P1v3 Linux (koreacentral). Four phases tested: baseline, unlink, re-link, and rapid toggle. All DNS transitions were immediate across 80 total probes.

procfs InterpretationPublished

Reliability and limits of reading /proc filesystem data inside App Service Linux containers. Examines where procfs values reflect the container vs. the host, and how cgroup v1/v2 boundaries affect metric interpretation.

Experiment Complete

Completed 2026-04 on B1/P1v3/P2v3 Linux (koreacentral). Key finding: cgroup memory and CPU limits are not enforced via cgroup (both return unlimited); /proc/meminfo MemTotal tracks SKU spec within 3-5%.

Slow RequestsPublished

Diagnosing slow HTTP responses under pressure conditions. Distinguishes between frontend (ARR) timeout, worker-side processing delay, and downstream dependency latency. Tests how different bottleneck locations produce different diagnostic signals.

Experiment Complete

Completed 2026-04 on B1 Linux (koreacentral). Key finding: CPU-bound delays serialize on Node.js event loop (p50 inflates 2-3×); async dependency delays process in parallel at constant latency. ARR timeout confirmed at 230-240s.

Zip Deploy vs ContainerPublished

Behavioral differences between zip deployment and custom container deployment. Investigates startup time, file system behavior, environment variable handling, and troubleshooting signal availability across deployment methods.

Experiment Complete

Completed 2026-04 on B1/P1v3 Linux (koreacentral). Key finding: cold start timing is dominated by platform variance (>50% CV), not deployment method. Major differences are in filesystem layout (cwd, /home contents) and environment variables (PORT vs WEBSITES_PORT).

Draft Experiments

Slot Swap WarmupDraft

How in-flight requests are handled during slot swap warmup phase. Investigates whether requests are dropped, queued, or routed to the old slot during the warmup period, and what signals indicate warmup completion.

Status: Draft - Awaiting Execution

Designed based on Oracle recommendations. Awaiting execution.

Access Restrictions SCMDraft

Behavior of access restrictions on the SCM (Kudu) site. Tests whether main site access restrictions automatically apply to SCM, whether SCM has independent restrictions, and edge cases with VNet integration.

Status: Draft - Awaiting Execution

Designed based on Oracle recommendations. Awaiting execution.

  • FunctionsCold Start explores startup phase breakdown, which shares diagnostic overlap with App Service container startup behavior.
  • Container AppsOOM Visibility Gap investigates OOM kill observability, relevant when comparing App Service memory pressure signals.
  • Cross-cuttingMI RBAC Propagation tests identity propagation delays that affect App Service managed identity usage.