Skip to content

Lab Guide (Level 3): SNAT Exhaustion on Azure App Service Linux

This lab is a full diagnostic reference for reproducing and proving outbound SNAT pressure on Azure App Service Linux using a Python/Flask workload. It expands the original scaffold into a complete investigation package with architecture background, falsifiable hypothesis, deterministic runbook, and artifact-backed experiment log.


Lab Metadata

Attribute Value
Difficulty Advanced
Estimated Duration 60-75 minutes
Tier Basic
Failure Mode Outbound connection churn without pooling drives SNAT pressure, timeouts, and worker instability
Skills Practiced Outbound dependency troubleshooting, SNAT-pressure analysis, network diagnostics, HTTP and console log correlation

What this guide is

This is a troubleshooting reference lab guide intended for engineers who need repeatable, evidence-driven diagnosis. It is not a quickstart.

PII policy

All IDs in this guide are already sanitized. Keep all examples sanitized if you copy this structure for new investigations.


1) Background

SNAT exhaustion on App Service is rarely a single-event failure. It is usually a cascade:

  1. App generates high outbound connection churn.
  2. Platform SNAT mapping inventory gets stressed.
  3. Pending outbound calls wait longer for usable translated ports.
  4. Upstream call latency grows.
  5. Worker threads block and queue expands.
  6. Gunicorn workers timeout and get killed/recycled.
  7. Inbound availability degrades (499/503/000 symptoms).

1.1 Outbound flow on App Service Linux

The following logical flow explains where SNAT sits in the path:

flowchart TD
    A[Client Request] --> B[App Service Front End]
    B --> C[Linux Worker / Gunicorn]
    C --> D[Flask /outbound endpoint]
    D --> E[New outbound TCP connection]
    E --> F[Azure Load Balancer SNAT]
    F --> G[Internet target e.g. httpbin.org]
    G --> F
    F --> C

    style F fill:#f57c00,color:#fff
    style D fill:#1976d2,color:#fff
    style C fill:#455a64,color:#fff

Key point: SNAT mapping happens on platform egress. Your code does not directly manage SNAT tables, but your connection behavior determines churn pressure.

1.2 Why per-request TCP creation is dangerous

In this lab app:

  • /outbound uses urllib.request with Connection: close.
  • Every outbound call tends to create a fresh TCP socket.
  • Under concurrency, sockets accumulate in active and post-close states.

When requests finish, connections do not disappear immediately due to TCP lifecycle behavior (for example, TIME_WAIT). That lag means transient churn can still consume port inventory for a while.

1.3 SNAT port inventory and App Service guidance

Microsoft guidance for App Service outbound troubleshooting describes finite SNAT inventory and recommends connection reuse/pooling to avoid intermittent failures.

Operationally relevant concepts:

  • Finite SNAT mappings per instance.
  • Port reuse delay due to TCP lifecycle.
  • Connection pooling reduces churn and improves stability.
  • Symptom signatures: outbound timeout, connection refused, intermittent spikes.

1.4 Causal mechanics with TCP states

sequenceDiagram
    participant W as Worker Thread
    participant A as App Code (/outbound)
    participant L as LB SNAT
    participant T as External Target

    W->>A: Handle inbound request
    A->>L: Open outbound TCP
    L->>T: Forward with translated source port
    T-->>L: Response
    L-->>A: Return response
    A-->>W: Complete call
    Note over L: Port mapping not always instantly reusable

Under low traffic this is fine. Under burst concurrency with no pooling, the per-call setup/teardown overhead becomes dominant and error-prone.

1.5 Why this lab can also show worker SIGKILL

The lab does not claim SNAT directly kills a worker process. The chain is indirect:

  • Outbound calls stall.
  • Request handlers exceed worker timeout thresholds.
  • Worker recycling/kill events increase.
  • Platform and app become unstable.

This is a classic cascading failure pattern, where initial network pressure manifests as process churn.

1.6 Lab code paths relevant to diagnosis

Endpoint Purpose Behavior
/outbound Reproduce anti-pattern no pooling, Connection: close
/outbound-fixed Control path requests.Session() + pooled adapter
/diag/net Network diagnostics sockstat, TCP line count, local port range
/diag/stats Process counters request counters + outbound counters
/diag/env Runtime context safe env projection

1.7 Why this is not only an outbound problem

Outbound instability can surface as inbound errors:

  • 499 in HTTP logs (client closed or downstream timeout path).
  • 503 when process/service is degraded.
  • 000 in synthetic probes (curl transport failure).

1.8 Diagram: healthy vs exhausted behavior

flowchart TD
    A[Inbound request] --> B{Outbound call mode}
    B -->|Pooled| C[Reuse existing sockets]
    C --> D[Low churn]
    D --> E[Stable latency]
    E --> F[Healthy workers]

    B -->|No pooling| G[Create many new sockets]
    G --> H[High SNAT churn]
    H --> I[Outbound delay/timeout]
    I --> J[Worker timeout]
    J --> K[SIGKILL/recycle events]
    K --> L[499/503 increase]

    style C fill:#2e7d32,color:#fff
    style G fill:#ef6c00,color:#fff
    style K fill:#c62828,color:#fff

1.9 Baseline environment evidence (from artifacts)

Source files:

  • baseline/diag-net.json
  • baseline/diag-stats.json
  • baseline/diag-env.json
  • baseline/app-config.json
  • baseline/health.json

Observed baseline values:

Signal Value
Health payload {"lab":"snat-exhaustion","status":"healthy"}
Gunicorn startup command gunicorn --bind=0.0.0.0 --timeout=120 --workers=4 app:app
WEBSITES_PORT 8000
/proc/sys/net/ipv4/ip_local_port_range 32768-60999
Baseline connection_count 10
Baseline sockstat TCP in-use 5
Baseline sockstat TCP tw 4

2) Hypothesis

2.1 Statement (falsifiable)

Hypothesis:

When a Python/Flask app creates a new outbound TCP connection per request without connection pooling, SNAT ports exhaust within minutes under concurrent load, causing timeouts and SIGKILL'd workers.

2.2 Causal chain under test

flowchart TD
    A[No pooling in /outbound] --> B[High outbound socket churn]
    B --> C[SNAT mapping pressure]
    C --> D[Outbound timeout growth]
    D --> E[Gunicorn worker timeout]
    E --> F[Worker SIGKILL / recycle]
    F --> G[HTTP 499/503 and curl 000]

    style C fill:#ef6c00,color:#fff
    style E fill:#d84315,color:#fff
    style F fill:#b71c1c,color:#fff

2.3 Proof criteria

All of the following must be observed in the same trigger window:

  1. Transport failures appear under load
    • curl results include 000 responses and long (~60s) waits.
  2. HTTP log degradation appears
    • Large share of 499/503 with elevated TimeTaken.
  3. Application timeout signatures appear
    • Body samples include timeout text (for example, The read operation timed out).
  4. Worker instability appears in console logs
    • WORKER TIMEOUT and SIGKILL events recorded.
  5. Recovery indicator appears after pressure drops
    • Diagnostic endpoints become reachable again and counters restart/new PID appears.

2.4 Disproof criteria

Any one of the following disconfirms this specific chain:

  • High concurrency produces no transport failures and no elevated HTTP time.
  • No worker timeout/SIGKILL events during failure period.
  • Failures occur equally in pooled and non-pooled paths with equivalent concurrency.
  • Artifact evidence shows stable outbound behavior and no timeout signatures.

2.5 Scope boundaries

This lab tests application-driven outbound churn behavior, not every possible outbound failure root cause.

Not in scope:

  • Upstream service outage as primary fault.
  • DNS-wide outage.
  • VNet routing misconfiguration.
  • TLS certificate trust misconfiguration.

2.6 Expected measurable variables

Layer Variable Expected during failure
Trigger CSV status code many 000
Trigger CSV elapsed seconds cluster near 60
App response body sampleErrors timeout message present
HTTP logs ScStatus 499/503 rise
HTTP logs TimeTaken long-tail near timeout window
Console logs Gunicorn events WORKER TIMEOUT, SIGKILL
Diag endpoints reachability transient unreachability

2.7 Competing explanations considered

Alternative explanation How assessed in this lab
App code crash unrelated to outbound Console pattern shows repeated timeout->kill loop tied to pressure window
One-off platform restart Repeated failure signals in multiple artifacts, not a single restart message
Pure client-side network issue Server-side logs show timeout and worker churn signatures

3) Runbook

This runbook is the repeatable execution path. Use long-form flags for Azure CLI commands.

3.1 Prerequisites

Tool Check command
Azure CLI az version
Bash bash --version
jq jq --version
Authenticated session az account show

3.2 Variable setup

export RG="rg-lab-snat"
export LOCATION="koreacentral"
export TEMPLATE_FILE="labs/snat-exhaustion/main.bicep"

3.3 Deploy infrastructure

az group create --name "$RG" --location "$LOCATION"

az deployment group create \
  --resource-group "$RG" \
  --template-file "$TEMPLATE_FILE"

Capture app name:

export APP_NAME=$(az webapp list \
  --resource-group "$RG" \
  --query "[0].name" \
  --output tsv)

export APP_HOST=$(az webapp show \
  --resource-group "$RG" \
  --name "$APP_NAME" \
  --query "defaultHostName" \
  --output tsv)

export APP_URL="https://$APP_HOST"

3.4 Deploy lab app

az webapp deploy \
  --resource-group "$RG" \
  --name "$APP_NAME" \
  --src-path "labs/snat-exhaustion/app" \
  --type zip \
  --restart true

3.5 Baseline checks

curl --silent --show-error "$APP_URL/health"
curl --silent --show-error "$APP_URL/diag/env"
curl --silent --show-error "$APP_URL/diag/net"
curl --silent --show-error "$APP_URL/diag/stats"

Expected baseline shape:

  • Health returns status=healthy.
  • WEBSITES_PORT and/or PORT indicate container listener context.
  • /diag/net returns low TCP pressure.

3.6 Trigger failure mode

bash "labs/snat-exhaustion/trigger.sh" "$APP_URL"

Trigger behavior from script:

  • Sends 200 /outbound?calls=40 requests.
  • Runs concurrent batches (capped job count).
  • Summarizes transport (000) and HTTP (5xx) failures.

3.7 Optional control check (pooled endpoint)

Run a smaller controlled load against pooled mode:

for request_number in $(seq 1 40); do
  curl \
    --silent \
    --show-error \
    --output /dev/null \
    --write-out "%{http_code}\n" \
    "$APP_URL/outbound-fixed?calls=40"
done

3.8 Collect platform diagnostics

Portal view: Diagnose and solve (network and SNAT detector hub)

Azure portal Diagnose and solve problems blade for app-test-20251107 with a Search box, Common Solutions tab selected, and a Risk alerts panel showing Availability 2 Critical with View more details link. The Troubleshooting categories grid shows seven cards: Availability and Performance (links Application Logs, App Down Workflow, Web App Down), Configuration and Management (links Investigate EasyAuth errors, IP Address Configuration, All Scaling Operations), Risk Assessments (links Availability risks, Configuration risks), Deployment (Troubleshoot link), Networking (Troubleshoot link), Diagnostic Tools (links Auto-Heal, Network Troubleshooter, Advanced Application Restart), and Load Test your App (Create Load Test link). A Popular troubleshooting tools list at the bottom shows Application Logs, App Down Workflow, Web App Down, Web App Slow, and Process Full List.

The Diagnose and solve problems hub is the Portal first-stop for SNAT investigations - the Networking troubleshoot card and the Network Troubleshooter link under Diagnostic Tools both pivot directly to detectors that surface SNAT port pressure and outbound IP saturation. The Availability and Performance card's Web App Slow detector also frequently lights up first under SNAT exhaustion because the worker queue fills with TCP-blocked requests. Click into Networking here to land on the same blade shown below, then verify the outbound IP list against the per-instance SNAT port budget referenced in section 1.3. After this top-down triage, the queries in section 3.9 quantify the 499/503 error rate the detector tiles only summarize.

Portal view: Networking blade (outbound IP context)

Azure portal Networking blade for app-test-20251107 (Web App) with toolbar Refresh, Troubleshoot, Send us your feedback and a "Check your network configuration..." description with a Learn more link. Two-column layout: Inbound traffic configuration shows Public network access "Enabled with no access restrictions (Using default behavior)", App assigned address "Not configured", Private endpoints "0 private endpoints", Inbound IPv4 addresses 20.200.197.3, Inbound IPv6 addresses 2603:1040:f05:3::208, plus Optional inbound services with Azure Front Door "View details". Outbound traffic configuration shows Virtual network integration "Not configured", Hybrid connections "Not configured", Outbound DNS "Default (Azure-provided)", and a long Outbound IPv4 addresses list (20.214.209.150, 20.214.209.176, 20.214.209.187, ... ~30 platform-pool addresses) plus an Outbound IPv6 addresses list. Integration subnet configuration shows NAT gateway, Network security group, and User defined route all N/A. Left nav highlights Networking (under Favorites).

The Networking blade is the Portal counterpart to the SNAT KQL queries below. The Outbound traffic configuration column confirms Virtual network integration is Not configured and shows the ~30 shared platform-pool Outbound IPv4 addresses (20.214.209.150, 20.214.209.176, ...) that this app is multiplexing with other tenants - the exact root cause of SNAT port exhaustion under load. NAT gateway, Network security group, and User defined route all show N/A under Integration subnet configuration, which is the documented anti-pattern: a stateless outbound burst app has no dedicated SNAT pool and inherits the shared one. After confirming this state, run the queries below to quantify the resulting 499/503 errors.

HTTP signal query

AppServiceHTTPLogs
| where TimeGenerated > ago(2h)
| where CsHost has "azurewebsites"
| where CsUriStem in ("/outbound", "/outbound-fixed", "/diag/net", "/health")
| project TimeGenerated, CsUriStem, ScStatus, TimeTaken, CsHost
| order by TimeGenerated desc

Console signal query

AppServiceConsoleLogs
| where TimeGenerated > ago(2h)
| where ResultDescription has_any (
    "WORKER TIMEOUT",
    "SIGKILL",
    "timed out",
    "connection refused",
    "Cannot assign requested address",
    "EADDRNOTAVAIL"
)
| project TimeGenerated, ResultDescription
| order by TimeGenerated desc

Platform signal query

AppServicePlatformLogs
| where TimeGenerated > ago(2h)
| where Message has_any (
    "warmup",
    "Container",
    "startup",
    "timeout"
)
| project TimeGenerated, Level, Message
| order by TimeGenerated desc

3.9 Azure CLI-based KQL execution (optional automation)

export WORKSPACE_ID="<log-analytics-workspace-id>"

az monitor log-analytics query \
  --workspace "$WORKSPACE_ID" \
  --analytics-query "AppServiceHTTPLogs | where TimeGenerated > ago(2h) | where CsUriStem in ('/outbound','/outbound-fixed','/diag/net') | project TimeGenerated, CsUriStem, ScStatus, TimeTaken | order by TimeGenerated desc" \
  --output json

3.10 Validate recovery state

curl --silent --show-error "$APP_URL/diag/net"
curl --silent --show-error "$APP_URL/diag/stats"
curl --silent --show-error "$APP_URL/health"

If reachable and stable after pressure window, recovery evidence is present.

4) Experiment Log

This section is derived from actual sanitized lab artifacts:

labs/snat-exhaustion/artifacts-sanitized/

4.1 Artifact inventory used

Category File
Baseline baseline/diag-stats.json
Baseline baseline/diag-net.json
Baseline baseline/diag-env.json
Baseline baseline/app-config.json
Baseline baseline/health.json
Trigger trigger/outbound-targeted-20260404T055433Z.csv
Trigger trigger/error-body-*_body-20260404T055433Z.txt
Trigger trigger/diag-net-before-20260404T053447Z.json
Trigger trigger/diag-net-posttrigger-20260404T054508Z.json
Trigger trigger/diag-net-recovered-20260404T055433Z.json
Trigger trigger/kql-http-20260404T060610Z.json
Trigger trigger/kql-console-20260404T060610Z.json
Trigger trigger/kql-platform-20260404T060610Z.json

4.2 Baseline snapshot

4.2.1 /diag/stats baseline

Raw signal:

{"endpoint_counters":{"diag_stats":1},"outbound_call_counters":{"with-pooling":{"failures":0,"successes":0},"without-pooling":{"failures":0,"successes":0}},"pid":1907,"process_start_time":"2026-04-04T05:05:40.566103+00:00","request_count":1,"uptime_seconds":1642.522}

Interpretation:

  • Process was long-lived before trigger.
  • No outbound failures accumulated yet.

4.2.2 /diag/net baseline

Raw signal:

{"connection_count":10,"ip_local_port_range":{"end":"60999","start":"32768"},"sockstat":{"sockets":{"used":"10"},"tcp":{"alloc":"196","inuse":"5","mem":"18","orphan":"1","tw":"4"},"udp":{"inuse":"1","mem":"0"}}}

Interpretation:

  • Low active pressure.
  • Existing tw=4 confirms normal TCP post-close behavior.

4.3 Trigger output summary (targeted CSV)

Source: trigger/outbound-targeted-20260404T055433Z.csv

Derived summary:

Metric Value
Total requests 30
Transport failures (000) 22
HTTP 200 8
Requests at ~60s 22
Fastest successful request 31.967409s
Slowest successful request 57.227848s
Avg successful request time 44.16s

Selected rows:

Row Status Elapsed (s)
1 000 60.0002131
2 200 36.500380
3 200 53.071651
5 000 60.0000175
9 000 59.9991789
14 200 32.898595
20 000 60.0002442
30 000 60.0002993

4.4 Error-body sample evidence

Sources:

  • error-body-2_body-20260404T055433Z.txt
  • error-body-3_body-20260404T055433Z.txt
  • error-body-10_body-20260404T055433Z.txt
  • error-body-11_body-20260404T055433Z.txt
  • error-body-14_body-20260404T055433Z.txt

Aggregated findings:

Metric Value
Sample files analyzed 5
Files containing failures 1
Aggregate failures 1
Timeout text observed The read operation timed out

Representative payload with failure:

{"calls":20,"elapsedMs":35774,"failures":1,"mode":"without-pooling","sampleErrors":["The read operation timed out"],"successes":19,"target":"https://httpbin.org/get"}

4.5 Diag endpoint reachability during pressure

Checkpoint Artifact Observed
Pre-trigger diag-net-before-20260404T053447Z.json normal JSON (connection_count=6, tw=0)
During/after pressure diag-net-posttrigger-20260404T054508Z.json 504.0 GatewayTimeout
Recovered diag-net-recovered-20260404T055433Z.json normal JSON (connection_count=7, tw=1)

This confirms transient unreachability during the failure window.

4.6 HTTP KQL analysis

Source: kql-http-20260404T060610Z.json

Dataset size:

  • Total rows: 195

Status distribution:

Status Count
200 47
202 2
499 129
503 17

Endpoint-focused findings:

Metric Value
/outbound total rows 138
/outbound with 499 122
/outbound with 200 16
Rows with TimeTaken >= 59000ms 123
/outbound rows with 499 and TimeTaken >= 59000ms 118

Interpretation:

  • HTTP telemetry matches timeout-dominated failure shape.
  • The near-60s cluster strongly aligns with outbound wait/timeout behavior.

4.7 Console KQL analysis

Source: kql-console-20260404T060610Z.json

Dataset size and window:

Metric Value
Total rows 500
First row timestamp (oldest) 2026-04-04T05:40:02.5061808Z
Last row timestamp (newest) 2026-04-04T05:59:43.0881145Z

Pattern counts:

Signature Count
WORKER TIMEOUT 18
SIGKILL 14

Representative lines:

[2026-04-04 05:59:42 +0000] [1904] [CRITICAL] WORKER TIMEOUT (pid:1908)
[2026-04-04 05:59:43 +0000] [1904] [ERROR] Worker (pid:1908) was sent SIGKILL! Perhaps out of memory?
[2026-04-04 05:58:45 +0000] [1904] [CRITICAL] WORKER TIMEOUT (pid:1905)
[2026-04-04 05:58:46 +0000] [1904] [ERROR] Worker (pid:1905) was sent SIGKILL! Perhaps out of memory?

Interpretation:

  • Strong process churn coincides with high outbound timeout phase.
  • This supports cascading instability, not isolated request errors.

4.8 Platform KQL analysis

Source: kql-platform-20260404T060610Z.json

Rows sampled: 200

Dominant content:

  • Container lifecycle messages.
  • Startup/warmup informational traces.
  • No contradictory signal that would independently explain the timeout burst.

4.9 PID rollover evidence

Compare baseline vs recovered diagnostics:

Snapshot PID Process start time
Baseline diag-stats.json 1907 2026-04-04T05:05:40.566103+00:00
Recovered diag-stats-recovered...json 1908 2026-04-04T05:52:07.242253+00:00

This indicates process turnover occurred during the trigger window.

4.10 Failure cascade timeline (reconstructed)

timeline
    title SNAT Lab Failure Cascade (artifact reconstruction)
    05:40:02 : Console window starts
    05:52:07 : New worker generation visible
    05:54-05:56 : Trigger pressure period
    05:56:00-05:57:26 : HTTP 499 near 59-60s dominates
    05:58-05:59 : Repeated WORKER TIMEOUT and SIGKILL events
    05:55+ : /diag endpoints eventually recover

4.11 Hypothesis verdict

Criterion Result Evidence
Transport failures under load ✅ Met 22/30 probe rows = 000
HTTP degradation with long times ✅ Met 129×499, 17×503, long TimeTaken cluster
Timeout body evidence ✅ Met The read operation timed out in sample
Worker churn evidence ✅ Met 18 WORKER TIMEOUT, 14 SIGKILL
Recovery after pressure ✅ Met diag-net recovers from 504 to JSON

Final verdict: Hypothesis supported by artifacts.

4.12 Practical mitigation mapping

Symptom Mitigation
High churn outbound Reuse sessions/connection pools
Timeout bursts Reduce per-request outbound fan-out
Worker timeout/SIGKILL loops Increase resiliency + reduce blocked call time
Recurrence under load Scale out and validate outbound dependency behavior

To make this lab even stronger, add a matched run against /outbound-fixed with the same trigger shape and log both runs side-by-side.

Suggested comparison table:

Metric No pooling With pooling
curl 000 ratio expected high expected low
499 count expected high expected low
Worker timeout events expected present expected rare/none

Expected Evidence

This section defines what you SHOULD observe at each phase of the lab. Use it to validate your investigation is on track.

Before Trigger (Baseline)

Evidence Source Expected State What to Capture
AppServiceHTTPLogs All 200s with low latency Baseline query snapshot for /health, /diag/stats, /diag/net
AppServiceConsoleLogs Normal Gunicorn startup behavior Boot lines showing 4 sync workers
AppServicePlatformLogs Standard startup lifecycle Site start sequence without churn
/diag/stats + /diag/net Low outbound churn and stable socket counts Baseline connection_count, sockstat, and endpoint counters

During Incident

Evidence Source Expected State Key Indicator
AppServiceHTTPLogs (/outbound) 499 dominates during burst TimeTaken ~29786-29840 ms on timed-out outbound calls
AppServiceHTTPLogs (/diag/stats) Diagnostic endpoint can also time out 499 with TimeTaken 59709 ms indicates full stall
Trigger CSV + app payloads Mixed 000/499/503 and timeout text Connection churn exceeds available SNAT mappings
Console logs Worker timeout and kill churn WORKER TIMEOUT and SIGKILL align with outbound pressure window

After Recovery

Evidence Source Expected State Key Indicator
AppServiceHTTPLogs Timeout ratio drops when pressure is reduced Fewer long-tail 499 events after concurrency reduction
/diag/net + /diag/stats Endpoints become reachable again Diagnostic JSON resumes after stall period
Mitigation test Connection pooling/reduced fan-out improves stability Recovery requires reducing concurrent outbound calls or using service endpoints/private connectivity patterns
Incident interpretation 499 remains key symptom Front-end timeout waiting on blocked worker path, not immediate app-side 5xx

Evidence Timeline

graph TD
    A[Baseline Capture] --> B[Trigger Fault]
    B --> C[During: Collect Evidence]
    C --> D[After: Compare to Baseline]
    D --> E[Verdict: Confirmed/Falsified]

Evidence Chain: Why This Proves the Hypothesis

Falsification Logic

If you observe long TimeTaken 499 patterns on /outbound and even /diag/stats, plus worker timeout/kill churn in the same window, the hypothesis is CONFIRMED because outbound connection churn is stalling request processing in a SNAT-pressure cascade.

If you do NOT observe timeout clustering, diagnostic endpoint stall, or worker churn under equivalent outbound concurrency, the hypothesis is FALSIFIED — consider upstream dependency outages or non-SNAT network constraints.


Clean Up

az group delete --name "$RG" --yes --no-wait


See Also

Sources