Skip to content

Troubleshooting Architecture Map for Azure Functions

This guide is a diagnostic architecture reference for incident response. It is intentionally failure-oriented so you can locate where symptoms originate and what evidence source to check first. Use it as a fast index from symptom category to architecture layer ownership.

How to use this document

This is not a design poster. Read each diagram left-to-right as an evidence path: symptom → likely layer → telemetry/CLI confirmation.

Troubleshooting workflow

Start with First 10 Minutes, apply Methodology, run focused queries from KQL Query Library, and then execute scenario fixes from Functions not executing, High latency / slow responses, Functions failing with errors, or Deployment failures.

Request path architecture (where user-facing failures surface)

The request path is where availability and latency symptoms first appear. Most 5xx, timeout, DNS, and auth issues can be mapped to one of these nodes.

flowchart LR
    A[Client] --> B[DNS]
    B --> C[Load Balancer]
    C --> D[Frontend]
    D --> E[Worker Process]
    E --> F[Function Code]
    F --> G[Output Binding]
    G --> H[Downstream]

    B -. DNS failure .-> X1[Name resolution error]
    D -. 401 or 403 auth failure .-> X2[Auth or EasyAuth reject]
    D -. 5xx gateway failure .-> X3[Frontend-generated 5xx]
    E -. timeout .-> X4[Execution timeout]
    G -. 5xx or retry storm .-> X5[Binding write failure]
    H -. timeout or 5xx .-> X6[Dependency failure]

Request-path failure map

Hop Typical Symptom Common Failure Primary Evidence First Action
Client → DNS Immediate failure before app code runs DNS record mismatch, private DNS zone missing Client logs, Azure Front Door/Application Gateway logs, Azure DNS analytics Validate DNS resolution path and private zone link
DNS → Load Balancer Intermittent connect failures Edge routing issue or transient platform fault requests + Azure status Check Azure Service Health and region incidents
Load Balancer → Frontend 502/503 spikes Frontend cannot route to healthy worker requests resultCode + platform logs Correlate failure spike with restarts
Frontend → Worker Process Long tail latency, timed-out requests Worker warm-up delay, instance recycle traces host lifecycle, request duration Check host start and recycle timeline
Worker Process → Function Code 500 with exception Runtime crash, unhandled exception exceptions, traces Identify dominant exception family
Function Code → Output Binding Retry storms, partial success Binding auth/config mismatch traces binding errors, dependency failures Validate binding connection settings
Output Binding → Downstream High p95 and downstream 5xx API/database slowness or throttling dependencies p95/failure rate Isolate failing target and apply backoff

High-signal KQL for request-path triage

requests
| where timestamp > ago(30m)
| summarize
    Total=count(),
    Failed=countif(success == false),
    P95=percentile(duration, 95)
  by resultCode, bin(timestamp, 5m)
| order by timestamp desc

Runtime and worker model (where execution failures originate)

Azure Functions execution is split across host and language worker boundaries. Symptoms often appear in requests, but root causes are frequently in process lifecycle and resource pressure.

flowchart TB
    A[Functions Host Process\nTrigger listeners, bindings, scale hooks] --> B[Language Worker Process\nPython, Node.js, Java, .NET isolated]
    B --> C[User Function Code]

    A -. startup failure .-> X1[Host startup failed]
    B -. process crash .-> X2[Worker terminated]
    B -. GIL contention .-> X3[Python throughput collapse]
    A -. memory pressure .-> X4[Recycle or OOM risk]
    C -. blocking call .-> X5[Execution timeout]

Runtime diagnostic table

Layer Component Common Failure Evidence Source
Host JobHost startup Host cannot initialize listeners or storage lock traces messages containing Initializing Host, Host lock, Host started
Host Trigger listener Listener disabled, misconfigured, or auth denied traces listener warnings + trigger silence in requests/metrics
Worker Language worker process Crash loop, startup timeout, incompatible runtime platform logs + traces worker startup lines
Worker Python execution runtime GIL contention under CPU-bound concurrency high duration variance, low CPU parallelism, dependencies idle while requests queue
Runtime Memory management Memory pressure and instance recycle Activity Log events, traces host shutdown/startup cycle
Code Function entrypoint Unhandled exceptions and blocking operations exceptions, slow request traces, timeout messages

Runtime evidence shortcuts

Signal Why it matters Quick check
Host started missing App state may be Running but host not healthy Query traces over last 15 minutes
Repeated host start/shutdown Crash loop or platform recycle Compare traces timeline with Activity Log
Fast failure with no dependency call Fails before downstream access Inspect startup/config/identity exceptions first
Latency rises while dependency latency stable Worker-side bottleneck Check CPU/memory pressure and concurrency model

Deployment path (where release regressions appear)

Many incidents begin at deployment transitions. Map failures by stage so rollback and forward-fix decisions are evidence-based.

flowchart LR
    A[Code] --> B[Build]
    B --> C[Artifact]
    C --> D[Deploy Target]
    D --> E[Slot]
    E --> F[Production]

    B -. dependency restore failure .-> X1[Build failed]
    C -. runtime mismatch .-> X2[Invalid artifact]
    D -. deploy auth failure .-> X3[Deploy rejected]
    E -. slot config drift .-> X4[Swap risk]
    F -. post-release 5xx .-> X5[Production regression]

Deployment stage troubleshooting table

Stage Failure Mode Detection Method Recovery Action
Code Missing config contract, breaking change PR checks, config schema validation Revert change or patch config compatibility
Build Dependency resolution or compile failure CI logs and build summary Pin package versions, fix pipeline cache and restore
Artifact Runtime/version mismatch with app settings Compare artifact metadata to function runtime Rebuild artifact with aligned runtime
Deploy Target Access denied or failed deployment operation Activity Log and deployment task output Fix RBAC/service principal scope and redeploy
Slot Slot-specific app settings missing Slot config diff and smoke tests Sync required settings and mark sticky config
Production Immediate 5xx/timeouts after release requests and exceptions spike post-deploy Swap back or roll back to last known good artifact

Minimal CLI checks for deployment path

az monitor activity-log list \
  --subscription "<subscription-id>" \
  --resource-group "rg-myapp-prod" \
  --offset 2h \
  --max-events 50 \
  --output table

az functionapp deployment slot list \
  --resource-group "rg-myapp-prod" \
  --name "func-myapp-prod" \
  --output table

Network and outbound path (where external connectivity fails)

Outbound failures often look like app bugs but originate in network controls. Use this path to separate DNS, routing, NSG, and SNAT issues.

flowchart LR
    A[Function App] --> B[VNet Integration]
    B --> C[Subnet]
    C --> D[NSG]
    D --> E[UDR]
    E --> F[NAT or Firewall]
    F --> G[Internet or Private Endpoint]

    C -. SNAT exhaustion .-> X1[Ephemeral port depletion]
    D -. NSG block .-> X2[Outbound denied]
    E -. UDR misconfiguration .-> X3[Blackhole route]
    F -. firewall deny .-> X4[Egress blocked]
    G -. DNS resolution failure .-> X5[Name lookup failed]

Outbound failure evidence table

Failure Point Symptom CLI Check KQL Query
SNAT exhaustion Intermittent connect timeout to many external targets Diagnose and Solve Problems → SNAT Port Exhaustion; az monitor metrics list --resource "/subscriptions/<subscription-id>/resourceGroups/rg-myapp-prod/providers/Microsoft.Web/sites/func-myapp-prod" --metric "TcpSynSent" --interval PT1M --aggregation Total --offset 30m --output table dependencies | where timestamp > ago(30m) | where success == false | summarize failures=count() by type, target
DNS resolution (outbound) ENOTFOUND, Name or service not known az network private-dns zone list --resource-group "rg-network" --output table exceptions | where timestamp > ago(30m) | where type has "SocketException" or outerMessage has "DNS" or outerMessage has "NameResolution"
NSG block Hard timeout after SYN attempts az network nsg rule list --resource-group "rg-network" --nsg-name "nsg-functions" --output table dependencies | where timestamp > ago(30m) | summarize timeoutCount=countif(tostring(resultCode) in ("", "0")) by target
UDR misconfiguration All traffic to one range fails after route change az network route-table route list --resource-group "rg-network" --route-table-name "rt-functions" --output table dependencies | where timestamp > ago(30m) | where success == false | summarize count() by target
Firewall or NAT Region-wide external egress failures az network firewall show --resource-group "rg-network" --name "fw-hub" --output table dependencies | where timestamp > ago(30m) | summarize failed=countif(success == false), p95=percentile(duration,95) by target

Observability map (where evidence is collected)

This map shows the primary data paths for troubleshooting. During incidents, choose evidence source by hypothesis rather than querying everything.

flowchart LR
    A[Function App] --> B[Application Insights]
    A --> C[Platform Logs]
    B --> D[Log Analytics Workspace]
    C --> D
    D --> E[Queries and Dashboards]
    B -. requests, traces, dependencies, exceptions .-> E
    C -. startup, recycle, platform events .-> E

Observability source matrix

Data Source What It Captures Best For Latency
Application Insights requests Invocation success/failure, latency, result codes User-facing 5xx, timeout trends, p95 analysis Near real-time
Application Insights traces Host lifecycle, listener state, runtime diagnostics Startup failures, trigger initialization issues Near real-time
Application Insights exceptions Exception type, message, stack traces Root-cause clustering by error family Near real-time
Application Insights dependencies Outbound call target, duration, success Downstream slowness, DNS/network symptoms Near real-time
Platform logs Host/container/platform lifecycle events Recycle loops and platform-generated restarts Minutes
Activity Log Configuration, deployment, RBAC change history Change correlation and blast-window audit Near real-time

Where problems happen (summary)

Use this as the first routing table when symptom ownership is unclear.

Symptom Category Architecture Layer Evidence Source First Check
5xx responses Frontend / Worker requests table, Http5xx metric KQL: failed requests by resultCode
Startup failure Host process traces table, platform logs KQL: host startup events
DNS or SNAT failure Network / outbound dependencies + exceptions, app logs Run High latency / slow responses checks and SNAT detector + TcpSynSent metric
Trigger silence Listener / storage traces table, queue metrics CLI: function list + storage peek
Slow responses Worker / dependency dependencies table KQL: dependency p95
Recycle or restart Platform events Activity Log, traces KQL: host shutdown/startup timeline

Suggested incident flow through architecture layers

flowchart TD
    A[Symptom detected] --> B{User-facing 5xx or timeout?}
    B -->|Yes| C[Check requests and exceptions]
    B -->|No| D[Check trigger activity and traces]
    C --> E{Downstream latency high?}
    E -->|Yes| F[Inspect dependencies and outbound network path]
    E -->|No| G[Inspect worker and host lifecycle]
    D --> H{No listener startup events?}
    H -->|Yes| I[Investigate host startup and config]
    H -->|No| J[Inspect trigger source and backlog]
    F --> K[Apply smallest safe mitigation]
    G --> K
    I --> K
    J --> K

See Also

Sources