Troubleshooting Architecture Map for Azure Functions¶

This guide is a diagnostic architecture reference for incident response. It is intentionally failure-oriented so you can locate where symptoms originate and what evidence source to check first. Use it as a fast index from symptom category to architecture layer ownership.

How to use this document

This is not a design poster. Read each diagram left-to-right as an evidence path: symptom → likely layer → telemetry/CLI confirmation.

Troubleshooting workflow

Start with First 10 Minutes, apply Methodology, run focused queries from KQL Query Library, and then execute scenario fixes from Functions not executing, High latency / slow responses, Functions failing with errors, or Deployment failures.

Request path architecture (where user-facing failures surface)¶

The request path is where availability and latency symptoms first appear. Most 5xx, timeout, DNS, and auth issues can be mapped to one of these nodes.

flowchart LR
    A[Client] --> B[DNS]
    B --> C[Load Balancer]
    C --> D[Frontend]
    D --> E[Worker Process]
    E --> F[Function Code]
    F --> G[Output Binding]
    G --> H[Downstream]

    B -. DNS failure .-> X1[Name resolution error]
    D -. 401 or 403 auth failure .-> X2[Auth or EasyAuth reject]
    D -. 5xx gateway failure .-> X3[Frontend-generated 5xx]
    E -. timeout .-> X4[Execution timeout]
    G -. 5xx or retry storm .-> X5[Binding write failure]
    H -. timeout or 5xx .-> X6[Dependency failure]

Request-path failure map¶

Hop	Typical Symptom	Common Failure	Primary Evidence	First Action
Client → DNS	Immediate failure before app code runs	DNS record mismatch, private DNS zone missing	Client logs, Azure Front Door/Application Gateway logs, Azure DNS analytics	Validate DNS resolution path and private zone link
DNS → Load Balancer	Intermittent connect failures	Edge routing issue or transient platform fault	`requests` + Azure status	Check Azure Service Health and region incidents
Load Balancer → Frontend	`502`/`503` spikes	Frontend cannot route to healthy worker	`requests` resultCode + platform logs	Correlate failure spike with restarts
Frontend → Worker Process	Long tail latency, timed-out requests	Worker warm-up delay, instance recycle	`traces` host lifecycle, request duration	Check host start and recycle timeline
Worker Process → Function Code	`500` with exception	Runtime crash, unhandled exception	`exceptions`, `traces`	Identify dominant exception family
Function Code → Output Binding	Retry storms, partial success	Binding auth/config mismatch	`traces` binding errors, dependency failures	Validate binding connection settings
Output Binding → Downstream	High p95 and downstream `5xx`	API/database slowness or throttling	`dependencies` p95/failure rate	Isolate failing target and apply backoff

High-signal KQL for request-path triage¶

requests
| where timestamp > ago(30m)
| summarize
    Total=count(),
    Failed=countif(success == false),
    P95=percentile(duration, 95)
  by resultCode, bin(timestamp, 5m)
| order by timestamp desc

Runtime and worker model (where execution failures originate)¶

Azure Functions execution is split across host and language worker boundaries. Symptoms often appear in requests, but root causes are frequently in process lifecycle and resource pressure.

flowchart TB
    A[Functions Host Process\nTrigger listeners, bindings, scale hooks] --> B[Language Worker Process\nPython, Node.js, Java, .NET isolated]
    B --> C[User Function Code]

    A -. startup failure .-> X1[Host startup failed]
    B -. process crash .-> X2[Worker terminated]
    B -. GIL contention .-> X3[Python throughput collapse]
    A -. memory pressure .-> X4[Recycle or OOM risk]
    C -. blocking call .-> X5[Execution timeout]

Runtime diagnostic table¶

Layer	Component	Common Failure	Evidence Source
Host	JobHost startup	Host cannot initialize listeners or storage lock	`traces` messages containing `Initializing Host`, `Host lock`, `Host started`
Host	Trigger listener	Listener disabled, misconfigured, or auth denied	`traces` listener warnings + trigger silence in `requests`/metrics
Worker	Language worker process	Crash loop, startup timeout, incompatible runtime	platform logs + `traces` worker startup lines
Worker	Python execution runtime	GIL contention under CPU-bound concurrency	high duration variance, low CPU parallelism, `dependencies` idle while requests queue
Runtime	Memory management	Memory pressure and instance recycle	Activity Log events, `traces` host shutdown/startup cycle
Code	Function entrypoint	Unhandled exceptions and blocking operations	`exceptions`, slow request traces, timeout messages

Runtime evidence shortcuts¶

Signal	Why it matters	Quick check
`Host started` missing	App state may be Running but host not healthy	Query `traces` over last 15 minutes
Repeated host start/shutdown	Crash loop or platform recycle	Compare `traces` timeline with Activity Log
Fast failure with no dependency call	Fails before downstream access	Inspect startup/config/identity exceptions first
Latency rises while dependency latency stable	Worker-side bottleneck	Check CPU/memory pressure and concurrency model

Deployment path (where release regressions appear)¶

Many incidents begin at deployment transitions. Map failures by stage so rollback and forward-fix decisions are evidence-based.

flowchart LR
    A[Code] --> B[Build]
    B --> C[Artifact]
    C --> D[Deploy Target]
    D --> E[Slot]
    E --> F[Production]

    B -. dependency restore failure .-> X1[Build failed]
    C -. runtime mismatch .-> X2[Invalid artifact]
    D -. deploy auth failure .-> X3[Deploy rejected]
    E -. slot config drift .-> X4[Swap risk]
    F -. post-release 5xx .-> X5[Production regression]

Deployment stage troubleshooting table¶

Stage	Failure Mode	Detection Method	Recovery Action
Code	Missing config contract, breaking change	PR checks, config schema validation	Revert change or patch config compatibility
Build	Dependency resolution or compile failure	CI logs and build summary	Pin package versions, fix pipeline cache and restore
Artifact	Runtime/version mismatch with app settings	Compare artifact metadata to function runtime	Rebuild artifact with aligned runtime
Deploy Target	Access denied or failed deployment operation	Activity Log and deployment task output	Fix RBAC/service principal scope and redeploy
Slot	Slot-specific app settings missing	Slot config diff and smoke tests	Sync required settings and mark sticky config
Production	Immediate `5xx`/timeouts after release	`requests` and `exceptions` spike post-deploy	Swap back or roll back to last known good artifact

Minimal CLI checks for deployment path¶

az monitor activity-log list \
  --subscription "<subscription-id>" \
  --resource-group "rg-myapp-prod" \
  --offset 2h \
  --max-events 50 \
  --output table

az functionapp deployment slot list \
  --resource-group "rg-myapp-prod" \
  --name "func-myapp-prod" \
  --output table

Network and outbound path (where external connectivity fails)¶

Outbound failures often look like app bugs but originate in network controls. Use this path to separate DNS, routing, NSG, and SNAT issues.

flowchart LR
    A[Function App] --> B[VNet Integration]
    B --> C[Subnet]
    C --> D[NSG]
    D --> E[UDR]
    E --> F[NAT or Firewall]
    F --> G[Internet or Private Endpoint]

    C -. SNAT exhaustion .-> X1[Ephemeral port depletion]
    D -. NSG block .-> X2[Outbound denied]
    E -. UDR misconfiguration .-> X3[Blackhole route]
    F -. firewall deny .-> X4[Egress blocked]
    G -. DNS resolution failure .-> X5[Name lookup failed]

Outbound failure evidence table¶

Failure Point	Symptom	CLI Check	KQL Query
SNAT exhaustion	Intermittent connect timeout to many external targets	Diagnose and Solve Problems → SNAT Port Exhaustion; `az monitor metrics list --resource "/subscriptions/<subscription-id>/resourceGroups/rg-myapp-prod/providers/Microsoft.Web/sites/func-myapp-prod" --metric "TcpSynSent" --interval PT1M --aggregation Total --offset 30m --output table`	`dependencies \| where timestamp > ago(30m) \| where success == false \| summarize failures=count() by type, target`
DNS resolution (outbound)	`ENOTFOUND`, `Name or service not known`	`az network private-dns zone list --resource-group "rg-network" --output table`	`exceptions \| where timestamp > ago(30m) \| where type has "SocketException" or outerMessage has "DNS" or outerMessage has "NameResolution"`
NSG block	Hard timeout after SYN attempts	`az network nsg rule list --resource-group "rg-network" --nsg-name "nsg-functions" --output table`	`dependencies \| where timestamp > ago(30m) \| summarize timeoutCount=countif(tostring(resultCode) in ("", "0")) by target`
UDR misconfiguration	All traffic to one range fails after route change	`az network route-table route list --resource-group "rg-network" --route-table-name "rt-functions" --output table`	`dependencies \| where timestamp > ago(30m) \| where success == false \| summarize count() by target`
Firewall or NAT	Region-wide external egress failures	`az network firewall show --resource-group "rg-network" --name "fw-hub" --output table`	`dependencies \| where timestamp > ago(30m) \| summarize failed=countif(success == false), p95=percentile(duration,95) by target`

Observability map (where evidence is collected)¶

This map shows the primary data paths for troubleshooting. During incidents, choose evidence source by hypothesis rather than querying everything.

flowchart LR
    A[Function App] --> B[Application Insights]
    A --> C[Platform Logs]
    B --> D[Log Analytics Workspace]
    C --> D
    D --> E[Queries and Dashboards]
    B -. requests, traces, dependencies, exceptions .-> E
    C -. startup, recycle, platform events .-> E

Observability source matrix¶

Data Source	What It Captures	Best For	Latency
Application Insights `requests`	Invocation success/failure, latency, result codes	User-facing `5xx`, timeout trends, p95 analysis	Near real-time
Application Insights `traces`	Host lifecycle, listener state, runtime diagnostics	Startup failures, trigger initialization issues	Near real-time
Application Insights `exceptions`	Exception type, message, stack traces	Root-cause clustering by error family	Near real-time
Application Insights `dependencies`	Outbound call target, duration, success	Downstream slowness, DNS/network symptoms	Near real-time
Platform logs	Host/container/platform lifecycle events	Recycle loops and platform-generated restarts	Minutes
Activity Log	Configuration, deployment, RBAC change history	Change correlation and blast-window audit	Near real-time

Where problems happen (summary)¶

Use this as the first routing table when symptom ownership is unclear.

Symptom Category	Architecture Layer	Evidence Source	First Check
5xx responses	Frontend / Worker	`requests` table, `Http5xx` metric	KQL: failed requests by resultCode
Startup failure	Host process	`traces` table, platform logs	KQL: host startup events
DNS or SNAT failure	Network / outbound	`dependencies` + `exceptions`, app logs	Run High latency / slow responses checks and SNAT detector + `TcpSynSent` metric
Trigger silence	Listener / storage	`traces` table, queue metrics	CLI: function list + storage peek
Slow responses	Worker / dependency	`dependencies` table	KQL: dependency p95
Recycle or restart	Platform events	Activity Log, `traces`	KQL: host shutdown/startup timeline

Suggested incident flow through architecture layers¶

flowchart TD
    A[Symptom detected] --> B{User-facing 5xx or timeout?}
    B -->|Yes| C[Check requests and exceptions]
    B -->|No| D[Check trigger activity and traces]
    C --> E{Downstream latency high?}
    E -->|Yes| F[Inspect dependencies and outbound network path]
    E -->|No| G[Inspect worker and host lifecycle]
    D --> H{No listener startup events?}
    H -->|Yes| I[Investigate host startup and config]
    H -->|No| J[Inspect trigger source and backlog]
    F --> K[Apply smallest safe mitigation]
    G --> K
    I --> K
    J --> K