Troubleshooting Mental Model¶
This page provides a classification model for Azure Functions incidents so you can start with the correct evidence source instead of guessing.
Core idea: classify the problem first, then investigate deeply.
Why this model matters¶
Most incident delays come from category mistakes:
- trigger listener failures investigated as application code bugs
- outbound DNS/storage auth failures investigated as CPU problems
- deployment events ignored while symptoms are treated as random instability
- cold start behavior on Consumption plans mistaken for application regression
This classification helps you avoid looking at the wrong logs from the start.
Classification flowchart¶
flowchart TD
A[Observed symptom] --> B{Primary failure signal}
B -->|No invocations, zero executions| C[Category 1: Trigger and listener issue]
B -->|5xx, latency, timeout| D[Category 2: Execution and runtime issue]
B -->|Degradation over time| E[Category 3: Resource exhaustion]
B -->|Dependency timeout, DNS, connection reset| F[Category 4: Dependency and outbound issue]
B -->|Regression after change, restart, config| G[Category 5: Deployment and configuration event]
C --> C1[Start with trigger listener and host startup evidence]
D --> D1[Start with requests and exceptions tables]
E --> E1[Start with memory and cold start progression]
F --> F1[Start with dependencies and resolver evidence]
G --> G1[Start with Activity Log and config delta] Category summary matrix¶
| Category | Typical Symptoms | First Signal to Check | Common Mistake |
|---|---|---|---|
| Trigger/listener issue | Zero invocations, listener failed to start, function disabled | traces table: listener and host startup messages | Assuming function is running because the app is up |
| Execution/runtime issue | 5xx errors, timeouts, exception storms | requests + exceptions tables | Restarting app before collecting error evidence |
| Resource exhaustion | Gradual slowdown, worker crashes, cold start spikes | Memory metrics + traces for OOM/restart patterns | Looking only at CPU when memory is the bottleneck |
| Dependency/outbound issue | Connect timeout, 401/403, DNS failures | dependencies table + DNS resolver checks | Blaming function code when downstream is unreachable |
| Deployment/config event | Incident starts after deploy/config change/restart | Activity Log + traces for host lifecycle | Treating change-related incidents as random noise |
1) Category: Trigger and Listener Issue¶
Trigger issues are failures in the path from event source to function invocation.
Typical symptom patterns¶
- Zero invocations despite active event source
listener ... unable to startin traces- Function shows as disabled in portal
- Blob trigger not firing on Flex Consumption (missing Event Grid subscription)
First signal to check¶
let appName = "func-myapp-prod";
traces
| where timestamp > ago(30m)
| where cloud_RoleName =~ appName
| where message has_any ("listener", "disabled", "unable to start", "trigger", "Host started")
| project timestamp, severityLevel, message
| order by timestamp desc
Key differentiation¶
| Sub-pattern | Evidence | Resolution Direction |
|---|---|---|
| Function disabled | IsDisabled=true in function list | Remove disable setting |
| Listener auth failure | 403 or 401 in listener start error | Fix RBAC or connection string |
| Host not completing startup | Host started missing | Check app settings and runtime config |
| Source not delivering | Zero messages in source metrics | Fix upstream publisher or subscription |
Related playbooks¶
2) Category: Execution and Runtime Issue¶
Execution issues are failures during function invocation — the function starts but produces errors or exceeds time limits.
Typical symptom patterns¶
- HTTP 5xx responses from function endpoints
RpcExceptionor application exceptions in logs- Execution timeout exceeded messages
- High error rate on specific functions while others are healthy
First signal to check¶
let appName = "func-myapp-prod";
requests
| where timestamp > ago(1h)
| where cloud_RoleName =~ appName
| where operation_Name startswith "Functions."
| summarize
Invocations = count(),
Failures = countif(success == false),
FailureRate = round(100.0 * countif(success == false) / count(), 2),
P95Ms = percentile(duration, 95)
by FunctionName = operation_Name
| order by Failures desc
Key differentiation¶
| Sub-pattern | Evidence | Resolution Direction |
|---|---|---|
| Application exception | Dominant exception type in exceptions | Fix application code |
| Execution timeout | Timeout value of ... exceeded in traces | Reduce work or increase timeout |
| HTTP 230s load balancer timeout | HTTP trigger returns 502 after ~230s | Use Durable Functions async pattern |
| Poison message loop | Same message dequeued repeatedly then poisoned | Fix processing code or increase dequeue count |
Related playbooks¶
3) Category: Resource Exhaustion¶
Resource exhaustion issues develop gradually as load increases or memory accumulates over time.
Typical symptom patterns¶
- Increasing latency over hours
- Worker process crashes (OOM)
- Cold start frequency increasing
System.OutOfMemoryExceptionin exceptions
First signal to check¶
let appName = "func-myapp-prod";
exceptions
| where timestamp > ago(6h)
| where cloud_RoleName =~ appName
| where type has_any ("OutOfMemory", "StackOverflow", "ThreadAbort")
| summarize Count = count() by bin(timestamp, 15m), type
| order by timestamp desc
Key differentiation¶
| Sub-pattern | Evidence | Resolution Direction |
|---|---|---|
| Memory pressure | OOM exceptions + worker restarts | Reduce memory usage or upgrade plan |
| Cold start cascade | High startup frequency + latency spikes | Pre-warm or use Premium plan |
| Thread pool exhaustion | Async deadlock patterns + growing latency | Fix sync-over-async code |
| GIL contention (Python) | CPU flat but latency high on CPU-bound work | Use multiprocessing or offload to Durable |
Related playbooks¶
4) Category: Dependency and Outbound Issue¶
Dependency issues are failures in outbound calls to external services, storage, databases, or APIs.
Typical symptom patterns¶
ConnectTimeoutorConnectionRefusedin dependency logs- 401/403 from downstream services (managed identity issues)
- DNS resolution failures in VNet-integrated apps
- SNAT port exhaustion on Consumption plan
First signal to check¶
let appName = "func-myapp-prod";
dependencies
| where timestamp > ago(1h)
| where cloud_RoleName =~ appName
| where success == false
| summarize Count = count(), AvgDuration = avg(duration) by target, resultCode, type
| order by Count desc
Key differentiation¶
| Sub-pattern | Evidence | Resolution Direction |
|---|---|---|
| Auth failure (managed identity) | 401/403 on specific targets | Fix role assignments or identity config |
| DNS resolution failure | DNS-related error messages in VNet app | Fix private DNS zones or DNS forwarding |
| Storage unreachable | Failed calls to blob/queue/table endpoints | Check firewall rules and network config |
| SNAT exhaustion | Intermittent outbound failures at scale | Use connection pooling, consider VNet integration |
Related playbooks¶
5) Category: Deployment and Configuration Event¶
Configuration issues are failures triggered by recent changes — deployments, setting modifications, identity updates, or platform events.
Typical symptom patterns¶
- Incident starts immediately after deployment or config change
No job functions foundafter deploy- Host startup failure after runtime version change
- Functions disappear after
FUNCTIONS_WORKER_RUNTIMEchange
First signal to check¶
az monitor activity-log list \
--resource-group "$RG" \
--offset 2h \
--status Succeeded \
--output table
Key differentiation¶
| Sub-pattern | Evidence | Resolution Direction |
|---|---|---|
| Wrong runtime setting | No job functions found after deploy | Fix FUNCTIONS_WORKER_RUNTIME |
| Missing storage config | Host fails to start | Restore AzureWebJobsStorage |
| Extension bundle mismatch | Binding errors at startup | Update extensionBundle in host.json |
| Key Vault reference syntax error | Setting resolves to literal @Microsoft.KeyVault(...) | Fix reference URI syntax |
Related playbooks¶
Using this model during incidents¶
sequenceDiagram
participant R as Responder
participant M as Mental Model
participant E as Evidence
participant P as Playbook
R->>M: Observe primary symptom
M->>R: Classify into category (1-5)
R->>E: Check first signal for that category
E->>R: Evidence confirms or eliminates category
alt Category confirmed
R->>P: Open category-specific playbook
P->>R: Follow hypothesis-driven investigation
else Category eliminated
R->>M: Re-classify with next likely category
end Anti-patterns¶
| Anti-pattern | Why It Fails | Better Approach |
|---|---|---|
| Restart first, ask questions later | Destroys diagnostic state | Collect first signal, then restart if needed |
| Assume it is always code | Config and platform causes are equally common | Classify first, investigate accordingly |
| Check everything at once | Wastes time and creates noise | Use category to narrow first evidence source |
| Skip classification on familiar symptoms | Confirmation bias leads to wrong fix | Always validate classification with evidence |