Troubleshooting Decision Tree¶
Use this page when you need to triage quickly from symptom to likely failure category and then open the right playbook.
The tree is intentionally symptom-first and optimized for the first 10–15 minutes of incident response.
Main triage decision tree¶
flowchart TD
S[Incident starts: user-visible impact] --> Q1{Functions executing?}
Q1 -->|No invocations| Q1A{Function enabled and listener healthy?}
Q1 -->|"Yes but slow/failing"| Q2{Is it a latency issue?}
Q1A -->|Function disabled| P1[Playbook: Functions Not Executing]
Q1A -->|Listener failed to start| Q1B{Auth or connection error?}
Q1A -->|Host not starting| P2[Playbook: App Settings Misconfiguration]
Q1B -->|"Auth error 401/403"| P3["Playbook: Managed Identity / RBAC Failure"]
Q1B -->|Connection error| P4[Playbook: Functions Not Executing]
Q2 -->|Yes, high latency| Q2A{Cold start or dependency?}
Q2 -->|"No, errors/failures"| Q3{What type of failure?}
Q2A -->|Cold start pattern| P5[Playbook: High Latency]
Q2A -->|Dependency timeout| P6[Check dependency health]
Q2A -->|Execution timeout| P7["Playbook: Timeout / Execution Limit Exceeded"]
Q3 -->|5xx errors| Q3A{After deployment?}
Q3 -->|Exception storm| P8[Playbook: Functions Failing]
Q3 -->|Queue backlog growing| P9[Playbook: Queue Piling Up]
Q3A -->|Yes| P10[Playbook: Deployment Failures]
Q3A -->|No| Q4{Memory or OOM signals?}
Q4 -->|Yes| P11["Playbook: Out of Memory / Worker Crash"]
Q4 -->|No| P12[Use Methodology: build hypotheses from evidence] Trigger-specific decision tree¶
flowchart LR
A[Trigger type] --> B{Which trigger?}
B -->|HTTP| C{Status code pattern}
B -->|"Queue/Service Bus"| D{Processing pattern}
B -->|Blob| E{Event delivery}
B -->|Timer| F{Schedule pattern}
B -->|Event Hub| G{Lag pattern}
B -->|Durable| H{Orchestration state}
C -->|5xx spike| I[Functions Failing]
C -->|Timeout 230s| J["Timeout / Execution Limit"]
D -->|Backlog growing| K[Queue Piling Up]
D -->|Poison messages| L[Functions Failing]
E -->|Not firing on FC1| M[Blob Trigger Not Firing]
E -->|Delayed on Y1| N[Blob Trigger Not Firing]
F -->|Missed executions| O[Check isPastDue and RunOnStartup]
G -->|Checkpoint behind| P["Event Hub / Service Bus Lag"]
H -->|Stuck in Running| Q[Durable Orchestration Stuck] Playbook leaves (direct links)¶
Triggers¶
Scaling¶
Auth / Config¶
General¶
- Functions Not Executing
- High Latency
- Functions Failing
- Queue Piling Up
- Blob Trigger Not Firing
- Deployment Failures
Quick reference matrix¶
| Symptom Pattern | Most Likely Cause Category | Playbook Link |
|---|---|---|
| Zero invocations despite active source | Disabled function, listener failure, or host not starting | Functions Not Executing |
| 5xx errors after deployment | Deployment artifact or config regression | Deployment Failures |
| High P95 latency, normal P50 | Cold start or intermittent dependency | High Latency |
| Execution timeout errors | Function exceeds plan timeout limit | Timeout / Execution Limit |
| Queue depth rising, executions flat | Trigger stall or scaling bottleneck | Queue Piling Up |
| Blob trigger not firing on FC1 | Missing Event Grid subscription | Blob Trigger Not Firing |
| 401/403 on dependency calls | Managed identity or RBAC misconfiguration | Managed Identity / RBAC Failure |
| Host fails to start, no functions found | Missing or wrong app settings | App Settings Misconfiguration |
| Worker crashes under load | Memory exhaustion | Out of Memory / Worker Crash |
| Durable orchestration stuck in Running | Replay storm or non-deterministic code | Durable Orchestration Stuck |
| Event Hub/Service Bus processing falling behind | Checkpoint lag or slow processing | Event Hub / Service Bus Lag |
| Repeated exceptions dominating failures | Application code error | Functions Failing |
Triage prompts to ask in order¶
- Are functions executing at all? If no, is the function enabled and listener healthy?
- Was there a recent deployment or configuration change in the incident window?
- Is it a latency issue (slow) or a failure issue (errors)?
- Is the issue specific to one trigger type or affecting all functions?
- Are there memory pressure or OOM signals in the logs?
Minimal evidence before choosing a branch¶
- 15-minute function execution trend (
requeststable) - Host lifecycle events for restarts/startups (
tracestable) - Recent Activity Log operations
let appName = "func-myapp-prod";
requests
| where timestamp > ago(2h)
| where cloud_RoleName =~ appName
| where operation_Name startswith "Functions."
| summarize total=count(), err=countif(success == false), p95=percentile(duration, 95) by bin(timestamp, 5m)
| order by timestamp asc
let appName = "func-myapp-prod";
traces
| where timestamp > ago(24h)
| where cloud_RoleName =~ appName
| where message has_any ("Host started", "Host shutdown", "restart", "listener", "unable to start", "timeout")
| project timestamp, message
| order by timestamp desc
az monitor activity-log list \
--resource-group "$RG" \
--offset 24h \
--max-events 20 \
--output table
Avoid branch bias
Do not choose a branch only because it matches a familiar past issue. If the first branch is disproven by timestamps, return to the top and re-classify. Decision trees accelerate triage, but evidence still decides root cause.
Decision Tree Limits¶
- This tree is optimized for Azure Functions serverless workloads.
- Multi-cause incidents can map to more than one branch.
- If no branch matches cleanly, use Troubleshooting Method and build explicit competing hypotheses.
Branch-specific first checks¶
If you choose the trigger branch¶
- Confirm function is enabled and listener started.
- Check trigger-specific connection strings and auth.
- For blob triggers on FC1, verify Event Grid subscription exists.
If you choose the latency branch¶
- Compare cold start frequency against latency pattern.
- Check dependency P95 for single-target bottleneck.
- Verify
functionTimeoutin host.json matches plan limits.
If you choose the failure branch¶
- Correlate error onset with deployment timestamps.
- Check for dominant exception type in
exceptionstable. - Verify app settings (especially
FUNCTIONS_WORKER_RUNTIME,AzureWebJobsStorage).
Practical triage examples¶
-
Zero invocations + listener unable to start + 403 error
- Decision tree branch: not executing → listener failed → auth error.
- Start with Managed Identity / RBAC Failure.
-
Deployment succeeded + immediate 500 errors + no functions found
- Decision tree branch: errors → after deployment → config issue.
- Start with App Settings Misconfiguration.
-
Queue depth growing over hours + worker restarts + OOM exceptions
- Decision tree branch: queue backlog → memory signals.
- Start with Out of Memory / Worker Crash.
See Also¶
- Troubleshooting Method
- Detector Map
- Architecture
- Evidence Map
- Mental Model
- First 10 Minutes: Triggers Not Firing
- First 10 Minutes: High Latency
- First 10 Minutes: Scaling Issues