Troubleshooting¶
Use this section when Azure Functions workloads are degraded, failing, or behaving unexpectedly. It is designed for incident response first, then root-cause analysis and prevention.
Operations Guide
For monitoring setup and alert configuration, see Monitoring and Alerts.
What this section covers¶
- First 10 Minutes: incident triage checklist for rapid stabilization.
- Decision Tree: visual routing from symptom to investigation path.
- Mental Model: conceptual framework for Azure Functions troubleshooting.
- Playbooks: scenario runbooks with symptoms, diagnosis, and fixes.
- Methodology: repeatable troubleshooting workflow for complex incidents.
- KQL Query Library: ready-to-use Application Insights and Log Analytics queries.
- Lab Guides: hands-on failure simulations to practice response.
Suggested incident flow¶
- Start with First 10 Minutes to verify platform health and blast radius.
- Move to Playbooks for scenario-specific diagnosis paths.
- Use KQL Query Library to validate hypotheses with telemetry.
- Apply Methodology to avoid guesswork and reduce MTTR.
- Rehearse with Lab Guides to improve operational readiness.
Troubleshooting mental model¶
Use this classification first to narrow where to collect evidence.
| Category | Examples | First Check | Typical Evidence |
|---|---|---|---|
| Request path issue | 5xx, timeout, 403, connection refused | requests + exceptions tables | HTTP status codes, error types |
| App startup issue | Host not starting, container ping failure, health check timeout | traces table (host lifecycle) | Host started missing, startup duration |
| Runtime degradation | Memory pressure, GIL contention, thread pool starvation | customMetrics, process metrics | CPU/memory trends, cold start frequency |
| Dependency / outbound issue | DNS failure, SNAT exhaustion, private endpoint unreachable | dependencies table | Failed dependency calls, target resolution |
| Deployment / recycle event | Post-deploy failures, slot swap issues, config drift | Activity Log, traces | Deploy events, host restart events |
About customMetrics
The customMetrics table contains metrics explicitly emitted by your application or SDK. Only a few metrics (for example, FunctionExecutionCount, FunctionExecutionUnits) are emitted automatically by the Azure Functions runtime. Queue-related metrics and custom business metrics require explicit instrumentation.
Decision tree¶
flowchart TD
A[Issue detected] --> B{Is it a 5xx issue?}
B -->|Yes| C{Intermittent or constant?}
C -->|Constant| D[Check host startup + recent deploy]
C -->|Intermittent| E{Recent deployment?}
E -->|Yes| F[Compare before/after metrics, consider rollback]
E -->|No| G{Dependency-correlated?}
G -->|Yes| H[Check dependency health + outbound networking]
G -->|No| I[Check concurrency + memory + cold start]
B -->|No| J{Trigger not firing?}
J -->|Yes| K[Check listener status + connection config]
J -->|No| L{Performance degradation?}
L -->|Yes| M[Check dependencies + scaling + storage]
L -->|No| N[Review evidence-map for matching symptoms] Representative log patterns (quick reference)¶
| Pattern | Indicates | Severity | Next Action |
|---|---|---|---|
Container didn't respond to HTTP pings | Host startup failure | Critical | Check host logs and recent deploy activity |
Storage operation failed: (403) Forbidden | Storage auth broken | Critical | Check managed identity assignments and RBAC scope |
Host started (>10000ms) | Severe cold start | Warning | Check dependency initialization path and hosting plan |
Message has been dequeued 'N' time(s) | Poison message loop | Warning | Check handler idempotency and maxDequeueCount |
getaddrinfo ENOTFOUND | DNS resolution failure | Critical | Check VNet integration and private DNS zones |
Quick investigation flow¶
- For architecture context, see Troubleshooting Architecture.
- For "where do I look first?", see Evidence Map.
- For fast triage sequence, start at First 10 Minutes.
Updated section map¶
| Document | Coverage |
|---|---|
| First 10 Minutes | Time-boxed triage checks for active incidents |
| Decision Tree | Visual routing from symptom to investigation path |
| Mental Model | Conceptual framework for Azure Functions troubleshooting |
| Playbooks | Scenario-based diagnostics and mitigations |
| Methodology | Reproducible Observe → Hypothesize → Test → Fix → Verify workflow |
| KQL Query Library | Reusable telemetry and evidence queries |
| Troubleshooting Architecture | Component boundaries and failure-domain context |
| Evidence Map | Symptom-to-evidence lookup for first-query selection |
| Lab Guides | Failure drills for response readiness |
Scope and source policy¶
- Guidance in this section follows Microsoft Learn documentation for Azure Functions, App Service, Application Insights, and Azure Monitor.
- Product behavior, limits, and trigger specifics should always be validated against the linked Learn references.
- Examples use masked identifiers (
<subscription-id>,xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx) to avoid exposing PII.