Troubleshooting¶
Use this section when Azure Functions workloads are degraded, failing, or behaving unexpectedly. It is designed for incident response first, then root-cause analysis and prevention.
Operations Guide
For monitoring setup and alert configuration, see Monitoring and Alerts.
What this section covers¶
- First 10 Minutes: incident triage checklist for rapid stabilization.
- Decision Tree: visual routing from symptom to investigation path.
- Mental Model: conceptual framework for Azure Functions troubleshooting.
- Playbooks: scenario runbooks with symptoms, diagnosis, and fixes.
- Methodology: repeatable troubleshooting workflow for complex incidents.
- KQL Query Library: ready-to-use Application Insights and Log Analytics queries.
- Lab Guides: hands-on failure simulations to practice response.
Suggested incident flow¶
- Start with First 10 Minutes to verify platform health and blast radius.
- Move to Playbooks for scenario-specific diagnosis paths.
- Use KQL Query Library to validate hypotheses with telemetry.
- Apply Methodology to avoid guesswork and reduce MTTR.
- Rehearse with Lab Guides to improve operational readiness.
Troubleshooting mental model¶
Use this classification first to narrow where to collect evidence.
| Category | Examples | First Check | Typical Evidence |
|---|---|---|---|
| Request path issue | 5xx, timeout, 403, connection refused | requests + exceptions tables | HTTP status codes, error types |
| App startup issue | Host not starting, container ping failure, health check timeout | traces table (host lifecycle) | Host started missing, startup duration |
| Runtime degradation | Memory pressure, GIL contention, thread pool starvation | customMetrics, process metrics | CPU/memory trends, cold start frequency |
| Dependency / outbound issue | DNS failure, SNAT exhaustion, private endpoint unreachable | dependencies table | Failed dependency calls, target resolution |
| Deployment / recycle event | Post-deploy failures, slot swap issues, config drift | Activity Log, traces | Deploy events, host restart events |
About customMetrics
The customMetrics table contains metrics explicitly emitted by your application or SDK. Only a few metrics (for example, FunctionExecutionCount, FunctionExecutionUnits) are emitted automatically by the Azure Functions runtime. Queue-related metrics and custom business metrics require explicit instrumentation.
Decision tree¶
flowchart TD
A[Issue detected] --> B{Is it a 5xx issue?}
B -->|Yes| C{Intermittent or constant?}
C -->|Constant| D[Check host startup + recent deploy]
C -->|Intermittent| E{Recent deployment?}
E -->|Yes| F["Compare before/after metrics, consider rollback"]
E -->|No| G{Dependency-correlated?}
G -->|Yes| H[Check dependency health + outbound networking]
G -->|No| I[Check concurrency + memory + cold start]
B -->|No| J{Trigger not firing?}
J -->|Yes| K[Check listener status + connection config]
J -->|No| L{Performance degradation?}
L -->|Yes| M[Check dependencies + scaling + storage]
L -->|No| N[Review evidence-map for matching symptoms] Representative log patterns (quick reference)¶
| Pattern | Indicates | Severity | Next Action |
|---|---|---|---|
Container didn't respond to HTTP pings | Host startup failure | Critical | Check host logs and recent deploy activity |
Storage operation failed: (403) Forbidden | Storage auth broken | Critical | Check managed identity assignments and RBAC scope |
Host started (>10000ms) | Severe cold start | Warning | Check dependency initialization path and hosting plan |
Message has been dequeued 'N' time(s) | Poison message loop | Warning | Check handler idempotency and maxDequeueCount |
getaddrinfo ENOTFOUND | DNS resolution failure | Critical | Check VNet integration and private DNS zones |
Quick investigation flow¶
- For architecture context, see Troubleshooting Architecture.
- For "where do I look first?", see Evidence Map.
- For fast triage sequence, start at First 10 Minutes.
Updated section map¶
| Document | Coverage |
|---|---|
| First 10 Minutes | Time-boxed triage checks for active incidents |
| Decision Tree | Visual routing from symptom to investigation path |
| Mental Model | Conceptual framework for Azure Functions troubleshooting |
| Playbooks | Scenario-based diagnostics and mitigations |
| Methodology | Reproducible Observe → Hypothesize → Test → Fix → Verify workflow |
| KQL Query Library | Reusable telemetry and evidence queries |
| Troubleshooting Architecture | Component boundaries and failure-domain context |
| Evidence Map | Symptom-to-evidence lookup for first-query selection |
| Lab Guides | Failure drills for response readiness |
Scope and source policy¶
- Guidance in this section follows Microsoft Learn documentation for Azure Functions, App Service, Application Insights, and Azure Monitor.
- Product behavior, limits, and trigger specifics should always be validated against the linked Learn references.
- Examples use masked identifiers (
<subscription-id>,xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx) to avoid exposing PII.
See Also¶
Cross-service references¶
If you also operate App Service or Container Apps with container-based continuous deployment that uses managed identity to pull from Azure Container Registry, the same RBAC unique-key constraint on (scope, principal, role) can surface as RoleAssignmentExists on CD reconnect. Although Azure Functions managed deployments do not currently expose this exact failure mode, the diagnostic pattern (convert hex assignment ID → GUID, look up principal, delete orphaned assignment) transfers directly:
- App Service: CD RBAC Role Assignment Conflict (playbook)
- Container Apps: CD RBAC Role Assignment Conflict (playbook)