Azure Functions Troubleshooting Evidence Map¶

Use this page when you need to answer: "I need to know X, where do I look first?" It maps diagnostic questions to the fastest evidence source, then shows what healthy and unhealthy log patterns look like.

Use with incident workflow

Pair this quick map with First 10 Minutes for triage order, Methodology for hypothesis-driven validation, KQL Query Library for reusable queries, and Playbooks for scenario-specific recovery actions.

Evidence path overview¶

flowchart LR
    A[Diagnostic question] --> B{What kind of symptom?}
    B -->|Request failure or latency| C[Application Insights: requests and dependencies]
    B -->|Startup or trigger issue| D[Application Insights: traces]
    B -->|Restart or deployment correlation| E[Platform logs and Activity Log]
    B -->|Queue backlog or storage issues| F[Storage metrics and queue inspection]
    B -->|Regional concern| G[Azure Service Health]
    C --> H[Decide first mitigation]
    D --> H
    E --> H
    F --> H
    G --> H

Evidence source inventory¶

Source	Type	Access Method	Best For	Latency
Application Insights (requests)	Telemetry	KQL / Portal	Request success/failure, latency	Near real-time
Application Insights (traces)	Logs	KQL / Portal	Host lifecycle, trigger status	Near real-time
Application Insights (exceptions)	Telemetry	KQL / Portal	Error types, stack traces	Near real-time
Application Insights (dependencies)	Telemetry	KQL / Portal	Outbound call health	Near real-time
Platform logs	Diagnostic	Log Analytics / Portal	Host startup, recycle events	Minutes
Activity Log	Audit	CLI / Portal	Deploy, config, RBAC changes	Near real-time
Storage metrics	Metrics	CLI / Portal	Queue depth, throttling	Minutes
Azure Service Health	Status	Portal	Regional outages	Real-time

Question-to-evidence mapping (primary routing table)¶

Use this as your first lookup table during active incident triage.

Question	Best Source	CLI Query	KQL Query	Portal Path
Was the app restarting?	Platform logs / Activity Log	`az monitor activity-log list --subscription "<subscription-id>" --resource-group "rg-myapp-prod" --offset 2h --max-events 50 --output table`	`traces \\| where timestamp > ago(2h) \\| where message has "Host started"`	Diagnose and Solve Problems → App Restarts
Were requests failing?	`requests` table	`az monitor metrics list --resource "/subscriptions/<subscription-id>/resourceGroups/rg-myapp-prod/providers/Microsoft.Web/sites/func-myapp-prod" --metric "Http5xx" --interval PT1M --aggregation Total --offset 1h --output table`	`requests \\| where timestamp > ago(1h) \\| where success == false`	Application Insights → Failures
Was startup failing?	`traces` + `exceptions` tables	`az functionapp show --name "func-myapp-prod" --resource-group "rg-myapp-prod" --query "state" --output tsv`	`traces \\| where timestamp > ago(1h) \\| where message has "Host initialization" or message has "A host error has occurred" \\| where severityLevel >= 3`	Log stream / Console logs
Was dependency slow?	`dependencies` table	`N/A`	`dependencies \\| where timestamp > ago(1h) \\| summarize p95=percentile(duration,95) by target`	Application Insights → Performance
Was DNS failing?	`exceptions`/`traces` + app logs	`az network vnet show --resource-group "rg-network" --name "vnet-prod" --output table`	`exceptions \\| where timestamp > ago(1h) \\| where type has "SocketException" or outerMessage has "DNS" or outerMessage has "NameResolution"`	Diagnose and Solve Problems → Networking
Was scale involved?	Metrics / platform signals	`az monitor metrics list --resource "/subscriptions/<subscription-id>/resourceGroups/rg-myapp-prod/providers/Microsoft.Web/sites/func-myapp-prod" --metric "FunctionExecutionCount" --interval PT1M --aggregation Total --offset 1h --output table`	`traces \\| where timestamp > ago(1h) \\| where message has "scale"`	Metrics → Instance Count
Were messages piling up?	Storage metrics	`az monitor metrics list --resource "/subscriptions/<subscription-id>/resourceGroups/rg-myapp-prod/providers/Microsoft.Storage/storageAccounts/stmyapp" --metric "QueueMessageCount" --interval PT1M --aggregation Average --offset 1h --output table`	`N/A (QueueMessageCount is a Storage metric, not an Application Insights custom metric)`	Storage account → Queue metrics
Was identity broken?	`exceptions` + Activity Log	`az role assignment list --scope "/subscriptions/<subscription-id>/resourceGroups/rg-myapp-prod" --output table`	`exceptions \\| where timestamp > ago(1h) \\| where type has "Authorization"`	Activity Log → RBAC changes

Symptom category to first evidence source¶

Symptom	First Evidence Source	Why first	Secondary Source
Sudden `5xx` increase	`requests`	Direct failure-rate and result-code signal	`exceptions` for error family
Trigger stopped firing	`traces`	Listener initialization/shutdown appears here first	Storage metrics and queue depth
Slow response tail	`dependencies`	Quickly separates internal vs downstream slowness	`requests` percentile trend
Repeated recycle	Platform logs / Activity Log	Confirms whether restart is platform- or change-driven	`traces` host lifecycle
Outbound timeout	`dependencies`	Captures target-level timeout distribution	VNet/NSG/UDR CLI checks
Auth failures after change	Activity Log	Change timeline for RBAC/app settings	`exceptions` authorization traces

Representative log patterns¶

Recognizing these signatures in raw logs reduces time-to-hypothesis. For each pattern: what it looks like → what it means → normal vs abnormal → next step.

1) Healthy host startup¶

What it looks like:

Host lock lease acquired by instance ID 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'.
Initializing Host. OperationId: 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'.
Host started (Xms)
Host initialized (Xms)

What it means: - Host acquired storage lease and completed startup sequence. - Trigger listeners can initialize normally.

Normal vs abnormal: - Normal: sequence appears once on cold start or planned recycle. - Abnormal: repeated startup cycles in short intervals with no stable execution window.

Next step: - If healthy, move focus to function code or dependencies. - If startup loops, correlate with Activity Log changes and container health events.

2) Worker timeout¶

What it looks like:

Worker was unable to load function: 'FunctionName'
Timeout value of 00:00:30 exceeded by host
The operation was canceled.

What it means: - Worker failed to initialize or function execution exceeded configured timeout. - Can indicate runtime mismatch, blocking startup code, or downstream stall.

Normal vs abnormal: - Normal: occasional timeout under rare dependency spikes. - Abnormal: repeated timeout messages across many invocations or immediately after deploy.

Next step: - Check deployment/runtime compatibility and recent config changes. - Query dependencies for latency spike and validate timeout policy.

3) Connection refused or storage failure¶

What it looks like:

An error occurred while processing the request. Connection refused (127.0.0.1:XXXXX)
Storage operation failed: The remote server returned an error: (403) Forbidden.
Unable to connect to the remote server

What it means: - Connection target is unreachable, blocked, or not listening. - Storage auth or network path is broken for trigger/binding operations.

Normal vs abnormal: - Normal: brief transient bursts with automatic retry recovery. - Abnormal: sustained failures causing trigger silence, retry storms, or poison growth.

Next step: - Validate storage identity/RBAC and firewall/network rules. - Confirm connection strings or identity-based settings are present and correct.

4) Health check failed¶

What it looks like:

Container func-myapp_X didn't respond to HTTP pings on port XXXX
Health check failure: StatusCode=503
Sending SIGTERM to container

What it means: - Platform health probe marked instance unhealthy and initiated recycle. - Usually associated with startup deadlock, severe resource pressure, or port binding failure.

Normal vs abnormal: - Normal: isolated occurrence during planned restart. - Abnormal: repeated probe failures with short-lived containers and rising 503.

Next step: - Review host startup sequence and memory/CPU pressure. - Correlate with deployment changes and warm-up behavior.

5) 503 spike after restart¶

What it looks like:

Host is shutting down
Stopping JobHost
Host started (Xms)  ← new instance
Request timed out after 230000ms

What it means: - Requests hit transition window between old and new host states. - Can indicate cold-start amplification or unhealthy rollout.

Normal vs abnormal: - Normal: brief 503 blip during controlled swap/restart. - Abnormal: prolonged timeout window and repeated shutdown/start loops.

Next step: - Check if restart reason is platform, deployment, or configuration. - Validate slot health before swap and ensure dependency readiness.

6) DNS resolution failure¶

What it looks like:

Name or service not known
getaddrinfo ENOTFOUND myservice.privatelink.database.windows.net
A connection attempt failed because the connected party did not properly respond

What it means: - Name resolution failed or resolved endpoint is unreachable from current network route. - Common with private endpoint DNS zone linkage or route misconfiguration.

Normal vs abnormal: - Normal: short-lived resolver transient with fast retry recovery. - Abnormal: persistent ENOTFOUND against private endpoints across instances.

Next step: - Verify VNet integration, private DNS zone link, and route table intent. - Cross-check dependency failures by target in dependencies.

Query snippets for fast evidence confirmation¶

// Failed requests by code
requests
| where timestamp > ago(30m)
| where success == false
| summarize failures=count() by resultCode
| order by failures desc

// Host startup and shutdown timeline
traces
| where timestamp > ago(2h)
| where message has_any ("Host started", "Host is shutting down", "Stopping JobHost")
| project timestamp, message
| order by timestamp desc

// Dependency tail latency by target
dependencies
| where timestamp > ago(1h)
| summarize p95=percentile(duration,95), failed=countif(success == false) by target
| order by failed desc, p95 desc

Azure Functions Troubleshooting Evidence Map¶

Evidence path overview¶

Evidence source inventory¶

Question-to-evidence mapping (primary routing table)¶

Symptom category to first evidence source¶

Representative log patterns¶

1) Healthy host startup¶

2) Worker timeout¶

3) Connection refused or storage failure¶

4) Health check failed¶

5) 503 spike after restart¶

6) DNS resolution failure¶

Query snippets for fast evidence confirmation¶

See Also¶

Sources¶