Functions Failing with Errors¶

1. Summary¶

Symptom¶

Function invocations fail repeatedly with 5xx, listener startup errors, retry storms, or host unhealthy events.

Troubleshooting decision flow¶

flowchart TD
    A[Incident: Functions failing with errors] --> B{Failures start after deploy or config change?}
    B -->|Yes| H4[H4 Deployment regression]
    B -->|No| C{Dominant failure signal}
    C -->|403 or Unauthorized| H1[H1 Auth failures to downstream resources]
    C -->|Long duration then fail| H2[H2 Function timeout reached]
    C -->|Worker crash or runtime errors| H3[H3 Memory pressure or runtime mismatch]
    C -->|Mixed signals| D[Collect baseline evidence]
    D --> E{Host says AzureWebJobsStorage unhealthy?}
    E -->|Yes| H1
    E -->|No| F{Durations climbing toward timeout ceiling?}
    F -->|Yes| H2
    F -->|No| G{Worker restarts and runtime churn?}
    G -->|Yes| H3
    G -->|No| H4

2. Common Misreadings¶

"Retry more" when failures are deterministic (403, permission mismatch, startup listener failure).
"Host unhealthy means platform outage" when host message explicitly identifies storage authorization.
"Queue backlog means scale only" while listeners are failing to start.
"Exception count equals failed invocation count" although one invocation can emit wrapper + inner exceptions.

3. Competing Hypotheses¶

H1: Auth failures to downstream resources¶

Managed identity role removed or assigned at wrong scope.
Secret/connection setting drift causes forbidden or unauthorized calls.
Storage authorization failure blocks listener startup and cascades failures.

H2: Function timeout reached¶

Invocation durations rise until functionTimeout threshold is exceeded.
Dependency latency + retries consume the full execution budget.
Queue trigger throughput drops while failed execution duration approaches timeout ceiling.

H3: Memory pressure or runtime mismatch¶

Worker restarts/crashes under load (code 137, out-of-memory patterns).
Runtime stack mismatch after release (language version, extensions, native dependencies).
Startup appears healthy, then workers terminate during processing.

H4: Deployment regression¶

New artifact introduces deterministic exception path.
Deployment changed app settings, binding behavior, or extension compatibility.
Failure slope changes sharply at release time.

4. What to Check First¶

First-pass checks¶

Top exception type and message trend in the incident window.
Per-function failure rate and failed invocation duration distribution.
Host traces for listener startup and storage health diagnostics.
Deployment/config/identity changes around onset time.

Portal checks¶

Application Insights -> Failures -> Exception types: identify dominant error family.
Application Insights -> Failures -> Samples: review outerMessage, stack trace, operation correlation.
Function App -> Configuration: inspect identity, app settings, runtime version, and drift.
Activity Log: correlate deployment and configuration operations with failure onset.

CLI quick checks¶

az account show --subscription "$SUBSCRIPTION_ID" --output table
az functionapp show --resource-group "$RG" --name "$APP_NAME" --output table
az functionapp config show --resource-group "$RG" --name "$APP_NAME" --output json
az functionapp identity show --resource-group "$RG" --name "$APP_NAME" --output json

5. Evidence to Collect¶

Required evidence set¶

UTC timeline: start, peak, mitigation, recovery.
Top exception clusters (type, outerMessage, count).
Function-level success/failure and latency profile.
Host storage/listener health messages.
Deployment and configuration activity timeline.

Sample Log Patterns¶

# Abnormal: managed identity RBAC failure pattern (real-world style)
[2026-04-04T12:15:42Z] The listener for function 'Functions.scheduled_cleanup' was unable to start.
[2026-04-04T12:17:54Z] Storage operation failed: Status: 403 (AuthorizationPermissionMismatch)
[2026-04-04T12:22:42Z] Process reporting unhealthy: azure.functions.webjobs.storage: Unhealthy
[2026-04-04T12:22:42Z] description="Unable to access AzureWebJobsStorage" errorCode="AuthorizationPermissionMismatch"

# Abnormal: timeout pattern
[2026-04-04T12:31:04Z] Function 'Functions.QueueProcessor' timed out after 00:05:00.
[2026-04-04T12:31:04Z] Microsoft.Azure.WebJobs.Host.FunctionTimeoutException: Timeout value of 00:05:00 exceeded.

# Abnormal: runtime crash pattern
[2026-04-04T12:40:10Z] Worker process started and initialized.
[2026-04-04T12:40:14Z] Worker process exited with code 137.

# Normal: transient error recovered by retry
[2026-04-04T12:50:07Z] Dependency timeout on attempt 1.
[2026-04-04T12:50:09Z] Retry succeeded on attempt 2.

How to Read This

Repeating AuthorizationPermissionMismatch with listener startup failure is deterministic authorization breakage, not transient dependency noise.

KQL Queries with Example Output¶

Query 1: Function execution summary (success/failure/duration)¶

let appName = "func-myapp-prod";
requests
| where timestamp > ago(1h)
| where cloud_RoleName =~ appName
| where operation_Name startswith "Functions."
| summarize
    Invocations = count(),
    Failures = countif(success == false),
    FailureRatePercent = round(100.0 * countif(success == false) / count(), 2),
    P95Ms = percentile(duration, 95)
  by FunctionName = operation_Name
| order by Failures desc, P95Ms desc

Example Output

FunctionName	Invocations	Failures	FailureRatePercent	P95Ms
Functions.ErrorHandler	15	15	100.00	13730
Functions.scheduled_cleanup	10	10	100.00	9800
Functions.HttpTrigger	50	1	2.00	1729

Query 2: Failed invocations with error details¶

let appName = "func-myapp-prod";
requests
| where timestamp > ago(2h)
| where cloud_RoleName =~ appName
| where operation_Name startswith "Functions."
| where success == false
| project timestamp, operation_Id, FunctionName = operation_Name, resultCode, duration
| join kind=leftouter (
    exceptions
    | where timestamp > ago(2h)
    | where cloud_RoleName =~ appName
    | project operation_Id, ExceptionType = type, ExceptionMessage = outerMessage
) on operation_Id
| order by timestamp desc

Example Output

timestamp	operation_Id	FunctionName	resultCode	duration	ExceptionType	ExceptionMessage
2026-04-04T12:17:54Z	xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx	Functions.scheduled_cleanup	500	8023	System.UnauthorizedAccessException	Storage operation failed: Status: 403 (AuthorizationPermissionMismatch)
2026-04-04T12:17:42Z	xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx	Functions.scheduled_cleanup	500	7988	Microsoft.Azure.WebJobs.Host.Listeners.FunctionListenerException	The listener for function 'Functions.scheduled_cleanup' was unable to start.

Query 3: Exception trends¶

let appName = "func-myapp-prod";
exceptions
| where timestamp > ago(24h)
| where cloud_RoleName =~ appName
| summarize Count=count() by bin(timestamp, 15m), type
| order by timestamp desc

Example Output

timestamp	type	Count
2026-04-04T12:15:00Z	Microsoft.Azure.WebJobs.Host.Listeners.FunctionListenerException	18
2026-04-04T12:15:00Z	System.UnauthorizedAccessException	18
2026-04-04T12:00:00Z	Microsoft.Azure.WebJobs.Script.Workers.Rpc.RpcException	3

How to Read This

Abrupt multi-bin dominance by one exception type indicates a regression window, not random transient activity.

CLI Investigation Commands¶

az monitor log-analytics query --workspace "$WORKSPACE_ID" --analytics-query "exceptions | where timestamp > ago(1h) | where cloud_RoleName =~ '$APP_NAME' | summarize count() by type, outerMessage | order by count_ desc" --output table
az monitor log-analytics query --workspace "$WORKSPACE_ID" --analytics-query "traces | where timestamp > ago(1h) | where cloud_RoleName =~ '$APP_NAME' | where message has_any ('unhealthy','Unable to access AzureWebJobsStorage','AuthorizationPermissionMismatch') | project timestamp, severityLevel, message | order by timestamp desc" --output table
# Note: If the storage account is in a different resource group, adjust $RG accordingly
STORAGE_ID=$(az storage account show --resource-group "$RG" --name "$STORAGE_NAME" --query id --output tsv)
az role assignment list --scope "$STORAGE_ID" --query "[?principalId=='<principal-id>'].{role:roleDefinitionName, scope:scope}" --output table
az monitor activity-log list --subscription "$SUBSCRIPTION_ID" --resource-group "$RG" --max-events 50 --output table

Example Output (sanitized)

$ az monitor log-analytics query --workspace "$WORKSPACE_ID" --analytics-query "exceptions | where timestamp > ago(1h) | where cloud_RoleName =~ '$APP_NAME' | summarize count() by type, outerMessage | order by count_ desc" --output table
type                                                         outerMessage                                                                    count_
-----------------------------------------------------------  ------------------------------------------------------------------------------  ------
System.UnauthorizedAccessException                           Storage operation failed: Status: 403 (AuthorizationPermissionMismatch)         18
Microsoft.Azure.WebJobs.Host.Listeners.FunctionListenerException  The listener for function 'Functions.scheduled_cleanup' was unable to start.  18

$ STORAGE_ID=$(az storage account show --resource-group "$RG" --name "$STORAGE_NAME" --query id --output tsv)
$ az role assignment list --scope "$STORAGE_ID" --query "[?principalId=='<principal-id>'].{role:roleDefinitionName, scope:scope}" --output table
Role                              Scope
--------------------------------  -------------------------------------------------------------------------------------------------------------
Storage Blob Data Contributor     /subscriptions/<subscription-id>/resourceGroups/rg-app-prod/providers/Microsoft.Storage/storageAccounts/stfuncprod

6. Validation and Disproof by Hypothesis¶

H1: Auth failures to downstream resources¶

Signals that support
- 403, UnauthorizedAccessException, or AuthorizationPermissionMismatch dominates.
- Listener startup failures mention storage access.
- Host unhealthy traces reference AzureWebJobsStorage authorization.
Signals that weaken
- Errors are primarily timeout or worker crash without auth wording.
- Current role assignments and credential versions are validated and unchanged.

What to verify with INLINE KQL

let appName = "func-myapp-prod";
traces
| where timestamp > ago(2h)
| where cloud_RoleName =~ appName
| where message has_any ("AuthorizationPermissionMismatch", "Unable to access AzureWebJobsStorage", "unhealthy", "Forbidden", "Unauthorized")
| project timestamp, severityLevel, message
| order by timestamp desc

Example Output

timestamp	severityLevel	message
2026-04-04T12:22:42Z	3	Process reporting unhealthy: azure.functions.webjobs.storage: Unhealthy
2026-04-04T12:22:42Z	3	Unable to access AzureWebJobsStorage. Status: 403 (AuthorizationPermissionMismatch)
2026-04-04T12:17:54Z	3	Storage operation failed: Status: 403 (AuthorizationPermissionMismatch)

How to Read This

Repeating host-storage authorization messages confirm a deterministic permission break. Retrying only adds load.

CLI verification

az functionapp identity show --resource-group "$RG" --name "$APP_NAME" --output json
PRINCIPAL_ID=$(az functionapp identity show --resource-group "$RG" --name "$APP_NAME" --query principalId --output tsv)
# Note: If the storage account is in a different resource group, adjust $RG accordingly
STORAGE_ID=$(az storage account show --resource-group "$RG" --name "$STORAGE_NAME" --query id --output tsv)
az role assignment list --scope "$STORAGE_ID" --query "[?principalId=='$PRINCIPAL_ID'].{role:roleDefinitionName, scope:scope}" --output table

H2: Function timeout reached¶

Signals that support
- Failed invocations cluster near a fixed duration ceiling.
- Timeout exception messages increase with dependency latency/retry activity.
- Queue processing delay and retry volume rise together.
Signals that weaken
- Failures are short and immediate.
- Timeout exception text is absent.

What to verify with INLINE KQL

let appName = "func-myapp-prod";
requests
| where timestamp > ago(2h)
| where cloud_RoleName =~ appName
| where operation_Name startswith "Functions."
| summarize
    Invocations = count(),
    Failures = countif(success == false),
    P95Ms = percentile(duration, 95),
    MaxMs = max(duration)
  by bin(timestamp, 10m), FunctionName = operation_Name
| order by timestamp desc

Example Output

timestamp	FunctionName	Invocations	Failures	P95Ms	MaxMs
2026-04-04T12:30:00Z	Functions.QueueProcessor	120	37	294000	300000
2026-04-04T12:20:00Z	Functions.QueueProcessor	118	11	220000	300000
2026-04-04T12:10:00Z	Functions.QueueProcessor	110	2	64000	98000

How to Read This

Rising duration bins followed by failures near one ceiling strongly indicate timeout budget exhaustion.

CLI verification

az functionapp config appsettings list --resource-group "$RG" --name "$APP_NAME" --output table
az functionapp config show --resource-group "$RG" --name "$APP_NAME" --output json
az monitor log-analytics query --workspace "$WORKSPACE_ID" --analytics-query "exceptions | where timestamp > ago(2h) | where cloud_RoleName =~ '$APP_NAME' | where outerMessage has_any ('timed out','FunctionTimeoutException') | summarize count() by type, outerMessage" --output table

H3: Memory pressure or runtime mismatch¶

Signals that support
- Worker start/exit churn increases.
- Failures start after runtime stack or package update.
- Crash/exit patterns correlate with invocation errors.
Signals that weaken
- Runtime unchanged and worker lifecycle stable.
- Error mix is dominated by auth/forbidden signals.

What to verify with INLINE KQL

let appName = "func-myapp-prod";
traces
| where timestamp > ago(6h)
| where cloud_RoleName =~ appName
| where message has_any ("Worker process started", "Worker process exited", "out of memory", "Host started", "Starting Host")
| summarize Events=count() by bin(timestamp, 15m), message
| order by timestamp desc

Example Output

timestamp	message	Events
2026-04-04T12:45:00Z	Worker process exited with code 137	6
2026-04-04T12:45:00Z	Worker process started and initialized.	6
2026-04-04T12:30:00Z	Host started (75ms)	1

How to Read This

Frequent start-exit cycles over short bins indicate unstable runtime behavior and likely memory or compatibility problems.

CLI verification

az functionapp config show --resource-group "$RG" --name "$APP_NAME" --query "linuxFxVersion" --output tsv
az functionapp config appsettings list --resource-group "$RG" --name "$APP_NAME" --output table
az functionapp log deployment show --resource-group "$RG" --name "$APP_NAME"

H4: Deployment regression¶

Signals that support
- Exception and failure rates jump immediately after deployment.
- New dominant exception type appears in same release window.
- Rollback lowers failure rate.
Signals that weaken
- Failure trend predates deployment.
- No change in release artifact or app configuration.

What to verify with INLINE KQL

let appName = "func-myapp-prod";
let timeGrain = 1h;
let exceptionTrend = exceptions
| where timestamp > ago(24h)
| where cloud_RoleName =~ appName
| summarize ExceptionCount=count() by bin(timestamp, timeGrain), type;
let requestFailureTrend = requests
| where timestamp > ago(24h)
| where cloud_RoleName =~ appName
| where operation_Name startswith "Functions."
| where success == false
| summarize FailedRequests=count() by bin(timestamp, timeGrain), operation_Name;
exceptionTrend
| join kind=inner requestFailureTrend on timestamp
| project timestamp, operation_Name, type, ExceptionCount, FailedRequests
| order by timestamp desc

Example Output

timestamp	operation_Name	type	ExceptionCount	FailedRequests
2026-04-04T12:00:00Z	Functions.ErrorHandler	Microsoft.Azure.WebJobs.Script.Workers.Rpc.RpcException	2	1
2026-04-04T11:00:00Z	Functions.ErrorHandler	Microsoft.Azure.WebJobs.Script.Workers.Rpc.RpcException	32	15

How to Read This

Hard step-change at release time across exceptions and request failures is strong regression evidence.

CLI verification

az functionapp deployment source show --resource-group "$RG" --name "$APP_NAME" --output json
az functionapp deployment list-publishing-profiles --resource-group "$RG" --name "$APP_NAME" --output table
az monitor activity-log list --subscription "$SUBSCRIPTION_ID" --resource-group "$RG" --max-events 50 --output table

Normal vs Abnormal Comparison¶

Signal	Normal	Abnormal	Interpretation
Exception cadence	Sporadic and mixed	Sustained dominance by one type	Deterministic issue likely
Host storage health	Healthy or brief transient warning	Repeated `AuthorizationPermissionMismatch`	RBAC/scope break likely
Failure onset timing	No release coupling	Sharp increase after deploy/config change	Regression or drift likely
Failed duration profile	Mostly short with noise	Clusters near timeout ceiling	Timeout budget exhaustion
Worker lifecycle	Infrequent restarts	Frequent start-exit cycles	Memory/runtime mismatch likely

7. Likely Root Cause Patterns¶

Pattern A: Managed identity scope error
- Identity exists but lacks required data-plane role at target resource scope.
- Listener startup and storage-backed functions fail first.
Pattern B: Timeout budget overrun
- Retry amplification and dependency latency push invocations beyond functionTimeout.
Pattern C: Runtime drift after deployment
- Updated runtime/dependency set creates worker instability.
Pattern D: Deterministic code-path regression
- New release introduces repeatable failure under common traffic path.

8. Immediate Mitigations¶

Restore least-privilege RBAC required for failing dependency access.
Roll back to last known good artifact if deployment regression is strongly indicated.
Temporarily reduce concurrency pressure by isolating non-critical triggers.
Adjust timeout only as short-term stabilization with explicit rollback plan.
Communicate deterministic vs transient classification to prevent ineffective retry storms.

Example mitigation commands¶

az role assignment create --assignee-object-id "<principal-id>" --role "Storage Blob Data Contributor" --scope "/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RG/providers/Microsoft.Storage/storageAccounts/<storage-account-name>"
az role assignment create --assignee-object-id "<principal-id>" --role "Storage Queue Data Contributor" --scope "/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RG/providers/Microsoft.Storage/storageAccounts/<storage-account-name>"
az functionapp restart --resource-group "$RG" --name "$APP_NAME"

Mitigation discipline

Change one variable at a time, then re-check failure rate and dominant exception type before applying the next action.

9. Prevention¶

Add pre-deploy RBAC validation for all managed identity dependencies.
Enforce config/runtime drift detection in CI/CD before promotion.
Alert on sustained FailureRatePercent, dominant exception share, and host unhealthy storage signals.
Load-test timeout budgets with realistic dependency latency and retry behavior.
Require release guardrails: staged rollout, health gates, and rollback criteria.

Operational checklist¶

Identity roles validated at exact resource scope.
Timeout budgets reviewed against SLA and trigger behavior.
Runtime version pinning and extension compatibility verified.
Incident workbook contains H1-H4 query panels.

Functions Failing with Errors¶

1. Summary¶

Symptom¶

Troubleshooting decision flow¶

2. Common Misreadings¶

3. Competing Hypotheses¶

H1: Auth failures to downstream resources¶

H2: Function timeout reached¶

H3: Memory pressure or runtime mismatch¶

H4: Deployment regression¶

4. What to Check First¶

First-pass checks¶

Portal checks¶

CLI quick checks¶

5. Evidence to Collect¶

Required evidence set¶

Sample Log Patterns¶

KQL Queries with Example Output¶

Query 1: Function execution summary (success/failure/duration)¶

Query 2: Failed invocations with error details¶

Query 3: Exception trends¶

CLI Investigation Commands¶

6. Validation and Disproof by Hypothesis¶

H1: Auth failures to downstream resources¶

H2: Function timeout reached¶

H3: Memory pressure or runtime mismatch¶

H4: Deployment regression¶

Normal vs Abnormal Comparison¶

7. Likely Root Cause Patterns¶

8. Immediate Mitigations¶

Example mitigation commands¶

9. Prevention¶

Operational checklist¶

See Also¶

Sources¶

Functions Failing with Errors¶

1. Summary¶

Symptom¶

Troubleshooting decision flow¶

2. Common Misreadings¶

3. Competing Hypotheses¶

H1: Auth failures to downstream resources¶

H2: Function timeout reached¶

H3: Memory pressure or runtime mismatch¶

H4: Deployment regression¶

4. What to Check First¶

First-pass checks¶

Portal checks¶

CLI quick checks¶

5. Evidence to Collect¶

Required evidence set¶

Sample Log Patterns¶

KQL Queries with Example Output¶

Query 1: Function execution summary (success/failure/duration)¶

Query 2: Failed invocations with error details¶

Query 3: Exception trends¶

CLI Investigation Commands¶

6. Validation and Disproof by Hypothesis¶

H1: Auth failures to downstream resources¶

H2: Function timeout reached¶

H3: Memory pressure or runtime mismatch¶

H4: Deployment regression¶

Normal vs Abnormal Comparison¶

7. Likely Root Cause Patterns¶

8. Immediate Mitigations¶

Example mitigation commands¶

9. Prevention¶

Operational checklist¶

Related Labs¶

See Also¶

Sources¶