Skip to content

Quick Diagnosis Cards

One-page reference cards for rapid Azure Functions incident triage. Each card maps: Symptom → First Query → Platform Segment → Playbook.

Use these when you have 60 seconds to classify the failure before opening the deeper playbooks.


Card 1: Slow First Invocation / Cold Start

graph LR
    A[Slow first invocation] --> B[First Query]
    B --> C[Platform Segment]
    C --> D[Playbook]
Step Action
Symptom First HTTP request or first trigger execution after idle/restart takes seconds longer than steady-state traffic
First Query AppTraces \| where TimeGenerated > ago(2h) \| where AppRoleName =~ "func-myapp-prod" \| where Message has_any ("Host started", "Initializing Host", "Host lock lease acquired") \| project TimeGenerated, Message \| order by TimeGenerated desc
What to Look For Repeated host startups, large startup gaps before first invocation, or high first-request duration after scale-out
Platform Segment Startup / Performance
Playbook High Latency

Quick KQL Check:

let appName = "func-myapp-prod";
traces
| where timestamp > ago(6h)
| where cloud_RoleName =~ appName
| where message has_any ("Host started", "Initializing Host", "Host lock lease acquired")
| summarize StartupEvents=count() by bin(timestamp, 15m)
| join kind=leftouter (
    requests
    | where timestamp > ago(6h)
    | where cloud_RoleName =~ appName
    | where operation_Name startswith "Functions."
    | summarize FirstInvocation=min(timestamp), MinDurationMs=min(toreal(duration / 1ms)) by bin(timestamp, 15m)
) on timestamp
| order by timestamp desc

Quick CLI Check:

az monitor log-analytics query --workspace "$WORKSPACE_ID" --analytics-query "AppTraces | where TimeGenerated > ago(2h) | where AppRoleName =~ '$APP_NAME' | where Message has_any ('Host started','Initializing Host','Host lock lease acquired') | project TimeGenerated, Message | order by TimeGenerated desc" --output table

Card 2: Trigger Failures by Trigger Type

graph LR
    A[Trigger not firing] --> B{Trigger type}
    B --> C[Source evidence]
    C --> D[Listener traces]
Step Action
Symptom HTTP, Timer, Queue, Event Hub, Blob, or Cosmos DB trigger stops executing while the app still appears available
First Query AppRequests \| where TimeGenerated > ago(1h) \| where AppRoleName =~ "func-myapp-prod" \| where OperationName startswith "Functions." \| summarize Invocations=count(), Failures=countif(Success == false) by OperationName \| order by Invocations asc
What to Look For HTTP: 404/401/5xx patterns. Timer: missing schedule traces or isPastDue. Queue: backlog rising while invocations stay flat. Event Hub: checkpoint lag. Blob: missing Event Grid subscription or listener startup. Cosmos DB: lease/checkpoint errors or connection failures.
Platform Segment Trigger Listener / Source Delivery
Playbook Functions Not Executing

Quick KQL Check:

let appName = "func-myapp-prod";
let recentInvocations =
requests
| where timestamp > ago(1h)
| where cloud_RoleName =~ appName
| where operation_Name startswith "Functions."
| summarize Invocations=count(), Failures=countif(success == false) by FunctionName=operation_Name;
let recentTriggerTraces =
traces
| where timestamp > ago(1h)
| where cloud_RoleName =~ appName
| where tostring(customDimensions.Category) startswith "Function" or operation_Name startswith "Functions."
| where message has_any ("listener", "Timer", "Blob", "Queue", "EventHub", "Cosmos", "unable to start", "isPastDue")
| summarize TraceHits=count() by FunctionName=operation_Name;
recentInvocations
| join kind=leftouter recentTriggerTraces on FunctionName
| order by Invocations asc, Failures desc

traces.operation_Name can include non-function traces. The function-category filter above reduces false matches in trigger-correlation joins.

Quick CLI Check:

az functionapp function list --resource-group "$RG" --name "$APP_NAME" --output table
az monitor log-analytics query --workspace "$WORKSPACE_ID" --analytics-query "AppTraces | where TimeGenerated > ago(1h) | where AppRoleName =~ '$APP_NAME' | where Message has_any ('listener','unable to start','Timer','Blob','Queue','EventHub','Cosmos','isPastDue') | project TimeGenerated, Message | order by TimeGenerated desc" --output table

Card 3: Binding and Extension Errors

graph LR
    A[Binding errors] --> B[Indexing traces]
    B --> C[Config mismatch]
    C --> D[Auth or extension fix]
Step Action
Symptom Functions fail during host startup or invocation with binding, indexing, serialization, or extension-bundle errors
First Query AppTraces \| where TimeGenerated > ago(2h) \| where AppRoleName =~ "func-myapp-prod" \| where Message has_any ("Error indexing method", "binding", "extension", "Unable to resolve app setting", "Storage account connection string") \| project TimeGenerated, Message \| order by TimeGenerated desc
What to Look For Error indexing method, missing app setting names, unsupported binding attributes, wrong extension bundle version, or identity-based connection settings that do not match the binding configuration
Platform Segment Runtime / Bindings
Playbook App Settings Misconfiguration

Quick KQL Check:

let appName = "func-myapp-prod";
traces
| where timestamp > ago(2h)
| where cloud_RoleName =~ appName
| where message has_any (
    "Error indexing method",
    "binding",
    "extension",
    "Unable to resolve app setting",
    "Storage account connection string",
    "Microsoft.Azure.WebJobs"
)
| project timestamp, severityLevel, message
| order by timestamp desc

Quick CLI Check:

az functionapp config appsettings list --resource-group "$RG" --name "$APP_NAME" --output table
az functionapp config show --resource-group "$RG" --name "$APP_NAME" --output json

Card 4: Timeout / Execution Limit Exceeded

graph LR
    A[Timeout symptom] --> B[Duration query]
    B --> C[Plan limit check]
    C --> D[Timeout playbook]
Step Action
Symptom Invocations end with timeout errors, 230-second HTTP cutoff behavior, or long-running trigger executions that never complete
First Query AppRequests \| where TimeGenerated > ago(2h) \| where AppRoleName =~ "func-myapp-prod" \| where OperationName startswith "Functions." \| summarize P95Ms=percentile(DurationMs, 95), MaxMs=max(DurationMs), Failures=countif(Success == false) by OperationName \| order by MaxMs desc
What to Look For Duration clustering near plan limit, requests ending with timeout-related exceptions, HTTP triggers failing at the front-end timeout boundary, or durable/orchestrated work incorrectly running inside a regular function
Platform Segment Execution / Limits
Playbook Timeout / Execution Limit Exceeded

Quick KQL Check:

let appName = "func-myapp-prod";
requests
| where timestamp > ago(2h)
| where cloud_RoleName =~ appName
| where operation_Name startswith "Functions."
| summarize
    Invocations=count(),
    Failures=countif(success == false),
    P95Ms=percentile(duration, 95),
    MaxMs=max(duration)
  by FunctionName=operation_Name
| order by MaxMs desc

Quick CLI Check:

az functionapp config appsettings list --resource-group "$RG" --name "$APP_NAME" --output table
az monitor log-analytics query --workspace "$WORKSPACE_ID" --analytics-query "AppExceptions | where TimeGenerated > ago(2h) | where AppRoleName =~ '$APP_NAME' | where OuterMessage has_any ('timeout','timed out','execution time') | project TimeGenerated, ExceptionType, OuterMessage | order by TimeGenerated desc" --output table

Card 5: Memory or CPU Exhaustion

graph LR
    A[Worker slowdown] --> B[Metrics]
    B --> C[Restart or crash traces]
    C --> D[Resource-pressure playbook]
Step Action
Symptom Throughput drops, worker restarts, OOM kills appear, or CPU-bound functions cause broad latency across multiple triggers
First Query AppTraces \| where TimeGenerated > ago(6h) \| where AppRoleName =~ "func-myapp-prod" \| where Message has_any ("OOM", "OutOfMemory", "worker process started and initialized", "Host is shutting down", "restarting", "health check") \| project TimeGenerated, Message \| order by TimeGenerated desc
What to Look For Restart storms, OOM signatures, high latency across unrelated functions, dependency noise caused by saturation, or queue backlog growth with stable upstream volume
Platform Segment Compute / Worker Health
Playbook Out of Memory / Worker Crash

Quick KQL Check:

let appName = "func-myapp-prod";
traces
| where timestamp > ago(6h)
| where cloud_RoleName =~ appName
| where message has_any (
    "OOM",
    "OutOfMemory",
    "worker process started and initialized",
    "Host is shutting down",
    "restarting",
    "health check"
)
| project timestamp, severityLevel, message
| order by timestamp desc

Quick CLI Check:

az monitor metrics list --resource "/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RG/providers/Microsoft.Web/sites/$APP_NAME" --metric "CpuPercentage" "MemoryWorkingSet" "Requests" --interval PT1M --aggregation Average Maximum Total --output table

Card 6: Deployment Succeeded but Functions Broke

graph LR
    A[Failure after deploy] --> B[Activity Log]
    B --> C[Startup and indexing traces]
    C --> D[Deployment playbook]
Step Action
Symptom Release completed, but functions are missing, returning errors, or no longer processing events immediately afterward
First Query AppTraces \| where TimeGenerated > ago(6h) \| where AppRoleName =~ "func-myapp-prod" \| where Message has_any ("Host started", "Generating", "No job functions found", "Error indexing method", "Syncing triggers") \| project TimeGenerated, Message \| order by TimeGenerated desc
What to Look For No job functions found, broken package structure, runtime mismatch, extension load errors, or trigger sync problems introduced at deployment time
Platform Segment Deployment / Startup
Playbook Deployment Failures

Quick KQL Check:

let appName = "func-myapp-prod";
traces
| where timestamp > ago(6h)
| where cloud_RoleName =~ appName
| where message has_any (
    "Host started",
    "Generating",
    "No job functions found",
    "Error indexing method",
    "Syncing triggers",
    "Worker process started and initialized"
)
| project timestamp, severityLevel, message
| order by timestamp desc

Quick CLI Check:

az monitor activity-log list --resource-group "$RG" --offset 6h --max-events 20 --output table
az functionapp function list --resource-group "$RG" --name "$APP_NAME" --output table
az functionapp config appsettings list --resource-group "$RG" --name "$APP_NAME" --output table

Card 7: Scale Out Not Keeping Up

graph LR
    A[Backlog grows] --> B[Invocation trend]
    B --> C[Scale signals]
    C --> D[Queue or event-hub playbook]
Step Action
Symptom Queue depth, Event Hub lag, or pending work grows faster than completions even though the function app is still executing some work
First Query AppRequests \| where TimeGenerated > ago(2h) \| where AppRoleName =~ "func-myapp-prod" \| where OperationName startswith "Functions." \| summarize Invocations=count(), Failures=countif(Success == false), P95Ms=percentile(DurationMs, 95) by bin(TimeGenerated, 5m), OperationName \| order by TimeGenerated asc
What to Look For Flat or weak invocation growth while source volume rises, repeated scale-controller or listener warnings, partition imbalance, checkpoint lag, or one hot partition blocking the rest of the workload
Platform Segment Scaling / Throughput
Playbook Queue Piling Up and Event Hub / Service Bus Lag

Quick KQL Check:

let appName = "func-myapp-prod";
requests
| where timestamp > ago(2h)
| where cloud_RoleName =~ appName
| where operation_Name startswith "Functions."
| summarize Invocations=count(), Failures=countif(success == false), P95Ms=percentile(duration, 95) by bin(timestamp, 5m), FunctionName=operation_Name
| order by timestamp asc

Quick CLI Check:

az monitor metrics list --resource "/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RG/providers/Microsoft.Web/sites/$APP_NAME" --metric "FunctionExecutionCount" "FunctionExecutionUnits" "Requests" --interval PT5M --aggregation Total Average --output table
az monitor log-analytics query --workspace "$WORKSPACE_ID" --analytics-query "AppTraces | where TimeGenerated > ago(2h) | where AppRoleName =~ '$APP_NAME' | where Message has_any ('scale','partition','checkpoint','listener','backlog') | project TimeGenerated, Message | order by TimeGenerated desc" --output table

Universal First 3 Queries

When you do not know where to start, run these three queries first to establish the failure domain.

Query 1: Function Execution Trend

let appName = "func-myapp-prod";
requests
| where timestamp > ago(2h)
| where cloud_RoleName =~ appName
| where operation_Name startswith "Functions."
| summarize total=count(), failed=countif(success == false), p95=percentile(duration, 95) by bin(timestamp, 5m), operation_Name
| order by timestamp asc

Query 2: Host and Listener Events

let appName = "func-myapp-prod";
traces
| where timestamp > ago(24h)
| where cloud_RoleName =~ appName
| where message has_any ("Host started", "Host shutdown", "listener", "unable to start", "Error indexing method", "Syncing triggers", "scale")
| project timestamp, severityLevel, message
| order by timestamp desc

Query 3: Dominant Exceptions and Dependencies

let appName = "func-myapp-prod";
exceptions
| where timestamp > ago(6h)
| where cloud_RoleName =~ appName
| summarize ExceptionCount=count() by type, outerMessage
| order by ExceptionCount desc

Decision Matrix

Observation Most Likely Card Confidence
First invocation slow after idle or scale event Card 1 (Cold Start) High
HTTP/Timer/Queue/Event Hub/Blob/Cosmos trigger not firing Card 2 (Trigger Failures) High
Host startup logs mention indexing, binding, or extension issues Card 3 (Binding Errors) High
Durations cluster near timeout boundary Card 4 (Timeout) High
Slowdown followed by worker recycle or OOM traces Card 5 (Memory/CPU Exhaustion) High
Incident begins right after deployment or trigger sync Card 6 (Deployment Failures) High
Backlog or lag grows while invocations rise too slowly Card 7 (Scale Out Problems) Medium-High

See Also

Sources