Quick Diagnosis Cards¶

One-page reference cards for rapid Azure Functions incident triage. Each card maps: Symptom → First Query → Platform Segment → Playbook.

Use these when you have 60 seconds to classify the failure before opening the deeper playbooks.

Card 1: Slow First Invocation / Cold Start¶

graph LR
    A[Slow first invocation] --> B[First Query]
    B --> C[Platform Segment]
    C --> D[Playbook]

Step	Action
Symptom	First HTTP request or first trigger execution after idle/restart takes seconds longer than steady-state traffic
First Query	`AppTraces \\| where TimeGenerated > ago(2h) \\| where AppRoleName =~ "func-myapp-prod" \\| where Message has_any ("Host started", "Initializing Host", "Host lock lease acquired") \\| project TimeGenerated, Message \\| order by TimeGenerated desc`
What to Look For	Repeated host startups, large startup gaps before first invocation, or high first-request duration after scale-out
Platform Segment	Startup / Performance
Playbook	High Latency

Quick KQL Check:

let appName = "func-myapp-prod";
traces
| where timestamp > ago(6h)
| where cloud_RoleName =~ appName
| where message has_any ("Host started", "Initializing Host", "Host lock lease acquired")
| summarize StartupEvents=count() by bin(timestamp, 15m)
| join kind=leftouter (
    requests
    | where timestamp > ago(6h)
    | where cloud_RoleName =~ appName
    | where operation_Name startswith "Functions."
    | summarize FirstInvocation=min(timestamp), MinDurationMs=min(toreal(duration / 1ms)) by bin(timestamp, 15m)
) on timestamp
| order by timestamp desc

Quick CLI Check:

az monitor log-analytics query --workspace "$WORKSPACE_ID" --analytics-query "AppTraces | where TimeGenerated > ago(2h) | where AppRoleName =~ '$APP_NAME' | where Message has_any ('Host started','Initializing Host','Host lock lease acquired') | project TimeGenerated, Message | order by TimeGenerated desc" --output table

Card 2: Trigger Failures by Trigger Type¶

graph LR
    A[Trigger not firing] --> B{Trigger type}
    B --> C[Source evidence]
    C --> D[Listener traces]

Step	Action
Symptom	HTTP, Timer, Queue, Event Hub, Blob, or Cosmos DB trigger stops executing while the app still appears available
First Query	`AppRequests \\| where TimeGenerated > ago(1h) \\| where AppRoleName =~ "func-myapp-prod" \\| where OperationName startswith "Functions." \\| summarize Invocations=count(), Failures=countif(Success == false) by OperationName \\| order by Invocations asc`
What to Look For	HTTP: 404/401/5xx patterns. Timer: missing schedule traces or `isPastDue`. Queue: backlog rising while invocations stay flat. Event Hub: checkpoint lag. Blob: missing Event Grid subscription or listener startup. Cosmos DB: lease/checkpoint errors or connection failures.
Platform Segment	Trigger Listener / Source Delivery
Playbook	Functions Not Executing

Quick KQL Check:

let appName = "func-myapp-prod";
let recentInvocations =
requests
| where timestamp > ago(1h)
| where cloud_RoleName =~ appName
| where operation_Name startswith "Functions."
| summarize Invocations=count(), Failures=countif(success == false) by FunctionName=operation_Name;
let recentTriggerTraces =
traces
| where timestamp > ago(1h)
| where cloud_RoleName =~ appName
| where tostring(customDimensions.Category) startswith "Function" or operation_Name startswith "Functions."
| where message has_any ("listener", "Timer", "Blob", "Queue", "EventHub", "Cosmos", "unable to start", "isPastDue")
| summarize TraceHits=count() by FunctionName=operation_Name;
recentInvocations
| join kind=leftouter recentTriggerTraces on FunctionName
| order by Invocations asc, Failures desc

traces.operation_Name can include non-function traces. The function-category filter above reduces false matches in trigger-correlation joins.

Quick CLI Check:

az functionapp function list --resource-group "$RG" --name "$APP_NAME" --output table
az monitor log-analytics query --workspace "$WORKSPACE_ID" --analytics-query "AppTraces | where TimeGenerated > ago(1h) | where AppRoleName =~ '$APP_NAME' | where Message has_any ('listener','unable to start','Timer','Blob','Queue','EventHub','Cosmos','isPastDue') | project TimeGenerated, Message | order by TimeGenerated desc" --output table

Card 3: Binding and Extension Errors¶

graph LR
    A[Binding errors] --> B[Indexing traces]
    B --> C[Config mismatch]
    C --> D[Auth or extension fix]

Step	Action
Symptom	Functions fail during host startup or invocation with binding, indexing, serialization, or extension-bundle errors
First Query	`AppTraces \\| where TimeGenerated > ago(2h) \\| where AppRoleName =~ "func-myapp-prod" \\| where Message has_any ("Error indexing method", "binding", "extension", "Unable to resolve app setting", "Storage account connection string") \\| project TimeGenerated, Message \\| order by TimeGenerated desc`
What to Look For	`Error indexing method`, missing app setting names, unsupported binding attributes, wrong extension bundle version, or identity-based connection settings that do not match the binding configuration
Platform Segment	Runtime / Bindings
Playbook	App Settings Misconfiguration

Quick KQL Check:

let appName = "func-myapp-prod";
traces
| where timestamp > ago(2h)
| where cloud_RoleName =~ appName
| where message has_any (
    "Error indexing method",
    "binding",
    "extension",
    "Unable to resolve app setting",
    "Storage account connection string",
    "Microsoft.Azure.WebJobs"
)
| project timestamp, severityLevel, message
| order by timestamp desc

Quick CLI Check:

az functionapp config appsettings list --resource-group "$RG" --name "$APP_NAME" --output table
az functionapp config show --resource-group "$RG" --name "$APP_NAME" --output json

Card 4: Timeout / Execution Limit Exceeded¶

graph LR
    A[Timeout symptom] --> B[Duration query]
    B --> C[Plan limit check]
    C --> D[Timeout playbook]

Step	Action
Symptom	Invocations end with timeout errors, 230-second HTTP cutoff behavior, or long-running trigger executions that never complete
First Query	`AppRequests \\| where TimeGenerated > ago(2h) \\| where AppRoleName =~ "func-myapp-prod" \\| where OperationName startswith "Functions." \\| summarize P95Ms=percentile(DurationMs, 95), MaxMs=max(DurationMs), Failures=countif(Success == false) by OperationName \\| order by MaxMs desc`
What to Look For	Duration clustering near plan limit, requests ending with timeout-related exceptions, HTTP triggers failing at the front-end timeout boundary, or durable/orchestrated work incorrectly running inside a regular function
Platform Segment	Execution / Limits
Playbook	Timeout / Execution Limit Exceeded

Quick KQL Check:

let appName = "func-myapp-prod";
requests
| where timestamp > ago(2h)
| where cloud_RoleName =~ appName
| where operation_Name startswith "Functions."
| summarize
    Invocations=count(),
    Failures=countif(success == false),
    P95Ms=percentile(duration, 95),
    MaxMs=max(duration)
  by FunctionName=operation_Name
| order by MaxMs desc

Quick CLI Check:

az functionapp config appsettings list --resource-group "$RG" --name "$APP_NAME" --output table
az monitor log-analytics query --workspace "$WORKSPACE_ID" --analytics-query "AppExceptions | where TimeGenerated > ago(2h) | where AppRoleName =~ '$APP_NAME' | where OuterMessage has_any ('timeout','timed out','execution time') | project TimeGenerated, ExceptionType, OuterMessage | order by TimeGenerated desc" --output table

Card 5: Memory or CPU Exhaustion¶

graph LR
    A[Worker slowdown] --> B[Metrics]
    B --> C[Restart or crash traces]
    C --> D[Resource-pressure playbook]

Step	Action
Symptom	Throughput drops, worker restarts, OOM kills appear, or CPU-bound functions cause broad latency across multiple triggers
First Query	`AppTraces \\| where TimeGenerated > ago(6h) \\| where AppRoleName =~ "func-myapp-prod" \\| where Message has_any ("OOM", "OutOfMemory", "worker process started and initialized", "Host is shutting down", "restarting", "health check") \\| project TimeGenerated, Message \\| order by TimeGenerated desc`
What to Look For	Restart storms, OOM signatures, high latency across unrelated functions, dependency noise caused by saturation, or queue backlog growth with stable upstream volume
Platform Segment	Compute / Worker Health
Playbook	Out of Memory / Worker Crash

Quick KQL Check:

let appName = "func-myapp-prod";
traces
| where timestamp > ago(6h)
| where cloud_RoleName =~ appName
| where message has_any (
    "OOM",
    "OutOfMemory",
    "worker process started and initialized",
    "Host is shutting down",
    "restarting",
    "health check"
)
| project timestamp, severityLevel, message
| order by timestamp desc

Quick CLI Check:

az monitor metrics list --resource "/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RG/providers/Microsoft.Web/sites/$APP_NAME" --metric "CpuPercentage" "MemoryWorkingSet" "Requests" --interval PT1M --aggregation Average Maximum Total --output table

Card 6: Deployment Succeeded but Functions Broke¶

graph LR
    A[Failure after deploy] --> B[Activity Log]
    B --> C[Startup and indexing traces]
    C --> D[Deployment playbook]

Step	Action
Symptom	Release completed, but functions are missing, returning errors, or no longer processing events immediately afterward
First Query	`AppTraces \\| where TimeGenerated > ago(6h) \\| where AppRoleName =~ "func-myapp-prod" \\| where Message has_any ("Host started", "Generating", "No job functions found", "Error indexing method", "Syncing triggers") \\| project TimeGenerated, Message \\| order by TimeGenerated desc`
What to Look For	`No job functions found`, broken package structure, runtime mismatch, extension load errors, or trigger sync problems introduced at deployment time
Platform Segment	Deployment / Startup
Playbook	Deployment Failures

Quick KQL Check:

let appName = "func-myapp-prod";
traces
| where timestamp > ago(6h)
| where cloud_RoleName =~ appName
| where message has_any (
    "Host started",
    "Generating",
    "No job functions found",
    "Error indexing method",
    "Syncing triggers",
    "Worker process started and initialized"
)
| project timestamp, severityLevel, message
| order by timestamp desc

Quick CLI Check:

az monitor activity-log list --resource-group "$RG" --offset 6h --max-events 20 --output table
az functionapp function list --resource-group "$RG" --name "$APP_NAME" --output table
az functionapp config appsettings list --resource-group "$RG" --name "$APP_NAME" --output table

Card 7: Scale Out Not Keeping Up¶

graph LR
    A[Backlog grows] --> B[Invocation trend]
    B --> C[Scale signals]
    C --> D[Queue or event-hub playbook]

Step	Action
Symptom	Queue depth, Event Hub lag, or pending work grows faster than completions even though the function app is still executing some work
First Query	`AppRequests \\| where TimeGenerated > ago(2h) \\| where AppRoleName =~ "func-myapp-prod" \\| where OperationName startswith "Functions." \\| summarize Invocations=count(), Failures=countif(Success == false), P95Ms=percentile(DurationMs, 95) by bin(TimeGenerated, 5m), OperationName \\| order by TimeGenerated asc`
What to Look For	Flat or weak invocation growth while source volume rises, repeated scale-controller or listener warnings, partition imbalance, checkpoint lag, or one hot partition blocking the rest of the workload
Platform Segment	Scaling / Throughput
Playbook	Queue Piling Up and Event Hub / Service Bus Lag

Quick KQL Check:

let appName = "func-myapp-prod";
requests
| where timestamp > ago(2h)
| where cloud_RoleName =~ appName
| where operation_Name startswith "Functions."
| summarize Invocations=count(), Failures=countif(success == false), P95Ms=percentile(duration, 95) by bin(timestamp, 5m), FunctionName=operation_Name
| order by timestamp asc

Quick CLI Check:

az monitor metrics list --resource "/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RG/providers/Microsoft.Web/sites/$APP_NAME" --metric "FunctionExecutionCount" "FunctionExecutionUnits" "Requests" --interval PT5M --aggregation Total Average --output table
az monitor log-analytics query --workspace "$WORKSPACE_ID" --analytics-query "AppTraces | where TimeGenerated > ago(2h) | where AppRoleName =~ '$APP_NAME' | where Message has_any ('scale','partition','checkpoint','listener','backlog') | project TimeGenerated, Message | order by TimeGenerated desc" --output table

Universal First 3 Queries¶

When you do not know where to start, run these three queries first to establish the failure domain.

Query 1: Function Execution Trend¶

let appName = "func-myapp-prod";
requests
| where timestamp > ago(2h)
| where cloud_RoleName =~ appName
| where operation_Name startswith "Functions."
| summarize total=count(), failed=countif(success == false), p95=percentile(duration, 95) by bin(timestamp, 5m), operation_Name
| order by timestamp asc

Query 2: Host and Listener Events¶

let appName = "func-myapp-prod";
traces
| where timestamp > ago(24h)
| where cloud_RoleName =~ appName
| where message has_any ("Host started", "Host shutdown", "listener", "unable to start", "Error indexing method", "Syncing triggers", "scale")
| project timestamp, severityLevel, message
| order by timestamp desc

Query 3: Dominant Exceptions and Dependencies¶

let appName = "func-myapp-prod";
exceptions
| where timestamp > ago(6h)
| where cloud_RoleName =~ appName
| summarize ExceptionCount=count() by type, outerMessage
| order by ExceptionCount desc

Decision Matrix¶

Observation	Most Likely Card	Confidence
First invocation slow after idle or scale event	Card 1 (Cold Start)	High
HTTP/Timer/Queue/Event Hub/Blob/Cosmos trigger not firing	Card 2 (Trigger Failures)	High
Host startup logs mention indexing, binding, or extension issues	Card 3 (Binding Errors)	High
Durations cluster near timeout boundary	Card 4 (Timeout)	High
Slowdown followed by worker recycle or OOM traces	Card 5 (Memory/CPU Exhaustion)	High
Incident begins right after deployment or trigger sync	Card 6 (Deployment Failures)	High
Backlog or lag grows while invocations rise too slowly	Card 7 (Scale Out Problems)	Medium-High

Quick Diagnosis Cards¶

Card 1: Slow First Invocation / Cold Start¶

Card 2: Trigger Failures by Trigger Type¶

Card 3: Binding and Extension Errors¶

Card 4: Timeout / Execution Limit Exceeded¶

Card 5: Memory or CPU Exhaustion¶

Card 6: Deployment Succeeded but Functions Broke¶

Card 7: Scale Out Not Keeping Up¶

Universal First 3 Queries¶

Query 1: Function Execution Trend¶

Query 2: Host and Listener Events¶

Query 3: Dominant Exceptions and Dependencies¶

Decision Matrix¶

See Also¶

Sources¶