Observability in Azure Functions¶
Azure Functions has built-in integration with Application Insights and Azure Monitor to provide comprehensive observability for your serverless applications.
Data Flow Diagram¶
graph TD
Trigger[Function Trigger] -->|Invocation| Function[Azure Function]
Function -->|Trace/Log| AI[Application Insights]
AI -->|HTTPS| LAW[Log Analytics Workspace]
Function -->|Metrics| AM[Azure Monitor Metrics]
AM -->|Rule Evaluation| Alerts[Alerts] Key Monitoring Areas¶
- Execution Logs: Detailed information about each function execution, including traces, exceptions, and requests.
- Host Metrics: Platform-level performance data such as CPU, memory usage, and function invocation counts.
- Invocation Tracing: Correlation of events across different services (e.g., tracking a message from a queue through function processing).
Keep host telemetry and invocation telemetry separate during analysis:
- Host metrics describe the Functions runtime and underlying App Service worker behavior.
- CPU percentage
- Memory working set
- Instance count
- Host restarts or runtime initialization traces
- Invocation metrics describe what each function execution experienced.
- Invocation count
- Success rate
- Duration percentiles
- Trigger-specific retries or throttling symptoms
If duration rises while host metrics remain stable, the problem is often dependency latency or upstream backlog rather than worker saturation. If both host metrics and invocation metrics degrade together, investigate plan capacity, scaling, and runtime configuration first.
Log Categories in Log Analytics¶
When diagnostic logs are enabled, you can find function logs in these tables:
- FunctionAppLogs: Traces generated by the function host and your application code.
- AppServiceHTTPLogs: Details about incoming HTTP requests to your function.
Depending on trigger type and hosting plan, operators often also enable adjacent App Service categories for broader context:
- Useful in most environments
FunctionAppLogsAppServiceHTTPLogsAppServiceConsoleLogsAppServicePlatformLogs
- Enable selectively
AppServiceAuditLogsAppServiceIPSecAuditLogs
FunctionAppLogs is the best default source for runtime execution evidence, but HTTP-triggered workloads benefit from AppServiceHTTPLogs when you need the request envelope and web server perspective alongside Application Insights telemetry.
Configuration Examples¶
Connecting Application Insights via CLI¶
To enable Application Insights for a function app, set the APPLICATIONINSIGHTS_CONNECTION_STRING in the app settings.
az functionapp config appsettings set \
--resource-group "my-resource-group" \
--name "my-function-app" \
--settings "APPLICATIONINSIGHTS_CONNECTION_STRING=<connection-string>"
KQL Query Examples¶
Monitor Function Execution Status¶
Summarize function execution results over the last hour.
Analyze Function Duration¶
Find the average and maximum duration of your function executions.
requests
| where timestamp > ago(12h)
| summarize avg(duration), max(duration) by name
| order by avg_duration desc
Find Common Exceptions¶
List the top errors occurring in your function app.
Correlate Failed Invocations With Dependencies¶
requests
| where timestamp > ago(1h)
| where success == false
| project operation_Id, name, duration, resultCode, cloud_RoleName
| join kind=leftouter (
dependencies
| where timestamp > ago(1h)
| project operation_Id, DependencyName=name, DependencyType=type, DependencySuccess=success, DependencyDuration=duration
) on operation_Id
| order by duration desc
Review Host-Level Traces¶
traces
| where timestamp > ago(30m)
| where cloud_RoleName == "my-function-app"
| where message has_any ("Host started", "Function started", "Executing", "Failed")
| project timestamp, severityLevel, message
| order by timestamp desc
Sample output:
timestamp severityLevel message
------------------------- ------------- ---------------------------------------------------------
2026-04-06T01:05:00Z 3 Executed 'ProcessOrders' (Failed, Id=a1b2c3d4, Duration=2145ms)
2026-04-06T01:05:00Z 2 Executing 'ProcessOrders' (Reason='New queue message detected')
Monitoring Baseline¶
For Azure Functions, split monitoring into these four operational views:
- Invocation success
- Failed executions
- Retry patterns
- Trigger-specific backlog symptoms
- Performance
- Function duration
- Cold-start symptoms for HTTP triggers
- Dependency latency
- Host behavior
- Host restarts
- Scaling changes
- Runtime version or configuration changes
- Application behavior
- Traces
- Exceptions
- Custom business events if instrumented
CLI Workflow¶
Confirm application settings¶
az functionapp config appsettings list \
--resource-group "my-resource-group" \
--name "my-function-app" \
--query "[?name=='APPLICATIONINSIGHTS_CONNECTION_STRING']"
Sample output:
Check Function App configuration and runtime¶
az functionapp show \
--resource-group "my-resource-group" \
--name "my-function-app" \
--query "{state:state, defaultHostName:defaultHostName, kind:kind, reserved:reserved}"
Sample output:
{
"defaultHostName": "my-function-app.azurewebsites.net",
"kind": "functionapp,linux",
"reserved": true,
"state": "Running"
}
Query recent failed executions¶
az monitor app-insights query \
--app "my-app-insights" \
--analytics-query "requests | where timestamp > ago(30m) | where success == false | summarize count() by name" \
--output table
Sample output:
Diagnostic Settings Strategy¶
Azure Functions usually sends rich application telemetry through Application Insights, but diagnostic settings still matter because they capture host and platform context that can be missing from request telemetry alone.
Recommended categories to enable¶
- Minimum baseline
FunctionAppLogsAppServicePlatformLogsAllMetrics
- Add for HTTP-heavy workloads
AppServiceHTTPLogs
- Add for startup or custom logging investigations
AppServiceConsoleLogs
Create diagnostic settings for a function app¶
az monitor diagnostic-settings create \
--name "diag-functionapp-observability" \
--resource "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.Web/sites/my-function-app" \
--workspace "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.OperationalInsights/workspaces/law-monitoring-prod" \
--logs '[
{
"category": "FunctionAppLogs",
"enabled": true
},
{
"category": "AppServiceHTTPLogs",
"enabled": true
},
{
"category": "AppServiceConsoleLogs",
"enabled": true
},
{
"category": "AppServicePlatformLogs",
"enabled": true
}
]' \
--metrics '[
{
"category": "AllMetrics",
"enabled": true
}
]'
Check categories before rollout¶
az monitor diagnostic-settings categories list \
--resource "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.Web/sites/my-function-app"
Use the category list in CI or during standards reviews so Linux, Windows, and plan-specific deployments are validated against the same observability baseline.
Additional KQL for Timeout and Plan Behavior¶
Detect long-running executions nearing timeout¶
Timeout symptoms are often visible before full failures occur. This query highlights functions with long-running executions and separates successful completions from failures.
requests
| where timestamp > ago(24h)
| where cloud_RoleName == "my-function-app"
| summarize InvocationCount=count(), P95Duration=percentile(duration, 95), MaxDuration=max(duration), Failures=countif(success == false) by name
| order by P95Duration desc
Sample output:
| name | InvocationCount | P95Duration | MaxDuration | Failures | Interpretation |
|---|---|---|---|---|---|
| ProcessOrders | 1842 | 00:00:52 | 00:01:18 | 14 | Queue processing is running near timeout risk; review dependency latency and batch sizing. |
| HttpWebhookReceiver | 9211 | 00:00:04 | 00:00:11 | 3 | HTTP trigger latency is normal and not the main incident driver. |
Correlate host restarts with invocation degradation¶
let HostEvents =
traces
| where timestamp > ago(6h)
| where cloud_RoleName == "my-function-app"
| where message has_any ("Host started", "Host initialized", "Stopping JobHost", "Generating", "Host lock lease")
| summarize HostTraceCount=count() by bin(timestamp, 10m);
let InvocationFailures =
requests
| where timestamp > ago(6h)
| where cloud_RoleName == "my-function-app"
| summarize FailedInvocations=countif(success == false), TotalInvocations=count() by bin(timestamp, 10m);
HostEvents
| join kind=fullouter InvocationFailures on timestamp
| project timestamp, HostTraceCount=coalesce(HostTraceCount, 0), FailedInvocations=coalesce(FailedInvocations, 0), TotalInvocations=coalesce(TotalInvocations, 0)
| order by timestamp desc
Sample output:
| timestamp | HostTraceCount | FailedInvocations | TotalInvocations | Interpretation |
|---|---|---|---|---|
| 2026-04-06T01:00:00Z | 12 | 18 | 220 | Host churn and invocation failures rose together; inspect scaling or runtime recycle causes. |
| 2026-04-06T01:10:00Z | 1 | 17 | 225 | Failures continued after host stabilized; investigate downstream dependencies. |
Compare dependency latency with execution duration¶
requests
| where timestamp > ago(4h)
| where cloud_RoleName == "my-function-app"
| summarize AvgExecution=avg(duration), P95Execution=percentile(duration, 95) by operation_Id, name
| join kind=inner (
dependencies
| where timestamp > ago(4h)
| summarize AvgDependency=avg(duration), FailedDependencies=countif(success == false) by operation_Id
) on operation_Id
| summarize AvgExecution=avg(AvgExecution), P95Execution=max(P95Execution), AvgDependency=avg(AvgDependency), FailedDependencies=sum(FailedDependencies) by name
| order by P95Execution desc
Sample output:
| name | AvgExecution | P95Execution | AvgDependency | FailedDependencies | Interpretation |
|---|---|---|---|---|---|
| ProcessOrders | 00:00:09 | 00:00:41 | 00:00:07 | 28 | Most execution time is spent waiting on dependencies. |
| GenerateInvoice | 00:00:03 | 00:00:07 | 00:00:01 | 0 | Healthy function; not part of the slowdown. |
Execution Duration and Timeout Monitoring¶
Treat timeout monitoring as a combination of three signals:
- Duration percentiles
- P50 shows normal behavior.
- P95 and P99 show customer-impacting tail latency.
- Near-timeout executions
- Long-running successes are an early warning before hard failures.
- Timeout exceptions
- Runtime-generated errors or canceled dependency calls confirm the ceiling was reached.
For queue, Service Bus, and Event Hub triggers, rising duration often means upstream backlog will grow even before invocation failures appear. For HTTP triggers, near-timeout executions often map directly to customer-facing latency.
Consumption Plan vs Premium Plan Monitoring Differences¶
The same KQL patterns work across plans, but the operational meaning differs:
Consumption plan¶
- Expect cold starts and instance churn to appear more often in host traces.
- Prioritize:
- Invocation duration percentiles
- Instance count changes
- Queue backlog or retry signals
- Watch for:
- Burst traffic exceeding scale-out speed
- Short-lived startup penalties after idle periods
Premium plan¶
- Pre-warmed instances reduce cold-start noise, so persistent latency is more likely to be application or dependency related.
- Prioritize:
- Sustained CPU and memory saturation on workers
- Long-running executions consuming concurrency
- Dependency bottlenecks hidden behind stable host startup behavior
Practical interpretation¶
If a Consumption app shows a short spike in host initialization traces, that can be normal. If a Premium app shows the same pattern repeatedly, review runtime restarts, deployment churn, or configuration drift instead of assuming expected scale behavior.
Practical Alert Examples¶
Alert on failed executions¶
az monitor scheduled-query create \
--name "func-failed-executions" \
--resource-group "my-resource-group" \
--scopes "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.OperationalInsights/workspaces/law-monitoring-prod" \
--condition "count 'requests | where timestamp > ago(5m) | where cloud_RoleName == \"my-function-app\" and success == false' > 5" \
--description "Azure Functions failed execution count exceeded threshold" \
--evaluation-frequency "5m" \
--window-size "5m" \
--severity 2 \
--action-groups "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.Insights/actionGroups/ag-app-oncall"
Alert on exception bursts¶
az monitor scheduled-query create \
--name "func-exception-burst" \
--resource-group "my-resource-group" \
--scopes "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.OperationalInsights/workspaces/law-monitoring-prod" \
--condition "count 'exceptions | where timestamp > ago(10m) | where cloud_RoleName == \"my-function-app\"' > 20" \
--description "Exception volume is above the normal Functions baseline" \
--evaluation-frequency "5m" \
--window-size "10m" \
--severity 3 \
--action-groups "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.Insights/actionGroups/ag-app-oncall"
Trigger-Specific Investigation Tips¶
- HTTP trigger
- Compare response latency and failure rate before checking host internals
- Queue or Service Bus trigger
- Correlate failures with dependency timeouts and upstream queue depth
- Timer trigger
- Watch for missed schedules or host restarts around the expected execution time
- Event-driven trigger
- Use
operation_Idto connect trigger ingestion to dependency calls and final outcome
- Use
Triage Workflow¶
- Failed requests
- Which function names are failing?
- Exception details
- Are failures grouped by one exception type?
- Dependency correlation
- Is the root cause external, such as SQL, Storage, or HTTP API latency?
- Host traces
- Did the host restart or scale during the incident window?
- Platform metrics
- Is CPU, memory, or concurrency pressure contributing to the symptom?
Workbook Suggestions¶
- Invocation count and failure rate by function name
- Duration trend with percentile breakdown
- Top dependency failures by function
- Exception trend after deployment
- Host trace samples for the latest failed execution
Dashboard and Workbook Recommendations¶
Built-in experiences to use first¶
- Application Insights Application Dashboard
- Requests, failures, and dependencies for quick service health review.
- Metrics explorer for the function app
- Execution count, execution units, response time, and instance count.
- Log Analytics workbooks
- Best for combining
requests,dependencies,exceptions, andtracesin one view.
- Best for combining
Workbook tabs to build¶
- Execution health
- Invocation count by function name
- Failure rate by trigger type
- P95 and P99 duration trend
- Host behavior
- Host restart traces
- Instance count
- CPU and memory utilization
- Dependency view
- Slowest dependencies
- Failed dependency trend
- Top operations by dependency wait time
- Timeout risk
- Functions closest to timeout thresholds
- Long-running successes vs failures
- Retry-heavy executions
High-value dashboard pins¶
- Pin a chart for failed invocations by function name.
- Pin a chart for P95 duration.
- Pin a chart for instance count to reveal scale behavior.
- Pin a Logs tile for host traces containing
Host startedorStopping JobHost.
Common Pitfalls¶
Mistake 1: Treating host metrics as proof of business health¶
What happens: CPU and memory look stable, so teams assume the app is healthy while queue backlog or dependency latency keeps growing.
Correction: Pair host metrics with invocation duration, failure rate, and trigger backlog evidence.
Mistake 2: Alerting only on failed invocations¶
What happens: You learn about issues only after customers are already impacted.
Correction: Add alerts for high P95 duration, host restart bursts, and dependency failure spikes so degradation is caught earlier.
Mistake 3: Ignoring plan-specific behavior¶
What happens: Consumption cold starts are investigated as incidents, or Premium resource saturation is dismissed as expected scaling.
Correction: Interpret traces and metrics in the context of the hosting plan before escalating.
Cost Considerations¶
Functions observability spend is usually driven by Application Insights ingestion and retention rather than by platform metrics.
- Telemetry volume estimate
- A function app processing 1 million invocations per day with request, dependency, and trace telemetry can easily generate multiple GB/day depending on sampling and custom logs.
- Primary optimization levers
- Use adaptive or fixed-rate sampling for low-value successful requests.
- Keep exceptions unsampled when they are needed for incident response.
- Reduce verbose traces in hot execution paths.
- Route only required diagnostic categories to Log Analytics.
- When to review cost
- After introducing new dependency instrumentation
- After enabling verbose startup logging
- After moving from a low-volume timer workload to bursty queue or HTTP traffic
If cost rises suddenly, compare telemetry volume by table first. Many Functions environments discover that traces and dependency telemetry expanded faster than failed-request volume.