Observability in Azure Functions¶

Azure Functions has built-in integration with Application Insights and Azure Monitor to provide comprehensive observability for your serverless applications.

Data Flow Diagram¶

graph TD
    Trigger[Function Trigger] -->|Invocation| Function[Azure Function]
    Function -->|Trace/Log| AI[Application Insights]
    AI -->|HTTPS| LAW[Log Analytics Workspace]
    Function -->|Metrics| AM[Azure Monitor Metrics]
    AM -->|Rule Evaluation| Alerts[Alerts]

Key Monitoring Areas¶

Execution Logs: Detailed information about each function execution, including traces, exceptions, and requests.
Host Metrics: Platform-level performance data such as CPU, memory usage, and function invocation counts.
Invocation Tracing: Correlation of events across different services (e.g., tracking a message from a queue through function processing).

Keep host telemetry and invocation telemetry separate during analysis:

Host metrics describe the Functions runtime and underlying App Service worker behavior.
- CPU percentage
- Memory working set
- Instance count
- Host restarts or runtime initialization traces
Invocation metrics describe what each function execution experienced.
- Invocation count
- Success rate
- Duration percentiles
- Trigger-specific retries or throttling symptoms

If duration rises while host metrics remain stable, the problem is often dependency latency or upstream backlog rather than worker saturation. If both host metrics and invocation metrics degrade together, investigate plan capacity, scaling, and runtime configuration first.

Log Categories in Log Analytics¶

When diagnostic logs are enabled, you can find function logs in these tables:

FunctionAppLogs: Traces generated by the function host and your application code.
AppServiceHTTPLogs: Details about incoming HTTP requests to your function.

Depending on trigger type and hosting plan, operators often also enable adjacent App Service categories for broader context:

Useful in most environments
- FunctionAppLogs
- AppServiceHTTPLogs
- AppServiceConsoleLogs
- AppServicePlatformLogs
Enable selectively
- AppServiceAuditLogs
- AppServiceIPSecAuditLogs

FunctionAppLogs is the best default source for runtime execution evidence, but HTTP-triggered workloads benefit from AppServiceHTTPLogs when you need the request envelope and web server perspective alongside Application Insights telemetry.

Configuration Examples¶

Connecting Application Insights via CLI¶

To enable Application Insights for a function app, set the APPLICATIONINSIGHTS_CONNECTION_STRING in the app settings.

az functionapp config appsettings set \
    --resource-group "my-resource-group" \
    --name "my-function-app" \
    --settings "APPLICATIONINSIGHTS_CONNECTION_STRING=<connection-string>"

KQL Query Examples¶

Monitor Function Execution Status¶

Summarize function execution results over the last hour.

requests
| where timestamp > ago(1h)
| summarize count() by success, name
| order by name asc

Analyze Function Duration¶

Find the average and maximum duration of your function executions.

requests
| where timestamp > ago(12h)
| summarize avg(duration), max(duration) by name
| order by avg_duration desc

Find Common Exceptions¶

List the top errors occurring in your function app.

exceptions
| summarize count() by problemId, outerMessage
| order by count_ desc

Correlate Failed Invocations With Dependencies¶

requests
| where timestamp > ago(1h)
| where success == false
| project operation_Id, name, duration, resultCode, cloud_RoleName
| join kind=leftouter (
    dependencies
    | where timestamp > ago(1h)
    | project operation_Id, DependencyName=name, DependencyType=type, DependencySuccess=success, DependencyDuration=duration
) on operation_Id
| order by duration desc

Review Host-Level Traces¶

traces
| where timestamp > ago(30m)
| where cloud_RoleName == "my-function-app"
| where message has_any ("Host started", "Function started", "Executing", "Failed")
| project timestamp, severityLevel, message
| order by timestamp desc

Sample output:

timestamp                  severityLevel  message
-------------------------  -------------  ---------------------------------------------------------
2026-04-06T01:05:00Z       3              Executed 'ProcessOrders' (Failed, Id=a1b2c3d4, Duration=2145ms)
2026-04-06T01:05:00Z       2              Executing 'ProcessOrders' (Reason='New queue message detected')

Monitoring Baseline¶

For Azure Functions, split monitoring into these four operational views:

Invocation success
- Failed executions
- Retry patterns
- Trigger-specific backlog symptoms
Performance
- Function duration
- Cold-start symptoms for HTTP triggers
- Dependency latency
Host behavior
- Host restarts
- Scaling changes
- Runtime version or configuration changes
Application behavior
- Traces
- Exceptions
- Custom business events if instrumented

CLI Workflow¶

Confirm application settings¶

az functionapp config appsettings list \
    --resource-group "my-resource-group" \
    --name "my-function-app" \
    --query "[?name=='APPLICATIONINSIGHTS_CONNECTION_STRING']"

Sample output:

[
  {
    "name": "APPLICATIONINSIGHTS_CONNECTION_STRING",
    "value": "<connection-string>"
  }
]

Check Function App configuration and runtime¶

az functionapp show \
    --resource-group "my-resource-group" \
    --name "my-function-app" \
    --query "{state:state, defaultHostName:defaultHostName, kind:kind, reserved:reserved}"

Sample output:

{
  "defaultHostName": "my-function-app.azurewebsites.net",
  "kind": "functionapp,linux",
  "reserved": true,
  "state": "Running"
}

Query recent failed executions¶

az monitor app-insights query \
    --app "my-app-insights" \
    --analytics-query "requests | where timestamp > ago(30m) | where success == false | summarize count() by name" \
    --output table

Sample output:

name                 count_
-------------------  ------
ProcessOrders        7
HttpWebhookReceiver  2

Diagnostic Settings Strategy¶

Azure Functions usually sends rich application telemetry through Application Insights, but diagnostic settings still matter because they capture host and platform context that can be missing from request telemetry alone.

Recommended categories to enable¶

Minimum baseline
- FunctionAppLogs
- AppServicePlatformLogs
- AllMetrics
Add for HTTP-heavy workloads
- AppServiceHTTPLogs
Add for startup or custom logging investigations
- AppServiceConsoleLogs

Create diagnostic settings for a function app¶

az monitor diagnostic-settings create \
    --name "diag-functionapp-observability" \
    --resource "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.Web/sites/my-function-app" \
    --workspace "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.OperationalInsights/workspaces/law-monitoring-prod" \
    --logs '[
        {
            "category": "FunctionAppLogs",
            "enabled": true
        },
        {
            "category": "AppServiceHTTPLogs",
            "enabled": true
        },
        {
            "category": "AppServiceConsoleLogs",
            "enabled": true
        },
        {
            "category": "AppServicePlatformLogs",
            "enabled": true
        }
    ]' \
    --metrics '[
        {
            "category": "AllMetrics",
            "enabled": true
        }
    ]'

Check categories before rollout¶

az monitor diagnostic-settings categories list \
    --resource "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.Web/sites/my-function-app"

Use the category list in CI or during standards reviews so Linux, Windows, and plan-specific deployments are validated against the same observability baseline.

Additional KQL for Timeout and Plan Behavior¶

Detect long-running executions nearing timeout¶

Timeout symptoms are often visible before full failures occur. This query highlights functions with long-running executions and separates successful completions from failures.

requests
| where timestamp > ago(24h)
| where cloud_RoleName == "my-function-app"
| summarize InvocationCount=count(), P95Duration=percentile(duration, 95), MaxDuration=max(duration), Failures=countif(success == false) by name
| order by P95Duration desc

Sample output:

name	InvocationCount	P95Duration	MaxDuration	Failures	Interpretation
ProcessOrders	1842	00:00:52	00:01:18	14	Queue processing is running near timeout risk; review dependency latency and batch sizing.
HttpWebhookReceiver	9211	00:00:04	00:00:11	3	HTTP trigger latency is normal and not the main incident driver.

Correlate host restarts with invocation degradation¶

let HostEvents =
    traces
    | where timestamp > ago(6h)
    | where cloud_RoleName == "my-function-app"
    | where message has_any ("Host started", "Host initialized", "Stopping JobHost", "Generating", "Host lock lease")
    | summarize HostTraceCount=count() by bin(timestamp, 10m);
let InvocationFailures =
    requests
    | where timestamp > ago(6h)
    | where cloud_RoleName == "my-function-app"
    | summarize FailedInvocations=countif(success == false), TotalInvocations=count() by bin(timestamp, 10m);
HostEvents
| join kind=fullouter InvocationFailures on timestamp
| project timestamp, HostTraceCount=coalesce(HostTraceCount, 0), FailedInvocations=coalesce(FailedInvocations, 0), TotalInvocations=coalesce(TotalInvocations, 0)
| order by timestamp desc

Sample output:

timestamp	HostTraceCount	FailedInvocations	TotalInvocations	Interpretation
2026-04-06T01:00:00Z	12	18	220	Host churn and invocation failures rose together; inspect scaling or runtime recycle causes.
2026-04-06T01:10:00Z	1	17	225	Failures continued after host stabilized; investigate downstream dependencies.

Compare dependency latency with execution duration¶

requests
| where timestamp > ago(4h)
| where cloud_RoleName == "my-function-app"
| summarize AvgExecution=avg(duration), P95Execution=percentile(duration, 95) by operation_Id, name
| join kind=inner (
    dependencies
    | where timestamp > ago(4h)
    | summarize AvgDependency=avg(duration), FailedDependencies=countif(success == false) by operation_Id
) on operation_Id
| summarize AvgExecution=avg(AvgExecution), P95Execution=max(P95Execution), AvgDependency=avg(AvgDependency), FailedDependencies=sum(FailedDependencies) by name
| order by P95Execution desc

Sample output:

name	AvgExecution	P95Execution	AvgDependency	FailedDependencies	Interpretation
ProcessOrders	00:00:09	00:00:41	00:00:07	28	Most execution time is spent waiting on dependencies.
GenerateInvoice	00:00:03	00:00:07	00:00:01	0	Healthy function; not part of the slowdown.

Execution Duration and Timeout Monitoring¶

Treat timeout monitoring as a combination of three signals:

Duration percentiles
- P50 shows normal behavior.
- P95 and P99 show customer-impacting tail latency.
Near-timeout executions
- Long-running successes are an early warning before hard failures.
Timeout exceptions
- Runtime-generated errors or canceled dependency calls confirm the ceiling was reached.

For queue, Service Bus, and Event Hub triggers, rising duration often means upstream backlog will grow even before invocation failures appear. For HTTP triggers, near-timeout executions often map directly to customer-facing latency.

Consumption Plan vs Premium Plan Monitoring Differences¶

The same KQL patterns work across plans, but the operational meaning differs:

Consumption plan¶

Expect cold starts and instance churn to appear more often in host traces.
Prioritize:
- Invocation duration percentiles
- Instance count changes
- Queue backlog or retry signals
Watch for:
- Burst traffic exceeding scale-out speed
- Short-lived startup penalties after idle periods

Premium plan¶

Pre-warmed instances reduce cold-start noise, so persistent latency is more likely to be application or dependency related.
Prioritize:
- Sustained CPU and memory saturation on workers
- Long-running executions consuming concurrency
- Dependency bottlenecks hidden behind stable host startup behavior

Practical interpretation¶

If a Consumption app shows a short spike in host initialization traces, that can be normal. If a Premium app shows the same pattern repeatedly, review runtime restarts, deployment churn, or configuration drift instead of assuming expected scale behavior.

Practical Alert Examples¶

Alert on failed executions¶

az monitor scheduled-query create \
    --name "func-failed-executions" \
    --resource-group "my-resource-group" \
    --scopes "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.OperationalInsights/workspaces/law-monitoring-prod" \
    --condition "count 'requests | where timestamp > ago(5m) | where cloud_RoleName == \"my-function-app\" and success == false' > 5" \
    --description "Azure Functions failed execution count exceeded threshold" \
    --evaluation-frequency "5m" \
    --window-size "5m" \
    --severity 2 \
    --action-groups "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.Insights/actionGroups/ag-app-oncall"

Alert on exception bursts¶

az monitor scheduled-query create \
    --name "func-exception-burst" \
    --resource-group "my-resource-group" \
    --scopes "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.OperationalInsights/workspaces/law-monitoring-prod" \
    --condition "count 'exceptions | where timestamp > ago(10m) | where cloud_RoleName == \"my-function-app\"' > 20" \
    --description "Exception volume is above the normal Functions baseline" \
    --evaluation-frequency "5m" \
    --window-size "10m" \
    --severity 3 \
    --action-groups "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.Insights/actionGroups/ag-app-oncall"

Trigger-Specific Investigation Tips¶

HTTP trigger
- Compare response latency and failure rate before checking host internals
Queue or Service Bus trigger
- Correlate failures with dependency timeouts and upstream queue depth
Timer trigger
- Watch for missed schedules or host restarts around the expected execution time
Event-driven trigger
- Use operation_Id to connect trigger ingestion to dependency calls and final outcome

Triage Workflow¶

Failed requests
- Which function names are failing?
Exception details
- Are failures grouped by one exception type?
Dependency correlation
- Is the root cause external, such as SQL, Storage, or HTTP API latency?
Host traces
- Did the host restart or scale during the incident window?
Platform metrics
- Is CPU, memory, or concurrency pressure contributing to the symptom?

Workbook Suggestions¶

Invocation count and failure rate by function name
Duration trend with percentile breakdown
Top dependency failures by function
Exception trend after deployment
Host trace samples for the latest failed execution

Dashboard and Workbook Recommendations¶

Built-in experiences to use first¶

Application Insights Application Dashboard
- Requests, failures, and dependencies for quick service health review.
Metrics explorer for the function app
- Execution count, execution units, response time, and instance count.
Log Analytics workbooks
- Best for combining requests, dependencies, exceptions, and traces in one view.

Workbook tabs to build¶

Execution health
- Invocation count by function name
- Failure rate by trigger type
- P95 and P99 duration trend
Host behavior
- Host restart traces
- Instance count
- CPU and memory utilization
Dependency view
- Slowest dependencies
- Failed dependency trend
- Top operations by dependency wait time
Timeout risk
- Functions closest to timeout thresholds
- Long-running successes vs failures
- Retry-heavy executions

High-value dashboard pins¶

Pin a chart for failed invocations by function name.
Pin a chart for P95 duration.
Pin a chart for instance count to reveal scale behavior.
Pin a Logs tile for host traces containing Host started or Stopping JobHost.

Common Pitfalls¶

Mistake 1: Treating host metrics as proof of business health¶

What happens: CPU and memory look stable, so teams assume the app is healthy while queue backlog or dependency latency keeps growing.

Correction: Pair host metrics with invocation duration, failure rate, and trigger backlog evidence.

Mistake 2: Alerting only on failed invocations¶

What happens: You learn about issues only after customers are already impacted.

Correction: Add alerts for high P95 duration, host restart bursts, and dependency failure spikes so degradation is caught earlier.

Mistake 3: Ignoring plan-specific behavior¶

What happens: Consumption cold starts are investigated as incidents, or Premium resource saturation is dismissed as expected scaling.

Correction: Interpret traces and metrics in the context of the hosting plan before escalating.

Cost Considerations¶

Functions observability spend is usually driven by Application Insights ingestion and retention rather than by platform metrics.

Telemetry volume estimate
- A function app processing 1 million invocations per day with request, dependency, and trace telemetry can easily generate multiple GB/day depending on sampling and custom logs.
Primary optimization levers
- Use adaptive or fixed-rate sampling for low-value successful requests.
- Keep exceptions unsampled when they are needed for incident response.
- Reduce verbose traces in hot execution paths.
- Route only required diagnostic categories to Log Analytics.
When to review cost
- After introducing new dependency instrumentation
- After enabling verbose startup logging
- After moving from a low-volume timer workload to bursty queue or HTTP traffic

If cost rises suddenly, compare telemetry volume by table first. Many Functions environments discover that traces and dependency telemetry expanded faster than failed-request volume.

Observability in Azure Functions¶

Data Flow Diagram¶

Key Monitoring Areas¶

Log Categories in Log Analytics¶

Configuration Examples¶

Connecting Application Insights via CLI¶

KQL Query Examples¶

Monitor Function Execution Status¶

Analyze Function Duration¶

Find Common Exceptions¶

Correlate Failed Invocations With Dependencies¶

Review Host-Level Traces¶

Monitoring Baseline¶

CLI Workflow¶

Confirm application settings¶

Check Function App configuration and runtime¶

Query recent failed executions¶

Diagnostic Settings Strategy¶

Recommended categories to enable¶

Create diagnostic settings for a function app¶

Check categories before rollout¶

Additional KQL for Timeout and Plan Behavior¶

Detect long-running executions nearing timeout¶

Correlate host restarts with invocation degradation¶

Compare dependency latency with execution duration¶

Execution Duration and Timeout Monitoring¶

Consumption Plan vs Premium Plan Monitoring Differences¶

Consumption plan¶

Premium plan¶

Practical interpretation¶

Practical Alert Examples¶

Alert on failed executions¶

Alert on exception bursts¶

Trigger-Specific Investigation Tips¶

Triage Workflow¶

Workbook Suggestions¶

Dashboard and Workbook Recommendations¶

Built-in experiences to use first¶

Workbook tabs to build¶

High-value dashboard pins¶

Common Pitfalls¶

Mistake 1: Treating host metrics as proof of business health¶

Mistake 2: Alerting only on failed invocations¶

Mistake 3: Ignoring plan-specific behavior¶

Cost Considerations¶

See Also¶

Sources¶