Monitoring¶

This guide describes how to monitor Azure Functions in production using Azure Monitor and Application Insights. It combines metrics, logs, traces, and dashboards into a practical operational workflow.

Platform Guide

For scaling architecture and plan comparison, see Scaling.

Language Guide

For Python deployment specifics, see the Python Tutorial.

Prerequisites¶

A running Function App in Consumption, Flex Consumption, Premium, or Dedicated.
An Application Insights resource connected to the app.
Access to Azure Monitor metrics and Log Analytics query permissions.
Azure CLI installed and authenticated.
Resource placeholders ready for commands.

RG=<resource-group>
APP_NAME=<app-name>
SUBSCRIPTION_ID=<subscription-id>

When to Use¶

Use the signal that best answers the operational question.

Scenario	Primary approach	Why	Secondary approach
Traffic increase/decrease	Metrics	Fast trend view with low query cost	Logs
Error spike after deployment	Logs (`requests`, `exceptions`)	Rich failure context	Traces
Latency regression	Metrics + KQL percentile	Quantify p95/p99 drift	Live Metrics
External dependency incident	Dependencies	Clear target/result visibility	Exceptions
Host recycle/cold start analysis	Traces	Runtime lifecycle evidence	Instance metrics
Configuration change impact	Activity logs	Control-plane history	Logs + traces

Procedure¶

Monitoring architecture¶

Azure Functions emits multiple telemetry streams:

Platform metrics in Azure Monitor (execution count, failures, instance activity).
Application telemetry in Application Insights (requests, dependencies, traces, exceptions).
Activity logs for control-plane changes.

Use all three for complete operational visibility.

flowchart LR
    A[Function App Runtime] --> B[Azure Monitor Metrics]
    A --> C[Application Insights]
    A --> D[Activity Log]
    B --> E[Metric Alerts]
    C --> F["KQL / Log Analytics"]
    C --> G[Workbooks]
    F --> H[Log Alerts]
    E --> I[Action Group]
    H --> I

Enable Application Insights¶

Set the connection string in app settings:

az functionapp config appsettings set \
    --resource-group <resource-group> \
    --name <app-name> \
    --settings APPLICATIONINSIGHTS_CONNECTION_STRING="InstrumentationKey=<masked>;IngestionEndpoint=https://<region>.in.applicationinsights.azure.com/"

Prefer connection strings over legacy instrumentation-key-only configuration.

Core metrics to track¶

Track a small set of high-signal metrics first:

Signal	Why it matters
Execution count	Detect traffic shifts and workload volume
Execution duration	Detect latency regressions and cold start symptoms
Failure count/rate	Detect runtime and dependency instability
Instance count	Observe scale behavior per plan
Queue or backlog depth	Detect processing lag in event-driven flows

Backlog metrics

Queue-length and lag metrics usually come from the messaging service (for example, Storage Queue or Service Bus), not only from the Function App resource.

Query metrics with Azure CLI:

APP_ID=$(az functionapp show \
    --resource-group "$RG" \
    --name "$APP_NAME" \
    --query id \
    --output tsv)

az monitor metrics list \
    --resource "$APP_ID" \
    --metric "Function Execution Count" "Function Execution Units" \
    --interval PT5M \
    --aggregation Total Average \
    --start-time 2026-04-05T00:00:00Z \
    --end-time 2026-04-05T01:00:00Z \
    --output table

Sample output (PII masked):

Cost    Interval    Metric                    TimeStamp                   Total    Average
0       PT5M        Function Execution Count  2026-04-05T00:00:00Z       184      6.13
0       PT5M        Function Execution Units  2026-04-05T00:00:00Z       42       1.40

Live Metrics stream¶

Use Live Metrics during deployments and incidents for near real-time visibility:

Open Application Insights.
Select Live Metrics.
Watch request rate, failures, and server response time during rollout.

This is especially useful during slot swaps and traffic ramp-up windows.

Log Analytics and KQL basics¶

Application Insights data is queryable with KQL.

Recent failed invocations¶

requests
| where timestamp > ago(1h)
| where success == false
| project timestamp, name, resultCode, duration, operation_Id
| order by timestamp desc

Slow operations over time¶

requests
| where timestamp > ago(24h)
| summarize p95_duration=percentile(duration, 95), avg_duration=avg(duration) by bin(timestamp, 5m)
| render timechart

Exceptions by type¶

exceptions
| where timestamp > ago(7d)
| summarize failures=count() by type, outerMessage
| order by failures desc

End-to-end correlation¶

union requests, dependencies, traces, exceptions
| where operation_Id == "<operation-id>"
| project timestamp, itemType, name, message, resultCode, duration
| order by timestamp asc

Host startup events¶

traces
| where timestamp > ago(24h)
| where message has_any ("Host started", "Host initialized", "Stopping JobHost")
| project timestamp, severityLevel, cloud_RoleName, message
| order by timestamp desc

Dependency health by target¶

dependencies
| where timestamp > ago(6h)
| summarize total_calls=count(), failed_calls=countif(success == false), p95_duration=percentile(duration, 95) by target, type
| extend failure_rate = toreal(failed_calls) / iif(total_calls == 0, 1.0, toreal(total_calls))
| order by failure_rate desc, failed_calls desc

Dashboards and workbooks¶

Build a workbook that answers these operational questions:

Is availability stable?
Are failures isolated to a function, dependency, or region?
Did a deployment change latency or error distribution?
Is queue backlog growing faster than throughput?

Recommended workbook visuals:

Timechart of request count and failure rate.
P95/P99 duration trend by function name.
Exceptions by type and operation.
Dependency failure trend for external calls.
Queue depth trend alongside execution rate.

Sampling and data volume control¶

Adjust Application Insights sampling in host.json when telemetry volume grows.

{
  "version": "2.0",
  "logging": {
    "applicationInsights": {
      "samplingSettings": {
        "isEnabled": true,
        "maxTelemetryItemsPerSecond": 5,
        "excludedTypes": "Request;Exception"
      }
    }
  }
}

Keep request and exception data unsampled for reliable incident triage.

Operational monitoring decision flow¶

flowchart TD
    A[Alert or Incident] --> B{"Availability/latency issue?"}
    B -- Yes --> C[Check metrics first]
    B -- No --> D[Check logs and traces]
    C --> E{Anomaly detected?}
    E -- Yes --> F[Run focused KQL]
    E -- No --> G[Review Activity Log]
    D --> H{Error signature found?}
    H -- Yes --> I[Correlate by operation_Id]
    H -- No --> J[Use Live Metrics and widen time range]
    F --> K[Mitigate and tune alerts]
    G --> K
    I --> K
    J --> K

Operational monitoring routine¶

Daily:

Check failure trend and top exception signatures.
Verify queue backlog and processing lag.

Per deployment:

Monitor Live Metrics during release window.
Compare before/after latency and failure ratio.

Weekly:

Review dashboard trends and adjust alert sensitivity.
Validate telemetry cost and sampling strategy.

Verification¶

Validate that monitoring is working end-to-end after changes.

Trigger at least one function invocation.
Confirm metrics appear in 5-minute bins.
Confirm logs and traces are queryable.
Confirm workbook visuals show data.
Confirm alert rules evaluate without data-source errors.

Metric verification command:

az monitor metrics list \
    --resource "$APP_ID" \
    --metric "Function Execution Count" \
    --interval PT5M \
    --aggregation Total \
    --start-time 2026-04-05T00:00:00Z \
    --end-time 2026-04-05T00:30:00Z \
    --query "value[0].timeseries[0].data[?total > \`0\`].[timeStamp,total]" \
    --output table

Log verification command:

requests
| where timestamp > ago(15m)
| summarize total_requests=count(), failed_requests=countif(success == false)

Expected result: total_requests is greater than 0, metric timestamps align with test traffic, and dependency calls appear in dependencies for external calls.

Rollback / Troubleshooting¶

Missing telemetry in Application Insights¶

Metrics appear but requests table is empty.
Live Metrics stream has no flow.
Verify APPLICATIONINSIGHTS_CONNECTION_STRING exists.
Verify endpoint/region value is correct.
Restart Function App after config changes.
Validate egress/network rules for telemetry ingestion.

az functionapp config appsettings list \
    --resource-group "$RG" \
    --name "$APP_NAME" \
    --query "[?name=='APPLICATIONINSIGHTS_CONNECTION_STRING'].value" \
    --output tsv

Sampling too aggressive¶

Log request counts are much lower than platform metrics.
Exception evidence is sparse during incidents.
Inspect host.json sampling settings.
Exclude Request;Exception from sampling.
Increase maxTelemetryItemsPerSecond temporarily for investigations.

Rollback:

Revert to last known-good sampling settings.
Redeploy and re-run verification queries.

Monitoring only HTTP success and ignoring non-HTTP triggers.
Missing downstream dependency metrics.
Over-sampling that removes needed forensic signals.
No version marker in logs, making release impact hard to isolate.