Skip to content

High Latency / Slow Responses

1. Summary

This playbook handles incidents where Azure Functions shows slow responses, elevated p95/p99 latency, and intermittent timeout behavior. Use it when performance degradation is user-visible, even if failure rate is still low.

Troubleshooting decision flow

flowchart TD
    A[Latency alert or customer report] --> B{Mostly first request after idle?}
    B -->|Yes| H1[H1 Cold start delays]
    B -->|No| C{One dependency target has high p95?}
    C -->|Yes| H2[H2 Slow downstream dependency]
    C -->|No| D{Traffic grows but completion lags?}
    D -->|Yes| H3[H3 Concurrency saturation]
    D -->|No| E{Worker restart or pressure traces?}
    E -->|Yes| H4[H4 Plan-level resource limits]
    E -->|No| F[Correlate operation_Id end-to-end]
    H1 --> G[Mitigate and verify p95 improvement]
    H2 --> G
    H3 --> G
    H4 --> G
    F --> G

Scope and severity

Indicator Sev 3 Sev 2 Sev 1
HTTP p95 increase vs baseline < 2x 2-5x > 5x
## 2. Common Misreadings
- Assuming all latency spikes are cold starts without dependency-level evidence.
- Blaming function code before checking whether downstream target p95 dominates duration.
- Running KQL without cloud_RoleName filter and mixing unrelated app telemetry.
## 3. Competing Hypotheses
### H1: Cold start delays
- Startup initialization and instance allocation delay first requests after idle or rapid scale-out.
### H2: Slow downstream dependency
- One API/database/storage dependency dominates response time while function code stays stable.
### H3: Concurrency saturation
- In-flight work grows faster than completion throughput, increasing queueing delay and tail latency.
### H4: Plan-level resource limits
- CPU/memory/socket/connection/thread limits cause broad slowness and worker lifecycle churn.
## 4. What to Check First
### First 10-minute checklist
1. Confirm incident window and primary impacted function operation.
2. Compare request p95/p99 trend by function for the same window.
3. Check dependency p95 by target to find concentration.
4. Check host startup/scale traces around first latency jump.
5. Check if traffic growth outpaced successful completion.
### Portal checks
- Application Insights -> Performance -> Operations: p95/p99 by operation.
- Application Insights -> Dependencies: slow targets and failure concentration.
- Function App -> Diagnose and solve problems: startup and performance detectors.
### CLI Investigation Commands
az account set \
  --subscription "$SUBSCRIPTION_ID"
az monitor metrics list \
  --resource "/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RG/providers/Microsoft.Web/sites/$APP_NAME" \
  --metric "AverageResponseTime" "Requests" "Http5xx" \
  --interval PT1M \
  --aggregation Average Total \
  --offset 2h \
  --output table
az monitor log-analytics query \
  --workspace "$WORKSPACE_ID" \
  --analytics-query "requests | where timestamp > ago(1h) | where cloud_RoleName =~ '$APP_NAME' | summarize P95Ms=percentile(duration,95), Failures=countif(success==false), Invocations=count() by operation_Name | order by P95Ms desc" \
  --output table
az monitor log-analytics query \
  --workspace "$WORKSPACE_ID" \
  --analytics-query "dependencies | where timestamp > ago(1h) | where cloud_RoleName =~ '$APP_NAME' | summarize Calls=count(), Failed=countif(success==false), P95Ms=percentile(duration,95) by target, type | order by P95Ms desc" \
  --output table
Example output:
MetricName            TimeGrain  Average   Total
--------------------  ---------  --------  ------
AverageResponseTime   PT1M       330       0
AverageResponseTime   PT1M       1420      0
Requests              PT1M       0         158
Http5xx               PT1M       0         15
operation_Name                 P95Ms   Failures   Invocations
-----------------------------  ------  ---------  -----------
Functions.HttpIngress          2458    3          340
target                         type  Calls  Failed  P95Ms
-----------------------------  ----  -----  ------  -----
api.partner.internal           HTTP  328    2       1260
### Decision trigger points
- Prioritize H1 when first invocation after idle is consistently slow.
- Prioritize H2 when one dependency target has clear p95 concentration.
- Prioritize H3 when p95 increases with traffic and completion lag.
## 5. Evidence to Collect
### Mandatory artifacts
- Incident timeline with UTC timestamps and deployment/configuration changes.
- Request duration trend (p50/p95/p99) for at least 60 minutes.
- Dependency summary by target/type with p95 and failure rate.
- Host lifecycle traces (startup, shutdown, scale, drain, worker start).
### Sample Log Patterns
# Abnormal: severe cold start
[2026-04-04T11:05:10Z] Host started (12543ms)
# Abnormal: dependency timeout
[2026-04-04T11:16:02Z] Executing 'Functions.HttpIngress' (Id=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
[2026-04-04T11:16:32Z] Dependency call failed: target=api.partner.internal, error=TimeoutException, duration=30011ms
[2026-04-04T11:16:32Z] Executed 'Functions.HttpIngress' (Failed, Duration=30218ms)
# Abnormal: concurrency pressure
[2026-04-04T11:22:45Z] Requests in flight=412, completed per second=25, queued=187
[2026-04-04T11:22:46Z] Function execution delayed due to host concurrency limits.
# Normal: warm baseline
[2026-04-04T11:40:00Z] Host started (200ms)
[2026-04-04T11:40:01Z] Dependency call success: target=api.partner.internal, duration=84ms
### KQL Queries with Example Output
#### Query 1: Function execution summary (from kql.md #1)
let appName = "func-myapp-prod";
requests
| where timestamp > ago(1h)
| where cloud_RoleName =~ appName
| where operation_Name startswith "Functions."
| summarize
    Invocations = count(),
    Failures = countif(success == false),
    FailureRatePercent = round(100.0 * countif(success == false) / count(), 2),
    P95Ms = percentile(duration, 95)
  by FunctionName = operation_Name
| order by Failures desc, P95Ms desc
FunctionName Invocations Failures FailureRatePercent
--- --- --- ---
Functions.HttpIngress 50 0 0.00
#### Query 2: Cold start analysis (from kql.md #3)
let appName = "func-myapp-prod";
traces
| where timestamp > ago(6h)
| where cloud_RoleName =~ appName
| where message has_any ("Host started", "Initializing Host", "Host lock lease acquired")
| summarize StartupEvents=count() by bin(timestamp, 15m)
| join kind=leftouter (
    requests
    | where timestamp > ago(6h)
    | where cloud_RoleName =~ appName
    | where operation_Name startswith "Functions."
    | summarize FirstInvocation=min(timestamp), FirstDurationMs=arg_min(timestamp, toreal(duration / 1ms)) by bin(timestamp, 15m)
) on timestamp
| order by timestamp desc
timestamp StartupEvents FirstInvocation FirstDurationMs
--- --- --- ---
2026-04-04T11:30:00Z 83 2026-04-04T11:30:00.003Z 3024.9
#### Query 3: Dependency call failures (from kql.md #4)
let appName = "func-myapp-prod";
dependencies
| where timestamp > ago(1h)
| where cloud_RoleName =~ appName
| summarize
    Calls=count(),
    Failed=countif(success == false),
    FailureRatePercent=round(100.0 * countif(success == false) / count(), 2),
    P95Ms=percentile(duration, 95)
  by target, type
| order by Failed desc, P95Ms desc
target type Calls Failed
--- --- --- ---
api.partner.internal HTTP 28 0
### Data quality checks
- Ensure all queries filter cloud_RoleName and consistent time windows.
- Confirm IDs are masked (xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx).
## 6. Validation and Disproof by Hypothesis
### H1: Cold start delays
#### Signals that support
- Slow requests cluster at idle-to-active transitions.
- Startup events increase near latency spike windows.
#### Signals that weaken
- Latency remains high during sustained warm traffic.
- No startup events near latency windows.
#### What to verify with INLINE KQL
let appName = "func-myapp-prod";
traces
| where timestamp > ago(6h)
| where cloud_RoleName =~ appName
| where message has_any ("Host started", "Initializing Host", "Host lock lease acquired")
| summarize StartupEvents=count() by bin(timestamp, 15m)
| join kind=leftouter (
    requests
    | where timestamp > ago(6h)
    | where cloud_RoleName =~ appName
    | where operation_Name startswith "Functions."
    | summarize FirstInvocation=min(timestamp), FirstDurationMs=arg_min(timestamp, toreal(duration / 1ms)) by bin(timestamp, 15m)
) on timestamp
| order by timestamp desc
timestamp StartupEvents FirstInvocation FirstDurationMs
--- --- --- ---
2026-04-04T11:30:00Z 83 2026-04-04T11:30:00.003Z 3024.9

How to Read This

Rising StartupEvents together with elevated FirstDurationMs supports H1. If startup events rise but first duration stays low, H1 is weaker.

CLI investigation

az monitor log-analytics query \
  --workspace "$WORKSPACE_ID" \
  --analytics-query "traces | where timestamp > ago(6h) | where cloud_RoleName =~ '$APP_NAME' | where message has_any ('Host started','Initializing Host','Host lock lease acquired') | summarize StartupEvents=count() by bin(timestamp, 15m) | order by timestamp desc" \
  --output table
Example output:
timestamp                StartupEvents
-----------------------  -------------
2026-04-04T11:30:00.000Z 83

H2: Slow downstream dependency

Signals that support

  • One target has sustained high dependency p95.
  • Request latency follows that target's latency profile.

Signals that weaken

  • Dependency p95 is normal while request p95 is high.
  • No concentration by target.

What to verify with INLINE KQL

let appName = "func-myapp-prod";
dependencies
| where timestamp > ago(1h)
| where cloud_RoleName =~ appName
| summarize
    Calls=count(),
    Failed=countif(success == false),
    FailureRatePercent=round(100.0 * countif(success == false) / count(), 2),
    P95Ms=percentile(duration, 95)
  by target, type
| order by Failed desc, P95Ms desc
| target | type | Calls | Failed | FailureRatePercent | P95Ms | |---|---|---|---|---|---| | api.partner.internal | HTTP | 328 | 2 | 0.61 | 1310 | | sql-prod-eastus.database.windows.net | SQL | 340 | 0 | 0.00 | 88 |

How to Read This

A single target with high P95Ms and high concentration strongly supports H2. Low failure rate does not disprove H2 when latency is dominant.

CLI investigation

az monitor log-analytics query \
  --workspace "$WORKSPACE_ID" \
  --analytics-query "dependencies | where timestamp > ago(1h) | where cloud_RoleName =~ '$APP_NAME' | summarize Calls=count(), Failed=countif(success==false), FailureRatePercent=round(100.0*countif(success==false)/count(),2), P95Ms=percentile(duration,95) by target, type | order by Failed desc, P95Ms desc" \
  --output table
Example output:
target                          type  Calls  Failed  FailureRatePercent  P95Ms
------------------------------  ----  -----  ------  ------------------  -----
api.partner.internal            HTTP  328    2       0.61                1310

H3: Concurrency saturation

Signals that support

  • Request volume increases while completion rate plateaus.
  • p95 and p99 rise together and remain elevated.

Signals that weaken

  • Latency spikes during low traffic.
  • High latency only on first invocation after idle.

What to verify with INLINE KQL

let appName = "func-myapp-prod";
requests
| where timestamp > ago(2h)
| where cloud_RoleName =~ appName
| where operation_Name startswith "Functions."
| summarize
    Invocations=count(),
    Failures=countif(success == false),
    FailureRatePercent=round(100.0 * countif(success == false) / count(), 2),
    P95Ms=percentile(duration, 95),
    P99Ms=percentile(duration, 99)
  by FunctionName=operation_Name, bin(timestamp, 5m)
| order by timestamp desc
| timestamp | FunctionName | Invocations | Failures | FailureRatePercent | P95Ms | P99Ms | |---|---|---|---|---|---|---| | 2026-04-04T11:20:00Z | Functions.HttpIngress | 120 | 3 | 2.50 | 2640 | 4018 | | 2026-04-04T11:10:00Z | Functions.HttpIngress | 40 | 0 | 0.00 | 5120 | 8125 |

How to Read This

Rising load plus sustained p95/p99 growth supports H3. Combine with scale/worker traces to separate transient burst from saturation.

CLI investigation

az monitor metrics list \
  --resource "/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RG/providers/Microsoft.Web/sites/$APP_NAME" \
  --metric "Requests" "AverageResponseTime" \
  --interval PT1M \
  --aggregation Total Average \
  --offset 2h \
  --output table
az monitor log-analytics query \
  --workspace "$WORKSPACE_ID" \
  --analytics-query "traces | where timestamp > ago(2h) | where cloud_RoleName =~ '$APP_NAME' | where message has_any ('worker','instance','concurrency','drain','scale') | project timestamp, severityLevel, message | order by timestamp desc" \
  --output table
Example output:
MetricName            TimeGrain  Total  Average
--------------------  ---------  -----  -------
Requests              PT1M       164    0
AverageResponseTime   PT1M       0      1490
timestamp                severityLevel  message
-----------------------  -------------  ---------------------------------------------------------
2026-04-04T11:22:46.000Z 1              Function execution delayed due to host concurrency limits.
2026-04-04T11:22:45.000Z 1              Requests in flight=412, completed per second=25, queued=187

H4: Plan-level resource limits

Signals that support

  • Multiple pressure signals: latency increase, retries, intermittent failures.
  • Frequent worker starts, drain events, or host shutdown traces.

Signals that weaken

  • One dependency fully explains the latency increase.
  • No lifecycle or pressure traces during incident window.

What to verify with INLINE KQL

let appName = "func-myapp-prod";
traces
| where timestamp > ago(6h)
| where cloud_RoleName =~ appName
| where message has_any ("scale", "instance", "worker", "concurrency", "drain", "Host shutdown", "Host is shutting down")
| project timestamp, severityLevel, message
| order by timestamp desc
| timestamp | severityLevel | message | |---|---|---| | 2026-04-04T11:32:20Z | 1 | Worker process started and initialized. | | 2026-04-04T11:31:50Z | 1 | Worker process started and initialized. | | 2026-04-04T11:31:20Z | 1 | Host is shutting down. | | 2026-04-04T11:30:50Z | 1 | Entering drain mode for instance replacement. |

How to Read This

Repeated restart/drain patterns during high latency support H4. Validate together with request and dependency trends before final attribution.

CLI investigation

az monitor log-analytics query \
  --workspace "$WORKSPACE_ID" \
  --analytics-query "traces | where timestamp > ago(6h) | where cloud_RoleName =~ '$APP_NAME' | where message has_any ('scale','instance','worker','concurrency','drain','Host shutdown','Host is shutting down') | project timestamp, severityLevel, message | order by timestamp desc" \
  --output table
az functionapp plan show \
  --resource-group "$RG" \
  --name "$PLAN_NAME" \
  --output json
Example output:
timestamp                severityLevel  message
-----------------------  -------------  -----------------------------------------------
2026-04-04T11:31:20.000Z 1              Host is shutting down.
{
  "name": "plan-func-prod",
  "sku": {
    "tier": "ElasticPremium",
    "name": "EP1"
  },
  "maximumElasticWorkerCount": 20
}

Correlation query for single slow invocation

Use this when you have a known operation_Id.

let opId = "<operation-id>";
union isfuzzy=true
(
    requests
    | where operation_Id == opId
    | project timestamp, itemType="request", name=operation_Name, success, resultCode, duration, details=tostring(url)
),
(
    dependencies
    | where operation_Id == opId
    | project timestamp, itemType="dependency", name=target, success, resultCode, duration, details=tostring(data)
),
(
    exceptions
    | where operation_Id == opId
    | project timestamp, itemType="exception", name=type, success=bool(false), resultCode="", duration=timespan(null), details=outerMessage
),
(
    traces
    | where operation_Id == opId
    | project timestamp, itemType="trace", name="trace", success=bool(true), resultCode="", duration=timespan(null), details=message
)
| order by timestamp asc
| timestamp | itemType | name | success | resultCode | duration | details | |---|---|---|---|---|---|---| | 2026-04-04T11:16:02.000Z | request | Functions.HttpIngress | false | 500 | 30.218 | https://func-myapp-prod.azurewebsites.net/api/orders | | 2026-04-04T11:16:02.100Z | dependency | api.partner.internal | false | 504 | 30.011 | GET /v1/orders/status | | 2026-04-04T11:16:32.110Z | exception | System.TimeoutException | false | | | The operation has timed out. | | 2026-04-04T11:16:32.120Z | trace | trace | true | | | Executed 'Functions.HttpIngress' (Failed, Duration=30218ms) |

7. Likely Root Cause Patterns

Pattern catalog

Pattern ID Symptom cluster Strongest evidence Likely root cause
P1 First invocation slow after idle Startup events + high first duration Cold start and instance allocation cost
P2 One dependency dominates latency Target-level p95 concentration Downstream API/database bottleneck
P3 Tail latency rises with traffic Load increase + queueing signals Concurrency saturation
P4 Broad latency with worker churn Restart/drain/shutdown traces Plan-level resource constraints
### Normal vs Abnormal Comparison
Signal Normal Abnormal Interpretation
--- --- --- ---
Host startup trace Host started (< 1000ms) Host started (> 5000ms) repeated Cold start or recycle pressure
Dependency p95 by target Critical targets < 300ms Single target > 1000ms sustained Downstream bottleneck likely
Request latency distribution Stable p95 with brief spikes Sustained p95/p99 growth Systemic latency degradation
Scale and worker lifecycle Occasional starts under load Frequent drain/restart loops Capacity instability
Failures with latency Independent or low failures Latency and failures rise together Timeout/retry amplification
### Common misdiagnoses
- Declaring H1 without checking H2 and H3 evidence.
## 8. Immediate Mitigations
### H1 mitigations
- Enable always-ready/pre-warmed capacity where plan supports it.
- Minimize startup cost with lazy initialization and dependency trimming.
### H2 mitigations
- Apply per-target timeout budgets aligned to end-to-end SLO.
- Use circuit breaker and fallback for unstable dependencies.
### H3 mitigations
- Limit in-flight work and apply backpressure on hot routes.
- Shift blocking operations off synchronous request path.
### H4 mitigations
- Increase capacity tier or scale-out headroom.
- Review connection pools, socket reuse, and outbound call patterns.
### Post-mitigation verification
1. Re-run request/dependency KQL with same granularity and window.
2. Confirm p95/p99 reduction is sustained for at least 30 minutes.
3. Confirm timeout and retry rates decline without backlog growth.
## 9. Prevention
### Engineering controls
- Define SLO alerts for p50, p95, and p99 separately.
- Add synthetic probes for idle-to-first-request latency regression.
- Instrument target-level dependency latency and timeout telemetry.
- Emit metrics for in-flight work, queue delay, and completion throughput.
### Capacity and architecture controls
- Run performance tests with burst, idle, and downstream slowdown scenarios.
- Validate hosting plan choice against concurrency and latency SLO.
### Operational controls
- Maintain baseline workbook with normal startup and dependency signatures.
- Require hypothesis validation/disproof evidence in post-incident reviews.
### Related Labs
- Cold Start Lab
## See Also
- First 10 Minutes
- Troubleshooting Methodology
- KQL Query Library
- Troubleshooting Playbooks
## Sources
- Monitor Azure Functions
- Application Insights telemetry data model
- Kusto Query Language overview
- Azure Functions hosting options