High Latency / Slow Responses¶
1. Summary¶
This playbook handles incidents where Azure Functions shows slow responses, elevated p95/p99 latency, and intermittent timeout behavior. Use it when performance degradation is user-visible, even if failure rate is still low.
Troubleshooting decision flow¶
flowchart TD
A[Latency alert or customer report] --> B{Mostly first request after idle?}
B -->|Yes| H1[H1 Cold start delays]
B -->|No| C{One dependency target has high p95?}
C -->|Yes| H2[H2 Slow downstream dependency]
C -->|No| D{Traffic grows but completion lags?}
D -->|Yes| H3[H3 Concurrency saturation]
D -->|No| E{Worker restart or pressure traces?}
E -->|Yes| H4[H4 Plan-level resource limits]
E -->|No| F[Correlate operation_Id end-to-end]
H1 --> G[Mitigate and verify p95 improvement]
H2 --> G
H3 --> G
H4 --> G
F --> G Scope and severity¶
How to Read This
Rising StartupEvents together with elevated FirstDurationMs supports H1. If startup events rise but first duration stays low, H1 is weaker.
CLI investigation¶
az monitor log-analytics query \
--workspace "$WORKSPACE_ID" \
--analytics-query "traces | where timestamp > ago(6h) | where cloud_RoleName =~ '$APP_NAME' | where message has_any ('Host started','Initializing Host','Host lock lease acquired') | summarize StartupEvents=count() by bin(timestamp, 15m) | order by timestamp desc" \
--output table
H2: Slow downstream dependency¶
Signals that support¶
- One target has sustained high dependency p95.
- Request latency follows that target's latency profile.
Signals that weaken¶
- Dependency p95 is normal while request p95 is high.
- No concentration by
target.
What to verify with INLINE KQL¶
let appName = "func-myapp-prod";
dependencies
| where timestamp > ago(1h)
| where cloud_RoleName =~ appName
| summarize
Calls=count(),
Failed=countif(success == false),
FailureRatePercent=round(100.0 * countif(success == false) / count(), 2),
P95Ms=percentile(duration, 95)
by target, type
| order by Failed desc, P95Ms desc
How to Read This
A single target with high P95Ms and high concentration strongly supports H2. Low failure rate does not disprove H2 when latency is dominant.
CLI investigation¶
az monitor log-analytics query \
--workspace "$WORKSPACE_ID" \
--analytics-query "dependencies | where timestamp > ago(1h) | where cloud_RoleName =~ '$APP_NAME' | summarize Calls=count(), Failed=countif(success==false), FailureRatePercent=round(100.0*countif(success==false)/count(),2), P95Ms=percentile(duration,95) by target, type | order by Failed desc, P95Ms desc" \
--output table
target type Calls Failed FailureRatePercent P95Ms
------------------------------ ---- ----- ------ ------------------ -----
api.partner.internal HTTP 328 2 0.61 1310
H3: Concurrency saturation¶
Signals that support¶
- Request volume increases while completion rate plateaus.
- p95 and p99 rise together and remain elevated.
Signals that weaken¶
- Latency spikes during low traffic.
- High latency only on first invocation after idle.
What to verify with INLINE KQL¶
let appName = "func-myapp-prod";
requests
| where timestamp > ago(2h)
| where cloud_RoleName =~ appName
| where operation_Name startswith "Functions."
| summarize
Invocations=count(),
Failures=countif(success == false),
FailureRatePercent=round(100.0 * countif(success == false) / count(), 2),
P95Ms=percentile(duration, 95),
P99Ms=percentile(duration, 99)
by FunctionName=operation_Name, bin(timestamp, 5m)
| order by timestamp desc
How to Read This
Rising load plus sustained p95/p99 growth supports H3. Combine with scale/worker traces to separate transient burst from saturation.
CLI investigation¶
az monitor metrics list \
--resource "/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RG/providers/Microsoft.Web/sites/$APP_NAME" \
--metric "Requests" "AverageResponseTime" \
--interval PT1M \
--aggregation Total Average \
--offset 2h \
--output table
az monitor log-analytics query \
--workspace "$WORKSPACE_ID" \
--analytics-query "traces | where timestamp > ago(2h) | where cloud_RoleName =~ '$APP_NAME' | where message has_any ('worker','instance','concurrency','drain','scale') | project timestamp, severityLevel, message | order by timestamp desc" \
--output table
MetricName TimeGrain Total Average
-------------------- --------- ----- -------
Requests PT1M 164 0
AverageResponseTime PT1M 0 1490
timestamp severityLevel message
----------------------- ------------- ---------------------------------------------------------
2026-04-04T11:22:46.000Z 1 Function execution delayed due to host concurrency limits.
2026-04-04T11:22:45.000Z 1 Requests in flight=412, completed per second=25, queued=187
H4: Plan-level resource limits¶
Signals that support¶
- Multiple pressure signals: latency increase, retries, intermittent failures.
- Frequent worker starts, drain events, or host shutdown traces.
Signals that weaken¶
- One dependency fully explains the latency increase.
- No lifecycle or pressure traces during incident window.
What to verify with INLINE KQL¶
let appName = "func-myapp-prod";
traces
| where timestamp > ago(6h)
| where cloud_RoleName =~ appName
| where message has_any ("scale", "instance", "worker", "concurrency", "drain", "Host shutdown", "Host is shutting down")
| project timestamp, severityLevel, message
| order by timestamp desc
How to Read This
Repeated restart/drain patterns during high latency support H4. Validate together with request and dependency trends before final attribution.
CLI investigation¶
az monitor log-analytics query \
--workspace "$WORKSPACE_ID" \
--analytics-query "traces | where timestamp > ago(6h) | where cloud_RoleName =~ '$APP_NAME' | where message has_any ('scale','instance','worker','concurrency','drain','Host shutdown','Host is shutting down') | project timestamp, severityLevel, message | order by timestamp desc" \
--output table
az functionapp plan show \
--resource-group "$RG" \
--name "$PLAN_NAME" \
--output json
timestamp severityLevel message
----------------------- ------------- -----------------------------------------------
2026-04-04T11:31:20.000Z 1 Host is shutting down.
{
"name": "plan-func-prod",
"sku": {
"tier": "ElasticPremium",
"name": "EP1"
},
"maximumElasticWorkerCount": 20
}
Correlation query for single slow invocation¶
Use this when you have a known operation_Id.
let opId = "<operation-id>";
union isfuzzy=true
(
requests
| where operation_Id == opId
| project timestamp, itemType="request", name=operation_Name, success, resultCode, duration, details=tostring(url)
),
(
dependencies
| where operation_Id == opId
| project timestamp, itemType="dependency", name=target, success, resultCode, duration, details=tostring(data)
),
(
exceptions
| where operation_Id == opId
| project timestamp, itemType="exception", name=type, success=bool(false), resultCode="", duration=timespan(null), details=outerMessage
),
(
traces
| where operation_Id == opId
| project timestamp, itemType="trace", name="trace", success=bool(true), resultCode="", duration=timespan(null), details=message
)
| order by timestamp asc
7. Likely Root Cause Patterns¶
Pattern catalog¶
| Pattern ID | Symptom cluster | Strongest evidence | Likely root cause |
|---|---|---|---|
| P1 | First invocation slow after idle | Startup events + high first duration | Cold start and instance allocation cost |
| P2 | One dependency dominates latency | Target-level p95 concentration | Downstream API/database bottleneck |
| P3 | Tail latency rises with traffic | Load increase + queueing signals | Concurrency saturation |
| P4 | Broad latency with worker churn | Restart/drain/shutdown traces | Plan-level resource constraints |
| ### Normal vs Abnormal Comparison | |||
| Signal | Normal | Abnormal | Interpretation |
| --- | --- | --- | --- |
| Host startup trace | Host started (< 1000ms) | Host started (> 5000ms) repeated | Cold start or recycle pressure |
| Dependency p95 by target | Critical targets < 300ms | Single target > 1000ms sustained | Downstream bottleneck likely |
| Request latency distribution | Stable p95 with brief spikes | Sustained p95/p99 growth | Systemic latency degradation |
| Scale and worker lifecycle | Occasional starts under load | Frequent drain/restart loops | Capacity instability |
| Failures with latency | Independent or low failures | Latency and failures rise together | Timeout/retry amplification |
| ### Common misdiagnoses | |||
| - Declaring H1 without checking H2 and H3 evidence. | |||
| ## 8. Immediate Mitigations | |||
| ### H1 mitigations | |||
| - Enable always-ready/pre-warmed capacity where plan supports it. | |||
| - Minimize startup cost with lazy initialization and dependency trimming. | |||
| ### H2 mitigations | |||
| - Apply per-target timeout budgets aligned to end-to-end SLO. | |||
| - Use circuit breaker and fallback for unstable dependencies. | |||
| ### H3 mitigations | |||
| - Limit in-flight work and apply backpressure on hot routes. | |||
| - Shift blocking operations off synchronous request path. | |||
| ### H4 mitigations | |||
| - Increase capacity tier or scale-out headroom. | |||
| - Review connection pools, socket reuse, and outbound call patterns. | |||
| ### Post-mitigation verification | |||
| 1. Re-run request/dependency KQL with same granularity and window. | |||
| 2. Confirm p95/p99 reduction is sustained for at least 30 minutes. | |||
| 3. Confirm timeout and retry rates decline without backlog growth. | |||
| ## 9. Prevention | |||
| ### Engineering controls | |||
| - Define SLO alerts for p50, p95, and p99 separately. | |||
| - Add synthetic probes for idle-to-first-request latency regression. | |||
| - Instrument target-level dependency latency and timeout telemetry. | |||
| - Emit metrics for in-flight work, queue delay, and completion throughput. | |||
| ### Capacity and architecture controls | |||
| - Run performance tests with burst, idle, and downstream slowdown scenarios. | |||
| - Validate hosting plan choice against concurrency and latency SLO. | |||
| ### Operational controls | |||
| - Maintain baseline workbook with normal startup and dependency signatures. | |||
| - Require hypothesis validation/disproof evidence in post-incident reviews. | |||
| ### Related Labs | |||
| - Cold Start Lab | |||
| ## See Also | |||
| - First 10 Minutes | |||
| - Troubleshooting Methodology | |||
| - KQL Query Library | |||
| - Troubleshooting Playbooks | |||
| ## Sources | |||
| - Monitor Azure Functions | |||
| - Application Insights telemetry data model | |||
| - Kusto Query Language overview | |||
| - Azure Functions hosting options |