Scaling Queries¶
KQL queries for analyzing cold starts, scaling behavior, and host lifecycle events.
flowchart LR
A[Cold start query] --> B[Scaling events timeline]
B --> C[Host startup and shutdown]
C --> D[Instance count trend]
D --> E[Scaling diagnosis] Cold start analysis¶
let appName = "func-myapp-prod";
// Measures earliest request latency per time bin (proxy for cold-start impact).
AppTraces
| where TimeGenerated > ago(6h)
| where AppRoleName =~ appName
| where Message has_any ("Host started", "Initializing Host")
| summarize StartupEvents=count() by bin(TimeGenerated, 15m)
| join kind=leftouter (
AppRequests
| where TimeGenerated > ago(6h)
| where AppRoleName =~ appName
| where OperationName startswith "Functions."
| summarize FirstInvocation=min(TimeGenerated), FirstDurationMs=arg_min(TimeGenerated, DurationMs) by bin(TimeGenerated, 15m)
) on TimeGenerated
| order by TimeGenerated desc
Example result:
| TimeGenerated | StartupEvents | FirstInvocation | FirstDurationMs |
|---|---|---|---|
| 2026-04-04T11:30:00Z | 83 | 2026-04-04T11:30:00.003Z | 3.0249 |
| 2026-04-04T11:15:00Z | 19 | 2026-04-04T11:29:25.000Z | 1600.4633 |
How to interpret:
| Indicator | Normal | Warning | Critical |
|---|---|---|---|
| StartupEvents per 15m bin | Consistent with plan type (FC1: dozens per bin is normal; Y1/EP: 1-3 per bin) | Sudden spike vs baseline (for example 10x normal) | Startup events with no subsequent successful invocations |
| FirstDurationMs after startup | < 1000ms | 1000-5000ms | > 5000ms |
FC1 Flex Consumption
Flex Consumption plans scale by spinning up many worker instances rapidly. Seeing 50-100+ startup events in a 15-minute bin is normal under load. Focus on whether startup events correlate with successful invocations, not the raw count.
Normal vs abnormal
Normal: One startup event and first invocation under 1 second.
Abnormal: Multiple startup events plus first invocation over 5 seconds indicates cold start pressure or host recycling.
Scaling events timeline¶
let appName = "func-myapp-prod";
AppTraces
| where TimeGenerated > ago(6h)
| where AppRoleName =~ appName
| where Message has_any ("scale", "instance", "worker", "concurrency", "drain")
| project TimeGenerated, SeverityLevel, Message
| order by TimeGenerated desc
Example result:
| TimeGenerated | SeverityLevel | Message |
|---|---|---|
| 2026-04-04T11:32:20Z | 1 | Worker process started and initialized. |
| 2026-04-04T11:31:50Z | 1 | Worker process started and initialized. |
| 2026-04-04T11:31:20Z | 1 | Worker process started and initialized. |
| 2026-04-04T11:30:50Z | 1 | Worker process started and initialized. |
How to interpret:
| Indicator | Normal | Warning | Critical |
|---|---|---|---|
| Scale events under sustained load | Present | Delayed | Missing |
| New instance allocation after scale out command | < 60s | 60-180s | > 180s |
| Frequent drain/recycle messages | Rare | Intermittent | Continuous |
Normal vs abnormal
Normal: Scaling out followed by New instance allocated and then stable processing.
Abnormal: Repeated Drain mode and recycle logs without sustained capacity growth indicate unstable workers or platform constraints.
Host startup/shutdown events¶
let appName = "func-myapp-prod";
AppTraces
| where TimeGenerated > ago(12h)
| where AppRoleName =~ appName
| where Message has_any ("Host started", "Job host started", "Host shutdown", "Host is shutting down", "Stopping JobHost")
| project TimeGenerated, SeverityLevel, Message
| order by TimeGenerated desc
Example result:
| TimeGenerated | SeverityLevel | Message |
|---|---|---|
| 2026-04-04T11:36:20Z | 1 | Host started (64ms) |
| 2026-04-04T11:32:30Z | 1 | Job host started |
| 2026-04-04T11:32:20Z | 1 | Host is shutting down |
| 2026-04-04T11:30:00Z | 1 | Host started (82ms) |
How to interpret:
| Indicator | Normal | Warning | Critical |
|---|---|---|---|
| Host start count per hour | 1-2 | 3-5 | > 5 |
| Start-stop cycle interval | N/A | 10-30m | < 10m |
| Shutdown messages with errors nearby | None | Occasional | Repeated |
Multiple Host started entries in short succession on FC1 | Normal scaling behavior | Review only if accompanied by error bursts | Persistent restarts with failures and no successful invocations |
Normal vs abnormal
Normal: One startup event after deployment or planned restart.
Abnormal: Repeated startup/shutdown cycling in short intervals usually indicates crash loops, configuration churn, or failing dependencies.
Instance count over time¶
let appName = "func-myapp-prod";
AppTraces
| where TimeGenerated > ago(6h)
| where AppRoleName =~ appName
| summarize InstanceCount = dcount(AppRoleInstance) by bin(TimeGenerated, 5m)
| order by TimeGenerated asc
Example result:
| TimeGenerated | InstanceCount |
|---|---|
| 2026-04-04T11:00:00Z | 1 |
| 2026-04-04T11:05:00Z | 3 |
| 2026-04-04T11:10:00Z | 5 |
| 2026-04-04T11:15:00Z | 5 |
| 2026-04-04T11:20:00Z | 2 |
How to interpret:
| Indicator | Normal | Warning | Critical |
|---|---|---|---|
| Instance count under load | Increases as traffic grows | Flat despite rising backlog | Drops to zero unexpectedly |
| Instance count after load subsides | Decreases gradually | Remains over-provisioned for extended periods | Oscillates rapidly up and down |
Reading instance counts
This query counts distinct AppRoleInstance values that emitted any trace within each 5-minute bin. An instance that processes at least one request or logs one trace during the bin is counted as active. Bins with zero traces produce no row, which is different from an instance count of zero.