Scaling Queries¶

KQL queries for analyzing cold starts, scaling behavior, and host lifecycle events.

flowchart LR
    A[Cold start query] --> B[Scaling events timeline]
    B --> C[Host startup and shutdown]
    C --> D[Instance count trend]
    D --> E[Scaling diagnosis]

Cold start analysis¶

let appName = "func-myapp-prod";
// Measures earliest request latency per time bin (proxy for cold-start impact).
AppTraces
| where TimeGenerated > ago(6h)
| where AppRoleName =~ appName
| where Message has_any ("Host started", "Initializing Host")
| summarize StartupEvents=count() by bin(TimeGenerated, 15m)
| join kind=leftouter (
    AppRequests
    | where TimeGenerated > ago(6h)
    | where AppRoleName =~ appName
    | where OperationName startswith "Functions."
    | summarize FirstInvocation=min(TimeGenerated), FirstDurationMs=arg_min(TimeGenerated, DurationMs) by bin(TimeGenerated, 15m)
) on TimeGenerated
| order by TimeGenerated desc

Example result:

TimeGenerated	StartupEvents	FirstInvocation	FirstDurationMs
2026-04-04T11:30:00Z	83	2026-04-04T11:30:00.003Z	3.0249
2026-04-04T11:15:00Z	19	2026-04-04T11:29:25.000Z	1600.4633

How to interpret:

Indicator	Normal	Warning	Critical
StartupEvents per 15m bin	Consistent with plan type (FC1: dozens per bin is normal; Y1/EP: 1-3 per bin)	Sudden spike vs baseline (for example 10x normal)	Startup events with no subsequent successful invocations
FirstDurationMs after startup	< 1000ms	1000-5000ms	> 5000ms

FC1 Flex Consumption

Flex Consumption plans scale by spinning up many worker instances rapidly. Seeing 50-100+ startup events in a 15-minute bin is normal under load. Focus on whether startup events correlate with successful invocations, not the raw count.

Normal vs abnormal

Normal: One startup event and first invocation under 1 second.

Abnormal: Multiple startup events plus first invocation over 5 seconds indicates cold start pressure or host recycling.

Scaling events timeline¶

let appName = "func-myapp-prod";
AppTraces
| where TimeGenerated > ago(6h)
| where AppRoleName =~ appName
| where Message has_any ("scale", "instance", "worker", "concurrency", "drain")
| project TimeGenerated, SeverityLevel, Message
| order by TimeGenerated desc

Example result:

TimeGenerated	SeverityLevel	Message
2026-04-04T11:32:20Z	1	Worker process started and initialized.
2026-04-04T11:31:50Z	1	Worker process started and initialized.
2026-04-04T11:31:20Z	1	Worker process started and initialized.
2026-04-04T11:30:50Z	1	Worker process started and initialized.

How to interpret:

Indicator	Normal	Warning	Critical
Scale events under sustained load	Present	Delayed	Missing
New instance allocation after scale out command	< 60s	60-180s	> 180s
Frequent drain/recycle messages	Rare	Intermittent	Continuous

Normal vs abnormal

Normal: Scaling out followed by New instance allocated and then stable processing.

Abnormal: Repeated Drain mode and recycle logs without sustained capacity growth indicate unstable workers or platform constraints.

Host startup/shutdown events¶

let appName = "func-myapp-prod";
AppTraces
| where TimeGenerated > ago(12h)
| where AppRoleName =~ appName
| where Message has_any ("Host started", "Job host started", "Host shutdown", "Host is shutting down", "Stopping JobHost")
| project TimeGenerated, SeverityLevel, Message
| order by TimeGenerated desc

Example result:

TimeGenerated	SeverityLevel	Message
2026-04-04T11:36:20Z	1	Host started (64ms)
2026-04-04T11:32:30Z	1	Job host started
2026-04-04T11:32:20Z	1	Host is shutting down
2026-04-04T11:30:00Z	1	Host started (82ms)

How to interpret:

Indicator	Normal	Warning	Critical
Host start count per hour	1-2	3-5	> 5
Start-stop cycle interval	N/A	10-30m	< 10m
Shutdown messages with errors nearby	None	Occasional	Repeated
Multiple `Host started` entries in short succession on FC1	Normal scaling behavior	Review only if accompanied by error bursts	Persistent restarts with failures and no successful invocations

Normal vs abnormal

Normal: One startup event after deployment or planned restart.

Abnormal: Repeated startup/shutdown cycling in short intervals usually indicates crash loops, configuration churn, or failing dependencies.

Instance count over time¶

let appName = "func-myapp-prod";
AppTraces
| where TimeGenerated > ago(6h)
| where AppRoleName =~ appName
| summarize InstanceCount = dcount(AppRoleInstance) by bin(TimeGenerated, 5m)
| order by TimeGenerated asc

Example result:

TimeGenerated	InstanceCount
2026-04-04T11:00:00Z	1
2026-04-04T11:05:00Z	3
2026-04-04T11:10:00Z	5
2026-04-04T11:15:00Z	5
2026-04-04T11:20:00Z	2

How to interpret:

Indicator	Normal	Warning	Critical
Instance count under load	Increases as traffic grows	Flat despite rising backlog	Drops to zero unexpectedly
Instance count after load subsides	Decreases gradually	Remains over-provisioned for extended periods	Oscillates rapidly up and down

Reading instance counts

This query counts distinct AppRoleInstance values that emitted any trace within each 5-minute bin. An instance that processes at least one request or logs one trace during the bin is counted as active. Bins with zero traces produce no row, which is different from an instance count of zero.

Scaling Queries¶

Cold start analysis¶

Scaling events timeline¶

Host startup/shutdown events¶

Instance count over time¶

See Also¶

Sources¶