First 10 Minutes: Scaling Issues¶
When queue backlogs grow, executions remain flat despite demand, or cold start spikes degrade throughput, use this checklist to narrow down the cause within the first 10 minutes.
Prerequisites¶
- Azure CLI access to the production subscription.
- Access to Application Insights and Log Analytics.
- Health endpoint implemented at
GET /api/health.
Set shared variables:
RG="rg-myapp-prod"
APP_NAME="func-myapp-prod"
PLAN_NAME="plan-myapp-prod"
SUBSCRIPTION_ID="<subscription-id>"
APP_INSIGHTS_NAME="appi-myapp-prod"
WORKSPACE_ID="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
STORAGE_NAME="stmyappprod"
flowchart TD
A[Backlog or lag rising] --> B[Compare executions vs source backlog]
B --> C{Executions scaling with demand?}
C -->|No| D[Check listener and scale-controller signals]
C -->|Yes| E[Check dependency bottlenecks]
D --> F{Plan limit reached?}
F -->|Yes| G[Increase plan capacity]
F -->|No| H[Review host and trigger configuration]
E --> I[Target downstream service constraints] 1) Check Azure status and regional incidents¶
Rule out platform-wide scaling limitations.
Check in Portal¶
Azure portal → Service Health → Health advisories.
Filter for the production region and services: Azure Functions, Storage, Azure Monitor.
Check with Azure CLI¶
az account set --subscription "$SUBSCRIPTION_ID"
az rest --method get \
--url "https://management.azure.com/subscriptions/$SUBSCRIPTION_ID/providers/Microsoft.ResourceHealth/events?api-version=2022-10-01&\$filter=eventType eq 'ServiceIssue' and status eq 'Active'"
How to Read This¶
| Signal | Interpretation | Action |
|---|---|---|
| No active service issues | Scaling issue is app or config-level | Continue to Step 2 |
| Active incident on Functions or Storage | Platform constraint | Monitor advisory, document for post-mortem |
2) Check execution rate vs backlog¶
Determine whether scale-out is keeping pace with incoming demand.
Check with Azure CLI¶
# Function execution metrics
az monitor metrics list \
--resource "/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RG/providers/Microsoft.Web/sites/$APP_NAME" \
--metric "FunctionExecutionCount" \
--interval PT1M \
--aggregation Total \
--offset 30m \
--output table
# Queue depth (if queue-triggered)
az monitor metrics list \
--resource "/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RG/providers/Microsoft.Storage/storageAccounts/$STORAGE_NAME" \
--metric "QueueMessageCount" \
--interval PT1M \
--aggregation Average \
--offset 30m \
--output table
Example Output¶
# Execution metrics - normal
MetricName TimeGrain Total
----------------------- --------- -------
FunctionExecutionCount PT1M 156
FunctionExecutionCount PT1M 162
FunctionExecutionCount PT1M 148
# Execution metrics - scaling issue
MetricName TimeGrain Total
----------------------- --------- -------
FunctionExecutionCount PT1M 12
FunctionExecutionCount PT1M 8
FunctionExecutionCount PT1M 5
# Queue depth - growing backlog
MetricName TimeGrain Average
----------------- --------- ---------
QueueMessageCount PT1M 120
QueueMessageCount PT1M 860
QueueMessageCount PT1M 2140
FC1 Flex Consumption Metrics
Flex Consumption plans use OnDemandFunctionExecutionCount and OnDemandFunctionExecutionUnits instead of FunctionExecutionCount. If standard metrics return empty, use the FC1-specific metric names.
How to Read This¶
| Pattern | Interpretation | Action |
|---|---|---|
| Queue depth stable + executions track demand | Normal drain behavior | No scaling issue |
| Queue depth up + executions flat | Scaling bottleneck or trigger stall | Check trigger listener and scale controller |
| Queue depth up + executions up but slow | Downstream dependency bottleneck | Check dependency health |
| Executions dropping to zero | Host or trigger failure | Check host logs immediately |
3) Check scale controller activity¶
The scale controller decides when to add or remove instances.
Check with KQL¶
let appName = "func-myapp-prod";
traces
| where timestamp > ago(1h)
| where cloud_RoleName =~ appName
| where message has_any ("scale", "instance", "worker", "concurrency", "drain", "Scaling out", "New instance")
| project timestamp, severityLevel, message
| order by timestamp desc
Example Output¶
# Normal scaling
timestamp message
--------------------------- ----------------------------------------------
2026-04-04T11:32:20Z Worker process started and initialized.
2026-04-04T11:31:50Z Worker process started and initialized.
2026-04-04T11:31:20Z Scaling out to 4 instances.
# Problematic - drain loop
timestamp message
--------------------------- ----------------------------------------------
2026-04-04T11:32:20Z Drain mode enabled.
2026-04-04T11:31:50Z Worker process started and initialized.
2026-04-04T11:31:20Z Drain mode enabled.
2026-04-04T11:30:50Z Worker process started and initialized.
How to Read This¶
| Pattern | Interpretation | Action |
|---|---|---|
Scaling out followed by stable workers | Normal scale behavior | No action |
Repeated Drain mode + restarts | Unstable workers, crash loop | Check application errors and memory |
| No scale messages despite backlog | Scale controller not triggering | Check trigger connection and host.json |
| Scale events present but latency high | Not enough instances for demand | Check plan limits |
4) Check plan limits¶
Each hosting plan has maximum instance limits that cap scale-out.
Plan instance limits¶
| Plan | Max Instances (Default) | Max Instances (Configurable) |
|---|---|---|
| Consumption (Y1) | 200 | — |
| Flex Consumption (FC1) | 100 | Up to 1000 |
| Premium EP1 | 20 | Up to 100 |
| Premium EP2 | 20 | Up to 100 |
| Premium EP3 | 20 | Up to 100 |
| Dedicated | Depends on plan SKU | Manual scaling |
Check with Azure CLI¶
# Check current plan and configuration
az functionapp show \
--name "$APP_NAME" \
--resource-group "$RG" \
--query "{plan:appServicePlanId, state:state, maxWorkers:siteConfig.numberOfWorkers}" \
--output table
# Check Premium plan scale limits
az functionapp plan show \
--name "$PLAN_NAME" \
--resource-group "$RG" \
--query "{sku:sku.name, maximumElasticWorkerCount:maximumElasticWorkerCount, numberOfWorkers:numberOfWorkers}" \
--output table
How to Read This¶
| Signal | Interpretation | Action |
|---|---|---|
| Current instances at max limit | Plan ceiling reached | Increase maximumElasticWorkerCount or upgrade plan |
| Current instances well below max | Scale controller not requesting more | Check trigger config and host.json batching |
maximumElasticWorkerCount = 1 | Scale-out effectively disabled | Set appropriate max worker count |
5) Check host.json concurrency settings¶
Misconfigured concurrency can bottleneck throughput even with many instances.
Key host.json settings¶
{
"extensions": {
"queues": {
"batchSize": 16,
"newBatchThreshold": 8,
"maxDequeueCount": 5,
"visibilityTimeout": "00:00:30"
},
"serviceBus": {
"maxConcurrentCalls": 16,
"maxConcurrentSessions": 8
},
"eventHubs": {
"maxEventBatchSize": 100,
"prefetchCount": 300
}
}
}
Check with Azure CLI¶
# Inspect general app and runtime site configuration
az functionapp config show \
--name "$APP_NAME" \
--resource-group "$RG" \
--output json
# host.json is not exposed by az functionapp config show.
# Inspect host.json in the deployed package or via Kudu/SCM (site/wwwroot/host.json).
How to Read This¶
| Setting | Too Low | Recommended | Too High |
|---|---|---|---|
queues.batchSize | 1 (serial processing) | 16 (default) | 32+ (may cause OOM) |
serviceBus.maxConcurrentCalls | 1 | 16 | 64+ (check dependencies) |
eventHubs.maxEventBatchSize | 1 | 100 | 1000+ (check processing time) |
6) Check recent deployments and configuration changes¶
az monitor activity-log list \
--resource-group "$RG" \
--offset 2h \
--status Succeeded \
--output table
Correlate configuration changes (especially host.json, concurrency settings) with scaling behavior changes.
Fast routing after triage¶
| What you see | Likely area | Next action |
|---|---|---|
| Backlog growing, executions flat | Trigger/listener failure | Use Functions Not Executing playbook |
| At plan instance limit | Capacity ceiling | Increase limit or upgrade plan |
| Host.json concurrency too low | Configuration | Tune batch size and concurrency settings |
| Workers crashing and restarting | Stability | Use Out of Memory playbook |
| Queue depth rising with slow processing | Downstream bottleneck | Use High Latency checklist |