First 10 Minutes: Scaling Issues¶

When queue backlogs grow, executions remain flat despite demand, or cold start spikes degrade throughput, use this checklist to narrow down the cause within the first 10 minutes.

Prerequisites¶

Azure CLI access to the production subscription.
Access to Application Insights and Log Analytics.
Health endpoint implemented at GET /api/health.

Set shared variables:

RG="rg-myapp-prod"
APP_NAME="func-myapp-prod"
PLAN_NAME="plan-myapp-prod"
SUBSCRIPTION_ID="<subscription-id>"
APP_INSIGHTS_NAME="appi-myapp-prod"
WORKSPACE_ID="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
STORAGE_NAME="stmyappprod"

flowchart TD
    A[Backlog or lag rising] --> B[Compare executions vs source backlog]
    B --> C{Executions scaling with demand?}
    C -->|No| D[Check listener and scale-controller signals]
    C -->|Yes| E[Check dependency bottlenecks]
    D --> F{Plan limit reached?}
    F -->|Yes| G[Increase plan capacity]
    F -->|No| H[Review host and trigger configuration]
    E --> I[Target downstream service constraints]

1) Check Azure status and regional incidents¶

Rule out platform-wide scaling limitations.

Check in Portal¶

Azure portal → Service Health → Health advisories.

Filter for the production region and services: Azure Functions, Storage, Azure Monitor.

Check with Azure CLI¶

az account set --subscription "$SUBSCRIPTION_ID"
az rest --method get \
  --url "https://management.azure.com/subscriptions/$SUBSCRIPTION_ID/providers/Microsoft.ResourceHealth/events?api-version=2022-10-01&\$filter=eventType eq 'ServiceIssue' and status eq 'Active'"

How to Read This¶

Signal	Interpretation	Action
No active service issues	Scaling issue is app or config-level	Continue to Step 2
Active incident on Functions or Storage	Platform constraint	Monitor advisory, document for post-mortem

2) Check execution rate vs backlog¶

Determine whether scale-out is keeping pace with incoming demand.

Check with Azure CLI¶

# Function execution metrics
az monitor metrics list \
  --resource "/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RG/providers/Microsoft.Web/sites/$APP_NAME" \
  --metric "FunctionExecutionCount" \
  --interval PT1M \
  --aggregation Total \
  --offset 30m \
  --output table

# Queue depth (if queue-triggered)
az monitor metrics list \
  --resource "/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RG/providers/Microsoft.Storage/storageAccounts/$STORAGE_NAME" \
  --metric "QueueMessageCount" \
  --interval PT1M \
  --aggregation Average \
  --offset 30m \
  --output table

Example Output¶

# Execution metrics - normal
MetricName               TimeGrain  Total
-----------------------  ---------  -------
FunctionExecutionCount   PT1M       156
FunctionExecutionCount   PT1M       162
FunctionExecutionCount   PT1M       148

# Execution metrics - scaling issue
MetricName               TimeGrain  Total
-----------------------  ---------  -------
FunctionExecutionCount   PT1M       12
FunctionExecutionCount   PT1M       8
FunctionExecutionCount   PT1M       5

# Queue depth - growing backlog
MetricName         TimeGrain  Average
-----------------  ---------  ---------
QueueMessageCount  PT1M       120
QueueMessageCount  PT1M       860
QueueMessageCount  PT1M       2140

FC1 Flex Consumption Metrics

Flex Consumption plans use OnDemandFunctionExecutionCount and OnDemandFunctionExecutionUnits instead of FunctionExecutionCount. If standard metrics return empty, use the FC1-specific metric names.

How to Read This¶

Pattern	Interpretation	Action
Queue depth stable + executions track demand	Normal drain behavior	No scaling issue
Queue depth up + executions flat	Scaling bottleneck or trigger stall	Check trigger listener and scale controller
Queue depth up + executions up but slow	Downstream dependency bottleneck	Check dependency health
Executions dropping to zero	Host or trigger failure	Check host logs immediately

3) Check scale controller activity¶

The scale controller decides when to add or remove instances.

Check with KQL¶

let appName = "func-myapp-prod";
traces
| where timestamp > ago(1h)
| where cloud_RoleName =~ appName
| where message has_any ("scale", "instance", "worker", "concurrency", "drain", "Scaling out", "New instance")
| project timestamp, severityLevel, message
| order by timestamp desc

Example Output¶

# Normal scaling
timestamp                    message
---------------------------  ----------------------------------------------
2026-04-04T11:32:20Z         Worker process started and initialized.
2026-04-04T11:31:50Z         Worker process started and initialized.
2026-04-04T11:31:20Z         Scaling out to 4 instances.

# Problematic - drain loop
timestamp                    message
---------------------------  ----------------------------------------------
2026-04-04T11:32:20Z         Drain mode enabled.
2026-04-04T11:31:50Z         Worker process started and initialized.
2026-04-04T11:31:20Z         Drain mode enabled.
2026-04-04T11:30:50Z         Worker process started and initialized.

How to Read This¶

Pattern	Interpretation	Action
`Scaling out` followed by stable workers	Normal scale behavior	No action
Repeated `Drain mode` + restarts	Unstable workers, crash loop	Check application errors and memory
No scale messages despite backlog	Scale controller not triggering	Check trigger connection and host.json
Scale events present but latency high	Not enough instances for demand	Check plan limits

4) Check plan limits¶

Each hosting plan has maximum instance limits that cap scale-out.

Plan instance limits¶

Plan	Max Instances (Default)	Max Instances (Configurable)
Consumption (Y1)	200	—
Flex Consumption (FC1)	100	Up to 1000
Premium EP1	20	Up to 100
Premium EP2	20	Up to 100
Premium EP3	20	Up to 100
Dedicated	Depends on plan SKU	Manual scaling

Check with Azure CLI¶

# Check current plan and configuration
az functionapp show \
  --name "$APP_NAME" \
  --resource-group "$RG" \
  --query "{plan:appServicePlanId, state:state, maxWorkers:siteConfig.numberOfWorkers}" \
  --output table

# Check Premium plan scale limits
az functionapp plan show \
  --name "$PLAN_NAME" \
  --resource-group "$RG" \
  --query "{sku:sku.name, maximumElasticWorkerCount:maximumElasticWorkerCount, numberOfWorkers:numberOfWorkers}" \
  --output table

How to Read This¶

Signal	Interpretation	Action
Current instances at max limit	Plan ceiling reached	Increase `maximumElasticWorkerCount` or upgrade plan
Current instances well below max	Scale controller not requesting more	Check trigger config and host.json batching
`maximumElasticWorkerCount` = 1	Scale-out effectively disabled	Set appropriate max worker count

5) Check host.json concurrency settings¶

Misconfigured concurrency can bottleneck throughput even with many instances.

Key host.json settings¶

{
  "extensions": {
    "queues": {
      "batchSize": 16,
      "newBatchThreshold": 8,
      "maxDequeueCount": 5,
      "visibilityTimeout": "00:00:30"
    },
    "serviceBus": {
      "maxConcurrentCalls": 16,
      "maxConcurrentSessions": 8
    },
    "eventHubs": {
      "maxEventBatchSize": 100,
      "prefetchCount": 300
    }
  }
}

Check with Azure CLI¶

# Inspect general app and runtime site configuration
az functionapp config show \
  --name "$APP_NAME" \
  --resource-group "$RG" \
  --output json

# host.json is not exposed by az functionapp config show.
# Inspect host.json in the deployed package or via Kudu/SCM (site/wwwroot/host.json).

How to Read This¶

Setting	Too Low	Recommended	Too High
`queues.batchSize`	1 (serial processing)	16 (default)	32+ (may cause OOM)
`serviceBus.maxConcurrentCalls`	1	16	64+ (check dependencies)
`eventHubs.maxEventBatchSize`	1	100	1000+ (check processing time)

6) Check recent deployments and configuration changes¶

az monitor activity-log list \
  --resource-group "$RG" \
  --offset 2h \
  --status Succeeded \
  --output table

Correlate configuration changes (especially host.json, concurrency settings) with scaling behavior changes.

Fast routing after triage¶

What you see	Likely area	Next action
Backlog growing, executions flat	Trigger/listener failure	Use Functions Not Executing playbook
At plan instance limit	Capacity ceiling	Increase limit or upgrade plan
Host.json concurrency too low	Configuration	Tune batch size and concurrency settings
Workers crashing and restarting	Stability	Use Out of Memory playbook
Queue depth rising with slow processing	Downstream bottleneck	Use High Latency checklist

First 10 Minutes: Scaling Issues¶

Prerequisites¶

1) Check Azure status and regional incidents¶

Check in Portal¶

Check with Azure CLI¶

How to Read This¶

2) Check execution rate vs backlog¶

Check with Azure CLI¶

Example Output¶

How to Read This¶

3) Check scale controller activity¶

Check with KQL¶

Example Output¶

How to Read This¶

4) Check plan limits¶

Plan instance limits¶

Check with Azure CLI¶

How to Read This¶

5) Check host.json concurrency settings¶

Key host.json settings¶

Check with Azure CLI¶

How to Read This¶

6) Check recent deployments and configuration changes¶

Fast routing after triage¶

See Also¶

Sources¶