Lab Guide: Queue Backlog Scaling on Y1 Consumption (Real Evidence)¶
This lab reproduces and analyzes a real queue backlog burst on Azure Functions Consumption (Y1) using telemetry collected on 2026-04-07. The objective is to prove what happened, what did not happen, and why a 2000-message burst drained in about 7.3 minutes without scale-out.
Lab Metadata¶
| Field | Value |
|---|---|
| Difficulty | Intermediate to advanced |
| Duration | 45-60 minutes |
| Hosting plan tested | Azure Functions Consumption (Y1), Linux |
| Runtime | Python 3.11 / Functions v4 |
| Resource Group | rg-lab-y1-shared (koreacentral) |
| Function App | labshared-func |
| Storage Account | labsharedstorage |
| Application Insights | labshared-insights |
| Queue | work-items |
| Function code path | apps/python/blueprints/queue_processor.py |
| Producer path | labs/queue-backlog-scaling/producer.py |
| Queue connection setting | QueueStorage |
| Evidence date (UTC) | 2026-04-07 |
Scope of this lab
This document replaces fabricated backlog data with a single real run from Y1 Consumption. Every metric, timestamp, and conclusion in this guide is constrained to the captured evidence listed in this lab.
1) Background¶
Queue backlog incidents are often misread as "the app is broken" or "the scale controller failed." In practice, backlog behavior is a throughput-shaping pattern where enqueue rate, per-message work time, batching, and scale decisions interact over time.
In this run, the producer injected 2000 messages in about 20.7 seconds (~97 msg/s), while each message included a configured 5-second processing delay. The workload did not fail, did not poison, and did not retry. The critical behavior was that Y1 remained on a single instance while draining.
Workload and trigger configuration¶
Queue trigger behavior came from host.json queue settings:
{
"extensions": {
"queues": {
"batchSize": 16,
"newBatchThreshold": 8,
"maxDequeueCount": 5,
"visibilityTimeout": "00:00:30",
"messageEncoding": "none"
}
}
}
Application log format in apps/python/blueprints/queue_processor.py:
These values are emitted in the message column of traces and must be parsed from message, not from customDimensions.
Failure progression model for this incident¶
flowchart LR
A[Burst injection starts\n08:23:09 UTC] --> B[2000 messages injected\nends 08:23:31 UTC]
B --> C[Single Y1 instance processes queue]
C --> D[Per-minute throughput ramps up\n25 -> 518]
D --> E[Backlog age increases over time]
E --> F[Drain completes ~08:30:29 UTC]
F --> G[No retry and no poison] Why this scenario matters¶
- It separates burst absorption from persistent failure.
- It shows that "no scale-out" can still end in successful drain.
- It demonstrates why message age can rise even with 100% success.
- It prevents incorrect root-cause claims such as poison loop or dependency bottleneck when those signals are absent.
Timeline view¶
gantt
title Queue Backlog Burst and Drain (UTC, 2026-04-07)
dateFormat HH:mm:ss
axisFormat %H:%M:%S
section Injection
Producer sends 2000 messages :done, inj, 08:23:09, 22s
section Drain
Single-instance drain phase :done, drn, 08:23:09, 7m20s
section Completion
Queue reaches fully drained :done, fin, 08:30:29, 1s 2) Hypothesis¶
Primary hypothesis (H1)¶
If a high-burst queue workload is injected into Y1 Consumption with steady 5-second message processing, backlog growth is primarily caused by burst-vs-drain imbalance and scale lag, where scale lag manifests as no additional instances during the incident window.
Competing hypotheses¶
| ID | Hypothesis | Result | Evidence summary |
|---|---|---|---|
| H1 | Scale lag/backlog due to insufficient active workers | Confirmed | DistinctInstances = 1 for all processed records; drain completed without scale-out |
| H2 | Poison/retry loop dominates throughput loss | Disproved | MaxDequeueCount = 1; all 2000 messages succeeded first try |
| H3 | Processing-time regression causes slowdown | Disproved | AvgProcessingMs = 5000; processing delay remained steady as configured |
| H4 | Dependency bottleneck throttles handler | Disproved | This workload made no dependency calls |
Proof criteria and observed outcome¶
| Criterion | Expected if true | Observed |
|---|---|---|
| Scale behavior | More than one worker if scale-out occurs | One worker only (DistinctInstances = 1) |
| Retry pattern | Elevated dequeue counts if poison/retry issue | All dequeue counts remained 1 |
| Processing consistency | Near configured delay if no regression | AvgProcessingMs = 5000 |
| End-state | Eventual drain if workload is healthy | Full drain around 08:30:29 UTC |
Hypothesis decision flow¶
flowchart TD
A[Backlog observed] --> B{Any retries or poison?}
B -->|No, dequeueCount=1| C{"Any dependency calls/failures?"}
B -->|Yes| X[Investigate poison loop path]
C -->|No| D{Scaled out beyond one instance?}
C -->|Yes| Y[Investigate dependency bottleneck]
D -->|No| E[Confirm single-instance burst absorption]
D -->|Yes| Z[Investigate post-scale throughput mismatch] 3) Runbook¶
Prerequisites¶
- Azure CLI authenticated and subscription selected.
- Query access to Application Insights component
labshared-insights. - Function app and queue resources already deployed in
rg-lab-y1-shared. - Repository paths available locally:
apps/python/blueprints/queue_processor.pylabs/queue-backlog-scaling/producer.py
Variables¶
SUBSCRIPTION_ID="<subscription-id>"
RG="rg-lab-y1-shared"
APP_NAME="labshared-func"
STORAGE_NAME="labsharedstorage"
AI_NAME="labshared-insights"
QUEUE_NAME="work-items"
Step 1: Confirm target resources¶
az account set --subscription "$SUBSCRIPTION_ID"
az functionapp show \
--resource-group "$RG" \
--name "$APP_NAME" \
--output table
az storage account show \
--resource-group "$RG" \
--name "$STORAGE_NAME" \
--output table
az monitor app-insights component show \
--app "$AI_NAME" \
--resource-group "$RG" \
--output table
Step 2: Verify queue trigger configuration in app source¶
Review queue settings and processing code before evidence collection:
- Check
host.jsonqueue config includes:batchSize = 16newBatchThreshold = 8maxDequeueCount = 5visibilityTimeout = "00:00:30"messageEncoding = "none"
Step 3: Run burst injection¶
Use labs/queue-backlog-scaling/producer.py to enqueue 2000 messages to work-items.
# Option A: connection-string auth
export STORAGE_CONNECTION_STRING=$(az storage account show-connection-string \
--resource-group "$RG" \
--name "$STORAGE_NAME" \
--query connectionString \
--output tsv)
python labs/queue-backlog-scaling/producer.py --count 2000 --batch-size 32 --delay-between-batches 0.05
# Option B: identity auth (requires Storage Queue Data Contributor on your user)
export STORAGE_ACCOUNT_NAME="$STORAGE_NAME"
python labs/queue-backlog-scaling/producer.py --count 2000 --use-identity
Reference incident timings from the 2026-04-07 real run:
| Event | Time (UTC) |
|---|---|
| Injection start | 08:23:09 |
| Injection end | 08:23:31 |
| Total sent | 2000 |
| Effective send rate | ~97 msg/s over 20.7s |
Step 4: Collect trace-derived processing evidence¶
Use parse message with * against the message column.
az monitor app-insights query \
--apps "$AI_NAME" \
--resource-group "$RG" \
--analytics-query "traces
| where timestamp between (datetime(2026-04-07T08:22:00Z) .. datetime(2026-04-07T08:35:00Z))
| where cloud_RoleName == 'labshared-func'
| where message has 'QueueProcessed'
| parse message with * 'sequence=' sequence:int ' ageMs=' ageMs:long ' dequeueCount=' dequeueCount:int ' processingMs=' processingMs:int ' instanceId=' instanceId:string ' delayMs=' delayMs:int
| summarize
TotalProcessed=count(),
AvgAgeMs=round(avg(ageMs), 0),
MaxAgeMs=max(ageMs),
MinAgeMs=min(ageMs),
P50AgeMs=round(percentile(ageMs, 50), 0),
P95AgeMs=round(percentile(ageMs, 95), 0),
AvgProcessingMs=round(avg(processingMs), 0),
MaxDequeueCount=max(dequeueCount),
DistinctInstances=dcount(instanceId)"
Per-minute drain curve:
az monitor app-insights query \
--apps "$AI_NAME" \
--resource-group "$RG" \
--analytics-query "traces
| where timestamp between (datetime(2026-04-07T08:22:00Z) .. datetime(2026-04-07T08:35:00Z))
| where cloud_RoleName == 'labshared-func'
| where message has 'QueueProcessed'
| parse message with * 'sequence=' sequence:int ' ageMs=' ageMs:long ' dequeueCount=' dequeueCount:int ' processingMs=' processingMs:int ' instanceId=' instanceId:string ' delayMs=' delayMs:int
| summarize Processed=count() by Minute=bin(timestamp, 1m)
| order by Minute asc"
Per-minute message age progression:
az monitor app-insights query \
--apps "$AI_NAME" \
--resource-group "$RG" \
--analytics-query "traces
| where timestamp between (datetime(2026-04-07T08:22:00Z) .. datetime(2026-04-07T08:35:00Z))
| where cloud_RoleName == 'labshared-func'
| where message has 'QueueProcessed'
| parse message with * 'sequence=' sequence:int ' ageMs=' ageMs:long ' dequeueCount=' dequeueCount:int ' processingMs=' processingMs:int ' instanceId=' instanceId:string ' delayMs=' delayMs:int
| summarize
AvgAgeSec=round(avg(ageMs) / 1000.0, 1),
P50AgeSec=round(percentile(ageMs, 50) / 1000.0, 1),
P95AgeSec=round(percentile(ageMs, 95) / 1000.0, 1),
MaxAgeSec=round(max(ageMs) / 1000.0, 1)
by Minute=bin(timestamp, 1m)
| order by Minute asc"
Step 5: Collect request-duration evidence¶
az monitor app-insights query \
--apps "$AI_NAME" \
--resource-group "$RG" \
--analytics-query "requests
| where timestamp between (datetime(2026-04-07T08:22:00Z) .. datetime(2026-04-07T08:35:00Z))
| where cloud_RoleName == 'labshared-func'
| where name has 'queue_processor'
| summarize
TotalInvocations=count(),
SuccessCount=countif(success == true),
FailCount=countif(success == false),
AvgDurationMs=round(avg(toreal(duration / 1ms)), 2),
P50DurationMs=round(percentile(toreal(duration / 1ms), 50), 2),
P95DurationMs=round(percentile(toreal(duration / 1ms), 95), 2),
MaxDurationMs=round(max(toreal(duration / 1ms)), 2)"
Why request duration differs from processing delay
processingMs in the app log represents the in-handler processing delay (5000 ms). requests.duration is the end-to-end function execution time measured by the Functions host, which includes queue message deserialization, binding setup, worker dispatch overhead, and the handler itself. It is not a batch metric — each request maps to one message — but host-level overhead causes it to exceed the pure handler time.
Step 6: Exclude dependency and poison hypotheses¶
Dependency calls: Source inspection of apps/python/blueprints/queue_processor.py confirms the handler performs no outbound HTTP, database, or storage dependency calls — it only parses the message, sleeps, and logs. The following scoped query validates that no dependency telemetry was emitted during the incident window:
az monitor app-insights query \
--apps "$AI_NAME" \
--resource-group "$RG" \
--analytics-query "dependencies
| where timestamp between (datetime(2026-04-07T08:22:00Z) .. datetime(2026-04-07T08:35:00Z))
| where cloud_RoleName == 'labshared-func'
| where operation_Name has 'queue_processor'
| summarize DependencyCalls=count()"
Poison-loop check: Trace evidence shows MaxDequeueCount = 1 across all 2000 messages, confirming every message succeeded on its first attempt. No poison inspection step is required for this run.
4) Experiment Log¶
Run context¶
| Item | Value |
|---|---|
| Date | 2026-04-07 (UTC) |
| App | labshared-func |
| Plan | Y1 Consumption |
| Queue | work-items |
| Injection window | 08:23:09-08:23:31 |
| Drain completion | ~08:30:29 |
Processing summary (App Insights traces)¶
| Metric | Value |
|---|---|
| TotalProcessed | 2000 |
| AvgAgeMs | 260,587 (~4.3 min) |
| MaxAgeMs | 398,417 (~6.6 min) |
| MinAgeMs | 21,893 (~22s) |
| P50AgeMs | 276,678 (~4.6 min) |
| P95AgeMs | 379,586 (~6.3 min) |
| AvgProcessingMs | 5000 |
| MaxDequeueCount | 1 |
| DistinctInstances | 1 |
Interpretation:
- Backlog age increased because injection rate exceeded single-instance drain rate.
- Processing delay was stable and intentional (
5000 ms). - No retries occurred (
dequeueCountnever exceeded 1). - Y1 did not scale out for this burst (
DistinctInstances = 1).
Request metrics (requests table)¶
| Metric | Value |
|---|---|
| TotalInvocations | 2000 |
| SuccessCount | 2000 |
| FailCount | 0 |
| AvgDurationMs | 16,146.12 |
| P50DurationMs | 14,979.58 |
| P95DurationMs | 24,945.77 |
| MaxDurationMs | 24,979.56 |
Interpretation:
- 100% success confirms this was not a failure-driven backlog.
- Request duration exceeds in-handler
processingMsbecause it includes host overhead (deserialization, worker dispatch, binding setup) — not because of batching. - Each request corresponds to one message (
TotalInvocations = 2000 = TotalProcessed).
Drain curve (per-minute)¶
| Time (UTC) | Processed | Cumulative |
|---|---|---|
| 08:23 | 25 | 25 |
| 08:24 | 115 | 140 |
| 08:25 | 200 | 340 |
| 08:26 | 275 | 615 |
| 08:27 | 355 | 970 |
| 08:28 | 445 | 1415 |
| 08:29 | 518 | 1933 |
| 08:30 | 67 | 2000 |
Drain interpretation:
- Throughput ramped up over time (
25 → 518per minute). The peak rate of 518/min exceeds the theoretical steady-state forbatchSize=16+newBatchThreshold=8(max 24 concurrent × 12 completions/min = 288/min) because per-minute bins capture messages that started processing in a prior minute and completed in the current one, and becauseFUNCTIONS_WORKER_PROCESS_COUNTmay allow multiple language worker processes sharing the sameWEBSITE_INSTANCE_ID. - Ramp-up behavior matches burst absorption, not platform error.
- Final minute contains only residual tail (
67) before full drain.
Message age over time (per-minute)¶
| Time (UTC) | AvgAge(s) | P50Age(s) | P95Age(s) | MaxAge(s) |
|---|---|---|---|---|
| 08:23 | 31.8 | 31.8 | 41.7 | 41.7 |
| 08:24 | 76.7 | 76.2 | 100.4 | 100.5 |
| 08:25 | 133.4 | 133.9 | 158.0 | 158.9 |
| 08:26 | 190.2 | 190.8 | 214.8 | 215.8 |
| 08:27 | 246.1 | 246.3 | 270.8 | 272.1 |
| 08:28 | 301.8 | 302.9 | 326.1 | 329.2 |
| 08:29 | 356.9 | 357.6 | 380.1 | 383.3 |
| 08:30 | 389.3 | 388.4 | 398.3 | 398.4 |
Age interpretation:
- Age rose nearly linearly during drain.
- Growth is expected when burst enqueue rate is higher than initial dequeue throughput.
- This trend alone does not imply failures; it must be correlated with retry and success data.
Evidence timeline¶
sequenceDiagram
participant Producer
participant Queue as work-items
participant Func as labshared-func (single instance)
participant AI as App Insights
Producer->>Queue: Inject 2000 messages (08:23:09-08:23:31)
Func->>Queue: Start dequeue and processing
Func->>AI: QueueProcessed traces (dequeueCount=1)
Note over Func,AI: DistinctInstances remains 1
Func->>AI: requests success=true for all invocations
Note over Queue,Func: Per-minute throughput ramps 25 -> 518
Note over Queue: Queue fully drained around 08:30:29 Final verdict for this run¶
| Question | Answer |
|---|---|
| Did Y1 scale out? | No. Stayed at one instance throughout processing evidence. |
| Was there poison or retry amplification? | No. All messages were first-try success (dequeueCount=1). |
| Was processing time regressed? | No. Processing stayed at configured 5000 ms. |
| Was there dependency bottleneck? | No. This workload had no dependency calls. |
| Why did backlog age rise? | Burst injection rate exceeded single-instance drain rate during early drain. |
| What pattern does this represent? | Burst absorption on one worker, not persistent throughput failure. |
Expected Evidence¶
Before Trigger (Baseline)¶
| Signal | Expected value |
|---|---|
| Queue trigger settings | batchSize=16, newBatchThreshold=8, maxDequeueCount=5, visibilityTimeout=00:00:30, messageEncoding=none |
| Application log shape | QueueProcessed sequence=... ageMs=... dequeueCount=... processingMs=... instanceId=... delayMs=... |
| Processing delay | processingMs around 5000 |
During Incident¶
| Signal | Expected value |
|---|---|
| Injection burst | 2000 messages in ~20.7s (08:23:09 to 08:23:31) |
| Drain profile | Per-minute processed count ramps 25 -> 115 -> 200 -> 275 -> 355 -> 445 -> 518 |
| Scale behavior | DistinctInstances = 1 (no scale-out) |
| Retry/poison behavior | MaxDequeueCount = 1 and no retry-driven amplification |
| Success rate | SuccessCount = 2000, FailCount = 0 |
After Recovery¶
| Signal | Expected value |
|---|---|
| End of drain | Queue fully drained around 08:30:29 UTC |
| Total elapsed from injection start | ~7.3 minutes |
| Max observed age | ~398.4 seconds (398,417 ms) |
| Outcome class | Burst absorbed successfully without scale-out |
Evidence Timeline¶
sequenceDiagram
participant Producer
participant Queue as work-items
participant Func as labshared-func (single instance)
participant AI as App Insights
Producer->>Queue: Inject 2000 messages (08:23:09-08:23:31)
Func->>Queue: Start dequeue and processing
Func->>AI: QueueProcessed traces (dequeueCount=1)
Note over Func,AI: DistinctInstances remains 1
Func->>AI: requests success=true for all invocations
Note over Queue,Func: Per-minute throughput ramps 25 → 518
Note over Queue: Queue fully drained around 08:30:29 Evidence chain: why this proves the conclusion¶
- Injection rate (~97 msg/s) created immediate pressure.
- Trace parsing showed exactly one active instance during the full window.
- Retry indicators stayed flat (
dequeueCount=1), excluding poison-loop tax. - Processing delay remained fixed (
5000 ms), excluding handler regression. - All invocations succeeded and queue drained, proving healthy but bounded single-instance throughput.
Clean Up¶
If this environment was dedicated to lab runs, remove resources:
If using shared resources (rg-lab-y1-shared), skip deletion and only clear test queue data in your operational process.
Related Playbook¶
See Also¶
- Troubleshooting methodology
- First 10 minutes triage
- KQL investigation guide
- Evidence map
- Other lab guides
Sources¶
- https://learn.microsoft.com/azure/azure-functions/functions-bindings-storage-queue-trigger
- https://learn.microsoft.com/azure/azure-functions/functions-host-json
- https://learn.microsoft.com/azure/azure-functions/functions-scale
- https://learn.microsoft.com/azure/azure-functions/monitor-functions
- https://learn.microsoft.com/azure/azure-monitor/logs/log-query-overview