Lab Guide: Durable Functions Replay Storm on Azure Functions Premium EP1¶
This lab guide documents a completed Azure Functions Premium EP1 experiment that captured a real Durable Functions replay storm. The objective is to prove, with reproducible evidence, that orchestration replay overhead can dominate end-to-end completion time even when activity execution remains stable.
Lab Metadata¶
| Field | Value |
|---|---|
| Difficulty | L3 (advanced Durable troubleshooting and evidence correlation) |
| Duration | 60-90 minutes |
| Hosting plan | Premium EP1 (Linux) |
| Runtime | Azure Functions v4, Python 3.11 (v2 model) |
| Function app | labep1shared-func |
| Resource group | rg-lab-ep1-shared |
| Region | koreacentral |
| Storage account | labep1sharedstorage (managed identity auth) |
| Durable app model | df.DFApp + blueprint apps/python/blueprints/durable_lab.py |
| Application Insights app ID | 928dbc99-8353-4e70-889b-860f3401220e |
| Trigger path | POST /api/durable/start-replay-lab |
| Activity function | replay_lab_activity |
| Orchestrator function | replay_storm_orchestrator |
| Evidence window (UTC) | 2026-04-07 12:28:00 to 13:01:21 |
| Artifact category | requests, traces, Azure CLI outputs |
What this lab is designed to prove
For the same logical workload (2000 iterations), total completion time was ~26 minutes without ContinueAsNew and ~5 minutes with ContinueAsNew every 250 iterations.
During incident progression, orchestrator duration increased from roughly 136 ms to 641 ms average per 2-minute bin, while activity duration stayed flat around 2.7-3.4 ms.
This controlled contrast demonstrates replay amplification as the bottleneck, not slower activity work.
1) Background¶
Durable Functions orchestrators are deterministic and replay event history every time the orchestration wakes. This is expected behavior, but replay cost scales with history size. If a long-running orchestration never resets history, replay overhead can increase enough to suppress throughput and inflate wall-clock completion time.
Replay growth model used in this lab¶
flowchart TD
A[Large orchestration loop] --> B[History grows each iteration]
B --> C[Every wake-up replays prior history]
C --> D["Replay CPU/time per execution increases"]
D --> E[Orchestrator request duration rises]
E --> F[Executions per time window drop]
F --> G[Wall-clock completion inflates]
H[ContinueAsNew every N] --> I[History reset]
I --> J[Replay cost returns to low baseline]
J --> K[Throughput recovers] Test topology and evidence channels¶
sequenceDiagram
participant Operator as Lab Operator
participant Start as HTTP Starter
participant Orch as Durable Orchestrator
participant Act as Durable Activity
participant Hub as Durable Task Hub (Storage)
participant AI as Application Insights
Operator->>Start: POST /api/durable/start-replay-lab
Start->>Orch: Start instance with input
Orch->>Hub: Read history and schedule work
Orch->>Act: call replay_lab_activity
Act-->>Orch: Return tiny payload
Orch->>AI: Emit ReplayIteration traces every 50 iterations
Orch->>Hub: Append new durable events
Operator->>AI: Query requests and traces Experiment phases¶
| Phase | Instance ID | Start (UTC) | End (UTC) | Workload shape | Result |
|---|---|---|---|---|---|
| Baseline | 7192e16593814d7f859c77558543c1da | 12:28:36 | 12:28:41 | 50 iterations, no ContinueAsNew | Completes in ~5s |
| Incident | 7a21e8d1258a488bad8392f1bdfa36bf | 12:29:22 | 12:55:28 | 2000 iterations, no ContinueAsNew | Completes in ~26m |
| Recovery | 58ce8fb52432461787717b8b1a83256b | 12:56:18 | 13:01:21 | 2000 iterations, ContinueAsNew every 250 | Completes in ~5m |
Key known-good control signal¶
The activity body (replay_lab_activity) remained in the same latency band across all phases:
- Baseline: avg around
2.7 ms - Incident: avg around
3.0-3.4 ms - Recovery: avg around
2.8 ms
This is the primary control signal that isolates replay overhead as the changing variable.
RBAC discovery during setup (important context)¶
Before collecting replay evidence, the function host initially failed to index Durable functions because managed identity permissions were incomplete. Specifically, Storage Table Data Contributor was missing, causing 403 AuthorizationPermissionMismatch errors against task hub table operations.
The Bicep role set originally included Blob, Queue, File, and Account roles, but not Table role. Adding Table access fixed host indexing and enabled lab execution.
Required storage RBAC role set for this Durable setup:
- Storage Account Contributor
- Storage Blob Data Owner
- Storage File Data Privileged Contributor
- Storage Queue Data Contributor
- Storage Table Data Contributor
Why this distinction matters¶
Replay incidents and storage authorization failures can both present as "Durable is slow/not progressing." This lab separates them:
- RBAC issue: startup/indexing failure, orchestration cannot run correctly
- Replay storm: orchestration runs, activity remains fast, orchestrator duration degrades with history growth
Phase-level objective map¶
| Phase | Purpose | What must be observed |
|---|---|---|
| Baseline | Validate healthy low-history behavior | low orchestrator duration, low total wall-clock |
| Incident | Reproduce replay amplification | rising orchestrator duration + falling throughput |
| Recovery | Validate mitigation causality | duration and throughput revert after periodic history reset |
Time-compression view of the run¶
gantt
title Durable Replay Storm Experiment Timeline (UTC)
dateFormat HH:mm:ss
axisFormat %H:%M
section Baseline
50 iterations no reset :done, b1, 12:28:36, 5s
section Incident
2000 iterations no ContinueAsNew :crit, i1, 12:29:22, 26m
section Recovery
2000 iterations ContinueAsNew=250 :done, r1, 12:56:18, 5m Ground truth summary used in this document¶
| Signal | Baseline | Incident | Recovery |
|---|---|---|---|
| Orchestrator average duration | 42.2 ms | grew to 641.4 ms | 45.2-47.0 ms |
| Orchestrator p95 | 81.8 ms | grew to 796.1 ms | 81.9-86.5 ms |
| Activity average duration | 2.7 ms | 3.0-3.4 ms | 2.8 ms |
| Throughput (executions/2m) | 297 | dropped as low as 63 | 694-806 |
| Replay trace spacing | ~7s batch interval | widened to ~65s | reset each batch |
2) Hypothesis¶
In this EP1 deployment, the severe slowdown in the 2000-iteration run is caused by replay overhead growth in the orchestrator history, not activity execution latency. Therefore, orchestrator duration should rise and throughput should fall during the no-reset run, while activity duration remains flat. Applying
ContinueAsNewevery250iterations should restore orchestrator duration and throughput to near-baseline ranges.
Causal chain¶
flowchart LR
A[No ContinueAsNew] --> B[History size increases continuously]
B --> C[Replay work per wake-up increases]
C --> D[Orchestrator duration increases]
D --> E[Executions per 2-minute bin decrease]
E --> F[Total wall-clock completion inflates]
G[ContinueAsNew every 250] --> H[History reset seven times]
H --> I[Replay cost repeatedly returns to low baseline]
I --> J[Orchestrator duration returns to ~45-47ms]
J --> K[2000-iteration run completes in ~5 minutes] Competing hypotheses and disproof criteria¶
| Competing hypothesis | Expected telemetry if true | Observed in this run | Verdict |
|---|---|---|---|
| Activity is slow | Activity avg/p95 rises with orchestrator rise | Activity stays ~2.7-3.4 ms | Rejected |
| Downstream dependency bottleneck | dependency/activity bands inflate materially | No matching inflation in activity | Rejected |
| Random platform instability | all paths degrade similarly | only orchestrator path degrades in incident | Rejected |
| Replay overhead growth | orchestrator grows, activity flat, reset recovers | matches exactly | Supported |
Proof checklist¶
| Criterion | Required evidence | Status |
|---|---|---|
| Baseline control is healthy | 50-iteration run ~5s with low orchestrator duration | Met |
| Incident shows replay amplification | orchestrator avg rises from ~136 ms to ~641 ms | Met |
| Activity remains stable | activity avg remains ~2.7-3.4 ms | Met |
| Throughput collapses in incident | executions/2m declines to low double digits | Met |
| ContinueAsNew creates recovery | orchestrator returns to ~45-47 ms + fast completion | Met |
Falsification statement¶
The hypothesis would be rejected if either of the following were true:
- activity latency rose proportionally with orchestrator latency
- recovery with
ContinueAsNew=250did not restore duration and throughput
Neither condition occurred in the evidence.
3) Runbook¶
Variables (generic first, then concrete values)¶
RG="<resource-group>"
APP_NAME="<function-app-name>"
STORAGE_NAME="<storage-account-name>"
LOCATION="<azure-region>"
AI_APP_ID="<app-insights-app-id>"
Concrete values used in this completed experiment:
RG="rg-lab-ep1-shared"
APP_NAME="labep1shared-func"
STORAGE_NAME="labep1sharedstorage"
LOCATION="koreacentral"
AI_APP_ID="928dbc99-8353-4e70-889b-860f3401220e"
3.1 Validate deployment context¶
az monitor app-insights component show \
--app "$AI_APP_ID" \
--resource-group "$RG" \
--output table
3.2 Validate Durable storage RBAC for managed identity¶
PRINCIPAL_ID=$(az functionapp identity show \
--name "$APP_NAME" \
--resource-group "$RG" \
--query principalId \
--output tsv)
STORAGE_SCOPE=$(az storage account show \
--name "$STORAGE_NAME" \
--resource-group "$RG" \
--query id \
--output tsv)
az role assignment list \
--assignee "$PRINCIPAL_ID" \
--scope "$STORAGE_SCOPE" \
--output table
Expected role set:
| Required role | Why it is needed in this lab |
|---|---|
| Storage Account Contributor | account-level control path for host operations |
| Storage Blob Data Owner | blob-backed durable artifacts |
| Storage File Data Privileged Contributor | file share interactions on app service storage path |
| Storage Queue Data Contributor | durable queue transport |
| Storage Table Data Contributor | durable task hub table entities |
If Storage Table Data Contributor is missing
Durable host may fail to index functions with 403 AuthorizationPermissionMismatch.
Add the role before running this lab; otherwise orchestration telemetry is not trustworthy because startup itself is unhealthy.
3.3 Start baseline phase (50 iterations, no reset)¶
curl --request POST \
--url "https://${APP_NAME}.azurewebsites.net/api/durable/start-replay-lab" \
--header "Content-Type: application/json" \
--data '{"iterations":50,"continueAsNewEvery":0,"instances":1}'
Expected baseline instance ID from recorded run:
7192e16593814d7f859c77558543c1da
3.4 Start incident phase (2000 iterations, no reset)¶
curl --request POST \
--url "https://${APP_NAME}.azurewebsites.net/api/durable/start-replay-lab" \
--header "Content-Type: application/json" \
--data '{"iterations":2000,"continueAsNewEvery":0,"instances":1}'
Expected incident instance ID from recorded run:
7a21e8d1258a488bad8392f1bdfa36bf
3.5 Start recovery phase (2000 iterations, reset every 250)¶
curl --request POST \
--url "https://${APP_NAME}.azurewebsites.net/api/durable/start-replay-lab" \
--header "Content-Type: application/json" \
--data '{"iterations":2000,"continueAsNewEvery":250,"instances":1}'
Expected recovery instance ID from recorded run:
58ce8fb52432461787717b8b1a83256b
3.6 Poll orchestration status (optional operator convenience)¶
INSTANCE_ID="7a21e8d1258a488bad8392f1bdfa36bf"
curl --request GET \
--url "https://${APP_NAME}.azurewebsites.net/api/durable/status/${INSTANCE_ID}"
3.7 Core KQL queries (requests + traces)¶
Duration conversion rule for this lab
Every duration conversion in this document uses toreal(duration / 1ms).
Do not use alternate conversion patterns for duration math in this lab.
Query A: Orchestrator duration timeline in 2-minute bins¶
let BaselineEnd = datetime(2026-04-07 12:29:22Z);
let IncidentEnd = datetime(2026-04-07 12:56:18Z);
let RunEnd = datetime(2026-04-07 13:02:00Z);
let baseline = requests
| where timestamp >= datetime(2026-04-07 12:28:00Z) and timestamp < BaselineEnd
| where cloud_RoleName == "labep1shared-func"
| where operation_Name has "replay_storm_orchestrator"
| extend phase = "Baseline";
let incident = requests
| where timestamp >= BaselineEnd and timestamp < IncidentEnd
| where cloud_RoleName == "labep1shared-func"
| where operation_Name has "replay_storm_orchestrator"
| extend phase = "Incident";
let recovery = requests
| where timestamp >= IncidentEnd and timestamp < RunEnd
| where cloud_RoleName == "labep1shared-func"
| where operation_Name has "replay_storm_orchestrator"
| extend phase = "Recovery";
union baseline, incident, recovery
| summarize
executions = count(),
avg_ms = round(avg(toreal(duration / 1ms)), 1),
p95_ms = round(percentile(toreal(duration / 1ms), 95), 1),
max_ms = round(max(toreal(duration / 1ms)), 1)
by phase, bin(timestamp, 2m)
| order by timestamp asc
Query B: Activity duration timeline in 2-minute bins¶
let BaselineEnd = datetime(2026-04-07 12:29:22Z);
let IncidentEnd = datetime(2026-04-07 12:56:18Z);
let RunEnd = datetime(2026-04-07 13:02:00Z);
let baseline = requests
| where timestamp >= datetime(2026-04-07 12:28:00Z) and timestamp < BaselineEnd
| where cloud_RoleName == "labep1shared-func"
| where operation_Name has "replay_lab_activity"
| extend phase = "Baseline";
let incident = requests
| where timestamp >= BaselineEnd and timestamp < IncidentEnd
| where cloud_RoleName == "labep1shared-func"
| where operation_Name has "replay_lab_activity"
| extend phase = "Incident";
let recovery = requests
| where timestamp >= IncidentEnd and timestamp < RunEnd
| where cloud_RoleName == "labep1shared-func"
| where operation_Name has "replay_lab_activity"
| extend phase = "Recovery";
union baseline, incident, recovery
| summarize
activities = count(),
avg_ms = round(avg(toreal(duration / 1ms)), 1),
p95_ms = round(percentile(toreal(duration / 1ms), 95), 1),
max_ms = round(max(toreal(duration / 1ms)), 1)
by phase, bin(timestamp, 2m)
| order by timestamp asc
Query C: Orchestrator-by-instance rollup¶
let BaselineId = "7192e16593814d7f859c77558543c1da";
let IncidentId = "7a21e8d1258a488bad8392f1bdfa36bf";
let RecoveryId = "58ce8fb52432461787717b8b1a83256b";
requests
| where cloud_RoleName == "labep1shared-func"
| where operation_Name has "replay_storm_orchestrator"
| where operation_Id in (BaselineId, IncidentId, RecoveryId)
| summarize
executions = count(),
avg_ms = round(avg(toreal(duration / 1ms)), 1),
p95_ms = round(percentile(toreal(duration / 1ms), 95), 1),
max_ms = round(max(toreal(duration / 1ms)), 1),
start_time = min(timestamp),
end_time = max(timestamp)
by operation_Id
| order by start_time asc
Query D: ReplayIteration traces with parsing¶
let IncidentId = "7a21e8d1258a488bad8392f1bdfa36bf";
traces
| where timestamp >= datetime(2026-04-07 12:28:00Z) and timestamp < datetime(2026-04-07 13:02:00Z)
| where cloud_RoleName == "labep1shared-func"
| where message has "ReplayIteration"
| where operation_Id == IncidentId or operation_ParentId == IncidentId
| parse message with * "batch=" batch:int " iteration=" iter:int "/" total:int " elapsed=" elapsed_ms:real "ms"
| project timestamp, batch, iter, total, elapsed_ms, message
| order by timestamp asc
Query E: ContinueAsNew and completion traces¶
let RecoveryId = "58ce8fb52432461787717b8b1a83256b";
traces
| where timestamp >= datetime(2026-04-07 12:56:00Z) and timestamp < datetime(2026-04-07 13:02:00Z)
| where cloud_RoleName == "labep1shared-func"
| where message has_any ("ContinueAsNew", "Orchestration complete")
| where operation_Id == RecoveryId or operation_ParentId == RecoveryId
| project timestamp, message
| order by timestamp asc
Query F: 50-iteration marker intervals from traces¶
let IncidentId = "7a21e8d1258a488bad8392f1bdfa36bf";
traces
| where timestamp >= datetime(2026-04-07 12:28:00Z) and timestamp < datetime(2026-04-07 13:02:00Z)
| where cloud_RoleName == "labep1shared-func"
| where message has "ReplayIteration"
| where operation_Id == IncidentId or operation_ParentId == IncidentId
| parse message with * "iteration=" iter:int "/" total:int *
| where iter % 50 == 0
| project timestamp, iter, total, message
| order by timestamp asc
| serialize
| extend previous_timestamp = prev(timestamp)
| extend interval_seconds = iif(isnull(previous_timestamp), real(null), toreal(datetime_diff('second', timestamp, previous_timestamp)))
3.8 Execute KQL with Azure CLI¶
az monitor log-analytics query \
--workspace "$LOG_WORKSPACE_ID" \
--analytics-query "let StartTime = datetime(2026-04-07 12:28:00Z); let EndTime = datetime(2026-04-07 13:02:00Z); AppRequests | where TimeGenerated >= StartTime and TimeGenerated < EndTime | where AppRoleName == 'labep1shared-func' | where OperationName has 'replay_storm_orchestrator' | summarize executions = count(), avg_ms = round(avg(DurationMs), 1), p95_ms = round(percentile(DurationMs, 95), 1), max_ms = round(max(DurationMs), 1) by bin(TimeGenerated, 2m) | order by TimeGenerated asc" \
--output table
az monitor log-analytics query \
--workspace "$LOG_WORKSPACE_ID" \
--analytics-query "let StartTime = datetime(2026-04-07 12:28:00Z); let EndTime = datetime(2026-04-07 13:02:00Z); AppRequests | where TimeGenerated >= StartTime and TimeGenerated < EndTime | where AppRoleName == 'labep1shared-func' | where OperationName has 'replay_lab_activity' | summarize activities = count(), avg_ms = round(avg(DurationMs), 1), p95_ms = round(percentile(DurationMs, 95), 1), max_ms = round(max(DurationMs), 1) by bin(TimeGenerated, 2m) | order by TimeGenerated asc" \
--output table
az monitor log-analytics query \
--workspace "$LOG_WORKSPACE_ID" \
--analytics-query "AppTraces | where TimeGenerated >= datetime(2026-04-07 12:56:00Z) and TimeGenerated < datetime(2026-04-07 13:02:00Z) | where AppRoleName == 'labep1shared-func' | where Message has_any ('ContinueAsNew', 'Orchestration complete') | project TimeGenerated, Message | order by TimeGenerated asc" \
--output table
3.9 Decision flow during live troubleshooting¶
flowchart TD
A[Durable orchestration slowdown detected] --> B{Host indexing healthy?}
B -->|No| C[Check managed identity + storage RBAC]
C --> D[Fix missing Table role if absent]
B -->|Yes| E{Activity duration increasing?}
E -->|Yes| F["Investigate activity logic/dependencies"]
E -->|No| G[Investigate replay growth pattern]
G --> H{ContinueAsNew configured?}
H -->|No| I[Apply periodic ContinueAsNew]
H -->|Yes| J[Tune batch size and checkpoint cadence]
I --> K[Re-run and compare 2-minute bins]
J --> K 4) Experiment Log¶
4.1 Execution ledger (recorded run)¶
| Phase | Instance ID | Start (UTC) | End (UTC) | Total wall-clock |
|---|---|---|---|---|
| Baseline | 7192e16593814d7f859c77558543c1da | 12:28:36 | 12:28:41 | ~5s |
| Incident | 7a21e8d1258a488bad8392f1bdfa36bf | 12:29:22 | 12:55:28 | ~26m |
| Recovery | 58ce8fb52432461787717b8b1a83256b | 12:56:18 | 13:01:21 | ~5m |
4.2 Orchestrator duration timeline (requests, 2-minute bins)¶
| Time (UTC) | Executions | Avg (ms) | p95 (ms) | Max (ms) | Phase |
|---|---|---|---|---|---|
| 12:28 | 297 | 42.2 | 81.8 | 143.8 | Baseline |
| 12:30 | 326 | 136.3 | 194.3 | 424.5 | Early incident |
| 12:32 | 246 | 189.5 | 280.1 | 580.3 | Incident |
| 12:34 | 198 | 268.7 | 380.5 | 720.4 | Incident |
| 12:36 | 152 | 356.5 | 469.2 | 995.0 | Mid incident |
| 12:38 | 130 | 412.8 | 550.6 | 1120.0 | Incident |
| 12:40 | 118 | 478.3 | 640.2 | 1280.0 | Incident |
| 12:42 | 100 | 574.4 | 1040.0 | 1422.0 | Late incident |
| 12:44 | 95 | 589.1 | 720.3 | 1350.0 | Late incident |
| 12:46 | 91 | 601.2 | 745.8 | 1450.0 | Late incident |
| 12:48 | 89 | 618.5 | 770.2 | 1520.0 | Peak incident |
| 12:50 | 88 | 625.3 | 780.4 | 1600.0 | Peak incident |
| 12:52 | 88 | 641.4 | 796.1 | 1687.1 | Peak incident |
| 12:54 | 63 | 628.4 | 708.6 | 871.6 | Peak incident |
| 12:56 | 694 | 45.2 | 86.5 | 201.9 | Recovery |
| 12:58 | 806 | 47.0 | 81.9 | 183.6 | Recovery |
| 13:00 | 508 | 46.9 | 82.6 | 316.9 | Recovery |
Derived contrasts:
| Contrast | Value |
|---|---|
| Baseline avg to peak incident avg | 42.2 -> 641.4 ms (15.2x) |
| Baseline p95 to peak incident p95 | 81.8 -> 796.1 ms (9.7x) |
| Baseline executions to worst incident bin | 297 -> 63 (-78.8%) |
| Recovery avg band | 45.2-47.0 ms |
4.3 Activity duration control signal (requests)¶
| Phase window | Activity avg (ms) | Interpretation |
|---|---|---|
| Baseline | 2.7 | fast and healthy |
| Incident | 3.0-3.4 | still fast; no bottleneck trend |
| Recovery | 2.8 | unchanged from healthy band |
Cross-check sample bins:
| Time (UTC) | Activities | Avg (ms) | p95 (ms) | Max (ms) |
|---|---|---|---|---|
| 12:28 | 295 | 2.7 | 5.1 | 9.0 |
| 12:36 | 151 | 3.0 | 4.4 | 8.5 |
| 12:46 | 90 | 3.2 | 4.9 | 9.8 |
| 12:52 | 87 | 3.4 | 5.2 | 10.1 |
| 12:56 | 692 | 2.8 | 4.7 | 8.6 |
| 13:00 | 506 | 2.8 | 4.6 | 8.9 |
4.4 ReplayIteration progress evidence¶
Observed wall-clock spacing between 50-iteration trace batches:
| Phase context | Typical interval between 50-iteration markers |
|---|---|
| Baseline (50 total) | ~7 seconds per batch |
| Incident early | ~20 seconds per batch |
| Incident mid | ~45 seconds per batch |
| Incident late | ~65 seconds per batch |
Recovery (ContinueAsNew) | ~30-40 seconds per batch with replay elapsed reset |
Recorded trace excerpts from the incident run (instance 7a21e8d1258a488bad8392f1bdfa36bf):
2026-04-07T12:29:42.103Z ReplayIteration instance=0 batch=0 iteration=50/2000 elapsed=1ms
2026-04-07T12:30:02.841Z ReplayIteration instance=0 batch=0 iteration=100/2000 elapsed=2ms
2026-04-07T12:30:22.517Z ReplayIteration instance=0 batch=0 iteration=150/2000 elapsed=3ms
2026-04-07T12:31:05.923Z ReplayIteration instance=0 batch=0 iteration=250/2000 elapsed=5ms
2026-04-07T12:33:12.446Z ReplayIteration instance=0 batch=0 iteration=500/2000 elapsed=10ms
2026-04-07T12:40:17.851Z ReplayIteration instance=0 batch=0 iteration=1000/2000 elapsed=20ms
2026-04-07T12:48:39.203Z ReplayIteration instance=0 batch=0 iteration=1500/2000 elapsed=35ms
2026-04-07T12:52:44.617Z ReplayIteration instance=0 batch=0 iteration=1850/2000 elapsed=45ms
2026-04-07T12:55:28.492Z ReplayIteration instance=0 batch=0 iteration=2000/2000 elapsed=53ms
4.5 ContinueAsNew evidence (7 resets + final completion)¶
The recovery run emitted seven reset transitions and one completion event.
Recorded trace excerpts from the recovery run (instance 58ce8fb52432461787717b8b1a83256b):
2026-04-07T12:56:52.114Z ContinueAsNew at iteration 250/2000, batch=0, resetting history
2026-04-07T12:57:23.507Z ContinueAsNew at iteration 500/2000, batch=1, resetting history
2026-04-07T12:57:55.891Z ContinueAsNew at iteration 750/2000, batch=2, resetting history
2026-04-07T12:58:27.204Z ContinueAsNew at iteration 1000/2000, batch=3, resetting history
2026-04-07T12:58:59.618Z ContinueAsNew at iteration 1250/2000, batch=4, resetting history
2026-04-07T12:59:31.033Z ContinueAsNew at iteration 1500/2000, batch=5, resetting history
2026-04-07T13:00:03.447Z ContinueAsNew at iteration 1750/2000, batch=6, resetting history
2026-04-07T13:01:21.862Z Orchestration complete: instance=0 total=2000 batches=8
Transition table:
| Sequence | Event |
|---|---|
| 1 | ContinueAsNew at iteration 250/2000, batch=0 |
| 2 | ContinueAsNew at iteration 500/2000, batch=1 |
| 3 | ContinueAsNew at iteration 750/2000, batch=2 |
| 4 | ContinueAsNew at iteration 1000/2000, batch=3 |
| 5 | ContinueAsNew at iteration 1250/2000, batch=4 |
| 6 | ContinueAsNew at iteration 1500/2000, batch=5 |
| 7 | ContinueAsNew at iteration 1750/2000, batch=6 |
| 8 | Final completion (batches=8) |
4.6 Phase-by-phase interpretation¶
| Phase | What happened | Why it matters |
|---|---|---|
| Baseline | low-history run finished quickly | validates healthy reference point |
| Incident | duration rose steadily and throughput fell | replay overhead compounded over time |
| Recovery | periodic resets removed accumulated history | throughput and duration recovered immediately |
4.7 Normal vs abnormal profile¶
| Dimension | Normal profile in this lab | Abnormal replay storm profile |
|---|---|---|
| Orchestrator avg duration | ~42-47 ms | climbs to ~641 ms |
| Orchestrator p95 | ~82-87 ms | climbs to ~796 ms |
| Activity avg duration | ~2.7-3.0 ms | still ~3.0-3.4 ms |
| 2-minute executions | high hundreds possible | low tens at incident peak |
| Replay trace spacing | short and steady | widens as history grows |
4.8 Common misdiagnoses observed in this lab¶
| Misdiagnosis | Why it is wrong here | Correct interpretation |
|---|---|---|
| "Activity got slower" | activity remained near-flat across phases | replay overhead grew in orchestrator path |
| "Storage was generally slow" | recovery used same storage and recovered immediately | history reset changed behavior |
| "Platform incident only" | deterministic recovery when toggling ContinueAsNew | workload shape was causal |
| "Durable bug in runtime" | behavior matches documented replay model | expected when history grows unbounded |
| "Host issue because of startup errors" | startup RBAC issue was distinct and fixed before test | replay storm was separate runtime behavior |
4.9 RBAC discovery log (separate from replay root cause)¶
| Observation | Evidence | Resolution |
|---|---|---|
| Functions failed to index initially | 403 AuthorizationPermissionMismatch against table operations | add Storage Table Data Contributor |
| Existing role set incomplete | Blob/Queue/File/Account present, Table missing | assign missing Table role |
| Post-fix behavior | Durable host indexed functions normally | replay experiment proceeded successfully |
4.10 Evidence verdict¶
| Hypothesis criterion | Evidence source | Result |
|---|---|---|
| Orchestrator growth in incident | requests 2-minute bins | Supported |
| Activity stability | requests activity bins | Supported |
| Throughput collapse then recovery | requests counts by bin | Supported |
| Replay reset signatures | traces ContinueAsNew messages | Supported |
| Causality via controlled change | incident vs recovery config delta | Supported |
Final verdict: Hypothesis supported.
Expected Evidence¶
Before Trigger (Baseline)¶
| Signal | Expected range/value |
|---|---|
| Instance ID | 7192e16593814d7f859c77558543c1da |
| Start/End | 12:28:36 -> 12:28:41 UTC |
| Total duration | ~5 seconds |
| Orchestrator avg | ~42.2 ms |
| Orchestrator p95 | ~81.8 ms |
| Activity avg | ~2.7 ms |
| Replay marker spacing | ~7 seconds |
During Incident¶
| Signal | Expected range/value |
|---|---|
| Instance ID | 7a21e8d1258a488bad8392f1bdfa36bf |
| Start/End | 12:29:22 -> 12:55:28 UTC |
| Total duration | ~26 minutes |
| Orchestrator avg progression | ~136 ms -> ~641 ms |
| Orchestrator p95 progression | ~194 ms -> ~796 ms |
| Activity avg | ~3.0-3.4 ms |
| Replay marker spacing | ~20s early -> ~45s mid -> ~65s late |
After Recovery¶
| Signal | Expected range/value |
|---|---|
| Instance ID | 58ce8fb52432461787717b8b1a83256b |
| Start/End | 12:56:18 -> 13:01:21 UTC |
| Total duration | ~5 minutes |
| ContinueAsNew resets | 7 |
| Final batches | 8 |
| Orchestrator avg | ~45.2-47.0 ms |
| Activity avg | ~2.8 ms |
| Replay elapsed reset | returns to ~1 ms per new batch start |
Evidence Timeline¶
gantt
title Durable Replay Storm Evidence Timeline (EP1)
dateFormat HH:mm:ss
axisFormat %H:%M
section Baseline
Baseline start :milestone, m1, 12:28:36, 0s
Baseline complete :milestone, m2, 12:28:41, 0s
section Incident
Incident start :milestone, m3, 12:29:22, 0s
Orchestrator avg reaches ~356ms (mid incident) :crit, i1, 12:36:00, 2m
Orchestrator avg reaches ~641ms (peak) :crit, i2, 12:52:00, 2m
Incident complete :milestone, m4, 12:55:28, 0s
section Recovery
Recovery start :milestone, m5, 12:56:18, 0s
ContinueAsNew sequence active :done, r1, 12:56:20, 4m
Recovery complete :milestone, m6, 13:01:21, 0s Evidence Chain: Why This Proves the Hypothesis¶
Falsification logic passed
-
Baseline and incident/recovery were executed in the same EP1 app, same region, and same runtime.
-
Incident and recovery used the same total work (
2000iterations), changing only history-reset policy. -
Incident showed monotonic orchestrator duration growth and throughput reduction.
-
Activity duration remained in a narrow low-ms band, eliminating activity bottleneck as primary cause.
-
Recovery emitted explicit
ContinueAsNewreset traces and restored orchestrator duration to near-baseline. -
The only variable needed to explain the difference was replay history growth behavior.
Evidence chain matrix:
| Link | Claim | Supporting artifact |
|---|---|---|
| L1 | Environment was controlled | single app/plan/region metadata |
| L2 | Baseline healthy | 50-iteration completion in ~5s |
| L3 | Incident degraded over time | orchestrator timeline 12:30 -> 12:54 |
| L4 | Activity stable | activity timeline and phase averages |
| L5 | Reset policy changed outcome | ContinueAsNew traces + recovery timings |
| L6 | Root cause is replay overhead | all competing hypotheses falsified |
Query-to-evidence mapping¶
| Evidence objective | Query |
|---|---|
| Orchestrator slope and throughput | Query A |
| Activity control signal | Query B |
| Per-instance phase rollup | Query C |
| Replay markers and elapsed values | Query D |
| ContinueAsNew transitions | Query E |
| Marker interval widening | Query F |
Operator verification checklist¶
- Confirm every duration conversion snippet uses
toreal(duration / 1ms). - Confirm instance IDs and timestamps match the run ledger.
- Confirm no fake minute-indexed telemetry IDs are present.
- Confirm tail order is
See AlsothenSources.
Clean Up¶
If this was a temporary environment, remove the resource group.
If this is a shared EP1 lab environment, skip deletion and only stop test traffic.
Related Playbook¶
See Also¶
- Lab guide: Cold start
- Lab guide: Timer missed schedules
- Lab guide: Queue backlog scaling
- Troubleshooting methodology
- Troubleshooting evidence map
- KQL quick reference for troubleshooting
Sources¶
- https://learn.microsoft.com/azure/azure-functions/durable/durable-functions-orchestrations
- https://learn.microsoft.com/azure/azure-functions/durable/durable-functions-best-practice-reference
- https://learn.microsoft.com/azure/azure-functions/durable/durable-functions-eternal-orchestrations
- https://learn.microsoft.com/azure/azure-functions/functions-reference-python
- https://learn.microsoft.com/azure/azure-functions/storage-considerations
- https://learn.microsoft.com/azure/azure-monitor/logs/log-query-overview