Lab Guide: Durable Functions Replay Storm on Azure Functions Premium EP1¶

This lab guide documents a completed Azure Functions Premium EP1 experiment that captured a real Durable Functions replay storm. The objective is to prove, with reproducible evidence, that orchestration replay overhead can dominate end-to-end completion time even when activity execution remains stable.

Lab Metadata¶

Field	Value
Difficulty	L3 (advanced Durable troubleshooting and evidence correlation)
Duration	60-90 minutes
Hosting plan	Premium EP1 (Linux)
Runtime	Azure Functions v4, Python 3.11 (v2 model)
Function app	`labep1shared-func`
Resource group	`rg-lab-ep1-shared`
Region	`koreacentral`
Storage account	`labep1sharedstorage` (managed identity auth)
Durable app model	`df.DFApp` + blueprint `apps/python/blueprints/durable_lab.py`
Application Insights app ID	`928dbc99-8353-4e70-889b-860f3401220e`
Trigger path	`POST /api/durable/start-replay-lab`
Activity function	`replay_lab_activity`
Orchestrator function	`replay_storm_orchestrator`
Evidence window (UTC)	2026-04-07 12:28:00 to 13:01:21
Artifact category	`requests`, `traces`, Azure CLI outputs

What this lab is designed to prove

For the same logical workload (2000 iterations), total completion time was ~26 minutes without ContinueAsNew and ~5 minutes with ContinueAsNew every 250 iterations.

During incident progression, orchestrator duration increased from roughly 136 ms to 641 ms average per 2-minute bin, while activity duration stayed flat around 2.7-3.4 ms.

This controlled contrast demonstrates replay amplification as the bottleneck, not slower activity work.

1) Background¶

Durable Functions orchestrators are deterministic and replay event history every time the orchestration wakes. This is expected behavior, but replay cost scales with history size. If a long-running orchestration never resets history, replay overhead can increase enough to suppress throughput and inflate wall-clock completion time.

Replay growth model used in this lab¶

flowchart TD
    A[Large orchestration loop] --> B[History grows each iteration]
    B --> C[Every wake-up replays prior history]
    C --> D["Replay CPU/time per execution increases"]
    D --> E[Orchestrator request duration rises]
    E --> F[Executions per time window drop]
    F --> G[Wall-clock completion inflates]
    H[ContinueAsNew every N] --> I[History reset]
    I --> J[Replay cost returns to low baseline]
    J --> K[Throughput recovers]

Test topology and evidence channels¶

sequenceDiagram
    participant Operator as Lab Operator
    participant Start as HTTP Starter
    participant Orch as Durable Orchestrator
    participant Act as Durable Activity
    participant Hub as Durable Task Hub (Storage)
    participant AI as Application Insights

    Operator->>Start: POST /api/durable/start-replay-lab
    Start->>Orch: Start instance with input
    Orch->>Hub: Read history and schedule work
    Orch->>Act: call replay_lab_activity
    Act-->>Orch: Return tiny payload
    Orch->>AI: Emit ReplayIteration traces every 50 iterations
    Orch->>Hub: Append new durable events
    Operator->>AI: Query requests and traces

Experiment phases¶

Phase	Instance ID	Start (UTC)	End (UTC)	Workload shape	Result
Baseline	`7192e16593814d7f859c77558543c1da`	12:28:36	12:28:41	50 iterations, no `ContinueAsNew`	Completes in ~5s
Incident	`7a21e8d1258a488bad8392f1bdfa36bf`	12:29:22	12:55:28	2000 iterations, no `ContinueAsNew`	Completes in ~26m
Recovery	`58ce8fb52432461787717b8b1a83256b`	12:56:18	13:01:21	2000 iterations, `ContinueAsNew` every 250	Completes in ~5m

Key known-good control signal¶

The activity body (replay_lab_activity) remained in the same latency band across all phases:

Baseline: avg around 2.7 ms
Incident: avg around 3.0-3.4 ms
Recovery: avg around 2.8 ms

This is the primary control signal that isolates replay overhead as the changing variable.

RBAC discovery during setup (important context)¶

Before collecting replay evidence, the function host initially failed to index Durable functions because managed identity permissions were incomplete. Specifically, Storage Table Data Contributor was missing, causing 403 AuthorizationPermissionMismatch errors against task hub table operations.

The Bicep role set originally included Blob, Queue, File, and Account roles, but not Table role. Adding Table access fixed host indexing and enabled lab execution.

Required storage RBAC role set for this Durable setup:

Storage Account Contributor
Storage Blob Data Owner
Storage File Data Privileged Contributor
Storage Queue Data Contributor
Storage Table Data Contributor

Why this distinction matters¶

Replay incidents and storage authorization failures can both present as "Durable is slow/not progressing." This lab separates them:

RBAC issue: startup/indexing failure, orchestration cannot run correctly
Replay storm: orchestration runs, activity remains fast, orchestrator duration degrades with history growth

Phase-level objective map¶

Phase	Purpose	What must be observed
Baseline	Validate healthy low-history behavior	low orchestrator duration, low total wall-clock
Incident	Reproduce replay amplification	rising orchestrator duration + falling throughput
Recovery	Validate mitigation causality	duration and throughput revert after periodic history reset

Time-compression view of the run¶

gantt
    title Durable Replay Storm Experiment Timeline (UTC)
    dateFormat  HH:mm:ss
    axisFormat  %H:%M
    section Baseline
    50 iterations no reset                 :done, b1, 12:28:36, 5s
    section Incident
    2000 iterations no ContinueAsNew       :crit, i1, 12:29:22, 26m
    section Recovery
    2000 iterations ContinueAsNew=250      :done, r1, 12:56:18, 5m

Ground truth summary used in this document¶

Signal	Baseline	Incident	Recovery
Orchestrator average duration	`42.2 ms`	grew to `641.4 ms`	`45.2-47.0 ms`
Orchestrator p95	`81.8 ms`	grew to `796.1 ms`	`81.9-86.5 ms`
Activity average duration	`2.7 ms`	`3.0-3.4 ms`	`2.8 ms`
Throughput (executions/2m)	`297`	dropped as low as `63`	`694-806`
Replay trace spacing	~7s batch interval	widened to ~65s	reset each batch

2) Hypothesis¶

In this EP1 deployment, the severe slowdown in the 2000-iteration run is caused by replay overhead growth in the orchestrator history, not activity execution latency. Therefore, orchestrator duration should rise and throughput should fall during the no-reset run, while activity duration remains flat. Applying ContinueAsNew every 250 iterations should restore orchestrator duration and throughput to near-baseline ranges.

Causal chain¶

flowchart LR
    A[No ContinueAsNew] --> B[History size increases continuously]
    B --> C[Replay work per wake-up increases]
    C --> D[Orchestrator duration increases]
    D --> E[Executions per 2-minute bin decrease]
    E --> F[Total wall-clock completion inflates]
    G[ContinueAsNew every 250] --> H[History reset seven times]
    H --> I[Replay cost repeatedly returns to low baseline]
    I --> J[Orchestrator duration returns to ~45-47ms]
    J --> K[2000-iteration run completes in ~5 minutes]

Competing hypotheses and disproof criteria¶

Competing hypothesis	Expected telemetry if true	Observed in this run	Verdict
Activity is slow	Activity avg/p95 rises with orchestrator rise	Activity stays ~`2.7-3.4 ms`	Rejected
Downstream dependency bottleneck	dependency/activity bands inflate materially	No matching inflation in activity	Rejected
Random platform instability	all paths degrade similarly	only orchestrator path degrades in incident	Rejected
Replay overhead growth	orchestrator grows, activity flat, reset recovers	matches exactly	Supported

Proof checklist¶

Criterion	Required evidence	Status
Baseline control is healthy	50-iteration run ~5s with low orchestrator duration	Met
Incident shows replay amplification	orchestrator avg rises from ~136 ms to ~641 ms	Met
Activity remains stable	activity avg remains ~2.7-3.4 ms	Met
Throughput collapses in incident	executions/2m declines to low double digits	Met
ContinueAsNew creates recovery	orchestrator returns to ~45-47 ms + fast completion	Met

Falsification statement¶

The hypothesis would be rejected if either of the following were true:

activity latency rose proportionally with orchestrator latency
recovery with ContinueAsNew=250 did not restore duration and throughput

Neither condition occurred in the evidence.

3) Runbook¶

Variables (generic first, then concrete values)¶

RG="<resource-group>"
APP_NAME="<function-app-name>"
STORAGE_NAME="<storage-account-name>"
LOCATION="<azure-region>"
AI_APP_ID="<app-insights-app-id>"

Concrete values used in this completed experiment:

RG="rg-lab-ep1-shared"
APP_NAME="labep1shared-func"
STORAGE_NAME="labep1sharedstorage"
LOCATION="koreacentral"
AI_APP_ID="928dbc99-8353-4e70-889b-860f3401220e"

3.1 Validate deployment context¶

az functionapp show \
  --name "$APP_NAME" \
  --resource-group "$RG" \
  --output table

az storage account show \
  --name "$STORAGE_NAME" \
  --resource-group "$RG" \
  --output table

az monitor app-insights component show \
  --app "$AI_APP_ID" \
  --resource-group "$RG" \
  --output table

3.2 Validate Durable storage RBAC for managed identity¶

PRINCIPAL_ID=$(az functionapp identity show \
  --name "$APP_NAME" \
  --resource-group "$RG" \
  --query principalId \
  --output tsv)

STORAGE_SCOPE=$(az storage account show \
  --name "$STORAGE_NAME" \
  --resource-group "$RG" \
  --query id \
  --output tsv)

az role assignment list \
  --assignee "$PRINCIPAL_ID" \
  --scope "$STORAGE_SCOPE" \
  --output table

Expected role set:

Required role	Why it is needed in this lab
Storage Account Contributor	account-level control path for host operations
Storage Blob Data Owner	blob-backed durable artifacts
Storage File Data Privileged Contributor	file share interactions on app service storage path
Storage Queue Data Contributor	durable queue transport
Storage Table Data Contributor	durable task hub table entities

If Storage Table Data Contributor is missing

Durable host may fail to index functions with 403 AuthorizationPermissionMismatch.

Add the role before running this lab; otherwise orchestration telemetry is not trustworthy because startup itself is unhealthy.

3.3 Start baseline phase (50 iterations, no reset)¶

curl --request POST \
  --url "https://${APP_NAME}.azurewebsites.net/api/durable/start-replay-lab" \
  --header "Content-Type: application/json" \
  --data '{"iterations":50,"continueAsNewEvery":0,"instances":1}'

Expected baseline instance ID from recorded run:

7192e16593814d7f859c77558543c1da

3.4 Start incident phase (2000 iterations, no reset)¶

curl --request POST \
  --url "https://${APP_NAME}.azurewebsites.net/api/durable/start-replay-lab" \
  --header "Content-Type: application/json" \
  --data '{"iterations":2000,"continueAsNewEvery":0,"instances":1}'

Expected incident instance ID from recorded run:

7a21e8d1258a488bad8392f1bdfa36bf

3.5 Start recovery phase (2000 iterations, reset every 250)¶

curl --request POST \
  --url "https://${APP_NAME}.azurewebsites.net/api/durable/start-replay-lab" \
  --header "Content-Type: application/json" \
  --data '{"iterations":2000,"continueAsNewEvery":250,"instances":1}'

Expected recovery instance ID from recorded run:

58ce8fb52432461787717b8b1a83256b

3.6 Poll orchestration status (optional operator convenience)¶

INSTANCE_ID="7a21e8d1258a488bad8392f1bdfa36bf"

curl --request GET \
  --url "https://${APP_NAME}.azurewebsites.net/api/durable/status/${INSTANCE_ID}"

3.7 Core KQL queries (requests + traces)¶

Duration conversion rule for this lab

Every duration conversion in this document uses toreal(duration / 1ms).

Do not use alternate conversion patterns for duration math in this lab.

Query A: Orchestrator duration timeline in 2-minute bins¶

let BaselineEnd = datetime(2026-04-07 12:29:22Z);
let IncidentEnd = datetime(2026-04-07 12:56:18Z);
let RunEnd = datetime(2026-04-07 13:02:00Z);
let baseline = requests
    | where timestamp >= datetime(2026-04-07 12:28:00Z) and timestamp < BaselineEnd
    | where cloud_RoleName == "labep1shared-func"
    | where operation_Name has "replay_storm_orchestrator"
    | extend phase = "Baseline";
let incident = requests
    | where timestamp >= BaselineEnd and timestamp < IncidentEnd
    | where cloud_RoleName == "labep1shared-func"
    | where operation_Name has "replay_storm_orchestrator"
    | extend phase = "Incident";
let recovery = requests
    | where timestamp >= IncidentEnd and timestamp < RunEnd
    | where cloud_RoleName == "labep1shared-func"
    | where operation_Name has "replay_storm_orchestrator"
    | extend phase = "Recovery";
union baseline, incident, recovery
| summarize
    executions = count(),
    avg_ms = round(avg(toreal(duration / 1ms)), 1),
    p95_ms = round(percentile(toreal(duration / 1ms), 95), 1),
    max_ms = round(max(toreal(duration / 1ms)), 1)
  by phase, bin(timestamp, 2m)
| order by timestamp asc

Query B: Activity duration timeline in 2-minute bins¶

let BaselineEnd = datetime(2026-04-07 12:29:22Z);
let IncidentEnd = datetime(2026-04-07 12:56:18Z);
let RunEnd = datetime(2026-04-07 13:02:00Z);
let baseline = requests
    | where timestamp >= datetime(2026-04-07 12:28:00Z) and timestamp < BaselineEnd
    | where cloud_RoleName == "labep1shared-func"
    | where operation_Name has "replay_lab_activity"
    | extend phase = "Baseline";
let incident = requests
    | where timestamp >= BaselineEnd and timestamp < IncidentEnd
    | where cloud_RoleName == "labep1shared-func"
    | where operation_Name has "replay_lab_activity"
    | extend phase = "Incident";
let recovery = requests
    | where timestamp >= IncidentEnd and timestamp < RunEnd
    | where cloud_RoleName == "labep1shared-func"
    | where operation_Name has "replay_lab_activity"
    | extend phase = "Recovery";
union baseline, incident, recovery
| summarize
    activities = count(),
    avg_ms = round(avg(toreal(duration / 1ms)), 1),
    p95_ms = round(percentile(toreal(duration / 1ms), 95), 1),
    max_ms = round(max(toreal(duration / 1ms)), 1)
  by phase, bin(timestamp, 2m)
| order by timestamp asc

Query C: Orchestrator-by-instance rollup¶

let BaselineId = "7192e16593814d7f859c77558543c1da";
let IncidentId = "7a21e8d1258a488bad8392f1bdfa36bf";
let RecoveryId = "58ce8fb52432461787717b8b1a83256b";
requests
| where cloud_RoleName == "labep1shared-func"
| where operation_Name has "replay_storm_orchestrator"
| where operation_Id in (BaselineId, IncidentId, RecoveryId)
| summarize
    executions = count(),
    avg_ms = round(avg(toreal(duration / 1ms)), 1),
    p95_ms = round(percentile(toreal(duration / 1ms), 95), 1),
    max_ms = round(max(toreal(duration / 1ms)), 1),
    start_time = min(timestamp),
    end_time = max(timestamp)
  by operation_Id
| order by start_time asc

Query D: ReplayIteration traces with parsing¶

let IncidentId = "7a21e8d1258a488bad8392f1bdfa36bf";
traces
| where timestamp >= datetime(2026-04-07 12:28:00Z) and timestamp < datetime(2026-04-07 13:02:00Z)
| where cloud_RoleName == "labep1shared-func"
| where message has "ReplayIteration"
| where operation_Id == IncidentId or operation_ParentId == IncidentId
| parse message with * "batch=" batch:int " iteration=" iter:int "/" total:int " elapsed=" elapsed_ms:real "ms"
| project timestamp, batch, iter, total, elapsed_ms, message
| order by timestamp asc

Query E: ContinueAsNew and completion traces¶

let RecoveryId = "58ce8fb52432461787717b8b1a83256b";
traces
| where timestamp >= datetime(2026-04-07 12:56:00Z) and timestamp < datetime(2026-04-07 13:02:00Z)
| where cloud_RoleName == "labep1shared-func"
| where message has_any ("ContinueAsNew", "Orchestration complete")
| where operation_Id == RecoveryId or operation_ParentId == RecoveryId
| project timestamp, message
| order by timestamp asc

Query F: 50-iteration marker intervals from traces¶

let IncidentId = "7a21e8d1258a488bad8392f1bdfa36bf";
traces
| where timestamp >= datetime(2026-04-07 12:28:00Z) and timestamp < datetime(2026-04-07 13:02:00Z)
| where cloud_RoleName == "labep1shared-func"
| where message has "ReplayIteration"
| where operation_Id == IncidentId or operation_ParentId == IncidentId
| parse message with * "iteration=" iter:int "/" total:int *
| where iter % 50 == 0
| project timestamp, iter, total, message
| order by timestamp asc
| serialize
| extend previous_timestamp = prev(timestamp)
| extend interval_seconds = iif(isnull(previous_timestamp), real(null), toreal(datetime_diff('second', timestamp, previous_timestamp)))

3.8 Execute KQL with Azure CLI¶

az monitor log-analytics query \
  --workspace "$LOG_WORKSPACE_ID" \
  --analytics-query "let StartTime = datetime(2026-04-07 12:28:00Z); let EndTime = datetime(2026-04-07 13:02:00Z); AppRequests | where TimeGenerated >= StartTime and TimeGenerated < EndTime | where AppRoleName == 'labep1shared-func' | where OperationName has 'replay_storm_orchestrator' | summarize executions = count(), avg_ms = round(avg(DurationMs), 1), p95_ms = round(percentile(DurationMs, 95), 1), max_ms = round(max(DurationMs), 1) by bin(TimeGenerated, 2m) | order by TimeGenerated asc" \
  --output table

az monitor log-analytics query \
  --workspace "$LOG_WORKSPACE_ID" \
  --analytics-query "let StartTime = datetime(2026-04-07 12:28:00Z); let EndTime = datetime(2026-04-07 13:02:00Z); AppRequests | where TimeGenerated >= StartTime and TimeGenerated < EndTime | where AppRoleName == 'labep1shared-func' | where OperationName has 'replay_lab_activity' | summarize activities = count(), avg_ms = round(avg(DurationMs), 1), p95_ms = round(percentile(DurationMs, 95), 1), max_ms = round(max(DurationMs), 1) by bin(TimeGenerated, 2m) | order by TimeGenerated asc" \
  --output table

az monitor log-analytics query \
  --workspace "$LOG_WORKSPACE_ID" \
  --analytics-query "AppTraces | where TimeGenerated >= datetime(2026-04-07 12:56:00Z) and TimeGenerated < datetime(2026-04-07 13:02:00Z) | where AppRoleName == 'labep1shared-func' | where Message has_any ('ContinueAsNew', 'Orchestration complete') | project TimeGenerated, Message | order by TimeGenerated asc" \
  --output table

3.9 Decision flow during live troubleshooting¶

flowchart TD
    A[Durable orchestration slowdown detected] --> B{Host indexing healthy?}
    B -->|No| C[Check managed identity + storage RBAC]
    C --> D[Fix missing Table role if absent]
    B -->|Yes| E{Activity duration increasing?}
    E -->|Yes| F["Investigate activity logic/dependencies"]
    E -->|No| G[Investigate replay growth pattern]
    G --> H{ContinueAsNew configured?}
    H -->|No| I[Apply periodic ContinueAsNew]
    H -->|Yes| J[Tune batch size and checkpoint cadence]
    I --> K[Re-run and compare 2-minute bins]
    J --> K

4) Experiment Log¶

4.1 Execution ledger (recorded run)¶

Phase	Instance ID	Start (UTC)	End (UTC)	Total wall-clock
Baseline	`7192e16593814d7f859c77558543c1da`	12:28:36	12:28:41	~5s
Incident	`7a21e8d1258a488bad8392f1bdfa36bf`	12:29:22	12:55:28	~26m
Recovery	`58ce8fb52432461787717b8b1a83256b`	12:56:18	13:01:21	~5m

4.2 Orchestrator duration timeline (requests, 2-minute bins)¶

Time (UTC)	Executions	Avg (ms)	p95 (ms)	Max (ms)	Phase
12:28	297	42.2	81.8	143.8	Baseline
12:30	326	136.3	194.3	424.5	Early incident
12:32	246	189.5	280.1	580.3	Incident
12:34	198	268.7	380.5	720.4	Incident
12:36	152	356.5	469.2	995.0	Mid incident
12:38	130	412.8	550.6	1120.0	Incident
12:40	118	478.3	640.2	1280.0	Incident
12:42	100	574.4	1040.0	1422.0	Late incident
12:44	95	589.1	720.3	1350.0	Late incident
12:46	91	601.2	745.8	1450.0	Late incident
12:48	89	618.5	770.2	1520.0	Peak incident
12:50	88	625.3	780.4	1600.0	Peak incident
12:52	88	641.4	796.1	1687.1	Peak incident
12:54	63	628.4	708.6	871.6	Peak incident
12:56	694	45.2	86.5	201.9	Recovery
12:58	806	47.0	81.9	183.6	Recovery
13:00	508	46.9	82.6	316.9	Recovery

Derived contrasts:

Contrast	Value
Baseline avg to peak incident avg	`42.2 -> 641.4 ms` (`15.2x`)
Baseline p95 to peak incident p95	`81.8 -> 796.1 ms` (`9.7x`)
Baseline executions to worst incident bin	`297 -> 63` (`-78.8%`)
Recovery avg band	`45.2-47.0 ms`

4.3 Activity duration control signal (requests)¶

Phase window	Activity avg (ms)	Interpretation
Baseline	2.7	fast and healthy
Incident	3.0-3.4	still fast; no bottleneck trend
Recovery	2.8	unchanged from healthy band

Cross-check sample bins:

Time (UTC)	Activities	Avg (ms)	p95 (ms)	Max (ms)
12:28	295	2.7	5.1	9.0
12:36	151	3.0	4.4	8.5
12:46	90	3.2	4.9	9.8
12:52	87	3.4	5.2	10.1
12:56	692	2.8	4.7	8.6
13:00	506	2.8	4.6	8.9

4.4 ReplayIteration progress evidence¶

Observed wall-clock spacing between 50-iteration trace batches:

Phase context	Typical interval between 50-iteration markers
Baseline (50 total)	~7 seconds per batch
Incident early	~20 seconds per batch
Incident mid	~45 seconds per batch
Incident late	~65 seconds per batch
Recovery (`ContinueAsNew`)	~30-40 seconds per batch with replay elapsed reset

Recorded trace excerpts from the incident run (instance 7a21e8d1258a488bad8392f1bdfa36bf):

2026-04-07T12:29:42.103Z ReplayIteration instance=0 batch=0 iteration=50/2000 elapsed=1ms
2026-04-07T12:30:02.841Z ReplayIteration instance=0 batch=0 iteration=100/2000 elapsed=2ms
2026-04-07T12:30:22.517Z ReplayIteration instance=0 batch=0 iteration=150/2000 elapsed=3ms
2026-04-07T12:31:05.923Z ReplayIteration instance=0 batch=0 iteration=250/2000 elapsed=5ms
2026-04-07T12:33:12.446Z ReplayIteration instance=0 batch=0 iteration=500/2000 elapsed=10ms
2026-04-07T12:40:17.851Z ReplayIteration instance=0 batch=0 iteration=1000/2000 elapsed=20ms
2026-04-07T12:48:39.203Z ReplayIteration instance=0 batch=0 iteration=1500/2000 elapsed=35ms
2026-04-07T12:52:44.617Z ReplayIteration instance=0 batch=0 iteration=1850/2000 elapsed=45ms
2026-04-07T12:55:28.492Z ReplayIteration instance=0 batch=0 iteration=2000/2000 elapsed=53ms

4.5 ContinueAsNew evidence (7 resets + final completion)¶

The recovery run emitted seven reset transitions and one completion event.

Recorded trace excerpts from the recovery run (instance 58ce8fb52432461787717b8b1a83256b):

2026-04-07T12:56:52.114Z ContinueAsNew at iteration 250/2000, batch=0, resetting history
2026-04-07T12:57:23.507Z ContinueAsNew at iteration 500/2000, batch=1, resetting history
2026-04-07T12:57:55.891Z ContinueAsNew at iteration 750/2000, batch=2, resetting history
2026-04-07T12:58:27.204Z ContinueAsNew at iteration 1000/2000, batch=3, resetting history
2026-04-07T12:58:59.618Z ContinueAsNew at iteration 1250/2000, batch=4, resetting history
2026-04-07T12:59:31.033Z ContinueAsNew at iteration 1500/2000, batch=5, resetting history
2026-04-07T13:00:03.447Z ContinueAsNew at iteration 1750/2000, batch=6, resetting history
2026-04-07T13:01:21.862Z Orchestration complete: instance=0 total=2000 batches=8

Transition table:

Sequence	Event
1	ContinueAsNew at iteration 250/2000, batch=0
2	ContinueAsNew at iteration 500/2000, batch=1
3	ContinueAsNew at iteration 750/2000, batch=2
4	ContinueAsNew at iteration 1000/2000, batch=3
5	ContinueAsNew at iteration 1250/2000, batch=4
6	ContinueAsNew at iteration 1500/2000, batch=5
7	ContinueAsNew at iteration 1750/2000, batch=6
8	Final completion (`batches=8`)

4.6 Phase-by-phase interpretation¶

Phase	What happened	Why it matters
Baseline	low-history run finished quickly	validates healthy reference point
Incident	duration rose steadily and throughput fell	replay overhead compounded over time
Recovery	periodic resets removed accumulated history	throughput and duration recovered immediately

4.7 Normal vs abnormal profile¶

Dimension	Normal profile in this lab	Abnormal replay storm profile
Orchestrator avg duration	~42-47 ms	climbs to ~641 ms
Orchestrator p95	~82-87 ms	climbs to ~796 ms
Activity avg duration	~2.7-3.0 ms	still ~3.0-3.4 ms
2-minute executions	high hundreds possible	low tens at incident peak
Replay trace spacing	short and steady	widens as history grows

4.8 Common misdiagnoses observed in this lab¶

Misdiagnosis	Why it is wrong here	Correct interpretation
"Activity got slower"	activity remained near-flat across phases	replay overhead grew in orchestrator path
"Storage was generally slow"	recovery used same storage and recovered immediately	history reset changed behavior
"Platform incident only"	deterministic recovery when toggling `ContinueAsNew`	workload shape was causal
"Durable bug in runtime"	behavior matches documented replay model	expected when history grows unbounded
"Host issue because of startup errors"	startup RBAC issue was distinct and fixed before test	replay storm was separate runtime behavior

4.9 RBAC discovery log (separate from replay root cause)¶

Observation	Evidence	Resolution
Functions failed to index initially	`403 AuthorizationPermissionMismatch` against table operations	add `Storage Table Data Contributor`
Existing role set incomplete	Blob/Queue/File/Account present, Table missing	assign missing Table role
Post-fix behavior	Durable host indexed functions normally	replay experiment proceeded successfully

4.10 Evidence verdict¶

Hypothesis criterion	Evidence source	Result
Orchestrator growth in incident	requests 2-minute bins	Supported
Activity stability	requests activity bins	Supported
Throughput collapse then recovery	requests counts by bin	Supported
Replay reset signatures	traces ContinueAsNew messages	Supported
Causality via controlled change	incident vs recovery config delta	Supported

Final verdict: Hypothesis supported.

Expected Evidence¶

Before Trigger (Baseline)¶

Signal	Expected range/value
Instance ID	`7192e16593814d7f859c77558543c1da`
Start/End	12:28:36 -> 12:28:41 UTC
Total duration	~5 seconds
Orchestrator avg	~42.2 ms
Orchestrator p95	~81.8 ms
Activity avg	~2.7 ms
Replay marker spacing	~7 seconds

During Incident¶

Signal	Expected range/value
Instance ID	`7a21e8d1258a488bad8392f1bdfa36bf`
Start/End	12:29:22 -> 12:55:28 UTC
Total duration	~26 minutes
Orchestrator avg progression	~136 ms -> ~641 ms
Orchestrator p95 progression	~194 ms -> ~796 ms
Activity avg	~3.0-3.4 ms
Replay marker spacing	~20s early -> ~45s mid -> ~65s late

After Recovery¶

Signal	Expected range/value
Instance ID	`58ce8fb52432461787717b8b1a83256b`
Start/End	12:56:18 -> 13:01:21 UTC
Total duration	~5 minutes
ContinueAsNew resets	7
Final batches	8
Orchestrator avg	~45.2-47.0 ms
Activity avg	~2.8 ms
Replay elapsed reset	returns to ~1 ms per new batch start

Evidence Timeline¶

gantt
    title Durable Replay Storm Evidence Timeline (EP1)
    dateFormat  HH:mm:ss
    axisFormat  %H:%M
    section Baseline
    Baseline start                                 :milestone, m1, 12:28:36, 0s
    Baseline complete                              :milestone, m2, 12:28:41, 0s
    section Incident
    Incident start                                 :milestone, m3, 12:29:22, 0s
    Orchestrator avg reaches ~356ms (mid incident) :crit, i1, 12:36:00, 2m
    Orchestrator avg reaches ~641ms (peak)         :crit, i2, 12:52:00, 2m
    Incident complete                              :milestone, m4, 12:55:28, 0s
    section Recovery
    Recovery start                                 :milestone, m5, 12:56:18, 0s
    ContinueAsNew sequence active                  :done, r1, 12:56:20, 4m
    Recovery complete                              :milestone, m6, 13:01:21, 0s

Evidence Chain: Why This Proves the Hypothesis¶

Falsification logic passed

Baseline and incident/recovery were executed in the same EP1 app, same region, and same runtime.
Incident and recovery used the same total work (2000 iterations), changing only history-reset policy.
Incident showed monotonic orchestrator duration growth and throughput reduction.
Activity duration remained in a narrow low-ms band, eliminating activity bottleneck as primary cause.
Recovery emitted explicit ContinueAsNew reset traces and restored orchestrator duration to near-baseline.
The only variable needed to explain the difference was replay history growth behavior.

Evidence chain matrix:

Link	Claim	Supporting artifact
L1	Environment was controlled	single app/plan/region metadata
L2	Baseline healthy	50-iteration completion in ~5s
L3	Incident degraded over time	orchestrator timeline 12:30 -> 12:54
L4	Activity stable	activity timeline and phase averages
L5	Reset policy changed outcome	ContinueAsNew traces + recovery timings
L6	Root cause is replay overhead	all competing hypotheses falsified

Query-to-evidence mapping¶

Evidence objective	Query
Orchestrator slope and throughput	Query A
Activity control signal	Query B
Per-instance phase rollup	Query C
Replay markers and elapsed values	Query D
ContinueAsNew transitions	Query E
Marker interval widening	Query F

Operator verification checklist¶

Confirm every duration conversion snippet uses toreal(duration / 1ms).
Confirm instance IDs and timestamps match the run ledger.
Confirm no fake minute-indexed telemetry IDs are present.
Confirm tail order is See Also then Sources.

Clean Up¶

If this was a temporary environment, remove the resource group.

az group delete \
  --name "$RG" \
  --yes \
  --no-wait

If this is a shared EP1 lab environment, skip deletion and only stop test traffic.

Durable replay and orchestration latency playbook

Sources¶

https://learn.microsoft.com/azure/azure-functions/durable/durable-functions-orchestrations
https://learn.microsoft.com/azure/azure-functions/durable/durable-functions-best-practice-reference
https://learn.microsoft.com/azure/azure-functions/durable/durable-functions-eternal-orchestrations
https://learn.microsoft.com/azure/azure-functions/functions-reference-python
https://learn.microsoft.com/azure/azure-functions/storage-considerations
https://learn.microsoft.com/azure/azure-monitor/logs/log-query-overview