Troubleshooting Architecture Overview¶

This page answers the first practical question during an Azure Functions incident: which platform segment is failing?

Before opening a deep playbook, map the symptom to the correct layer: hosting plan behavior, trigger delivery, scale controller decisions, worker startup, Durable Functions state coordination, or downstream service integration. That classification tells you which telemetry to query first and which mitigation is realistic.

Why this page exists¶

Playbooks explain specific scenarios. During active incidents, engineers usually need one faster artifact:

A hosting and execution map that explains where invocations actually run
A trigger and binding flow that shows where events can be delayed, dropped, or rejected
A scale model that explains backlog growth, burst behavior, and instance creation timing
A cold start path that explains why the first execution is slow even when application code is healthy
A Durable Functions model that explains why orchestrations can appear stuck while storage state is still changing

Use this page to route quickly to the right evidence and the right troubleshooting document.

1) Hosting Plan Architecture (where startup and scale behavior change)¶

Azure Functions does not have one runtime shape. The failure domain changes materially by hosting plan.

flowchart LR
    A[Client or Event Source] --> B["Azure Functions Front Door / Trigger Source"]
    B --> C{Hosting plan}
    C -->|Consumption| D[Shared serverless workers\nScale to zero]
    C -->|Flex Consumption| E[Serverless workers with newer scale model\nPer-function scaling focus]
    C -->|Premium| F[Prewarmed and always-ready workers\nNo scale-to-zero]
    C -->|Dedicated| G[App Service Plan instances\nManual or autoscale]

    D -. FP-PLAN-01 .-> D1[Cold start after idle or scale-out]
    E -. FP-PLAN-02 .-> E1[Trigger-specific scale behavior and networking assumptions]
    F -. FP-PLAN-03 .-> F1[Prewarmed capacity insufficient for burst]
    G -. FP-PLAN-04 .-> G1[Too few instances or noisy-neighbor within plan]

Hosting plan troubleshooting differences¶

Plan	Core behavior	Common incident pattern	First evidence to check	Diagnostic implication
Consumption	Event-driven instances, can scale to zero	Slow first execution, backlog during burst, timeout near cold scale-out	`traces`, `requests`, execution count metrics	Always consider cold start and scale latency before blaming app code
Flex Consumption	Newer serverless model with better networking and scaling controls	Trigger-specific delivery confusion, cold start reduced but still possible, storage or Event Grid assumptions for some triggers	`traces`, trigger configuration, platform settings	Check trigger transport design, deployment region, and plan-specific trigger support
Premium	Prewarmed/always-ready instances, VNet support, no scale-to-zero	Burst exceeds ready capacity, worker recycle, long-running workloads under pressure	instance count metrics, `requests`, `dependencies`	Cold start is less dominant; concurrency and downstream bottlenecks rise in priority
Dedicated	Runs on App Service Plan instances	App appears stable but under-provisioned, restarts due to shared plan pressure, scaling requires explicit plan action	App Service metrics, Activity Log, instance health	Treat as fixed-capacity web workload first, serverless elasticity second

Failure points and playbooks¶

Failure Point	Symptom	First evidence	Playbook
FP-PLAN-01 Consumption cold scale	first request after idle is slow, queue trigger starts late	host startup traces, request duration percentiles	High Latency
FP-PLAN-02 Flex trigger mismatch	blob or event-driven execution pattern differs from assumption	listener traces, trigger source configuration	Functions Not Executing
FP-PLAN-03 Premium ready-capacity gap	p95 worsens during bursts even without scale-to-zero	instance count, request volume, dependency latency	High Latency
FP-PLAN-04 Dedicated capacity ceiling	throughput flattens, queue backlog grows, CPU or memory pressure rises	plan metrics, worker restart correlation	Queue Piling Up and Out of Memory / Worker Crash

2) Trigger and Binding Architecture (where invocations begin or stall)¶

Most Azure Functions incidents begin before user code runs. The event must first reach the trigger listener, be accepted by the host, and then complete any binding reads and writes.

flowchart TD
    A[Event source or caller] --> B{Trigger type}
    B -->|HTTP| C[Functions front end and host routing]
    B -->|"Queue / Service Bus"| D[Listener polls or receives messages]
    B -->|Event Hubs| E[Partition processor and checkpoint store]
    B -->|"Blob / Event Grid"| F[Event delivery or storage scan path]
    B -->|Timer| G[Schedule monitor]
    C --> H[Functions host]
    D --> H
    E --> H
    F --> H
    G --> H
    H --> I[Language worker]
    I --> J[Function code]
    J --> K[Input and output bindings]
    K --> L[Storage, Service Bus, Event Hubs, Cosmos DB, HTTP APIs]

    C -. FP-TRIG-01 .-> C1[HTTP auth, route, or host availability issue]
    D -. FP-TRIG-02 .-> D1["Queue or Service Bus listener auth/backlog issue"]
    E -. FP-TRIG-03 .-> E1[Partition lag or checkpoint delay]
    F -. FP-TRIG-04 .-> F1[Blob delivery path or Event Grid subscription issue]
    H -. FP-TRIG-05 .-> H1[Host cannot initialize listener]
    K -. FP-TRIG-06 .-> K1["Binding write/read failure after execution"]

Request flow examples by trigger type¶

flowchart LR
    A[HTTP client] --> B[Functions endpoint]
    B --> C[Host route table]
    C --> D[Function invocation]
    D --> E[Output binding or HTTP response]

    F[Queue message] --> G[Queue trigger listener]
    G --> H[Dequeue and lock]
    H --> I[Function invocation]
    I --> J["Checkpoint / delete / poison handling"]

    K[Event Hub event] --> L[Event processor]
    L --> M[Partition ownership]
    M --> N[Function invocation]
    N --> O[Checkpoint write]

Trigger and binding diagnostic routing¶

Failure Point	Typical symptom	Why it fails	First evidence	Primary next step
FP-TRIG-01 HTTP path	`404`, `401`, `403`, or `5xx` before expected handler path	route mismatch, auth policy, host unavailable	`requests` + `traces`	verify route config, auth settings, and host startup
FP-TRIG-02 Queue / Service Bus listener	backlog rises while executions stay flat	listener start failed, connection broken, RBAC mismatch	`traces`, queue metrics, dead-letter growth	validate connection settings and listener startup
FP-TRIG-03 Event Hubs lag	partitions fall behind, checkpoint delay grows	slow handler, checkpoint failure, too few instances	consumer lag metrics, `traces`, dependencies	correlate backlog with execution time and checkpoint success
FP-TRIG-04 Blob / Event Grid path	blob created but no invocation	Event Grid subscription missing, storage events delayed, trigger configuration mismatch	Event Grid subscription state, `traces`	validate storage event path and listener configuration
FP-TRIG-05 Host listener init	app is running but trigger never becomes active	storage auth, extension init, host startup failure	startup traces and exceptions	open trigger-focused playbook
FP-TRIG-06 Binding output	function logs success but downstream artifact missing	output binding auth/config failure, retries exhausted	`dependencies`, traces, target service metrics	validate binding identity and downstream permissions

Integration map for common Azure services¶

Service	Typical role in Functions architecture	Common failure mode	First evidence source
Azure Storage	required host storage, queue/blob triggers, Durable state	auth failure, lease/checkpoint problems, trigger silence	`traces`, storage metrics, Activity Log
Event Hubs	high-volume event ingestion	checkpoint lag, partition imbalance, consumer auth	Event Hubs metrics, `traces`, dependency errors
Service Bus	queue/topic processing and dead-letter patterns	listener auth failure, lock loss, backlog growth	queue depth metrics, `traces`, exceptions
Cosmos DB	trigger source, output binding target, app dependency	lease container issues, throttling, auth failure	`dependencies`, traces, Cosmos metrics
Event Grid	delivery path for blob and event-based workflows	subscription misconfiguration or delivery failure	Event Grid subscription health, traces

3) Scale Controller and Instance Management (where backlog and burst failures originate)¶

Azure Functions scaling is a control loop, not instant elasticity. Backlog, target-based heuristics, and instance readiness all matter.

flowchart TD
    A[Incoming load or event backlog] --> B[Scale controller observes trigger metrics]
    B --> C{Need more workers?}
    C -->|No| D[Keep current instance count]
    C -->|Yes| E[Request additional instances]
    E --> F[Allocate workers for hosting plan]
    F --> G[Start host and listeners on new instances]
    G --> H[Instances begin processing]

    B -. FP-SCALE-01 .-> B1[Metrics do not reflect actual bottleneck yet]
    E -. FP-SCALE-02 .-> E1[Scale request delayed or capped]
    F -. FP-SCALE-03 .-> F1[Instance allocation slow in burst]
    G -. FP-SCALE-04 .-> G1[Host startup or specialization delay]
    H -. FP-SCALE-05 .-> H1[More instances exist but downstream still throttles]

Scale-out decision flow¶

flowchart LR
    A[Queue depth, partition lag, HTTP demand, concurrency pressure] --> B{Trigger-specific thresholds crossed?}
    B -->|No| C[Remain at current size]
    B -->|Yes| D{Hosting plan can add instances now?}
    D -->|No| E[Delay, throttle, or stay capped]
    D -->|Yes| F[Provision instance]
    F --> G[Load Functions host]
    G --> H[Replay config and start listeners]
    H --> I[Begin invocations]
    I --> J{Backlog falling?}
    J -->|Yes| K[Stabilize]
    J -->|No| L[Need more scale or dependency fix]

Scale troubleshooting matrix¶

Failure Point	Symptom pattern	First evidence	Interpretation
FP-SCALE-01 Observation gap	backlog grows before instance count rises	trigger-specific metrics, queue depth, partition lag	scale loop has not reacted yet or trigger metrics are delayed
FP-SCALE-02 Scale cap	load rises but instance count plateaus	plan limits, app settings, autoscale config	not all scale problems are runtime bugs
FP-SCALE-03 Allocation delay	new instances appear too slowly during spike	platform metrics and burst timeline	provisioning latency can dominate burst recovery
FP-SCALE-04 Startup delay on new worker	instance count rises but executions lag	startup traces on new instances	this is scale success followed by cold initialization cost
FP-SCALE-05 Downstream bottleneck	more workers increase dependency failures instead of throughput	`dependencies` failure rate, target throttling metrics	scale-out can amplify pressure on Service Bus, Cosmos DB, SQL, or APIs

Diagnostic commands for scale and instance state¶

az functionapp show --resource-group "<resource-group>" --name "<app-name>" --output json
az functionapp plan show --resource-group "<resource-group>" --name "<plan-name>" --output json
az monitor metrics list --resource "/subscriptions/<subscription-id>/resourceGroups/<resource-group>/providers/Microsoft.Web/sites/<app-name>" --metric "FunctionExecutionCount,FunctionExecutionUnits,Http5xx,Requests" --interval PT1M --aggregation "Total,Average,Maximum" --offset 1h --output table
az monitor metrics list --resource "/subscriptions/<subscription-id>/resourceGroups/<resource-group>/providers/Microsoft.Storage/storageAccounts/<storage-name>" --metric "QueueMessageCount" --interval PT1M --aggregation "Average,Maximum" --offset 1h --output table
az monitor activity-log list --resource-group "<resource-group>" --offset 2h --max-events 100 --output table

4) Cold Start Flow (where first-execution latency appears)¶

Cold start is not one step. It is a chain: instance allocation, host specialization, runtime loading, extension startup, application import, and first dependency initialization.

sequenceDiagram
    participant S as Scale Controller
    participant I as New Instance
    participant H as Functions Host
    participant W as Language Worker
    participant F as Function Code
    participant D as Dependency

    S->>I: Allocate worker for load or first request
    I->>H: Specialize site and mount app content
    H->>H: Load host.json and extensions
    H->>W: Start language worker process
    W->>F: Import user code and initialize globals
    F->>D: Open first dependency connection
    D-->>F: Return auth/session/metadata
    F-->>H: Ready for invocation
    H-->>I: Execute first invocation

Cold start interpretation¶

Slow first invocation can come from platform work before your handler begins.
A healthy steady-state p95 does not disprove cold start.
Cold-start impact is workload-dependent and should be treated as a range (often sub-second to several seconds), not a fixed single value.
Premium reduces but does not eliminate initialization cost; tune always-ready and prewarmed instance capacity for burst scenarios.
Queue or Event Hub workloads can show cold-start symptoms as delayed event pickup instead of slow HTTP response.

Cold-start failure points¶

Failure Point	Symptom	What it usually means	First evidence
FP-COLD-01 allocation delay	no worker available immediately	scale-to-zero or burst allocation latency	request timeline, instance count trend
FP-COLD-02 specialization delay	app files or site config take time to load	worker still preparing environment	startup traces
FP-COLD-03 extension startup	trigger listeners load slowly or fail	binding extension init problem	`traces` around host initialization
FP-COLD-04 language worker init	Python/Node/.NET isolated process starts slowly	package size, imports, runtime mismatch	worker startup traces
FP-COLD-05 dependency warm-up	first call to Key Vault, Cosmos DB, Storage, SQL is slow	network/auth/token/cache initialization	`dependencies` first-call duration

Cold-start diagnostic commands¶

az monitor log-analytics query --workspace "$WORKSPACE_ID" --analytics-query "AppTraces | where TimeGenerated > ago(2h) | where Message has_any ('Initializing Host','Host started','Generating 0 job function','Starting JobHost') | project TimeGenerated, AppRoleInstance, Message | order by TimeGenerated desc" --output table
az monitor log-analytics query --workspace "$WORKSPACE_ID" --analytics-query "AppRequests | where TimeGenerated > ago(2h) | summarize p50=percentile(DurationMs,50), p95=percentile(DurationMs,95), maxDuration=max(DurationMs) by bin(TimeGenerated, 5m) | order by TimeGenerated desc" --output table
az functionapp config appsettings list --resource-group "<resource-group>" --name "<app-name>" --output table

5) Durable Functions Orchestration Architecture (where replay and state coordination confuse diagnosis)¶

Durable Functions is not a single long-running process. It is a state-driven workflow pattern backed by storage. The orchestrator replays history, the runtime persists checkpoints, and activity functions run as separate executions.

flowchart TD
    A[Client starts orchestration] --> B[Durable client binding]
    B --> C[Orchestration trigger]
    C --> D[Read execution history from durable store]
    D --> E[Replay orchestrator code deterministically]
    E --> F{Schedule activity, timer, or sub-orchestration?}
    F -->|Activity| G[Queue work item]
    F -->|Timer| H[Persist wake-up event]
    F -->|External event| I[Wait for event in durable store]
    G --> J[Activity function executes]
    J --> K[Write result to durable store]
    H --> D
    I --> D
    K --> D

    C -. FP-DUR-01 .-> C1[Orchestrator trigger not advancing]
    D -. FP-DUR-02 .-> D1[History or storage access issue]
    E -. FP-DUR-03 .-> E1[Replay storm or non-deterministic code]
    G -. FP-DUR-04 .-> G1[Activity backlog or queue saturation]
    K -. FP-DUR-05 .-> K1["Checkpoint/state write failure"]

Durable state-machine view¶

stateDiagram-v2
    [*] --> Pending
    Pending --> Running: orchestration starts
    Running --> WaitingForActivity: schedule activity
    WaitingForActivity --> Running: activity result written
    Running --> WaitingForTimer: durable timer created
    WaitingForTimer --> Running: timer fires
    Running --> WaitingForExternalEvent: external event expected
    WaitingForExternalEvent --> Running: event received
    Running --> Completed: final output stored
    Running --> Failed: unhandled exception or storage failure
    WaitingForActivity --> Failed: repeated activity failure or poison state
    WaitingForExternalEvent --> Terminated: manual terminate

Durable troubleshooting matrix¶

Failure Point	Typical symptom	First evidence	Why it is misleading
FP-DUR-01 trigger not advancing	orchestration remains `Running` with no visible progress	orchestration status, traces, queue activity	app may be healthy while durable control queue is stuck
FP-DUR-02 history/store issue	replay cannot reconstruct state cleanly	storage auth failures, table/queue access traces	looks like code bug but may be storage availability or RBAC
FP-DUR-03 replay storm	high CPU/log volume with little forward progress	repeated replay traces, same history events	replays are expected, but excessive replays indicate design or storage pressure
FP-DUR-04 activity backlog	orchestration scheduled work but activities lag	queue depth, execution counts, activity failures	orchestrator status alone hides downstream queue saturation
FP-DUR-05 state write failure	orchestration appears to retry or stall after activity success	storage exceptions, durable backend errors	activity succeeded but checkpoint did not persist

Durable diagnostic commands¶

az functionapp function list --resource-group "<resource-group>" --name "<app-name>" --output table
az monitor log-analytics query --workspace "$WORKSPACE_ID" --analytics-query "AppTraces | where TimeGenerated > ago(2h) | where Message has_any ('DurableTask','orchestration','replay','activity') | project TimeGenerated, SeverityLevel, Message | order by TimeGenerated desc" --output table
az storage queue list --account-name "<storage-name>" --auth-mode login --output table
az storage table list --account-name "<storage-name>" --auth-mode login --output table

6) Azure Service Integration Path (where dependencies amplify failure)¶

Azure Functions usually sits in the middle of other services rather than at the edge alone. That means incidents often reflect integration failure, not runtime failure.

flowchart LR
    A[Function invocation] --> B{Integration type}
    B -->|Host storage| C[Azure Storage account]
    B -->|Streaming input| D[Event Hubs]
    B -->|Messaging| E[Service Bus]
    B -->|Document data| F[Cosmos DB]
    B -->|Output events| G[Event Grid or HTTP API]

    C -. FP-INT-01 .-> C1[Host lease, checkpoint, or auth failure]
    D -. FP-INT-02 .-> D1[Checkpoint lag or consumer auth failure]
    E -. FP-INT-03 .-> E1[Lock loss, dead-letter growth, auth failure]
    F -. FP-INT-04 .-> F1[429 throttling, lease issue, auth failure]
    G -. FP-INT-05 .-> G1[Delivery or downstream API failure]

Integration failure mapping¶

Integration	What breaks first	Fastest evidence	Common misread
Azure Storage	host startup, trigger checkpoints, Durable state	storage auth exceptions, queue/table/blob metrics	assuming storage is just an app dependency rather than a platform dependency
Event Hubs	lag and uneven partition progress	consumer group lag, checkpoint traces	blaming Functions scale before checking partition ownership
Service Bus	backlog, retries, dead-letter growth	queue depth, lock-lost exceptions, trace warnings	treating poison messages as scaling problem only
Cosmos DB	throttling or trigger lease contention	`dependencies`, 429 trend, lease container health	blaming handler latency without checking RU limits
Event Grid / APIs	delivery failures after successful execution	dependency failures, delivery metrics	concluding the function failed when only the output path failed

7) Observability Coverage and Fast Evidence Commands¶

flowchart TD
    A[Architecture segment] --> B[Best first evidence source]
    C[Hosting plan and scale] --> D[Metrics + Activity Log]
    E[Trigger startup and listener health] --> F[Application Insights traces]
    G[User-facing failures] --> H[requests + exceptions]
    I[Dependency issues] --> J[dependencies + service metrics]
    K[Durable state progression] --> L[traces + storage artifacts]

Quick evidence commands by component¶

az functionapp show --resource-group "<resource-group>" --name "<app-name>" --output json
az functionapp config appsettings list --resource-group "<resource-group>" --name "<app-name>" --output table
az functionapp function list --resource-group "<resource-group>" --name "<app-name>" --output table
az monitor activity-log list --resource-group "<resource-group>" --offset 24h --max-events 100 --output table
az monitor metrics list --resource "/subscriptions/<subscription-id>/resourceGroups/<resource-group>/providers/Microsoft.Web/sites/<app-name>" --metric "Requests,Http5xx,FunctionExecutionCount,FunctionExecutionUnits" --interval PT1M --aggregation "Total,Average,Maximum" --offset 2h --output table
az monitor log-analytics query --workspace "$WORKSPACE_ID" --analytics-query "AppRequests | where TimeGenerated > ago(2h) | summarize total=count(), failed=countif(Success == false), p95=percentile(DurationMs,95) by bin(TimeGenerated, 5m) | order by TimeGenerated asc" --output table
az monitor log-analytics query --workspace "$WORKSPACE_ID" --analytics-query "AppDependencies | where TimeGenerated > ago(2h) | summarize failed=countif(Success == false), p95=percentile(DurationMs,95) by Target | order by failed desc" --output table

8) Fast routing examples¶

Example A: HTTP requests fail only after idle periods.
- Start with hosting plan architecture and cold start flow.
- Open: High Latency, Quick Diagnosis Cards, and First 10 Minutes.
Example B: Queue depth rises while the app shows healthy in the portal.
- Start with trigger and binding architecture, then scale controller behavior.
- Open: Functions Not Executing and Queue Piling Up.
Example C: Durable orchestration stays Running for a long time with repeated logs.
- Start with Durable Functions orchestration architecture.
- Open: Durable Orchestration Stuck and Methodology.
Example D: More instances appear during load, but errors shift to Service Bus or Cosmos DB.
- Start with scale controller and integration path.
- Open: High Latency and Evidence Map.

How to use this page during incidents

Do not treat scale-out, cold start, or replay behavior as proof of root cause by themselves. Use this page to identify the most likely failure layer, then confirm with time-correlated telemetry from Application Insights, Activity Log, and the integrated Azure service.