Reliability¶
Reliability in Azure Functions is a design concern, not only an operations concern. Your trigger model, hosting plan, retry policy, and network topology jointly determine failure behavior.
Prerequisites¶
Before you finalize reliability design decisions, verify these prerequisites: - You know the trigger semantics for each workload (at-most-once, at-least-once, checkpoint-driven). - You have a defined business SLO/SLA for latency, recovery time, and acceptable data loss. - You can map each critical dependency (storage, messaging, identity, database, DNS, network). - You have access to Azure CLI (az) and monitoring telemetry (Application Insights, Metrics, Log Analytics). - You have ownership for poison/dead-letter triage and replay procedures.
Main Content¶
Reliability layers¶
Design for reliability across four layers: 1. Trigger semantics (delivery guarantees, retries, checkpointing) 2. Function behavior (idempotency, timeout, exception handling) 3. Platform behavior (scale transitions, zone support, host restarts) 4. Dependency behavior (throttling, transient failure, private network reachability)
Retry strategy¶
Azure Functions supports built-in retry behavior for supported triggers. Common retry models: - Fixed delay retry - Exponential backoff retry Use retries for transient failures only. Non-transient failures should route to dead-letter/poison handling paths.
Cross-language retry annotation patterns¶
import azure.functions as func
app = func.FunctionApp()
@app.function_name(name="ProcessQueue")
@app.queue_trigger(arg_name="msg", queue_name="orders", connection="AzureWebJobsStorage")
def process_queue(msg: func.QueueMessage) -> None:
# Handle message idempotently; raise on transient failures
pass
Retry flow with exponential backoff timing¶
sequenceDiagram
autonumber
participant T as Trigger
participant F as Function
participant D as Dependency
T->>F: Delivery #1
F->>D: Call
D-->>F: 503 transient failure
F-->>T: Throw exception
Note over T,F: Retry #1 after 5s
T->>F: Delivery #2
F->>D: Call
D-->>F: Timeout
F-->>T: Throw exception
Note over T,F: Retry #2 after 15s
T->>F: Delivery #3
F->>D: Call
D-->>F: 429 throttled
F-->>T: Throw exception
Note over T,F: Retry #3 after 45s
T->>F: Delivery #4
F->>D: Call
D-->>F: Success
F-->>T: Ack/Complete host.json retry configuration examples¶
Use these examples as host-level reliability templates. Trigger-level retry declarations still apply where supported by language/runtime bindings.
Fixed delay retry config
{
"version": "2.0",
"extensions": {
"serviceBus": {
"clientRetryOptions": {
"mode": "fixed",
"tryTimeout": "00:01:00",
"delay": "00:00:05",
"maxDelay": "00:00:05",
"maxRetries": 5
}
}
}
}
Exponential backoff retry config
{
"version": "2.0",
"extensions": {
"serviceBus": {
"clientRetryOptions": {
"mode": "exponential",
"tryTimeout": "00:01:00",
"delay": "00:00:02",
"maxDelay": "00:01:00",
"maxRetries": 8
}
}
}
}
Max retry count settings
{
"version": "2.0",
"extensions": {
"queues": {
"maxDequeueCount": 8,
"visibilityTimeout": "00:00:30",
"batchSize": 16,
"newBatchThreshold": 8
}
}
}
Retry scope matters
clientRetryOptions affects communication between the Functions host and the messaging service client. Trigger execution retries are configured by trigger/runtime support.
Poison message handling¶
For queue-based triggers, repeated failure eventually moves messages to poison/dead-letter paths (service-specific behavior). Design requirements: - preserve original payload and correlation metadata, - alert on poison queue growth, - provide replay workflow after remediation, - prevent infinite retry loops.
Do not drop poison messages
Poison events are high-signal reliability data. Route them to explicit triage and replay pipelines.
Queue-specific poison behaviors¶
Storage Queue trigger - Every failed processing attempt increments dequeueCount. - When dequeueCount exceeds maxDequeueCount, the runtime moves the message to <queue-name>-poison. - Preserve these fields for replay and forensics: - id - dequeueCount - insertionTime - nextVisibleTime - custom correlationId (if present)
Service Bus trigger - Messages are dead-lettered after max delivery count or explicit dead-letter action. - Capture deadLetterReason and deadLetterErrorDescription before replay. - Typical reasons include lock lost, deserialization failure, or business validation failure.
flowchart TD
A[Queued Message] --> B[Function Invocation]
B -->|Success| C[Complete Message]
B -->|Failure| D["Abandon/Release Lock"]
D --> E{Retry budget remaining?}
E -->|Yes| B
E -->|No| F[Poison Queue or Dead-letter Queue]
F --> G[Triage + Root Cause]
G --> H{Remediated?}
H -->|Yes| I[Replay Pipeline]
H -->|No| J[Escalate + Quarantine] Timeout design¶
Timeout boundaries are part of reliability behavior.
| Plan | Default | Maximum |
|---|---|---|
| Consumption (classic) | 5 min | 10 min |
| Flex Consumption | 30 min | Unbounded |
| Premium | 30 min (common default) | Unbounded |
| Dedicated | 30 min (common default) | Unbounded |
If your business process exceeds timeout bounds, redesign to asynchronous orchestration.
Availability zones and high availability¶
Zone-aware architecture options are strongest on Premium, Dedicated, and Flex Consumption plans. - Premium, Dedicated, and Flex Consumption can be designed for zone-resilient deployments (region permitting). - Zone-resilient design should include zone-redundant dependencies (storage, messaging, data stores). - Consumption designs should emphasize retry/idempotency and multi-region recovery patterns where needed.
flowchart LR
subgraph Region[Azure Region]
subgraph Z1[Zone 1]
F1[Function Workers]
end
subgraph Z2[Zone 2]
F2[Function Workers]
end
subgraph Z3[Zone 3]
F3[Function Workers]
end
LB["Front Door / Traffic Manager"]
SB[(Service Bus Premium ZR)]
ST[(Storage Account ZRS)]
end
LB --> F1
LB --> F2
LB --> F3
F1 --> SB
F2 --> SB
F3 --> SB
F1 --> ST
F2 --> ST
F3 --> ST Idempotency is mandatory¶
Because retries and duplicate deliveries are normal in distributed systems, handlers must be idempotent. Idempotency patterns: - deterministic operation keys, - upsert instead of blind insert, - de-duplication table/cache, - exactly-once effects at domain boundary where feasible.
Python idempotency example¶
import json
from datetime import datetime, timezone
import azure.functions as func
from azure.data.tables import TableServiceClient
app = func.FunctionApp()
@app.function_name(name="ProcessOrder")
@app.queue_trigger(arg_name="msg", queue_name="orders", connection="AzureWebJobsStorage")
def process_order(msg: func.QueueMessage) -> None:
payload = json.loads(msg.get_body().decode("utf-8"))
operation_id = payload["operationId"]
table_service = TableServiceClient.from_connection_string("UseDevelopmentStorage=true")
table_client = table_service.get_table_client("processedoperations")
table_client.create_table_if_not_exists()
try:
table_client.create_entity({
"PartitionKey": "order-processing",
"RowKey": operation_id,
"processedAt": datetime.now(timezone.utc).isoformat()
})
except Exception:
# Duplicate delivery: idempotent no-op
return
# Side effect executes once per operation_id
Dependency resilience¶
Protect downstream dependencies using: - timeout budgets per call, - transient retry with jitter, - circuit breaking, - and bulkheading (separate processing lanes for critical/non-critical work).
Reliability architecture pattern¶
flowchart LR
In[Trigger Event] --> Fn[Function Handler]
Fn -->|Success| Ok["Commit / Ack"]
Fn -->|Transient error| Rt[Retry Policy]
Rt --> Fn
Fn -->|Exceeded retries| P["Poison / Dead-letter"]
P --> Ops[Alert + Triage + Replay] CLI validation examples (PII masked)¶
Use CLI checks during reviews and incidents to confirm reliability-related configuration and telemetry.
Inspect function app reliability settings
az functionapp config show --resource-group "rg-functions-prod" --name "func-reliability-prod" --query "{alwaysOn:alwaysOn,http20Enabled:http20Enabled,ftpsState:ftpsState,minTlsVersion:minTlsVersion}" --output json
Query failure and retry metrics
az monitor metrics list --resource "/subscriptions/<subscription-id>/resourceGroups/rg-functions-prod/providers/Microsoft.Web/sites/func-reliability-prod" --metric "FunctionExecutionCount,FunctionExecutionUnits,FunctionExecutionFailureCount" --interval "PT5M" --aggregation "Total" --output table
az monitor metrics list --resource "/subscriptions/<subscription-id>/resourceGroups/rg-functions-prod/providers/Microsoft.ServiceBus/namespaces/sb-functions-prod" --metric "DeadletteredMessages,IncomingMessages,SuccessfulRequests,ServerErrors" --interval "PT5M" --aggregation "Total" --output table
Troubleshooting matrix¶
| Symptom | Likely Cause | Validation Path |
|---|---|---|
| Sudden spike in retries with eventual success | Downstream transient throttling | Check dependency 429/503 in traces and compare with retry timing |
| Messages accumulate in poison queue | Non-transient exception or schema mismatch | Inspect poison payload and verify handler version + contract changes |
| Duplicate business records | Missing idempotency key or non-atomic side effects | Correlate duplicate entities by operation key and retry attempts |
| Frequent timeout failures | Function timeout too low or dependency latency regression | Review timeout settings and dependency latency percentile |
| Dead-letter growth in Service Bus | Lock lost, max delivery exceeded, or explicit dead-letter | Query deadLetterReason and check lock duration |
| Regional incident causes prolonged outage | Single-region architecture with no failover path | Validate multi-region topology and failover runbook |
Reliability checklist¶
- Define retry policy per trigger type.
- Enforce idempotency in every async handler.
- Define poison queue alert + replay process.
- Align timeout with business SLA.
- Validate zone strategy on Premium/Dedicated where required.
Operations Guide
For runbook details, see Operations: Retries and Poison Handling.
Advanced Topics¶
Durable Functions reliability patterns¶
Durable Functions improves reliability for long-running orchestration, but reliability still depends on deterministic orchestrator logic and safe activity retries. - Keep orchestrator functions deterministic. - Put side effects in activity functions, not orchestrators. - Configure activity retry policies with bounded max attempts and backoff. - Use compensation activities for partially completed workflows.
Exactly-once processing patterns¶
Exactly-once transport is rarely available end-to-end; achieve exactly-once effects by combining idempotency and atomic state transitions. 1. Inbox table pattern - Record processed event key before side effect. - Skip side effect when key already exists. 2. Outbox pattern - Persist state change and outbound event atomically. - Publish from outbox worker with retry and dedupe. 3. Upsert + version check - Require expected version/etag for updates. - Reject stale duplicates safely.
Multi-region failover¶
Choose strategy based on workload criticality and recovery objectives: - Active-passive: lower cost, simpler operations, longer failover time. - Active-active: higher complexity, better regional fault tolerance.
Health check probes¶
Health endpoints and synthetic probes improve early detection of reliability regressions. - Provide a lightweight /api/healthz endpoint for liveness checks. - Add readiness checks for critical dependencies.
Language-Specific Details¶
Use language-specific guidance for runtime nuances, extension bundles, and host configuration details: - Python: Python Guide, host.json for Python, Python troubleshooting - Node.js: Node.js Guide - .NET: .NET Guide - Java: Java Guide
See Also¶
- Triggers and bindings
- Scaling
- Security
- Operations: Retries and Poison Handling
- Troubleshooting methodology
Sources¶
- Microsoft Learn: Design reliable Azure Functions applications
- Microsoft Learn: Azure Functions reliability in Azure Well-Architected Framework
- Microsoft Learn: Azure Functions host.json reference
- Microsoft Learn: Azure Queue Storage trigger and bindings
- Microsoft Learn: Azure Service Bus trigger and bindings