Skip to content

Reliability

Reliability in Azure Functions is a design concern, not only an operations concern. Your trigger model, hosting plan, retry policy, and network topology jointly determine failure behavior.

Prerequisites

Before you finalize reliability design decisions, verify these prerequisites: - You know the trigger semantics for each workload (at-most-once, at-least-once, checkpoint-driven). - You have a defined business SLO/SLA for latency, recovery time, and acceptable data loss. - You can map each critical dependency (storage, messaging, identity, database, DNS, network). - You have access to Azure CLI (az) and monitoring telemetry (Application Insights, Metrics, Log Analytics). - You have ownership for poison/dead-letter triage and replay procedures.

Main Content

Reliability layers

Design for reliability across four layers: 1. Trigger semantics (delivery guarantees, retries, checkpointing) 2. Function behavior (idempotency, timeout, exception handling) 3. Platform behavior (scale transitions, zone support, host restarts) 4. Dependency behavior (throttling, transient failure, private network reachability)

Retry strategy

Azure Functions supports built-in retry behavior for supported triggers. Common retry models: - Fixed delay retry - Exponential backoff retry Use retries for transient failures only. Non-transient failures should route to dead-letter/poison handling paths.

Cross-language retry annotation patterns

import azure.functions as func
app = func.FunctionApp()

@app.function_name(name="ProcessQueue")
@app.queue_trigger(arg_name="msg", queue_name="orders", connection="AzureWebJobsStorage")
def process_queue(msg: func.QueueMessage) -> None:
    # Handle message idempotently; raise on transient failures
    pass
const { app } = require('@azure/functions');

app.storageQueue('processQueue', {
  queueName: 'orders',
  connection: 'AzureWebJobsStorage',
  handler: async (message, context) => {
    // Handle idempotently
  }
});
using Microsoft.Azure.Functions.Worker;

public class ProcessQueue
{
    [Function("ProcessQueue")]
    public void Run([QueueTrigger("orders", Connection = "AzureWebJobsStorage")] string message)
    {
        // Handle idempotently
    }
}

Retry flow with exponential backoff timing

sequenceDiagram
    autonumber
    participant T as Trigger
    participant F as Function
    participant D as Dependency
    T->>F: Delivery #1
    F->>D: Call
    D-->>F: 503 transient failure
    F-->>T: Throw exception
    Note over T,F: Retry #1 after 5s
    T->>F: Delivery #2
    F->>D: Call
    D-->>F: Timeout
    F-->>T: Throw exception
    Note over T,F: Retry #2 after 15s
    T->>F: Delivery #3
    F->>D: Call
    D-->>F: 429 throttled
    F-->>T: Throw exception
    Note over T,F: Retry #3 after 45s
    T->>F: Delivery #4
    F->>D: Call
    D-->>F: Success
    F-->>T: Ack/Complete

host.json retry configuration examples

Use these examples as host-level reliability templates. Trigger-level retry declarations still apply where supported by language/runtime bindings.

Fixed delay retry config

{
  "version": "2.0",
  "extensions": {
    "serviceBus": {
      "clientRetryOptions": {
        "mode": "fixed",
        "tryTimeout": "00:01:00",
        "delay": "00:00:05",
        "maxDelay": "00:00:05",
        "maxRetries": 5
      }
    }
  }
}

Exponential backoff retry config

{
  "version": "2.0",
  "extensions": {
    "serviceBus": {
      "clientRetryOptions": {
        "mode": "exponential",
        "tryTimeout": "00:01:00",
        "delay": "00:00:02",
        "maxDelay": "00:01:00",
        "maxRetries": 8
      }
    }
  }
}

Max retry count settings

{
  "version": "2.0",
  "extensions": {
    "queues": {
      "maxDequeueCount": 8,
      "visibilityTimeout": "00:00:30",
      "batchSize": 16,
      "newBatchThreshold": 8
    }
  }
}

Retry scope matters

clientRetryOptions affects communication between the Functions host and the messaging service client. Trigger execution retries are configured by trigger/runtime support.

Poison message handling

For queue-based triggers, repeated failure eventually moves messages to poison/dead-letter paths (service-specific behavior). Design requirements: - preserve original payload and correlation metadata, - alert on poison queue growth, - provide replay workflow after remediation, - prevent infinite retry loops.

Do not drop poison messages

Poison events are high-signal reliability data. Route them to explicit triage and replay pipelines.

Queue-specific poison behaviors

Storage Queue trigger - Every failed processing attempt increments dequeueCount. - When dequeueCount exceeds maxDequeueCount, the runtime moves the message to <queue-name>-poison. - Preserve these fields for replay and forensics: - id - dequeueCount - insertionTime - nextVisibleTime - custom correlationId (if present)

Service Bus trigger - Messages are dead-lettered after max delivery count or explicit dead-letter action. - Capture deadLetterReason and deadLetterErrorDescription before replay. - Typical reasons include lock lost, deserialization failure, or business validation failure.

flowchart TD
    A[Queued Message] --> B[Function Invocation]
    B -->|Success| C[Complete Message]
    B -->|Failure| D["Abandon/Release Lock"]
    D --> E{Retry budget remaining?}
    E -->|Yes| B
    E -->|No| F[Poison Queue or Dead-letter Queue]
    F --> G[Triage + Root Cause]
    G --> H{Remediated?}
    H -->|Yes| I[Replay Pipeline]
    H -->|No| J[Escalate + Quarantine]

Timeout design

Timeout boundaries are part of reliability behavior.

Plan Default Maximum
Consumption (classic) 5 min 10 min
Flex Consumption 30 min Unbounded
Premium 30 min (common default) Unbounded
Dedicated 30 min (common default) Unbounded

If your business process exceeds timeout bounds, redesign to asynchronous orchestration.

Availability zones and high availability

Zone-aware architecture options are strongest on Premium, Dedicated, and Flex Consumption plans. - Premium, Dedicated, and Flex Consumption can be designed for zone-resilient deployments (region permitting). - Zone-resilient design should include zone-redundant dependencies (storage, messaging, data stores). - Consumption designs should emphasize retry/idempotency and multi-region recovery patterns where needed.

flowchart LR
    subgraph Region[Azure Region]
        subgraph Z1[Zone 1]
            F1[Function Workers]
        end
        subgraph Z2[Zone 2]
            F2[Function Workers]
        end
        subgraph Z3[Zone 3]
            F3[Function Workers]
        end
        LB["Front Door / Traffic Manager"]
        SB[(Service Bus Premium ZR)]
        ST[(Storage Account ZRS)]
    end
    LB --> F1
    LB --> F2
    LB --> F3
    F1 --> SB
    F2 --> SB
    F3 --> SB
    F1 --> ST
    F2 --> ST
    F3 --> ST

Idempotency is mandatory

Because retries and duplicate deliveries are normal in distributed systems, handlers must be idempotent. Idempotency patterns: - deterministic operation keys, - upsert instead of blind insert, - de-duplication table/cache, - exactly-once effects at domain boundary where feasible.

Python idempotency example

import json
from datetime import datetime, timezone
import azure.functions as func
from azure.data.tables import TableServiceClient

app = func.FunctionApp()

@app.function_name(name="ProcessOrder")
@app.queue_trigger(arg_name="msg", queue_name="orders", connection="AzureWebJobsStorage")
def process_order(msg: func.QueueMessage) -> None:
    payload = json.loads(msg.get_body().decode("utf-8"))
    operation_id = payload["operationId"]

    table_service = TableServiceClient.from_connection_string("UseDevelopmentStorage=true")
    table_client = table_service.get_table_client("processedoperations")
    table_client.create_table_if_not_exists()

    try:
        table_client.create_entity({
            "PartitionKey": "order-processing",
            "RowKey": operation_id,
            "processedAt": datetime.now(timezone.utc).isoformat()
        })
    except Exception:
        # Duplicate delivery: idempotent no-op
        return

    # Side effect executes once per operation_id

Dependency resilience

Protect downstream dependencies using: - timeout budgets per call, - transient retry with jitter, - circuit breaking, - and bulkheading (separate processing lanes for critical/non-critical work).

Reliability architecture pattern

flowchart LR
    In[Trigger Event] --> Fn[Function Handler]
    Fn -->|Success| Ok["Commit / Ack"]
    Fn -->|Transient error| Rt[Retry Policy]
    Rt --> Fn
    Fn -->|Exceeded retries| P["Poison / Dead-letter"]
    P --> Ops[Alert + Triage + Replay]

CLI validation examples (PII masked)

Use CLI checks during reviews and incidents to confirm reliability-related configuration and telemetry.

Inspect function app reliability settings

az functionapp config show   --resource-group "rg-functions-prod"   --name "func-reliability-prod"   --query "{alwaysOn:alwaysOn,http20Enabled:http20Enabled,ftpsState:ftpsState,minTlsVersion:minTlsVersion}"   --output json

Query failure and retry metrics

az monitor metrics list   --resource "/subscriptions/<subscription-id>/resourceGroups/rg-functions-prod/providers/Microsoft.Web/sites/func-reliability-prod"   --metric "FunctionExecutionCount,FunctionExecutionUnits,FunctionExecutionFailureCount"   --interval "PT5M"   --aggregation "Total"   --output table

az monitor metrics list   --resource "/subscriptions/<subscription-id>/resourceGroups/rg-functions-prod/providers/Microsoft.ServiceBus/namespaces/sb-functions-prod"   --metric "DeadletteredMessages,IncomingMessages,SuccessfulRequests,ServerErrors"   --interval "PT5M"   --aggregation "Total"   --output table

Troubleshooting matrix

Symptom Likely Cause Validation Path
Sudden spike in retries with eventual success Downstream transient throttling Check dependency 429/503 in traces and compare with retry timing
Messages accumulate in poison queue Non-transient exception or schema mismatch Inspect poison payload and verify handler version + contract changes
Duplicate business records Missing idempotency key or non-atomic side effects Correlate duplicate entities by operation key and retry attempts
Frequent timeout failures Function timeout too low or dependency latency regression Review timeout settings and dependency latency percentile
Dead-letter growth in Service Bus Lock lost, max delivery exceeded, or explicit dead-letter Query deadLetterReason and check lock duration
Regional incident causes prolonged outage Single-region architecture with no failover path Validate multi-region topology and failover runbook

Reliability checklist

  • Define retry policy per trigger type.
  • Enforce idempotency in every async handler.
  • Define poison queue alert + replay process.
  • Align timeout with business SLA.
  • Validate zone strategy on Premium/Dedicated where required.

Operations Guide

For runbook details, see Operations: Retries and Poison Handling.

Advanced Topics

Durable Functions reliability patterns

Durable Functions improves reliability for long-running orchestration, but reliability still depends on deterministic orchestrator logic and safe activity retries. - Keep orchestrator functions deterministic. - Put side effects in activity functions, not orchestrators. - Configure activity retry policies with bounded max attempts and backoff. - Use compensation activities for partially completed workflows.

Exactly-once processing patterns

Exactly-once transport is rarely available end-to-end; achieve exactly-once effects by combining idempotency and atomic state transitions. 1. Inbox table pattern - Record processed event key before side effect. - Skip side effect when key already exists. 2. Outbox pattern - Persist state change and outbound event atomically. - Publish from outbox worker with retry and dedupe. 3. Upsert + version check - Require expected version/etag for updates. - Reject stale duplicates safely.

Multi-region failover

Choose strategy based on workload criticality and recovery objectives: - Active-passive: lower cost, simpler operations, longer failover time. - Active-active: higher complexity, better regional fault tolerance.

Health check probes

Health endpoints and synthetic probes improve early detection of reliability regressions. - Provide a lightweight /api/healthz endpoint for liveness checks. - Add readiness checks for critical dependencies.

Language-Specific Details

Use language-specific guidance for runtime nuances, extension bundles, and host configuration details: - Python: Python Guide, host.json for Python, Python troubleshooting - Node.js: Node.js Guide - .NET: .NET Guide - Java: Java Guide

See Also

Sources