Reliability Best Practices for Azure Functions¶

Reliable Azure Functions systems are built around the runtime execution model: most event-driven triggers (queues, Event Hubs, Service Bus, blob) deliver at-least-once, retries are normal, host instances can restart at any time, and dependencies can throttle. HTTP triggers behave differently — they are synchronous request/response and do not retry automatically. This guide focuses on safe execution patterns that keep behavior correct during duplicates, retries, scale-out, and partial outages.

Platform and operations context

Keep this guide focused on implementation choices. For platform behavior details and recovery runbooks, see Platform Reliability, Retries and Poison Handling, and Recovery.

Why This Matters¶

Function instances are ephemeral. The runtime can:

start multiple instances during scale-out,
recycle an instance during host upgrade or health event,
reassign future invocations to different workers.

Any in-memory state can disappear or become inconsistent across instances.

Design style	State location	Behavior during restart/scale-out	Reliability risk	Recommended usage
Stateless (preferred)	External durable store	New instance can continue using shared state	Low when idempotency is implemented	Default for trigger handlers and API operations
Stateful in-memory cache	Worker process memory	Cache is lost on recycle; not shared across instances	High duplicate or skipped operations	Avoid for correctness-critical flows
Stateful local filesystem	Ephemeral local disk	Files may disappear on cold start or host move	High checkpoint/data loss risk	Use only for temporary non-critical artifacts
Durable orchestration state	Durable Functions backend	Replay and checkpoint survive restarts	Medium (determinism discipline required)	Multi-step workflows with wait states/compensation

What breaks when a function is stateful¶

In-memory dedup caches miss duplicates after restart.
Per-instance counters diverge during parallel scale-out.
"Last processed timestamp" stored in memory causes skipped or duplicated work.
Temporary local files are lost across cold starts and host moves.

Stateful handlers fail under retries and scale

If correctness depends on local memory, retries and replays will eventually create data corruption or duplicated side effects.

Externalize state safely¶

Use external stores based on state type:

Dedup/index state: Table Storage, Cosmos DB, SQL unique constraints.
Workflow state: Durable Functions instance state.
Checkpoint state: trigger-native checkpointing where provided.
Long-running orchestration progress: Durable orchestrator history.

Recommended Practices¶

Idempotency is mandatory for every trigger¶

At-least-once delivery means a message or event can be processed more than once. Build every handler so duplicate invocation is safe.

Trigger-specific deduplication patterns¶

Trigger type	Duplicate source	Recommended dedup key	Practical implementation
HTTP	Client retries, gateway retries, timeout retries	`Idempotency-Key` header or domain operation ID	Persist key + result status before external side effects
Storage Queue	Visibility timeout expiration, host restart	`MessageId` + business key	Upsert into processed-operations table with unique key
Service Bus	Lock loss, redelivery, transient failures	`MessageId` or domain command ID	Store processed command IDs with TTL and unique index
Event Hubs	Partition replay/checkpoint gaps	Event ID + partition + sequence constraints	Idempotent projection writes and version checks
Timer	Missed schedule catch-up, overlapping execution	Scheduled occurrence timestamp	Persist last completed schedule occurrence

Idempotent means effect-safe, not just exception-safe

Returning 200 twice is not enough. The business side effect (charge, shipment, status transition) must happen once, or happen multiple times with identical final state.

Retry-safe processing design¶

Treat each invocation as potentially repeated.

Validate input and compute deterministic operation key.
Check whether operation already completed.
Execute side effect with timeout budget.
Persist completion marker atomically.
Ack/complete message only after durable success.

flowchart TD
    S[Invocation starts] --> K[Compute deterministic operation key]
    K --> D{Already completed?}
    D -->|Yes| R1[Return cached result or ack]
    D -->|No| E[Execute side effect]
    E --> T{Failure type}
    T -->|Transient| RT[Retry with backoff]
    RT --> S
    T -->|Terminal| DL[Send to poison queue or DLQ with reason]
    E --> OK{Succeeded?}
    OK -->|Yes| M[Persist completion marker atomically]
    M --> R2[Ack or complete message]

Queue trigger safety (Storage Queue)¶

visibilityTimeout controls message reappearance window.
dequeueCount increments on each receive.
maxDequeueCount moves failed messages to poison queue.

Operational guidance:

Set visibilityTimeout longer than normal processing time plus downstream jitter.
Keep processing idempotent because visibility expiration can cause concurrent duplicate processing.
Alert on poison queue growth rate, not just absolute count.

Service Bus trigger safety¶

Delivery lock can expire if processing exceeds lock duration.
Retries can redeliver same message.
Max delivery count routes to dead-letter queue.

Operational guidance:

Keep handler duration inside lock window or enable lock renewal where supported.
Move terminal business-rule failures to dead-letter with explicit reason metadata.
Build replay tooling that can requeue selectively after remediation.

HTTP trigger safety with client retries¶

Clients, proxies, and API gateways can retry when they see 408/429/5xx or connection resets.

Design pattern:

Require Idempotency-Key for state-changing operations.
Store key with operation status (started, completed, failed-retryable, failed-terminal).
Return existing result when key already completed.
Expire keys with retention aligned to client retry behavior.

Timer trigger safety¶

Timer schedules can be delayed by host restarts and scale events.

Design pattern:

Read schedule metadata and detect missed occurrences.
Process windows by deterministic range (from, to) instead of "now only" logic.
Use singleton/lease patterns for non-partitionable maintenance jobs.

Storage dependency resilience (AzureWebJobsStorage)¶

AzureWebJobsStorage is a runtime dependency for core host behavior and many triggers.

Failure modes and impact:

Unavailable storage: triggers may stop polling or fail initialization.
Throttled storage: increased latency, repeated retries, backlog growth.
Misconfigured identity or connection: startup failure, binding errors, trigger silence.

Failure mode	Runtime symptom	Operational impact	Primary mitigation	Detection signal
Unavailable host storage	Trigger listener initialization fails or stops polling	Invocation gap and delayed processing	Ensure private/public path alignment and storage availability SLO	Host startup failures and trigger listener errors
Throttled host storage	Binding/lease/checkpoint operations slow down	Retry amplification and backlog growth	Capacity planning, retry budget tuning, isolate heavy app data workloads	Storage throttle metrics and queue lag growth
Identity/RBAC misconfiguration	Host startup errors, authentication failures to storage	App appears healthy but no trigger progress	Validate managed identity role assignments pre-production	Authorization failure logs and startup diagnostics
DNS/network mismatch to storage endpoint	Intermittent or persistent connection timeouts	Unpredictable trigger execution reliability	Align private endpoint, DNS links, and route policy for runtime subnet	Connection timeout spikes after network policy changes

Mitigations:

Use supported identity-based configuration where available.
Monitor host startup failures and storage-related exceptions.
Separate application data storage from host storage when isolation is needed.
Test failure behavior in staging by simulating storage deny/throttle conditions.

Do not treat host storage as optional

If host storage access fails, reliability degrades before business logic runs. Guardrails must include configuration validation and RBAC checks.

Durable Functions: when to use orchestrations¶

Use plain functions when work is short, stateless, and independently retryable. Use Durable Functions when you need deterministic workflow state across retries and restarts.

Choose Durable when you need:

fan-out/fan-in with durable checkpoints,
long-running human or external-system waits,
compensation workflows,
deterministic replayed orchestration logic.

Keep plain functions when you need:

single-step event handling,
low-latency stateless transformations,
simple queue-to-API processing.

Decision criteria	Plain Functions	Durable Functions
Workflow length	Short, single-step units	Multi-step or long-running workflows
State management	Externalized manually in app/storage	Built-in orchestration history and checkpoints
Retry model	Trigger/runtime retries per invocation	Deterministic replay with orchestrator semantics
Human interaction/waits	Complex to implement reliably	First-class pattern (wait for external event/timer)
Compensation/saga	Manual compensation logic	Structured orchestrations and compensation flows
Latency sensitivity	Better for low-latency stateless handlers	Additional orchestration overhead

Timeout and cancellation handling¶

Plan-specific execution limits must drive handler design.

Break long tasks into chunked operations.
Use cancellation tokens or equivalent cancellation signals.
Stop creating new outbound calls once cancellation is requested.
Persist safe resume checkpoints before exiting.

Example cancellation-aware strategy:

Read cancellation signal at loop boundaries.
Complete current atomic unit.
Persist checkpoint state.
Exit gracefully so retry resumes from checkpoint.

Hosting plan	Default timeout behavior	Maximum timeout behavior	Reliability design guidance
Consumption (Y1)	Default is short-running oriented	Bounded maximum (plan-specific)	Keep handlers chunked and checkpoint frequently
Flex Consumption (FC1)	Default around 30 minutes	Supports long-running/unbounded scenarios by configuration	Use explicit cancellation and progress checkpoints
Premium (EP)	Suitable for long-running workloads	Extended runtime support by plan/runtime settings	Prefer resilient chunking even when long timeout is available
Dedicated (App Service Plan)	App Service-style execution behavior	Configurable with host/runtime constraints	Guard against runaway executions with explicit budgets

Common Mistakes / Anti-Patterns¶

Do not treat poison/dead-letter paths as final discard queues.

Minimum production pattern:

capture payload, trigger metadata, and failure reason,
classify transient versus terminal failures,
create replay process with dedup protection,
track mean time to remediate poison items.

Trigger-specific poison handling¶

Storage Queue: monitor <queue>-poison; create poison processor.
Service Bus: monitor DLQ counts and age; preserve dead-letter reason.
Event-based systems: implement side-channel failed-event store if no native poison queue exists.

Trigger family	Native poison/dead-letter path	What to retain for replay	Replay control recommendation
Storage Queue	`<queue>-poison` via `maxDequeueCount`	Payload, `dequeueCount`, insertion/expiration metadata	Reprocess with dedup key check before side effects
Service Bus	Entity DLQ with dead-letter reason/description	Payload, `MessageId`, dead-letter reason, delivery count	Replay selectively after root-cause tagging
Event Hubs/Event Grid style eventing	No universal poison queue in all patterns	Event envelope, partition/offset or event ID, failure category	Store failed events in side channel and build idempotent re-drive pipeline
HTTP	No native poison queue	Request body hash, idempotency key, response status timeline	Client-safe retry contract and operator replay endpoint

Message processing lifecycle (retry and poison path)¶

flowchart TD
    T[Trigger event arrives] --> H[Function handler starts]
    H --> D{Duplicate already processed?}
    D -->|Yes| A1[Ack or return cached result]
    D -->|No| X[Execute business side effect]
    X --> S{Success?}
    S -->|Yes| C[Persist completion marker]
    C --> A2[Ack or complete message]
    S -->|No transient| R[Runtime retry path]
    R --> H
    S -->|No terminal| P[Poison queue or DLQ]
    P --> O[Ops triage and replay]
    O --> H

Validation Checklist¶

[ ] Every trigger handler has explicit idempotency key logic.
[ ] Side effects are guarded by dedup store or domain unique constraint.
[ ] Queue visibilityTimeout and maxDequeueCount are tuned and documented.
[ ] Service Bus lock duration/renewal and DLQ strategy are documented.
[ ] HTTP write operations require idempotency key and replay-safe status model.
[ ] Timer jobs handle missed schedules and avoid overlap issues.
[ ] AzureWebJobsStorage identity/permissions/config are validated in pre-production.
[ ] Timeout budget exists for each dependency call path.
[ ] Poison and dead-letter queues have alerts, dashboard, and replay runbook.
[ ] Durable Functions is used for multi-step workflows that need durable state.