Reliability Best Practices¶

Reliability in Azure App Service comes from deliberate architecture and operational discipline. This guide focuses on practical design decisions that reduce outage frequency, shorten recovery time, and improve user trust.

Reliability Goals¶

For production systems, reliability should be expressed through measurable targets:

Availability target (for example, 99.9% or higher)
Recovery Time Objective (RTO)
Recovery Point Objective (RPO)
Error budget and incident response thresholds

Reliability is multi-layered

Platform health, application behavior, and dependency resilience all contribute to end-to-end uptime.

Prerequisites¶

Before implementing advanced reliability patterns, ensure:

Structured logging and distributed tracing are enabled
Health endpoint exists and validates critical dependencies
Backups are configured and periodically tested
Incident runbooks have clear ownership and escalation

Health Check Probe Configuration¶

Health checks are the first reliability control in App Service. They allow unhealthy instances to be detected and removed from rotation.

Design Principles for Health Endpoints¶

Keep response lightweight and deterministic
Validate required dependencies (database, cache, key services)
Separate readiness from deep diagnostics where possible
Return clear status codes and minimal payload

Example Health Check Configuration¶

az webapp config set \
    --resource-group $RG \
    --name $APP_NAME \
    --generic-configurations '{"healthCheckPath":"/healthz"}'

What a Good `/healthz` Should Validate¶

Process liveness and request handling loop
Dependency connectivity with timeout guard
Critical configuration presence
Version metadata for rollout debugging

Portal view: Health check blade¶

The Health check blade is the platform's enforcement point for the rotation-removal behavior every reliability strategy depends on. The visible state captures both the most common anti-pattern and the platform's own warning about it: the Health check checkbox is unchecked, and the blue banner explicitly states that a single-instance plan cannot remove an unhealthy instance — it can only be replaced after a full hour of failing pings. This is why this guide pairs /healthz configuration with the minimum two instances rule: without the checkbox enabled and at least two instances, neither rotation nor fast replacement is available. Enable the checkbox, set the path to /healthz, and verify the change moves the platform from "replace after an hour" to "remove from rotation immediately".

Do not fake healthy status

If key dependencies are unavailable, return unhealthy. Hiding failures delays detection and increases incident impact.

Multi-Region Deployment Patterns¶

Single-region design is often acceptable for low-criticality apps, but production-critical systems should plan for regional disruption.

Pattern 1: Active-Passive¶

Primary region serves all traffic
Secondary region stays warm for failover
DNS or traffic manager performs failover routing

Pros:

Simpler operational model
Lower write consistency complexity

Cons:

Secondary capacity may be underutilized
Failover testing is mandatory to avoid surprise failures

Pattern 2: Active-Active¶

Multiple regions serve traffic concurrently
Global routing distributes traffic by priority, latency, or geography
Requires stronger data and session architecture

Pros:

Better regional resilience
Potentially lower user latency by geography

Cons:

Higher architecture and operations complexity
Harder consistency and incident isolation

Reliability Architecture with Health Checks¶

flowchart TD
    U[Users] --> G[Global traffic routing]
    G --> R1[Region A App Service]
    G --> R2[Region B App Service]
    R1 --> H1[/healthz probes/]
    R2 --> H2[/healthz probes/]
    R1 --> D1[(Regional data tier)]
    R2 --> D2[(Regional data tier)]
    H1 --> M[Monitoring and alerting]
    H2 --> M
    M --> O[On-call and incident runbook]

Graceful Shutdown and SIGTERM Handling¶

App Service can recycle instances during scale, patching, and deployment events. Applications must handle termination signals gracefully.

Plan for a short grace period — and treat the exact value as a configurable upper bound

When App Service recycles a worker, it sends SIGTERM to the application process and then sends SIGKILL after a grace period. The exact duration depends on the hosting model:

Code-based apps on App Service have a relatively short grace window (often described as on the order of a few tens of seconds) for in-flight request drain and cleanup.
Custom containers default to 5 seconds before SIGKILL. You can extend this up to 1800 seconds by setting the WEBSITES_CONTAINER_STOP_TIME_LIMIT app setting (see the Microsoft Learn "Configure a custom container" reference).

Treat the grace period as small and configurable, not as a fixed platform guarantee. In-flight request completion, connection draining, and cleanup work must finish well within whatever window your hosting model provides.

Why SIGTERM Handling Matters¶

Prevents abrupt termination of in-flight requests
Reduces partial writes and data corruption risk
Enables cleaner release transitions and scale-in events

Graceful Shutdown Checklist¶

Trap termination signals in application runtime
Stop accepting new requests quickly (close the listener or health endpoint)
Finish in-flight requests within the platform's grace window for your hosting model
For custom containers, raise WEBSITES_CONTAINER_STOP_TIME_LIMIT if 5 seconds is insufficient (but keep it bounded)
Flush logs/telemetry and close outbound connections cleanly
Move long-running cleanup (large flush, archive upload) outside the request path so it does not depend on shutdown

Language-specific implementation

Use this document for design guidance, then implement signal handlers in your language guide runtime section.

Retry and Circuit Breaker Patterns¶

Most production outages are dependency-driven, not web-tier-driven. Reliability requires controlled failure handling for downstream calls.

Retry Best Practices¶

Retry only transient failure types
Use exponential backoff with jitter
Set upper bounds on attempts and timeout budget
Avoid infinite retries

Circuit Breaker Best Practices¶

Open circuit when error threshold exceeded
Fail fast while circuit is open
Probe with limited half-open requests
Emit breaker state metrics for visibility

Retries can amplify outages

Without backoff and limits, retries create thundering herds against already-failing dependencies.

Backup and Restore Strategy¶

Backups are a reliability control for data/configuration recovery, not a substitute for high availability.

Strategy Components¶

Scheduled app backups with defined retention
Database-native backup alignment with app backup schedule
Storage account durability review
Documented restore sequence and ownership

Validation Requirements¶

Perform periodic restore drills in non-production
Measure real restore time against RTO target
Verify restored app boots and passes health checks

# Example: list current backup configuration
az webapp config backup show \
    --resource-group $RG \
    --webapp-name $APP_NAME

Minimum Instance Count for High Availability¶

For production workloads, run at least two instances.

Why minimum two instances matters:

Reduces single-instance failure impact
Supports rolling maintenance without complete service loss
Improves resilience during transient host issues

Single instance is a single failure domain

One instance in production means any restart, crash, or host issue can become user-visible downtime.

Failure Scenario Planning¶

Scenario A: One Instance Unhealthy¶

Expected behavior:

Health checks detect and remove unhealthy instance from rotation
Remaining instances continue serving traffic

Scenario B: Dependency Latency Spike¶

Expected behavior:

Timeout and retry policy engages
Circuit breaker opens if sustained failures occur
Alerting notifies operators before full outage

Scenario C: Regional Outage¶

Expected behavior:

Global routing removes affected region
Secondary region serves traffic within RTO
Post-failover validation runbook executes

Reliability Operating Model¶

Pre-Production¶

Load and chaos-style failure testing
Dependency failure simulations
Runbook dry-runs for failover and restore

Production¶

SLO dashboards and error budget tracking
Alert tuning to reduce noise and improve signal
Weekly review of high-severity incident trends

Post-Incident¶

Root cause analysis with timeline accuracy
Preventive action items with ownership and deadlines
Documentation updates in operations and best-practices sections

Reliability Best Practices¶

Reliability Goals¶

Prerequisites¶

Health Check Probe Configuration¶

Design Principles for Health Endpoints¶

Example Health Check Configuration¶

What a Good `/healthz` Should Validate¶

Portal view: Health check blade¶

Multi-Region Deployment Patterns¶

Pattern 1: Active-Passive¶

Pattern 2: Active-Active¶

Reliability Architecture with Health Checks¶

Graceful Shutdown and SIGTERM Handling¶

Why SIGTERM Handling Matters¶

Graceful Shutdown Checklist¶

Retry and Circuit Breaker Patterns¶

Retry Best Practices¶

Circuit Breaker Best Practices¶

Backup and Restore Strategy¶

Strategy Components¶

Validation Requirements¶

Minimum Instance Count for High Availability¶

Failure Scenario Planning¶

Scenario A: One Instance Unhealthy¶

Scenario B: Dependency Latency Spike¶

Scenario C: Regional Outage¶

Reliability Operating Model¶

Pre-Production¶

Production¶

Post-Incident¶

See Also¶

Sources¶

Reliability Best Practices¶

Reliability Goals¶

Prerequisites¶

Health Check Probe Configuration¶

Design Principles for Health Endpoints¶

Example Health Check Configuration¶

What a Good /healthz Should Validate¶

Portal view: Health check blade¶

Multi-Region Deployment Patterns¶

Pattern 1: Active-Passive¶

Pattern 2: Active-Active¶

Reliability Architecture with Health Checks¶

Graceful Shutdown and SIGTERM Handling¶

Why SIGTERM Handling Matters¶

Graceful Shutdown Checklist¶

Retry and Circuit Breaker Patterns¶

Retry Best Practices¶

Circuit Breaker Best Practices¶

Backup and Restore Strategy¶

Strategy Components¶

Validation Requirements¶

Minimum Instance Count for High Availability¶

Failure Scenario Planning¶

Scenario A: One Instance Unhealthy¶

Scenario B: Dependency Latency Spike¶

Scenario C: Regional Outage¶

Reliability Operating Model¶

Pre-Production¶

Production¶

Post-Incident¶

See Also¶

Sources¶

What a Good `/healthz` Should Validate¶