Skip to content

Reliability Best Practices

Reliability in Azure App Service comes from deliberate architecture and operational discipline. This guide focuses on practical design decisions that reduce outage frequency, shorten recovery time, and improve user trust.

Reliability Goals

For production systems, reliability should be expressed through measurable targets:

  • Availability target (for example, 99.9% or higher)
  • Recovery Time Objective (RTO)
  • Recovery Point Objective (RPO)
  • Error budget and incident response thresholds

Reliability is multi-layered

Platform health, application behavior, and dependency resilience all contribute to end-to-end uptime.

Prerequisites

Before implementing advanced reliability patterns, ensure:

  • Structured logging and distributed tracing are enabled
  • Health endpoint exists and validates critical dependencies
  • Backups are configured and periodically tested
  • Incident runbooks have clear ownership and escalation

Health Check Probe Configuration

Health checks are the first reliability control in App Service. They allow unhealthy instances to be detected and removed from rotation.

Design Principles for Health Endpoints

  • Keep response lightweight and deterministic
  • Validate required dependencies (database, cache, key services)
  • Separate readiness from deep diagnostics where possible
  • Return clear status codes and minimal payload

Example Health Check Configuration

az webapp config set \
    --resource-group $RG \
    --name $APP_NAME \
    --generic-configurations '{"healthCheckPath":"/healthz"}'

What a Good /healthz Should Validate

  • Process liveness and request handling loop
  • Dependency connectivity with timeout guard
  • Critical configuration presence
  • Version metadata for rollout debugging

Portal view: Health check blade

Health check blade for a Web App with two tabs — Health check (active) and Instances — and a command bar showing Save, Discard, Refresh, Troubleshoot, Metrics, and Send us your feedback actions. The first blue info banner reads "Your site has a single instance which will not be removed if it becomes unhealthy. However, after one hour of continuous unhealthy pings, the instance will be replaced. You can still set up Azure Monitor Alerts based on the health status." The second blue info banner reads "Health check is being moved to Configuration. Click here to go to the new experience." A descriptive line explains "Health check increases your application's availability by removing unhealthy instances from the load balancer. If your instance remains unhealthy, it will be replaced" with a Learn more link. The Health check enablement checkbox below is unchecked, so no path or threshold fields are visible. The left navigation shows Monitoring expanded with Alerts, Metrics, Logs, Health check (active), and Application Insights entries.

The Health check blade is the platform's enforcement point for the rotation-removal behavior every reliability strategy depends on. The visible state captures both the most common anti-pattern and the platform's own warning about it: the Health check checkbox is unchecked, and the blue banner explicitly states that a single-instance plan cannot remove an unhealthy instance — it can only be replaced after a full hour of failing pings. This is why this guide pairs /healthz configuration with the minimum two instances rule: without the checkbox enabled and at least two instances, neither rotation nor fast replacement is available. Enable the checkbox, set the path to /healthz, and verify the change moves the platform from "replace after an hour" to "remove from rotation immediately".

Do not fake healthy status

If key dependencies are unavailable, return unhealthy. Hiding failures delays detection and increases incident impact.

Multi-Region Deployment Patterns

Single-region design is often acceptable for low-criticality apps, but production-critical systems should plan for regional disruption.

Pattern 1: Active-Passive

  • Primary region serves all traffic
  • Secondary region stays warm for failover
  • DNS or traffic manager performs failover routing

Pros:

  • Simpler operational model
  • Lower write consistency complexity

Cons:

  • Secondary capacity may be underutilized
  • Failover testing is mandatory to avoid surprise failures

Pattern 2: Active-Active

  • Multiple regions serve traffic concurrently
  • Global routing distributes traffic by priority, latency, or geography
  • Requires stronger data and session architecture

Pros:

  • Better regional resilience
  • Potentially lower user latency by geography

Cons:

  • Higher architecture and operations complexity
  • Harder consistency and incident isolation

Reliability Architecture with Health Checks

flowchart TD
    U[Users] --> G[Global traffic routing]
    G --> R1[Region A App Service]
    G --> R2[Region B App Service]
    R1 --> H1[/healthz probes/]
    R2 --> H2[/healthz probes/]
    R1 --> D1[(Regional data tier)]
    R2 --> D2[(Regional data tier)]
    H1 --> M[Monitoring and alerting]
    H2 --> M
    M --> O[On-call and incident runbook]

Graceful Shutdown and SIGTERM Handling

App Service can recycle instances during scale, patching, and deployment events. Applications must handle termination signals gracefully.

Plan for a short grace period — and treat the exact value as a configurable upper bound

When App Service recycles a worker, it sends SIGTERM to the application process and then sends SIGKILL after a grace period. The exact duration depends on the hosting model:

  • Code-based apps on App Service have a relatively short grace window (often described as on the order of a few tens of seconds) for in-flight request drain and cleanup.
  • Custom containers default to 5 seconds before SIGKILL. You can extend this up to 1800 seconds by setting the WEBSITES_CONTAINER_STOP_TIME_LIMIT app setting (see the Microsoft Learn "Configure a custom container" reference).

Treat the grace period as small and configurable, not as a fixed platform guarantee. In-flight request completion, connection draining, and cleanup work must finish well within whatever window your hosting model provides.

Why SIGTERM Handling Matters

  • Prevents abrupt termination of in-flight requests
  • Reduces partial writes and data corruption risk
  • Enables cleaner release transitions and scale-in events

Graceful Shutdown Checklist

  • Trap termination signals in application runtime
  • Stop accepting new requests quickly (close the listener or health endpoint)
  • Finish in-flight requests within the platform's grace window for your hosting model
  • For custom containers, raise WEBSITES_CONTAINER_STOP_TIME_LIMIT if 5 seconds is insufficient (but keep it bounded)
  • Flush logs/telemetry and close outbound connections cleanly
  • Move long-running cleanup (large flush, archive upload) outside the request path so it does not depend on shutdown

Language-specific implementation

Use this document for design guidance, then implement signal handlers in your language guide runtime section.

Retry and Circuit Breaker Patterns

Most production outages are dependency-driven, not web-tier-driven. Reliability requires controlled failure handling for downstream calls.

Retry Best Practices

  • Retry only transient failure types
  • Use exponential backoff with jitter
  • Set upper bounds on attempts and timeout budget
  • Avoid infinite retries

Circuit Breaker Best Practices

  • Open circuit when error threshold exceeded
  • Fail fast while circuit is open
  • Probe with limited half-open requests
  • Emit breaker state metrics for visibility

Retries can amplify outages

Without backoff and limits, retries create thundering herds against already-failing dependencies.

Backup and Restore Strategy

Backups are a reliability control for data/configuration recovery, not a substitute for high availability.

Strategy Components

  • Scheduled app backups with defined retention
  • Database-native backup alignment with app backup schedule
  • Storage account durability review
  • Documented restore sequence and ownership

Validation Requirements

  • Perform periodic restore drills in non-production
  • Measure real restore time against RTO target
  • Verify restored app boots and passes health checks
# Example: list current backup configuration
az webapp config backup show \
    --resource-group $RG \
    --webapp-name $APP_NAME

Minimum Instance Count for High Availability

For production workloads, run at least two instances.

Why minimum two instances matters:

  • Reduces single-instance failure impact
  • Supports rolling maintenance without complete service loss
  • Improves resilience during transient host issues

Single instance is a single failure domain

One instance in production means any restart, crash, or host issue can become user-visible downtime.

Failure Scenario Planning

Scenario A: One Instance Unhealthy

Expected behavior:

  • Health checks detect and remove unhealthy instance from rotation
  • Remaining instances continue serving traffic

Scenario B: Dependency Latency Spike

Expected behavior:

  • Timeout and retry policy engages
  • Circuit breaker opens if sustained failures occur
  • Alerting notifies operators before full outage

Scenario C: Regional Outage

Expected behavior:

  • Global routing removes affected region
  • Secondary region serves traffic within RTO
  • Post-failover validation runbook executes

Reliability Operating Model

Pre-Production

  • Load and chaos-style failure testing
  • Dependency failure simulations
  • Runbook dry-runs for failover and restore

Production

  • SLO dashboards and error budget tracking
  • Alert tuning to reduce noise and improve signal
  • Weekly review of high-severity incident trends

Post-Incident

  • Root cause analysis with timeline accuracy
  • Preventive action items with ownership and deadlines
  • Documentation updates in operations and best-practices sections

See Also

Sources