Health and Recovery Operations¶

Maintain availability by combining health checks, automatic remediation, and diagnostics. This guide focuses on platform-native recovery controls for Azure App Service.

flowchart TD
    Probe[Health Probe Pings /health] --> Status{Status Code?}
    Status -- 200 OK --> Healthy[Instance Healthy]
    Status -- 5xx / Timeout --> Unhealthy[Instance Unhealthy]

    Unhealthy --> Action{Recovery Action}
    Action -- Remove --> LB[Remove from Load Balancer]
    Action -- Restart --> AutoHeal[Auto-Heal Restarts Instance]

    LB --> Retry[Wait and Retry Probe]
    AutoHeal --> Retry
    Retry --> Probe

Prerequisites¶

Existing App Service app with at least one active instance
A lightweight health endpoint (for example /health)
Azure Monitor access for metrics and activity logs
Variables set:
- RG
- APP_NAME

When to Use¶

Procedure¶

Define a Reliable Health Endpoint Contract¶

Your health endpoint should:

return HTTP 200 for healthy state
avoid expensive dependency checks by default
respond quickly (typically under 1 second)
include optional deep checks behind a separate path when needed

Do not over-couple health checks

If your liveness probe requires every downstream dependency to be healthy, transient external failures can trigger unnecessary instance removal.

Enable App Service Health Check¶

az webapp config set \
  --resource-group $RG \
  --name $APP_NAME \
  --health-check-path "/health" \
  --output json

Verify setting:

az webapp config show \
  --resource-group $RG \
  --name $APP_NAME \
  --query "{healthCheckPath:healthCheckPath,minimumTls:minTlsVersion,alwaysOn:alwaysOn}" \
  --output json

Understand Platform Default Probe Behavior¶

Once health check is enabled, the App Service platform probes each instance on a fixed cadence. Knowing the defaults helps you reason about removal latency and set realistic recovery SLOs.

Behavior	Default	Configurable via
Probe interval per instance	approximately 1 minute	not configurable
Consecutive failures before removal	10	`WEBSITE_HEALTHCHECK_MAXPINGFAILURES` app setting (range 2-10)
Single-instance safety net	platform never removes the only running instance	platform-enforced

Tune the failure threshold when you need faster or more lenient removal:

az webapp config appsettings set \
  --resource-group $RG \
  --name $APP_NAME \
  --settings WEBSITE_HEALTHCHECK_MAXPINGFAILURES=5 \
  --output json

Plan for at least 2 instances in production

Health check will not remove the only running instance, even if it is failing probes — this prevents total outage when no other instance is available to take traffic. To get the benefit of platform-side instance rotation, run at least 2 instances so an unhealthy instance can be replaced while a healthy one continues serving requests.

Configure Auto-Heal for Memory Pressure¶

az webapp config auto-heal update \
  --resource-group $RG \
  --name $APP_NAME \
  --auto-heal-enabled true \
  --auto-heal-action Restart \
  --auto-heal-memory-private-set-kb 1500000 \
  --auto-heal-memory-private-set-duration "00:05:00" \
  --output json

Configure Auto-Heal for Slow Requests¶

az webapp config auto-heal update \
  --resource-group $RG \
  --name $APP_NAME \
  --auto-heal-enabled true \
  --auto-heal-action Restart \
  --auto-heal-slow-requests-count 50 \
  --auto-heal-slow-requests-interval "00:05:00" \
  --auto-heal-slow-requests-time "00:00:10" \
  --output json

Inspect effective rules:

az webapp config auto-heal show \
  --resource-group $RG \
  --name $APP_NAME \
  --output json

Capture Recovery Signals¶

Tail live platform logs:

az webapp log tail \
  --resource-group $RG \
  --name $APP_NAME

List relevant activity events:

az monitor activity-log list \
  --resource-group $RG \
  --offset 1d \
  --max-events 50 \
  --query "[?contains(operationName.value, 'Microsoft.Web/sites/restart') || contains(operationName.value, 'AutoHeal')].{time:eventTimestamp,status:status.value,operation:operationName.localizedValue}" \
  --output table

Portal view: Availability and Performance diagnostic¶

The Availability and Performance diagnostic is the Portal equivalent of the recovery-signals queries above and the entry point for nearly every incident triage flow this guide describes. The two KPI tiles map directly to the SLO checks elsewhere in this document: Failed Requests: 0% is the error-budget signal, and App Performance: 63 ms (90th Percentile) is the latency signal that drives the early-warning alerts described in the advanced topics. The left-nav catalog (Web App Down, CPU Usage, Memory Usage, Web App Restarted, Health Check feature, SNAT Port Exhaustion, Process List, ...) is the menu of specialized diagnostics — when az monitor activity-log list from the snippet above shows repeated Auto-Heal events, the right tile to drill into is Web App Restarted; when 4xx spikes appear in the chart like the three visible here, drill into Http 4xx errors. Use this blade as the first stop during an incident before reaching for az webapp restart, because manual restart erases the in-memory context the AI-powered Diagnostics preview and the Linux drill-downs need to identify a root cause.

Build an Operational Recovery Runbook¶

Recommended sequence when incidents occur:

Confirm symptom scope (single instance vs whole app)
Check health check status and endpoint latency
Review auto-heal trigger frequency
Restart app only if automatic recovery is insufficient
Scale out temporarily if saturation persists
Capture logs, metrics, and timelines for post-incident review

Manual restart command:

az webapp restart \
  --resource-group $RG \
  --name $APP_NAME \
  --output json

Verification¶

Control plane validation:

az webapp config show \
  --resource-group $RG \
  --name $APP_NAME \
  --query "{healthCheckPath:healthCheckPath}" \
  --output json

az webapp config auto-heal show \
  --resource-group $RG \
  --name $APP_NAME \
  --query "{enabled:autoHealEnabled,action:autoHealRules.actions.actionType}" \
  --output json

Data plane validation:

curl --silent --show-error --include \
  "https://$APP_NAME.azurewebsites.net/health"

Expected result: HTTP success response and stable latency.

Example Incident Timeline (PII-masked)¶

2026-04-03T09:12:20Z  alert   MemoryPercentage > 90 for 5m
2026-04-03T09:13:10Z  action  Auto-Heal restart triggered
2026-04-03T09:14:02Z  probe   /health returned 200
2026-04-03T09:16:00Z  metric  Error rate back to baseline

Rollback / Troubleshooting¶

Health check keeps failing¶

Confirm endpoint path is correct
Ensure endpoint does not require authentication
Ensure dependencies used by health endpoint are reachable

Frequent auto-heal restarts¶

Increase thresholds to reduce false positives
Investigate memory leaks or long-running requests
Correlate restart times with traffic spikes

Single instance remains unhealthy¶

Verify there is enough capacity to rotate instances
Check startup latency and warm-up behavior
Review deployment slot and recent release changes

Advanced Topics¶

Liveness, Readiness, and Deep Health Patterns¶

Liveness: quick process check (/health)
Readiness: dependency readiness (/ready)
Deep diagnostics: detailed component checks (/health/deep)

Route platform probes to liveness, and use readiness/deep checks in pipelines and synthetic monitors.

Recovery-Oriented Alerting Strategy¶

Design alerts by stage:

Early warning: rising latency or queue depth
Trigger warning: repeated 5xx bursts
Recovery failure: repeated auto-heal loops

This helps detect when automatic recovery is not sufficient.

Chaos and Resilience Testing¶

Periodically test:

deliberate dependency timeout
temporary DNS failure scenarios
controlled memory stress

Capture observed recovery time and compare with target RTO.

Enterprise Considerations

Maintain a shared incident playbook with predefined ownership, communication channels, and rollback criteria. Treat repeated auto-heal events as reliability debt, not as normal steady state.

Language-Specific Details¶

For language-specific operational guidance, see: - Node.js Guide - Python Guide - Java Guide - .NET Guide