Health and Recovery Operations¶

This guide covers production health checks and recovery operations: probe tuning, restart behavior, and incident response patterns.

Prerequisites¶

Application exposes a reliable health endpoint (for example, /health)
SRE runbook defines recovery time objective (RTO)

export RG="rg-aca-prod"
export APP_NAME="app-python-api-prod"
export ENVIRONMENT_NAME="aca-env-prod"

Health Probe Configuration¶

Configure startup, liveness, and readiness probes in your Container App template:

flowchart LR
    U[User Traffic] --> R[Readiness Probe]
    P[Platform Runtime] --> L[Liveness Probe]
    S[Container Startup] --> ST[Startup Probe]
    R --> D[Receives Requests]
    L --> E[Restart Decision]

Probe paths must reflect real dependency posture

If readiness requires unavailable downstream services, the app can stay unavailable even when the container is healthy. Separate process-health from dependency-health where appropriate.

az containerapp update \
  --name "$APP_NAME" \
  --resource-group "$RG" \
  --yaml "./infra/containerapp-health.yaml"

Validate environment and platform-level status:

az resource show \
  --resource-group "$RG" \
  --resource-type "Microsoft.App/managedEnvironments" \
  --name "$ENVIRONMENT_NAME" \
  --output json

Restart and Recovery Workflows¶

Restart a revision when transient faults occur:

az containerapp revision restart \
  --name "$APP_NAME" \
  --resource-group "$RG" \
  --revision "${APP_NAME}--stable"

For persistent failures, roll traffic back to a healthy revision (see revisions guide).

Prefer rollback over repeated restart loops

If failures continue after one restart cycle, route traffic to a known-good revision and investigate offline.

Recovery Action Matrix¶

Symptom	First Action	Escalation Action
Sporadic probe failures	Restart revision once	Increase probe delay and inspect dependency latency
All replicas failing readiness	Check configuration/secrets rollout	Shift traffic to prior healthy revision
Repeated liveness restarts	Inspect memory/CPU pressure and startup logs	Reduce resource contention and redeploy
Environment-wide instability	Validate managed environment health	Activate incident response and failover runbook

Verification Steps¶

Check revision states and recent failures:

az containerapp revision list \
  --name "$APP_NAME" \
  --resource-group "$RG" \
  --output table

Review system logs for probe failures:

az containerapp logs show \
  --name "$APP_NAME" \
  --resource-group "$RG" \
  --type system \
  --follow false

Example output (PII masked):

2026-04-02T09:10:21Z Probe failed: readiness check returned HTTP 503
2026-04-02T09:10:31Z Restarting container due to failed liveness probe

Troubleshooting¶

Frequent restarts¶

Increase initialDelaySeconds for slow startup workloads.
Confirm probe path and port match the application listener.
Check downstream dependency outages causing readiness failures.

App never becomes ready¶

Inspect app logs for startup exceptions.
Verify secrets and configuration are available at startup.

Advanced Topics¶

Separate startup and readiness logic to reduce false positives.
Add synthetic probes from outside the environment for end-to-end health.
Trigger automated recovery playbooks from alert rules.