Health and Recovery Operations¶
This guide covers production health checks and recovery operations: probe tuning, restart behavior, and incident response patterns.
Prerequisites¶
- Application exposes a reliable health endpoint (for example,
/health) - SRE runbook defines recovery time objective (RTO)
export RG="rg-aca-prod"
export APP_NAME="app-python-api-prod"
export ENVIRONMENT_NAME="aca-env-prod"
Health Probe Configuration¶
Configure startup, liveness, and readiness probes in your Container App template:
flowchart TD
U[User Traffic] --> R[Readiness Probe]
P[Platform Runtime] --> L[Liveness Probe]
S[Container Startup] --> ST[Startup Probe]
R --> D[Receives Requests]
L --> E[Restart Decision] HTTP probes can include dependency-aware logic, but they should reflect the health signal you actually want the platform to act on. Microsoft Learn notes that HTTP probes "let you implement custom logic to check the status of application dependencies before reporting a healthy status," so keep basic process health separate from stricter dependency checks when you don't want every downstream failure to block readiness.
az containerapp update \
--name "$APP_NAME" \
--resource-group "$RG" \
--yaml "./infra/containerapp-health.yaml"
Validate environment and platform-level status:
az resource show \
--resource-group "$RG" \
--resource-type "Microsoft.App/managedEnvironments" \
--name "$ENVIRONMENT_NAME" \
--output json
Restart and Recovery Workflows¶
Restart a revision when transient faults occur:
az containerapp revision restart \
--name "$APP_NAME" \
--resource-group "$RG" \
--revision "${APP_NAME}--stable"
For persistent failures, roll traffic back to a healthy revision (see revisions guide).
Prefer rollback over repeated restart loops
If failures continue after one restart cycle, route traffic to a known-good revision and investigate offline.
Recovery Action Matrix¶
| Symptom | First Action | Escalation Action |
|---|---|---|
| Sporadic probe failures | Restart revision once | Increase probe delay and inspect dependency latency |
| All replicas failing readiness | Check configuration/secrets rollout | Shift traffic to prior healthy revision |
| Repeated liveness restarts | Inspect memory/CPU pressure and startup logs | Reduce resource contention and redeploy |
| Environment-wide instability | Validate managed environment health | Activate incident response and failover runbook |
Verification Steps¶
Check revision states and recent failures:
Review system logs for probe failures:
az containerapp logs show \
--name "$APP_NAME" \
--resource-group "$RG" \
--type system \
--follow false
Example output (PII masked):
2026-04-02T09:10:21Z Probe failed: readiness check returned HTTP 503
2026-04-02T09:10:31Z Restarting container due to failed liveness probe
Troubleshooting¶
Frequent restarts¶
- Increase
initialDelaySecondsfor slow startup workloads. - Confirm probe path and port match the application listener.
- Check downstream dependency outages causing readiness failures.
App never becomes ready¶
- Inspect app logs for startup exceptions.
- Verify secrets and configuration are available at startup.
Advanced Topics¶
- Separate startup and readiness logic to reduce false positives.
- Add synthetic probes from outside the environment for end-to-end health.
- Trigger automated recovery playbooks from alert rules.