Health and Recovery Operations¶
This guide covers production health checks and recovery operations: probe tuning, restart behavior, and incident response patterns.
Prerequisites¶
- Application exposes a reliable health endpoint (for example,
/health) - SRE runbook defines recovery time objective (RTO)
export RG="rg-aca-prod"
export APP_NAME="app-python-api-prod"
export ENVIRONMENT_NAME="aca-env-prod"
Health Probe Configuration¶
Configure startup, liveness, and readiness probes in your Container App template:
flowchart LR
U[User Traffic] --> R[Readiness Probe]
P[Platform Runtime] --> L[Liveness Probe]
S[Container Startup] --> ST[Startup Probe]
R --> D[Receives Requests]
L --> E[Restart Decision] Probe paths must reflect real dependency posture
If readiness requires unavailable downstream services, the app can stay unavailable even when the container is healthy. Separate process-health from dependency-health where appropriate.
az containerapp update \
--name "$APP_NAME" \
--resource-group "$RG" \
--yaml "./infra/containerapp-health.yaml"
Validate environment and platform-level status:
az resource show \
--resource-group "$RG" \
--resource-type "Microsoft.App/managedEnvironments" \
--name "$ENVIRONMENT_NAME" \
--output json
Restart and Recovery Workflows¶
Restart a revision when transient faults occur:
az containerapp revision restart \
--name "$APP_NAME" \
--resource-group "$RG" \
--revision "${APP_NAME}--stable"
For persistent failures, roll traffic back to a healthy revision (see revisions guide).
Prefer rollback over repeated restart loops
If failures continue after one restart cycle, route traffic to a known-good revision and investigate offline.
Recovery Action Matrix¶
| Symptom | First Action | Escalation Action |
|---|---|---|
| Sporadic probe failures | Restart revision once | Increase probe delay and inspect dependency latency |
| All replicas failing readiness | Check configuration/secrets rollout | Shift traffic to prior healthy revision |
| Repeated liveness restarts | Inspect memory/CPU pressure and startup logs | Reduce resource contention and redeploy |
| Environment-wide instability | Validate managed environment health | Activate incident response and failover runbook |
Verification Steps¶
Check revision states and recent failures:
Review system logs for probe failures:
az containerapp logs show \
--name "$APP_NAME" \
--resource-group "$RG" \
--type system \
--follow false
Example output (PII masked):
2026-04-02T09:10:21Z Probe failed: readiness check returned HTTP 503
2026-04-02T09:10:31Z Restarting container due to failed liveness probe
Troubleshooting¶
Frequent restarts¶
- Increase
initialDelaySecondsfor slow startup workloads. - Confirm probe path and port match the application listener.
- Check downstream dependency outages causing readiness failures.
App never becomes ready¶
- Inspect app logs for startup exceptions.
- Verify secrets and configuration are available at startup.
Advanced Topics¶
- Separate startup and readiness logic to reduce false positives.
- Add synthetic probes from outside the environment for end-to-end health.
- Trigger automated recovery playbooks from alert rules.