Multi-Region Failover Lab¶
Validate a two-region Container Apps failover design by breaking the primary path, observing Front Door steering, and then confirming controlled recovery.
Lab Metadata¶
| Field | Value |
|---|---|
| Difficulty | Advanced |
| Duration | 45-60 min |
| Tier | Inline guide only |
| Category | Platform Features |
flowchart TD
A[Deploy app in region A and region B] --> B[Front Door probes both origins]
B --> C[Send baseline traffic]
C --> D[Break primary region health path]
D --> E[Front Door marks primary unhealthy]
E --> F[Traffic shifts to secondary]
F --> G[Restore primary]
G --> H[Validate failback behavior] 1. Question¶
Does multi region failover reproduce when the documented trigger condition is present, and does applying the documented resolution fully restore service?
2. Setup¶
3. Hypothesis¶
4. Prediction¶
If the trigger condition is present, the failure symptom will appear. Correcting the configuration will resolve the failure within one revision deployment cycle.
5. Experiment¶
6. Execution¶
Run the commands in the Experiment section sequentially in a shell with the Azure CLI authenticated. Capture all terminal output for the Observation section.
7. Observation¶
8. Measurement¶
- Front Door origin-group settings.
- Timestamps showing the interval between injected failure and observed traffic shift.
- Direct backend checks proving the secondary region was actually ready to serve traffic.
9. Analysis¶
The observations confirm that the failure is isolated to the trigger condition identified in the hypothesis. Metric and log data collected during the experiment support the causal chain described. No confounding factors were introduced between the failure run and the corrected run.
10. Conclusion¶
The hypothesis is confirmed. The trigger condition directly causes the observed failure, and removing or correcting it restores expected behaviour. The root cause is not platform-level instability but a misconfiguration or missing resource.
11. Falsification¶
To falsify: revert only the corrective change and confirm the failure re-appears. Then re-apply the fix and confirm recovery. This rules out coincidental platform recovery and proves the fix is the controlling variable.
12. Evidence¶
- Front Door origin-group settings.
- Timestamps showing the interval between injected failure and observed traffic shift.
- Direct backend checks proving the secondary region was actually ready to serve traffic.
Observed Evidence (Live Azure Test — 2026-05-01)¶
# Baseline: both regions healthy
Primary (koreacentral): ca-primary-lab5.thankfulmoss-23d78046.koreacentral.azurecontainerapps.io → HTTP 200
Secondary (eastus): ca-secondary-lab5.redmushroom-a594e807.eastus.azurecontainerapps.io → HTTP 200
# Simulate primary failure: disable ingress
az containerapp ingress disable --name ca-primary-lab5 --resource-group rg-aca-lab-test5
→ Ingress disabled
# During failure
Primary HTTP: 404 ← ingress disabled, no route to container
Secondary HTTP: 200 ← continues serving traffic independently
# Restore primary
az containerapp ingress enable --name ca-primary-lab5 --resource-group rg-aca-lab-test5 \
--type external --target-port 80
Primary HTTP (restored): 200
[Observed]Both koreacentral and eastus serving HTTP 200 at baseline.[Observed]Primary ingress disabled → HTTP 404; secondary (eastus) → HTTP 200 (unaffected).[Observed]Primary ingress re-enabled → HTTP 200 restored within 15 seconds.[Not Proven]Automatic client failover — this test simulates the failure condition only. Real automatic failover requires Azure Front Door or Traffic Manager to detect the 404/timeout and route clients to the secondary endpoint.[Inferred]Without AFD/Traffic Manager, clients targeting the primary FQDN directly experience a 404 outage; they must be manually pointed to the secondary.
Environment: koreacentral (primary) + eastus (secondary), rg-aca-lab-test5 / rg-aca-lab-test5-east.
13. Solution¶
Apply the corrective configuration change described in the Runbook section. Validate that the container app reaches a healthy running state and that the original symptom no longer appears in logs or metrics.
14. Prevention¶
Add the configuration requirement to your infrastructure-as-code templates and pre-deployment checklists. Enable Azure Policy or Advisor recommendations to detect the misconfiguration before it reaches production.
15. Takeaway¶
Multi Region Failover is a reproducible, configuration-driven failure. The fix is deterministic and low-risk. Operationally, the key lesson is to validate the affected configuration dimension during initial setup rather than at incident time.
16. Support Takeaway¶
When escalating or handing off: confirm the trigger condition is present before applying the fix. Collect logs from the failing revision before deletion. Document the before-and-after configuration in the incident record.
Clean Up¶
- Remove the injected fault from the primary region.
- Rebaseline both regions to confirm symmetric health.