Bad Revision Rollout and Rollback¶
1. Summary¶
Symptom¶
- Error rate spikes immediately after traffic shifts to new revision.
- New revision healthy on startup, but business flows fail.
- Partial traffic split causes intermittent failures.
Why this scenario is confusing¶
Revision health, replica readiness, and business correctness are not the same thing. A rollout can look healthy at the platform level while still breaking application behavior, and partial traffic splits can make the incident appear intermittent rather than rollout-driven.
Troubleshooting decision flow¶
flowchart TD
A[Post-rollout errors detected] --> B{Errors isolated to new revision?}
B -->|Yes| C[Shift traffic to last known good revision]
B -->|No| D[Investigate shared dependency or platform issue]
C --> E{Service recovered after rollback?}
E -->|Yes| F[Freeze rollout and perform diff analysis]
E -->|No| G[Escalate to network/identity/runtime playbooks] 2. Common Misreadings¶
- "Health is green, so rollout is good." Health probes may not reflect real business success.
- "Roll back everything." Controlled rollback to last known good revision is usually sufficient.
3. Competing Hypotheses¶
| Hypothesis | Typical Evidence For | Typical Evidence Against |
|---|---|---|
| H1: Code or configuration regression in new revision | Failures correlate exactly with traffic shift | Same errors existed before rollout |
| H2: Secret or dependency drift between revisions | Old revision succeeds with same traffic | Both revisions fail similarly |
| H3: Incomplete canary analysis | Errors only in subset of requests/routes | Full health and business checks passed pre-rollout |
4. What to Check First¶
Metrics¶
- Error rate by revision and request latency after traffic movement.
Logs¶
let AppName = "ca-myapp";
ContainerAppConsoleLogs_CL
| where ContainerAppName_s == AppName
| where Log_s has_any ("error", "exception", "timeout", "failed")
| summarize errors=count() by RevisionName_s, bin(TimeGenerated, 5m)
| order by TimeGenerated desc
Platform Signals¶
az containerapp revision list --name "$APP_NAME" --resource-group "$RG" --query "[].{name:name,active:properties.active,traffic:properties.trafficWeight,health:properties.healthState}" --output table
az containerapp show --name "$APP_NAME" --resource-group "$RG" --query "properties.configuration.ingress.traffic" --output json
5. Evidence to Collect¶
Required Evidence¶
| Evidence | Command/Query | Purpose |
|---|---|---|
| Revision list | az containerapp revision list --name "$APP_NAME" --resource-group "$RG" --output table | Compare active revisions, traffic, and health state |
| Traffic config | az containerapp show --name "$APP_NAME" --resource-group "$RG" --query "properties.configuration.ingress.traffic" --output json | Confirm current traffic split |
| Error-by-revision KQL | KQL on ContainerAppConsoleLogs_CL | Verify whether failures align with the new revision |
| Console logs | az containerapp logs show --name "$APP_NAME" --resource-group "$RG" --type console | Capture business or runtime failures during rollout |
| Rollback command result | az containerapp ingress traffic set --name "$APP_NAME" --resource-group "$RG" --revision-weight "<stable-revision>=100" | Test whether service recovers after rollback |
Useful Context¶
- Time of the traffic shift
- Canary percentage and duration
- Revision-level differences in image, env, secret refs, and scale settings
- Whether both revisions depend on the same external services
Observed revision status output used during rollback decisions:
Name Active TrafficWeight Replicas HealthState RunningState
----------------- -------- --------------- ---------- ------------- ------------
ca-myapp--0000001 True 100 1 Healthy Running
6. Validation and Disproof by Hypothesis¶
H1: Code or configuration regression in new revision¶
Signals that support:
- Failures correlate exactly with traffic shift.
- Errors are concentrated in the new revision.
- Rollback to the previous revision restores service.
Signals that weaken:
- Same errors existed before rollout.
- Both revisions fail similarly under the same traffic.
- No recovery occurs after rollback.
What to verify:
az containerapp revision list --name "$APP_NAME" --resource-group "$RG" --query "[].{name:name,active:properties.active,traffic:properties.trafficWeight,health:properties.healthState}" --output table
az containerapp show --name "$APP_NAME" --resource-group "$RG" --query "properties.configuration.ingress.traffic" --output json
az containerapp logs show --name "$APP_NAME" --resource-group "$RG" --type console
let AppName = "ca-myapp";
ContainerAppConsoleLogs_CL
| where ContainerAppName_s == AppName
| where Log_s has_any ("error", "exception", "timeout", "failed")
| summarize errors=count() by RevisionName_s, bin(TimeGenerated, 5m)
| order by TimeGenerated desc
Disproof logic: If the error pattern predates the rollout or persists after moving traffic back to the stable revision, the new revision alone is not the full explanation.
H2: Secret or dependency drift between revisions¶
Signals that support:
- Old revision succeeds with the same traffic.
- New revision depends on changed secret references or downstream settings.
- Failures appear only when the new revision exercises a changed dependency path.
Signals that weaken:
- Both revisions fail similarly.
- Rollback does not change the failure rate.
- No meaningful configuration drift exists between revisions.
What to verify:
az containerapp revision list --name "$APP_NAME" --resource-group "$RG" --output table
az containerapp logs show --name "$APP_NAME" --resource-group "$RG" --type console
Disproof logic: If the stable revision also fails under the same dependency path, focus on shared infrastructure or dependency health instead of revision-specific drift.
H3: Incomplete canary analysis¶
Signals that support:
- Errors only affect a subset of requests or routes.
- Partial traffic split causes intermittent failures.
- Platform health checks passed, but business flows were not fully validated.
Signals that weaken:
- Full health and business checks passed pre-rollout.
- Failures are uniform across all traffic paths.
- The issue is reproducible even with 100% traffic on the stable revision.
What to verify:
az containerapp show --name "$APP_NAME" --resource-group "$RG" --query "properties.configuration.ingress.traffic" --output json
az containerapp ingress traffic set --name "$APP_NAME" --resource-group "$RG" --revision-weight "<stable-revision>=100"
az containerapp revision list --name "$APP_NAME" --resource-group "$RG" --output table
Disproof logic: If controlled rollback does not improve service or if failures are not isolated to canary traffic, canary analysis gaps are secondary, not primary.
7. Likely Root Cause Patterns¶
| Pattern | Frequency | First Signal | Typical Resolution |
|---|---|---|---|
| New revision regression | Very common | Errors start immediately after traffic shift | Roll back and diff code/config |
| Secret or dependency drift | Common | Old revision succeeds, new one fails | Align secret refs and dependency settings |
| Canary too narrow | Common | Failures only in subset of routes/flows | Expand validation before wider rollout |
| Shared dependency outage | Occasional | Both revisions fail similarly | Fix dependency, not rollout |
| Traffic split misread as random instability | Occasional | Intermittent failures during partial rollout | Correlate errors by revision and traffic weight |
8. Immediate Mitigations¶
- Compare error trends by revision and confirm regression scope.
- Shift traffic to stable revision and verify recovery.
- Diff image, env, secret refs, and scale settings between revisions.
- Fix regression and run controlled canary before full rollout.
9. Prevention¶
- Use gradual traffic shifting with rollback guardrails.
- Define release gates on business metrics, not only health probes.
- Keep automated revision comparison artifacts in CI/CD.