Bad Revision Rollout and Rollback¶

1. Summary¶

Symptom¶

Error rate spikes immediately after traffic shifts to new revision.
New revision healthy on startup, but business flows fail.
Partial traffic split causes intermittent failures.

Why this scenario is confusing¶

Revision health, replica readiness, and business correctness are not the same thing. A rollout can look healthy at the platform level while still breaking application behavior, and partial traffic splits can make the incident appear intermittent rather than rollout-driven.

Troubleshooting decision flow¶

flowchart TD
    A[Post-rollout errors detected] --> B{Errors isolated to new revision?}
    B -->|Yes| C[Shift traffic to last known good revision]
    B -->|No| D[Investigate shared dependency or platform issue]
    C --> E{Service recovered after rollback?}
    E -->|Yes| F[Freeze rollout and perform diff analysis]
    E -->|No| G[Escalate to network/identity/runtime playbooks]

2. Common Misreadings¶

"Health is green, so rollout is good." Health probes may not reflect real business success.
"Roll back everything." Controlled rollback to last known good revision is usually sufficient.

3. Competing Hypotheses¶

Hypothesis	Typical Evidence For	Typical Evidence Against
H1: Code or configuration regression in new revision	Failures correlate exactly with traffic shift	Same errors existed before rollout
H2: Secret or dependency drift between revisions	Old revision succeeds with same traffic	Both revisions fail similarly
H3: Incomplete canary analysis	Errors only in subset of requests/routes	Full health and business checks passed pre-rollout

4. What to Check First¶

Metrics¶

Error rate by revision and request latency after traffic movement.

Logs¶

let AppName = "ca-myapp";
ContainerAppConsoleLogs_CL
| where ContainerAppName_s == AppName
| where Log_s has_any ("error", "exception", "timeout", "failed")
| summarize errors=count() by RevisionName_s, bin(TimeGenerated, 5m)
| order by TimeGenerated desc

Platform Signals¶

az containerapp revision list --name "$APP_NAME" --resource-group "$RG" --query "[].{name:name,active:properties.active,traffic:properties.trafficWeight,health:properties.healthState}" --output table
az containerapp show --name "$APP_NAME" --resource-group "$RG" --query "properties.configuration.ingress.traffic" --output json

5. Evidence to Collect¶

Required Evidence¶

Evidence	Command/Query	Purpose
Revision list	`az containerapp revision list --name "$APP_NAME" --resource-group "$RG" --output table`	Compare active revisions, traffic, and health state
Traffic config	`az containerapp show --name "$APP_NAME" --resource-group "$RG" --query "properties.configuration.ingress.traffic" --output json`	Confirm current traffic split
Error-by-revision KQL	KQL on `ContainerAppConsoleLogs_CL`	Verify whether failures align with the new revision
Console logs	`az containerapp logs show --name "$APP_NAME" --resource-group "$RG" --type console`	Capture business or runtime failures during rollout
Rollback command result	`az containerapp ingress traffic set --name "$APP_NAME" --resource-group "$RG" --revision-weight "<stable-revision>=100"`	Test whether service recovers after rollback

Useful Context¶

Time of the traffic shift
Canary percentage and duration
Revision-level differences in image, env, secret refs, and scale settings
Whether both revisions depend on the same external services

Observed revision status output used during rollback decisions:

Name               Active    TrafficWeight    Replicas    HealthState    RunningState
-----------------  --------  ---------------  ----------  -------------  ------------
ca-myapp--0000001  True      100              1           Healthy        Running

6. Validation and Disproof by Hypothesis¶

H1: Code or configuration regression in new revision¶

Signals that support:

Failures correlate exactly with traffic shift.
Errors are concentrated in the new revision.
Rollback to the previous revision restores service.

Signals that weaken:

Same errors existed before rollout.
Both revisions fail similarly under the same traffic.
No recovery occurs after rollback.

What to verify:

az containerapp revision list --name "$APP_NAME" --resource-group "$RG" --query "[].{name:name,active:properties.active,traffic:properties.trafficWeight,health:properties.healthState}" --output table
az containerapp show --name "$APP_NAME" --resource-group "$RG" --query "properties.configuration.ingress.traffic" --output json
az containerapp logs show --name "$APP_NAME" --resource-group "$RG" --type console

let AppName = "ca-myapp";
ContainerAppConsoleLogs_CL
| where ContainerAppName_s == AppName
| where Log_s has_any ("error", "exception", "timeout", "failed")
| summarize errors=count() by RevisionName_s, bin(TimeGenerated, 5m)
| order by TimeGenerated desc

Disproof logic: If the error pattern predates the rollout or persists after moving traffic back to the stable revision, the new revision alone is not the full explanation.

H2: Secret or dependency drift between revisions¶

Signals that support:

Old revision succeeds with the same traffic.
New revision depends on changed secret references or downstream settings.
Failures appear only when the new revision exercises a changed dependency path.

Signals that weaken:

Both revisions fail similarly.
Rollback does not change the failure rate.
No meaningful configuration drift exists between revisions.

What to verify:

az containerapp revision list --name "$APP_NAME" --resource-group "$RG" --output table
az containerapp logs show --name "$APP_NAME" --resource-group "$RG" --type console

Disproof logic: If the stable revision also fails under the same dependency path, focus on shared infrastructure or dependency health instead of revision-specific drift.

H3: Incomplete canary analysis¶

Signals that support:

Errors only affect a subset of requests or routes.
Partial traffic split causes intermittent failures.
Platform health checks passed, but business flows were not fully validated.

Signals that weaken:

Full health and business checks passed pre-rollout.
Failures are uniform across all traffic paths.
The issue is reproducible even with 100% traffic on the stable revision.

What to verify:

az containerapp show --name "$APP_NAME" --resource-group "$RG" --query "properties.configuration.ingress.traffic" --output json
az containerapp ingress traffic set --name "$APP_NAME" --resource-group "$RG" --revision-weight "<stable-revision>=100"
az containerapp revision list --name "$APP_NAME" --resource-group "$RG" --output table

Disproof logic: If controlled rollback does not improve service or if failures are not isolated to canary traffic, canary analysis gaps are secondary, not primary.

7. Likely Root Cause Patterns¶

Pattern	Frequency	First Signal	Typical Resolution
New revision regression	Very common	Errors start immediately after traffic shift	Roll back and diff code/config
Secret or dependency drift	Common	Old revision succeeds, new one fails	Align secret refs and dependency settings
Canary too narrow	Common	Failures only in subset of routes/flows	Expand validation before wider rollout
Shared dependency outage	Occasional	Both revisions fail similarly	Fix dependency, not rollout
Traffic split misread as random instability	Occasional	Intermittent failures during partial rollout	Correlate errors by revision and traffic weight

Bad Revision Rollout and Rollback¶

1. Summary¶

Symptom¶

Why this scenario is confusing¶

Troubleshooting decision flow¶

2. Common Misreadings¶

3. Competing Hypotheses¶

4. What to Check First¶

Metrics¶

Logs¶

Platform Signals¶

5. Evidence to Collect¶

Required Evidence¶

Useful Context¶

6. Validation and Disproof by Hypothesis¶

H1: Code or configuration regression in new revision¶

H2: Secret or dependency drift between revisions¶

H3: Incomplete canary analysis¶

7. Likely Root Cause Patterns¶

8. Immediate Mitigations¶

9. Prevention¶

See Also¶

Sources¶