Revision Update Downtime During Container Apps Deployments¶
Status: Draft - Awaiting Execution
Experiment designed but not yet executed. This draft targets a real customer-facing issue pattern reported in GitHub issues #1166 and #1305: transient 502/503 errors such as upstream connect error or disconnect/reset before headers during revision updates.
1. Question¶
When updating an Azure Container App revision (new image or configuration change), what causes transient 502/503 responses during the transition window, and which deployment pattern achieves practical zero-downtime behavior?
2. Why this matters¶
This is a classic support case because the workload may be healthy before and after deployment, yet users still see a burst of failures during rollout. The most common customer interpretations are "Container Apps had an outage during deploy" or "Envoy randomly returned 502/503 even though the new revision eventually became healthy."
Support relevance:
- single revision mode is often chosen by default for simplicity
- many teams assume
minReplicas=1guarantees zero downtime - health probes are frequently missing or mapped to endpoints that do not reflect actual readiness
- platform-generated errors are brief, hard to capture, and easy to dismiss without continuous traffic generation
The experiment is intended to separate three possibilities:
- revision-switch behavior in single revision mode
- backend unavailability caused by readiness/probe timing
- mitigation value of multiple revisions, traffic splitting, and warm replicas
3. Customer symptom¶
Typical ticket phrasing:
- "Every deployment causes a few seconds of 502/503."
- "We see
upstream connect error or disconnect/reset before headersduring image updates." - "The app is stable normally, but config changes create a short outage."
- "Multiple revision mode seems better than single revision mode, but we need evidence."
- "Setting
minReplicasto 1 reduced failures, but did not fully eliminate them."
4. Hypothesis¶
- Single revision mode causes a brief backend gap during revision replacement, especially when the old revision stops accepting traffic before the new revision is fully ready.
- Multiple revision mode with explicit traffic control allows the old revision to continue serving until the new revision is confirmed healthy, enabling zero-downtime rollout.
- Health probe misconfiguration lengthens the failure window by making the new revision appear ready too late or by recycling it unnecessarily.
minReplicas > 0reduces cold-start contribution but does not by itself guarantee zero-downtime during revision transition.- The observed 502/503 responses are generated by the Envoy ingress layer during the interval when no eligible backend exists for the route.
5. Environment¶
| Parameter | Value |
|---|---|
| Service | Azure Container Apps |
| SKU / Plan | Consumption |
| Region | Korea Central |
| Runtime | Python 3.11 custom container |
| OS | Linux |
| Revision modes tested | Single, Multiple |
| Baseline app | Simple HTTP service with /, /health, /ready, /version, /slowstart |
| Ingress | External, target port 8080 |
| Traffic pattern | Continuous external curl loop during update |
| Logging | Log Analytics + Container Apps system/console logs |
| Date tested | Not yet executed |
6. Variables¶
Experiment type: Config / behavior comparison across rollout scenarios
Controlled:
- same Container Apps environment and region
- same app image family and target port
- same CPU/memory sizing
- same traffic generator cadence during each rollout
- same registry source and same baseline revision before each test
- same log collection and observation window
Independent variables:
- revision mode:
singlevsmultiple - update type: image update vs configuration-only update
minReplicas:0,1,2- probe profile: correct readiness vs delayed readiness vs intentionally weak/misleading readiness
- traffic movement method: automatic replacement vs manual traffic split
Observed:
- HTTP status counts during deployment (
200,502,503, timeout) - first failure timestamp and last failure timestamp
- total downtime window in seconds
- p50/p95 latency during transition
- revision creation, activation, provisioning, and healthy timestamps
- replica start/terminate events
- Envoy-style error body content
- system log events indicating readiness/probe state or revision activation delays
7. Instrumentation¶
Planned evidence sources:
- Continuous curl traffic generator with per-request timestamp, latency, and HTTP code
- ContainerAppSystemLogs_CL for revision lifecycle, probe failures, replica assignment, and termination events
- ContainerAppConsoleLogs_CL for application startup and readiness timestamps
- Azure CLI for revision, replica, ingress, and traffic-weight inspection
- Optional portal checks for revision health state and traffic assignment
Recommended app log markers:
APP_STARTREADY_FALSEREADY_TRUEREQUEST_RECEIVEDVERSION=<revision_marker>
Traffic generator script¶
#!/usr/bin/env bash
set -euo pipefail
URL="$1"
INTERVAL_SECONDS="${2:-0.2}"
OUTPUT_FILE="${3:-traffic.csv}"
printf 'ts_utc,epoch_ms,http_code,time_total,remote_ip,errormsg\n' > "$OUTPUT_FILE"
while true; do
ts_utc="$(date -u +"%Y-%m-%dT%H:%M:%S.%3NZ")"
epoch_ms="$(date +%s%3N)"
curl_result="$({ curl -skS "$URL" \
-o /tmp/aca-body.$$ \
-w '%{http_code},%{time_total},%{remote_ip},%{errormsg}' \
--max-time 10; } 2>&1 || true)"
printf '%s,%s,%s\n' "$ts_utc" "$epoch_ms" "$curl_result" >> "$OUTPUT_FILE"
if [[ -s /tmp/aca-body.$$ ]]; then
sed 's/^/BODY: /' /tmp/aca-body.$$ >> "${OUTPUT_FILE%.csv}.body.log"
: > /tmp/aca-body.$$
fi
sleep "$INTERVAL_SECONDS"
done
Summary helpers¶
# Overall HTTP code counts
cut -d, -f3 traffic.csv | sort | uniq -c
# Requests slower than 1 second
awk -F, 'NR==1{next} $4+0 > 1 {count++} END{print count+0}' traffic.csv
# Approximate downtime window (first and last non-200)
awk -F, 'NR==1{next} $3 != 200 {print $1,$3}' traffic.csv
Log Analytics queries¶
// System log timeline around the rollout window
ContainerAppSystemLogs_CL
| where TimeGenerated between (datetime(2026-04-12T00:00:00Z) .. datetime(2026-04-12T01:00:00Z))
| where ContainerAppName_s == "ca-revision-update"
| project TimeGenerated, Reason_s, Log_s, ReplicaName_s, RevisionName_s
| order by TimeGenerated asc
// Console markers from the application
ContainerAppConsoleLogs_CL
| where TimeGenerated between (datetime(2026-04-12T00:00:00Z) .. datetime(2026-04-12T01:00:00Z))
| where ContainerAppName_s == "ca-revision-update"
| project TimeGenerated, RevisionName_s, Log_s
| order by TimeGenerated asc
// Probe-related warnings/errors during rollout
ContainerAppSystemLogs_CL
| where ContainerAppName_s == "ca-revision-update"
| where Log_s has_any ("Probe", "readiness", "liveness", "startup", "unhealthy", "terminated")
| project TimeGenerated, RevisionName_s, Reason_s, Log_s
| order by TimeGenerated asc
8. Procedure¶
8.1 Infrastructure setup¶
# Resource group
az group create \
--name rg-aca-revision-update-lab \
--location koreacentral
# Log Analytics
az monitor log-analytics workspace create \
--resource-group rg-aca-revision-update-lab \
--workspace-name law-aca-revision-update \
--location koreacentral
LAW_ID=$(az monitor log-analytics workspace show \
--resource-group rg-aca-revision-update-lab \
--workspace-name law-aca-revision-update \
--query customerId -o tsv)
LAW_KEY=$(az monitor log-analytics workspace get-shared-keys \
--resource-group rg-aca-revision-update-lab \
--workspace-name law-aca-revision-update \
--query primarySharedKey -o tsv)
# Container Apps environment
az containerapp env create \
--name cae-revision-update-lab \
--resource-group rg-aca-revision-update-lab \
--location koreacentral \
--logs-workspace-id "$LAW_ID" \
--logs-workspace-key "$LAW_KEY"
# ACR
az acr create \
--name acrrevisionupdatelab \
--resource-group rg-aca-revision-update-lab \
--sku Basic \
--admin-enabled true \
--location koreacentral
8.2 Test application¶
The application should expose clear readiness and version signals.
import os
import time
from datetime import datetime, timezone
from flask import Flask, jsonify
app = Flask(__name__)
START = time.monotonic()
STARTUP_DELAY = int(os.getenv("STARTUP_DELAY_SECONDS", "0"))
VERSION = os.getenv("APP_VERSION", "v1")
READY_AFTER = START + STARTUP_DELAY
def is_ready():
return time.monotonic() >= READY_AFTER
@app.route("/")
def index():
return jsonify({
"status": "ok",
"version": VERSION,
"ready": is_ready(),
"utc": datetime.now(timezone.utc).isoformat(),
"revision": os.getenv("CONTAINER_APP_REVISION", "unknown")
})
@app.route("/health")
def health():
return jsonify({"status": "healthy", "version": VERSION}), 200
@app.route("/ready")
def ready():
if is_ready():
return jsonify({"status": "ready", "version": VERSION}), 200
return jsonify({"status": "starting", "version": VERSION}), 503
@app.route("/version")
def version():
return jsonify({"version": VERSION}), 200
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py .
EXPOSE 8080
CMD ["gunicorn", "--bind", "0.0.0.0:8080", "--workers", "1", "--timeout", "120", "app:app"]
8.3 Build and push baseline images¶
az acr build \
--registry acrrevisionupdatelab \
--resource-group rg-aca-revision-update-lab \
--image revision-lab:v1 \
--build-arg APP_VERSION=v1 \
.
az acr build \
--registry acrrevisionupdatelab \
--resource-group rg-aca-revision-update-lab \
--image revision-lab:v2 \
--build-arg APP_VERSION=v2 \
.
8.4 Create the baseline Container App¶
ACR_USER=$(az acr credential show \
--name acrrevisionupdatelab \
--resource-group rg-aca-revision-update-lab \
--query username -o tsv)
ACR_PASS=$(az acr credential show \
--name acrrevisionupdatelab \
--resource-group rg-aca-revision-update-lab \
--query 'passwords[0].value' -o tsv)
az containerapp create \
--name ca-revision-update \
--resource-group rg-aca-revision-update-lab \
--environment cae-revision-update-lab \
--image acrrevisionupdatelab.azurecr.io/revision-lab:v1 \
--registry-server acrrevisionupdatelab.azurecr.io \
--registry-username "$ACR_USER" \
--registry-password "$ACR_PASS" \
--ingress external \
--target-port 8080 \
--cpu 0.25 \
--memory 0.5Gi \
--min-replicas 1 \
--max-replicas 3 \
--revision-suffix v1 \
--env-vars APP_VERSION=v1 STARTUP_DELAY_SECONDS=0
8.5 Probe profiles¶
Profile A: correct readiness¶
probes:
- type: Readiness
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 2
periodSeconds: 3
timeoutSeconds: 2
failureThreshold: 10
- type: Liveness
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 2
failureThreshold: 3
Profile B: delayed readiness / long transition¶
probes:
- type: Readiness
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
timeoutSeconds: 2
failureThreshold: 12
Profile C: misleading health endpoint¶
probes:
- type: Readiness
httpGet:
path: /health
port: 8080
initialDelaySeconds: 1
periodSeconds: 3
timeoutSeconds: 2
failureThreshold: 3
8.6 Test scenarios¶
| Scenario | Revision mode | Update type | minReplicas | Probe profile | Expected risk |
|---|---|---|---|---|---|
| S1 | Single | image update v1 -> v2 |
1 | A | brief failure window possible |
| S2 | Single | env/config change | 1 | A | brief failure window possible |
| S3 | Multiple | image update + manual traffic shift | 1 | A | zero-downtime candidate |
| S4 | Single | image update | 0 / 1 / 2 | A | compare warm capacity effect |
| S5 | Single and Multiple | image update | 1 | B and C | probe-driven transition differences |
8.7 Execution steps per scenario¶
- Confirm baseline revision is healthy and serving only
200responses. - Start the traffic generator against the external FQDN and keep it running for the entire deployment.
- Record current revision list and traffic weights:
az containerapp revision list \
--name ca-revision-update \
--resource-group rg-aca-revision-update-lab \
-o table
az containerapp ingress traffic show \
--name ca-revision-update \
--resource-group rg-aca-revision-update-lab
- Trigger the scenario-specific update.
Scenario S1: single revision image update¶
az containerapp revision set-mode \
--name ca-revision-update \
--resource-group rg-aca-revision-update-lab \
--mode single
az containerapp update \
--name ca-revision-update \
--resource-group rg-aca-revision-update-lab \
--image acrrevisionupdatelab.azurecr.io/revision-lab:v2 \
--revision-suffix s1img
Scenario S2: single revision config change¶
az containerapp update \
--name ca-revision-update \
--resource-group rg-aca-revision-update-lab \
--set-env-vars APP_VERSION=v1 CONFIG_MARKER=$(date +%s) \
--revision-suffix s2cfg
Scenario S3: multiple revision gradual traffic shift¶
az containerapp revision set-mode \
--name ca-revision-update \
--resource-group rg-aca-revision-update-lab \
--mode multiple
az containerapp update \
--name ca-revision-update \
--resource-group rg-aca-revision-update-lab \
--image acrrevisionupdatelab.azurecr.io/revision-lab:v2 \
--revision-suffix s3multi
# After confirming new revision is healthy:
az containerapp ingress traffic set \
--name ca-revision-update \
--resource-group rg-aca-revision-update-lab \
--revision-weight ca-revision-update--v1=100 ca-revision-update--s3multi=0
az containerapp ingress traffic set \
--name ca-revision-update \
--resource-group rg-aca-revision-update-lab \
--revision-weight ca-revision-update--v1=90 ca-revision-update--s3multi=10
az containerapp ingress traffic set \
--name ca-revision-update \
--resource-group rg-aca-revision-update-lab \
--revision-weight ca-revision-update--v1=50 ca-revision-update--s3multi=50
az containerapp ingress traffic set \
--name ca-revision-update \
--resource-group rg-aca-revision-update-lab \
--revision-weight ca-revision-update--v1=0 ca-revision-update--s3multi=100
Scenario S4: minReplicas comparison¶
for MIN in 0 1 2; do
az containerapp update \
--name ca-revision-update \
--resource-group rg-aca-revision-update-lab \
--min-replicas "$MIN"
az containerapp update \
--name ca-revision-update \
--resource-group rg-aca-revision-update-lab \
--image acrrevisionupdatelab.azurecr.io/revision-lab:v2 \
--revision-suffix "min${MIN}"
done
Scenario S5: probe behavior comparison¶
Apply profile B and C separately, then repeat the image update while traffic generation continues.
- Continue traffic for at least 2 minutes after the rollout appears stable.
- Export traffic logs, revision list, traffic weights, and system/console logs immediately after each scenario.
- Reset to the baseline state before the next scenario.
9. Expected signal¶
- S1 / S2 single revision: short cluster of
502/503or connection-reset style responses during the handoff if the new revision is not yet traffic-eligible. - S3 multiple revision: no external failures while traffic remains on the old revision until the new revision is ready.
- S4 minReplicas:
minReplicas=0should show the worst transition behavior;1and2should shorten the failure window but may still not eliminate it in single revision mode. - S5 probe variants: delayed or misleading readiness should lengthen the transition or create premature routing to an unready backend.
Expected external error text, if the hypothesis is correct:
upstream connect error or disconnect/reset before headers503 service unavailable- brief
502 bad gatewaydepending on exact Envoy/backend state
10. Results¶
Not yet executed.
Use the following capture tables during execution.
10.1 Scenario summary¶
| Scenario | Revision mode | Update type | minReplicas | Probe profile | Total requests | 200 count | 502 count | 503 count | Timeout count | Failure window | Notes |
|---|---|---|---|---|---|---|---|---|---|---|---|
| S1 | |||||||||||
| S2 | |||||||||||
| S3 | |||||||||||
| S4-0 | 0 | ||||||||||
| S4-1 | 1 | ||||||||||
| S4-2 | 2 | ||||||||||
| S5-B | B | ||||||||||
| S5-C | C |
10.2 Transition timing worksheet¶
| Scenario | Update started | New revision created | First ready signal | Old revision removed from traffic | First external failure | Last external failure | Stable 200 resumes | Duration |
|---|---|---|---|---|---|---|---|---|
| S1 | ||||||||
| S2 | ||||||||
| S3 |
10.3 Latency summary worksheet¶
| Scenario | Baseline p50 | Baseline p95 | Transition p50 | Transition p95 | Max latency | Comments |
|---|---|---|---|---|---|---|
| S1 | ||||||
| S2 | ||||||
| S3 |
10.4 Representative log evidence¶
Capture:
- first non-200 response body from traffic generator
- system log events for revision activation and termination
- application log entries showing readiness transition
- revision/traffic outputs before and after the update
11. Interpretation¶
Interpretation will be added after execution. Use calibrated evidence tags only.
Planned interpretation questions:
- Did single revision mode ever leave Envoy without an eligible backend? [Observed/Not Proven pending execution]
- Did multiple revision mode eliminate the external failure window when traffic remained on the previous healthy revision? [Observed/Not Proven pending execution]
- Did probe design alter the size of the external outage window or only the time to full activation? [Measured/Not Proven pending execution]
- Did
minReplicasreduce transition failures or mainly reduce startup latency? [Measured/Not Proven pending execution]
12. What this proves¶
Pending execution. The final section should only claim evidence directly supported by:
- traffic generator output
- revision traffic assignment state
- system/console log timestamps
- repeated scenario comparison across revision modes and probe settings
13. What this does NOT prove¶
Even after execution, this experiment will not by itself prove:
- behavior across every region or workload profile
- behavior for internal ingress, TCP ingress, or Dapr-enabled apps
- behavior under heavy concurrent load beyond the selected curl cadence
- whether every
502vs503variant comes from the exact same internal platform path - whether customer-specific apps with long startup, sidecars, or external dependencies will match the same timing
14. Support takeaway¶
Expected support guidance if the hypothesis is confirmed:
- If the customer requires near-zero downtime during deployment, prefer multiple revision mode and shift traffic only after the new revision is healthy.
- Treat
minReplicasas a mitigation, not a guarantee, especially in single revision mode. - Validate that readiness probes reflect real traffic readiness, not just process liveness.
- When customers report brief deploy-time
502/503, collect:- exact deployment timestamp
- revision mode
- traffic weights
- probe configuration
- min/max replica settings
- sample error body showing Envoy text
- Escalation quality improves significantly when support includes both external traffic evidence and revision lifecycle logs for the same minute.
15. Reproduction notes¶
- Keep the traffic generator running before, during, and after the update; one-off manual curls often miss the failure window.
- Record UTC timestamps everywhere; correlation across traffic logs and Log Analytics is otherwise painful.
- Use the same baseline revision before each scenario so that results are comparable.
- If multiple revision traffic-set commands require exact revision names, fetch them immediately after the update rather than assuming the suffix-only format.
- If the failure window is too small to capture at
0.2sintervals, retest at0.05sintervals for a shorter period. - Do not mix scale-to-zero behavior with revision update behavior unless that is the explicit variable under test.
16. Related guide / official docs¶
- GitHub issue motivation: azure-container-apps #1166
- GitHub issue motivation: azure-container-apps #1305
- Microsoft Learn: Manage revisions in Azure Container Apps
- Microsoft Learn: Health probes in Azure Container Apps
- Microsoft Learn: Enable ingress in Azure Container Apps
- Related experiment: Scale-to-Zero First Request 503/Timeout
- Related experiment: Startup, Readiness, and Liveness Probe Interactions