Revision Update Downtime During Container Apps Deployments¶

Status: Draft - Awaiting Execution

Experiment designed but not yet executed. This draft targets a real customer-facing issue pattern reported in GitHub issues #1166 and #1305: transient 502/503 errors such as upstream connect error or disconnect/reset before headers during revision updates.

1. Question¶

When updating an Azure Container App revision (new image or configuration change), what causes transient 502/503 responses during the transition window, and which deployment pattern achieves practical zero-downtime behavior?

2. Why this matters¶

This is a classic support case because the workload may be healthy before and after deployment, yet users still see a burst of failures during rollout. The most common customer interpretations are "Container Apps had an outage during deploy" or "Envoy randomly returned 502/503 even though the new revision eventually became healthy."

Support relevance:

single revision mode is often chosen by default for simplicity
many teams assume minReplicas=1 guarantees zero downtime
health probes are frequently missing or mapped to endpoints that do not reflect actual readiness
platform-generated errors are brief, hard to capture, and easy to dismiss without continuous traffic generation

The experiment is intended to separate three possibilities:

revision-switch behavior in single revision mode
backend unavailability caused by readiness/probe timing
mitigation value of multiple revisions, traffic splitting, and warm replicas

3. Customer symptom¶

Typical ticket phrasing:

"Every deployment causes a few seconds of 502/503."
"We see upstream connect error or disconnect/reset before headers during image updates."
"The app is stable normally, but config changes create a short outage."
"Multiple revision mode seems better than single revision mode, but we need evidence."
"Setting minReplicas to 1 reduced failures, but did not fully eliminate them."

4. Hypothesis¶

Single revision mode causes a brief backend gap during revision replacement, especially when the old revision stops accepting traffic before the new revision is fully ready.
Multiple revision mode with explicit traffic control allows the old revision to continue serving until the new revision is confirmed healthy, enabling zero-downtime rollout.
Health probe misconfiguration lengthens the failure window by making the new revision appear ready too late or by recycling it unnecessarily.
minReplicas > 0 reduces cold-start contribution but does not by itself guarantee zero-downtime during revision transition.
The observed 502/503 responses are generated by the Envoy ingress layer during the interval when no eligible backend exists for the route.

5. Environment¶

Parameter	Value
Service	Azure Container Apps
SKU / Plan	Consumption
Region	Korea Central
Runtime	Python 3.11 custom container
OS	Linux
Revision modes tested	Single, Multiple
Baseline app	Simple HTTP service with `/`, `/health`, `/ready`, `/version`, `/slowstart`
Ingress	External, target port `8080`
Traffic pattern	Continuous external curl loop during update
Logging	Log Analytics + Container Apps system/console logs
Date tested	Not yet executed

6. Variables¶

Experiment type: Config / behavior comparison across rollout scenarios

Controlled:

same Container Apps environment and region
same app image family and target port
same CPU/memory sizing
same traffic generator cadence during each rollout
same registry source and same baseline revision before each test
same log collection and observation window

Independent variables:

revision mode: single vs multiple
update type: image update vs configuration-only update
minReplicas: 0, 1, 2
probe profile: correct readiness vs delayed readiness vs intentionally weak/misleading readiness
traffic movement method: automatic replacement vs manual traffic split

Observed:

HTTP status counts during deployment (200, 502, 503, timeout)
first failure timestamp and last failure timestamp
total downtime window in seconds
p50/p95 latency during transition
revision creation, activation, provisioning, and healthy timestamps
replica start/terminate events
Envoy-style error body content
system log events indicating readiness/probe state or revision activation delays

7. Instrumentation¶

Planned evidence sources:

Continuous curl traffic generator with per-request timestamp, latency, and HTTP code
ContainerAppSystemLogs_CL for revision lifecycle, probe failures, replica assignment, and termination events
ContainerAppConsoleLogs_CL for application startup and readiness timestamps
Azure CLI for revision, replica, ingress, and traffic-weight inspection
Optional portal checks for revision health state and traffic assignment

Recommended app log markers:

APP_START
READY_FALSE
READY_TRUE
REQUEST_RECEIVED
VERSION=<revision_marker>

Traffic generator script¶

#!/usr/bin/env bash
set -euo pipefail

URL="$1"
INTERVAL_SECONDS="${2:-0.2}"
OUTPUT_FILE="${3:-traffic.csv}"

printf 'ts_utc,epoch_ms,http_code,time_total,remote_ip,errormsg\n' > "$OUTPUT_FILE"

while true; do
  ts_utc="$(date -u +"%Y-%m-%dT%H:%M:%S.%3NZ")"
  epoch_ms="$(date +%s%3N)"
  curl_result="$({ curl -skS "$URL" \
    -o /tmp/aca-body.$$ \
    -w '%{http_code},%{time_total},%{remote_ip},%{errormsg}' \
    --max-time 10; } 2>&1 || true)"
  printf '%s,%s,%s\n' "$ts_utc" "$epoch_ms" "$curl_result" >> "$OUTPUT_FILE"

  if [[ -s /tmp/aca-body.$$ ]]; then
    sed 's/^/BODY: /' /tmp/aca-body.$$ >> "${OUTPUT_FILE%.csv}.body.log"
    : > /tmp/aca-body.$$
  fi

  sleep "$INTERVAL_SECONDS"
done

Summary helpers¶

# Overall HTTP code counts
cut -d, -f3 traffic.csv | sort | uniq -c

# Requests slower than 1 second
awk -F, 'NR==1{next} $4+0 > 1 {count++} END{print count+0}' traffic.csv

# Approximate downtime window (first and last non-200)
awk -F, 'NR==1{next} $3 != 200 {print $1,$3}' traffic.csv

Log Analytics queries¶

// System log timeline around the rollout window
ContainerAppSystemLogs_CL
| where TimeGenerated between (datetime(2026-04-12T00:00:00Z) .. datetime(2026-04-12T01:00:00Z))
| where ContainerAppName_s == "ca-revision-update"
| project TimeGenerated, Reason_s, Log_s, ReplicaName_s, RevisionName_s
| order by TimeGenerated asc

// Console markers from the application
ContainerAppConsoleLogs_CL
| where TimeGenerated between (datetime(2026-04-12T00:00:00Z) .. datetime(2026-04-12T01:00:00Z))
| where ContainerAppName_s == "ca-revision-update"
| project TimeGenerated, RevisionName_s, Log_s
| order by TimeGenerated asc

// Probe-related warnings/errors during rollout
ContainerAppSystemLogs_CL
| where ContainerAppName_s == "ca-revision-update"
| where Log_s has_any ("Probe", "readiness", "liveness", "startup", "unhealthy", "terminated")
| project TimeGenerated, RevisionName_s, Reason_s, Log_s
| order by TimeGenerated asc

8. Procedure¶

8.1 Infrastructure setup¶

# Resource group
az group create \
  --name rg-aca-revision-update-lab \
  --location koreacentral

# Log Analytics
az monitor log-analytics workspace create \
  --resource-group rg-aca-revision-update-lab \
  --workspace-name law-aca-revision-update \
  --location koreacentral

LAW_ID=$(az monitor log-analytics workspace show \
  --resource-group rg-aca-revision-update-lab \
  --workspace-name law-aca-revision-update \
  --query customerId -o tsv)

LAW_KEY=$(az monitor log-analytics workspace get-shared-keys \
  --resource-group rg-aca-revision-update-lab \
  --workspace-name law-aca-revision-update \
  --query primarySharedKey -o tsv)

# Container Apps environment
az containerapp env create \
  --name cae-revision-update-lab \
  --resource-group rg-aca-revision-update-lab \
  --location koreacentral \
  --logs-workspace-id "$LAW_ID" \
  --logs-workspace-key "$LAW_KEY"

# ACR
az acr create \
  --name acrrevisionupdatelab \
  --resource-group rg-aca-revision-update-lab \
  --sku Basic \
  --admin-enabled true \
  --location koreacentral

8.2 Test application¶

The application should expose clear readiness and version signals.

import os
import time
from datetime import datetime, timezone
from flask import Flask, jsonify

app = Flask(__name__)
START = time.monotonic()
STARTUP_DELAY = int(os.getenv("STARTUP_DELAY_SECONDS", "0"))
VERSION = os.getenv("APP_VERSION", "v1")
READY_AFTER = START + STARTUP_DELAY

def is_ready():
    return time.monotonic() >= READY_AFTER

@app.route("/")
def index():
    return jsonify({
        "status": "ok",
        "version": VERSION,
        "ready": is_ready(),
        "utc": datetime.now(timezone.utc).isoformat(),
        "revision": os.getenv("CONTAINER_APP_REVISION", "unknown")
    })

@app.route("/health")
def health():
    return jsonify({"status": "healthy", "version": VERSION}), 200

@app.route("/ready")
def ready():
    if is_ready():
        return jsonify({"status": "ready", "version": VERSION}), 200
    return jsonify({"status": "starting", "version": VERSION}), 503

@app.route("/version")
def version():
    return jsonify({"version": VERSION}), 200

FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py .
EXPOSE 8080
CMD ["gunicorn", "--bind", "0.0.0.0:8080", "--workers", "1", "--timeout", "120", "app:app"]

8.3 Build and push baseline images¶

az acr build \
  --registry acrrevisionupdatelab \
  --resource-group rg-aca-revision-update-lab \
  --image revision-lab:v1 \
  --build-arg APP_VERSION=v1 \
  .

az acr build \
  --registry acrrevisionupdatelab \
  --resource-group rg-aca-revision-update-lab \
  --image revision-lab:v2 \
  --build-arg APP_VERSION=v2 \
  .

8.4 Create the baseline Container App¶

ACR_USER=$(az acr credential show \
  --name acrrevisionupdatelab \
  --resource-group rg-aca-revision-update-lab \
  --query username -o tsv)

ACR_PASS=$(az acr credential show \
  --name acrrevisionupdatelab \
  --resource-group rg-aca-revision-update-lab \
  --query 'passwords[0].value' -o tsv)

az containerapp create \
  --name ca-revision-update \
  --resource-group rg-aca-revision-update-lab \
  --environment cae-revision-update-lab \
  --image acrrevisionupdatelab.azurecr.io/revision-lab:v1 \
  --registry-server acrrevisionupdatelab.azurecr.io \
  --registry-username "$ACR_USER" \
  --registry-password "$ACR_PASS" \
  --ingress external \
  --target-port 8080 \
  --cpu 0.25 \
  --memory 0.5Gi \
  --min-replicas 1 \
  --max-replicas 3 \
  --revision-suffix v1 \
  --env-vars APP_VERSION=v1 STARTUP_DELAY_SECONDS=0

8.5 Probe profiles¶

Profile A: correct readiness¶

probes:
  - type: Readiness
    httpGet:
      path: /ready
      port: 8080
    initialDelaySeconds: 2
    periodSeconds: 3
    timeoutSeconds: 2
    failureThreshold: 10
  - type: Liveness
    httpGet:
      path: /health
      port: 8080
    initialDelaySeconds: 10
    periodSeconds: 10
    timeoutSeconds: 2
    failureThreshold: 3

Profile B: delayed readiness / long transition¶

probes:
  - type: Readiness
    httpGet:
      path: /ready
      port: 8080
    initialDelaySeconds: 15
    periodSeconds: 10
    timeoutSeconds: 2
    failureThreshold: 12

Profile C: misleading health endpoint¶

probes:
  - type: Readiness
    httpGet:
      path: /health
      port: 8080
    initialDelaySeconds: 1
    periodSeconds: 3
    timeoutSeconds: 2
    failureThreshold: 3

8.6 Test scenarios¶

Scenario	Revision mode	Update type	minReplicas	Probe profile	Expected risk
S1	Single	image update `v1 -> v2`	1	A	brief failure window possible
S2	Single	env/config change	1	A	brief failure window possible
S3	Multiple	image update + manual traffic shift	1	A	zero-downtime candidate
S4	Single	image update	0 / 1 / 2	A	compare warm capacity effect
S5	Single and Multiple	image update	1	B and C	probe-driven transition differences

8.7 Execution steps per scenario¶

Confirm baseline revision is healthy and serving only 200 responses.
Start the traffic generator against the external FQDN and keep it running for the entire deployment.
Record current revision list and traffic weights:

az containerapp revision list \
  --name ca-revision-update \
  --resource-group rg-aca-revision-update-lab \
  -o table

az containerapp ingress traffic show \
  --name ca-revision-update \
  --resource-group rg-aca-revision-update-lab

Trigger the scenario-specific update.

Scenario S1: single revision image update¶

az containerapp revision set-mode \
  --name ca-revision-update \
  --resource-group rg-aca-revision-update-lab \
  --mode single

az containerapp update \
  --name ca-revision-update \
  --resource-group rg-aca-revision-update-lab \
  --image acrrevisionupdatelab.azurecr.io/revision-lab:v2 \
  --revision-suffix s1img

Scenario S2: single revision config change¶

az containerapp update \
  --name ca-revision-update \
  --resource-group rg-aca-revision-update-lab \
  --set-env-vars APP_VERSION=v1 CONFIG_MARKER=$(date +%s) \
  --revision-suffix s2cfg

Scenario S3: multiple revision gradual traffic shift¶

az containerapp revision set-mode \
  --name ca-revision-update \
  --resource-group rg-aca-revision-update-lab \
  --mode multiple

az containerapp update \
  --name ca-revision-update \
  --resource-group rg-aca-revision-update-lab \
  --image acrrevisionupdatelab.azurecr.io/revision-lab:v2 \
  --revision-suffix s3multi

# After confirming new revision is healthy:
az containerapp ingress traffic set \
  --name ca-revision-update \
  --resource-group rg-aca-revision-update-lab \
  --revision-weight ca-revision-update--v1=100 ca-revision-update--s3multi=0

az containerapp ingress traffic set \
  --name ca-revision-update \
  --resource-group rg-aca-revision-update-lab \
  --revision-weight ca-revision-update--v1=90 ca-revision-update--s3multi=10

az containerapp ingress traffic set \
  --name ca-revision-update \
  --resource-group rg-aca-revision-update-lab \
  --revision-weight ca-revision-update--v1=50 ca-revision-update--s3multi=50

az containerapp ingress traffic set \
  --name ca-revision-update \
  --resource-group rg-aca-revision-update-lab \
  --revision-weight ca-revision-update--v1=0 ca-revision-update--s3multi=100

Scenario S4: minReplicas comparison¶

for MIN in 0 1 2; do
  az containerapp update \
    --name ca-revision-update \
    --resource-group rg-aca-revision-update-lab \
    --min-replicas "$MIN"

  az containerapp update \
    --name ca-revision-update \
    --resource-group rg-aca-revision-update-lab \
    --image acrrevisionupdatelab.azurecr.io/revision-lab:v2 \
    --revision-suffix "min${MIN}"
done

Scenario S5: probe behavior comparison¶

Apply profile B and C separately, then repeat the image update while traffic generation continues.

Continue traffic for at least 2 minutes after the rollout appears stable.
Export traffic logs, revision list, traffic weights, and system/console logs immediately after each scenario.
Reset to the baseline state before the next scenario.

9. Expected signal¶

S1 / S2 single revision: short cluster of 502/503 or connection-reset style responses during the handoff if the new revision is not yet traffic-eligible.
S3 multiple revision: no external failures while traffic remains on the old revision until the new revision is ready.
S4 minReplicas: minReplicas=0 should show the worst transition behavior; 1 and 2 should shorten the failure window but may still not eliminate it in single revision mode.
S5 probe variants: delayed or misleading readiness should lengthen the transition or create premature routing to an unready backend.

Expected external error text, if the hypothesis is correct:

upstream connect error or disconnect/reset before headers
503 service unavailable
brief 502 bad gateway depending on exact Envoy/backend state

10. Results¶

Not yet executed.

Use the following capture tables during execution.

10.1 Scenario summary¶

Scenario	minReplicas	Probe profile
S1
S2
S3
S4-0	0
S4-1	1
S4-2	2
S5-B		B
S5-C		C

10.2 Transition timing worksheet¶

Scenario	Update started	New revision created	First ready signal	Old revision removed from traffic	First external failure	Last external failure	Stable 200 resumes	Duration
S1
S2
S3

10.3 Latency summary worksheet¶

Scenario	Baseline p50	Baseline p95	Transition p50	Transition p95	Max latency	Comments
S1
S2
S3

10.4 Representative log evidence¶

Capture:

first non-200 response body from traffic generator
system log events for revision activation and termination
application log entries showing readiness transition
revision/traffic outputs before and after the update

11. Interpretation¶

Interpretation will be added after execution. Use calibrated evidence tags only.

Planned interpretation questions:

Did single revision mode ever leave Envoy without an eligible backend? [Observed/Not Proven pending execution]
Did multiple revision mode eliminate the external failure window when traffic remained on the previous healthy revision? [Observed/Not Proven pending execution]
Did probe design alter the size of the external outage window or only the time to full activation? [Measured/Not Proven pending execution]
Did minReplicas reduce transition failures or mainly reduce startup latency? [Measured/Not Proven pending execution]

12. What this proves¶

Pending execution. The final section should only claim evidence directly supported by:

traffic generator output
revision traffic assignment state
system/console log timestamps
repeated scenario comparison across revision modes and probe settings

13. What this does NOT prove¶

Even after execution, this experiment will not by itself prove:

behavior across every region or workload profile
behavior for internal ingress, TCP ingress, or Dapr-enabled apps
behavior under heavy concurrent load beyond the selected curl cadence
whether every 502 vs 503 variant comes from the exact same internal platform path
whether customer-specific apps with long startup, sidecars, or external dependencies will match the same timing

14. Support takeaway¶

Expected support guidance if the hypothesis is confirmed:

If the customer requires near-zero downtime during deployment, prefer multiple revision mode and shift traffic only after the new revision is healthy.
Treat minReplicas as a mitigation, not a guarantee, especially in single revision mode.
Validate that readiness probes reflect real traffic readiness, not just process liveness.
When customers report brief deploy-time 502/503, collect:
- exact deployment timestamp
- revision mode
- traffic weights
- probe configuration
- min/max replica settings
- sample error body showing Envoy text
Escalation quality improves significantly when support includes both external traffic evidence and revision lifecycle logs for the same minute.

15. Reproduction notes¶

Keep the traffic generator running before, during, and after the update; one-off manual curls often miss the failure window.
Record UTC timestamps everywhere; correlation across traffic logs and Log Analytics is otherwise painful.
Use the same baseline revision before each scenario so that results are comparable.
If multiple revision traffic-set commands require exact revision names, fetch them immediately after the update rather than assuming the suffix-only format.
If the failure window is too small to capture at 0.2s intervals, retest at 0.05s intervals for a shorter period.
Do not mix scale-to-zero behavior with revision update behavior unless that is the explicit variable under test.

GitHub issue motivation: azure-container-apps #1166
GitHub issue motivation: azure-container-apps #1305
Microsoft Learn: Manage revisions in Azure Container Apps
Microsoft Learn: Health probes in Azure Container Apps
Microsoft Learn: Enable ingress in Azure Container Apps
Related experiment: Scale-to-Zero First Request 503/Timeout
Related experiment: Startup, Readiness, and Liveness Probe Interactions