Skip to content

Revision Update Downtime During Container Apps Deployments

Status: Draft - Awaiting Execution

Experiment designed but not yet executed. This draft targets a real customer-facing issue pattern reported in GitHub issues #1166 and #1305: transient 502/503 errors such as upstream connect error or disconnect/reset before headers during revision updates.

1. Question

When updating an Azure Container App revision (new image or configuration change), what causes transient 502/503 responses during the transition window, and which deployment pattern achieves practical zero-downtime behavior?

2. Why this matters

This is a classic support case because the workload may be healthy before and after deployment, yet users still see a burst of failures during rollout. The most common customer interpretations are "Container Apps had an outage during deploy" or "Envoy randomly returned 502/503 even though the new revision eventually became healthy."

Support relevance:

  • single revision mode is often chosen by default for simplicity
  • many teams assume minReplicas=1 guarantees zero downtime
  • health probes are frequently missing or mapped to endpoints that do not reflect actual readiness
  • platform-generated errors are brief, hard to capture, and easy to dismiss without continuous traffic generation

The experiment is intended to separate three possibilities:

  1. revision-switch behavior in single revision mode
  2. backend unavailability caused by readiness/probe timing
  3. mitigation value of multiple revisions, traffic splitting, and warm replicas

3. Customer symptom

Typical ticket phrasing:

  • "Every deployment causes a few seconds of 502/503."
  • "We see upstream connect error or disconnect/reset before headers during image updates."
  • "The app is stable normally, but config changes create a short outage."
  • "Multiple revision mode seems better than single revision mode, but we need evidence."
  • "Setting minReplicas to 1 reduced failures, but did not fully eliminate them."

4. Hypothesis

  1. Single revision mode causes a brief backend gap during revision replacement, especially when the old revision stops accepting traffic before the new revision is fully ready.
  2. Multiple revision mode with explicit traffic control allows the old revision to continue serving until the new revision is confirmed healthy, enabling zero-downtime rollout.
  3. Health probe misconfiguration lengthens the failure window by making the new revision appear ready too late or by recycling it unnecessarily.
  4. minReplicas > 0 reduces cold-start contribution but does not by itself guarantee zero-downtime during revision transition.
  5. The observed 502/503 responses are generated by the Envoy ingress layer during the interval when no eligible backend exists for the route.

5. Environment

Parameter Value
Service Azure Container Apps
SKU / Plan Consumption
Region Korea Central
Runtime Python 3.11 custom container
OS Linux
Revision modes tested Single, Multiple
Baseline app Simple HTTP service with /, /health, /ready, /version, /slowstart
Ingress External, target port 8080
Traffic pattern Continuous external curl loop during update
Logging Log Analytics + Container Apps system/console logs
Date tested Not yet executed

6. Variables

Experiment type: Config / behavior comparison across rollout scenarios

Controlled:

  • same Container Apps environment and region
  • same app image family and target port
  • same CPU/memory sizing
  • same traffic generator cadence during each rollout
  • same registry source and same baseline revision before each test
  • same log collection and observation window

Independent variables:

  • revision mode: single vs multiple
  • update type: image update vs configuration-only update
  • minReplicas: 0, 1, 2
  • probe profile: correct readiness vs delayed readiness vs intentionally weak/misleading readiness
  • traffic movement method: automatic replacement vs manual traffic split

Observed:

  • HTTP status counts during deployment (200, 502, 503, timeout)
  • first failure timestamp and last failure timestamp
  • total downtime window in seconds
  • p50/p95 latency during transition
  • revision creation, activation, provisioning, and healthy timestamps
  • replica start/terminate events
  • Envoy-style error body content
  • system log events indicating readiness/probe state or revision activation delays

7. Instrumentation

Planned evidence sources:

  • Continuous curl traffic generator with per-request timestamp, latency, and HTTP code
  • ContainerAppSystemLogs_CL for revision lifecycle, probe failures, replica assignment, and termination events
  • ContainerAppConsoleLogs_CL for application startup and readiness timestamps
  • Azure CLI for revision, replica, ingress, and traffic-weight inspection
  • Optional portal checks for revision health state and traffic assignment

Recommended app log markers:

  • APP_START
  • READY_FALSE
  • READY_TRUE
  • REQUEST_RECEIVED
  • VERSION=<revision_marker>

Traffic generator script

#!/usr/bin/env bash
set -euo pipefail

URL="$1"
INTERVAL_SECONDS="${2:-0.2}"
OUTPUT_FILE="${3:-traffic.csv}"

printf 'ts_utc,epoch_ms,http_code,time_total,remote_ip,errormsg\n' > "$OUTPUT_FILE"

while true; do
  ts_utc="$(date -u +"%Y-%m-%dT%H:%M:%S.%3NZ")"
  epoch_ms="$(date +%s%3N)"
  curl_result="$({ curl -skS "$URL" \
    -o /tmp/aca-body.$$ \
    -w '%{http_code},%{time_total},%{remote_ip},%{errormsg}' \
    --max-time 10; } 2>&1 || true)"
  printf '%s,%s,%s\n' "$ts_utc" "$epoch_ms" "$curl_result" >> "$OUTPUT_FILE"

  if [[ -s /tmp/aca-body.$$ ]]; then
    sed 's/^/BODY: /' /tmp/aca-body.$$ >> "${OUTPUT_FILE%.csv}.body.log"
    : > /tmp/aca-body.$$
  fi

  sleep "$INTERVAL_SECONDS"
done

Summary helpers

# Overall HTTP code counts
cut -d, -f3 traffic.csv | sort | uniq -c

# Requests slower than 1 second
awk -F, 'NR==1{next} $4+0 > 1 {count++} END{print count+0}' traffic.csv

# Approximate downtime window (first and last non-200)
awk -F, 'NR==1{next} $3 != 200 {print $1,$3}' traffic.csv

Log Analytics queries

// System log timeline around the rollout window
ContainerAppSystemLogs_CL
| where TimeGenerated between (datetime(2026-04-12T00:00:00Z) .. datetime(2026-04-12T01:00:00Z))
| where ContainerAppName_s == "ca-revision-update"
| project TimeGenerated, Reason_s, Log_s, ReplicaName_s, RevisionName_s
| order by TimeGenerated asc
// Console markers from the application
ContainerAppConsoleLogs_CL
| where TimeGenerated between (datetime(2026-04-12T00:00:00Z) .. datetime(2026-04-12T01:00:00Z))
| where ContainerAppName_s == "ca-revision-update"
| project TimeGenerated, RevisionName_s, Log_s
| order by TimeGenerated asc
// Probe-related warnings/errors during rollout
ContainerAppSystemLogs_CL
| where ContainerAppName_s == "ca-revision-update"
| where Log_s has_any ("Probe", "readiness", "liveness", "startup", "unhealthy", "terminated")
| project TimeGenerated, RevisionName_s, Reason_s, Log_s
| order by TimeGenerated asc

8. Procedure

8.1 Infrastructure setup

# Resource group
az group create \
  --name rg-aca-revision-update-lab \
  --location koreacentral

# Log Analytics
az monitor log-analytics workspace create \
  --resource-group rg-aca-revision-update-lab \
  --workspace-name law-aca-revision-update \
  --location koreacentral

LAW_ID=$(az monitor log-analytics workspace show \
  --resource-group rg-aca-revision-update-lab \
  --workspace-name law-aca-revision-update \
  --query customerId -o tsv)

LAW_KEY=$(az monitor log-analytics workspace get-shared-keys \
  --resource-group rg-aca-revision-update-lab \
  --workspace-name law-aca-revision-update \
  --query primarySharedKey -o tsv)

# Container Apps environment
az containerapp env create \
  --name cae-revision-update-lab \
  --resource-group rg-aca-revision-update-lab \
  --location koreacentral \
  --logs-workspace-id "$LAW_ID" \
  --logs-workspace-key "$LAW_KEY"

# ACR
az acr create \
  --name acrrevisionupdatelab \
  --resource-group rg-aca-revision-update-lab \
  --sku Basic \
  --admin-enabled true \
  --location koreacentral

8.2 Test application

The application should expose clear readiness and version signals.

import os
import time
from datetime import datetime, timezone
from flask import Flask, jsonify

app = Flask(__name__)
START = time.monotonic()
STARTUP_DELAY = int(os.getenv("STARTUP_DELAY_SECONDS", "0"))
VERSION = os.getenv("APP_VERSION", "v1")
READY_AFTER = START + STARTUP_DELAY

def is_ready():
    return time.monotonic() >= READY_AFTER

@app.route("/")
def index():
    return jsonify({
        "status": "ok",
        "version": VERSION,
        "ready": is_ready(),
        "utc": datetime.now(timezone.utc).isoformat(),
        "revision": os.getenv("CONTAINER_APP_REVISION", "unknown")
    })

@app.route("/health")
def health():
    return jsonify({"status": "healthy", "version": VERSION}), 200

@app.route("/ready")
def ready():
    if is_ready():
        return jsonify({"status": "ready", "version": VERSION}), 200
    return jsonify({"status": "starting", "version": VERSION}), 503

@app.route("/version")
def version():
    return jsonify({"version": VERSION}), 200
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py .
EXPOSE 8080
CMD ["gunicorn", "--bind", "0.0.0.0:8080", "--workers", "1", "--timeout", "120", "app:app"]

8.3 Build and push baseline images

az acr build \
  --registry acrrevisionupdatelab \
  --resource-group rg-aca-revision-update-lab \
  --image revision-lab:v1 \
  --build-arg APP_VERSION=v1 \
  .

az acr build \
  --registry acrrevisionupdatelab \
  --resource-group rg-aca-revision-update-lab \
  --image revision-lab:v2 \
  --build-arg APP_VERSION=v2 \
  .

8.4 Create the baseline Container App

ACR_USER=$(az acr credential show \
  --name acrrevisionupdatelab \
  --resource-group rg-aca-revision-update-lab \
  --query username -o tsv)

ACR_PASS=$(az acr credential show \
  --name acrrevisionupdatelab \
  --resource-group rg-aca-revision-update-lab \
  --query 'passwords[0].value' -o tsv)

az containerapp create \
  --name ca-revision-update \
  --resource-group rg-aca-revision-update-lab \
  --environment cae-revision-update-lab \
  --image acrrevisionupdatelab.azurecr.io/revision-lab:v1 \
  --registry-server acrrevisionupdatelab.azurecr.io \
  --registry-username "$ACR_USER" \
  --registry-password "$ACR_PASS" \
  --ingress external \
  --target-port 8080 \
  --cpu 0.25 \
  --memory 0.5Gi \
  --min-replicas 1 \
  --max-replicas 3 \
  --revision-suffix v1 \
  --env-vars APP_VERSION=v1 STARTUP_DELAY_SECONDS=0

8.5 Probe profiles

Profile A: correct readiness

probes:
  - type: Readiness
    httpGet:
      path: /ready
      port: 8080
    initialDelaySeconds: 2
    periodSeconds: 3
    timeoutSeconds: 2
    failureThreshold: 10
  - type: Liveness
    httpGet:
      path: /health
      port: 8080
    initialDelaySeconds: 10
    periodSeconds: 10
    timeoutSeconds: 2
    failureThreshold: 3

Profile B: delayed readiness / long transition

probes:
  - type: Readiness
    httpGet:
      path: /ready
      port: 8080
    initialDelaySeconds: 15
    periodSeconds: 10
    timeoutSeconds: 2
    failureThreshold: 12

Profile C: misleading health endpoint

probes:
  - type: Readiness
    httpGet:
      path: /health
      port: 8080
    initialDelaySeconds: 1
    periodSeconds: 3
    timeoutSeconds: 2
    failureThreshold: 3

8.6 Test scenarios

Scenario Revision mode Update type minReplicas Probe profile Expected risk
S1 Single image update v1 -> v2 1 A brief failure window possible
S2 Single env/config change 1 A brief failure window possible
S3 Multiple image update + manual traffic shift 1 A zero-downtime candidate
S4 Single image update 0 / 1 / 2 A compare warm capacity effect
S5 Single and Multiple image update 1 B and C probe-driven transition differences

8.7 Execution steps per scenario

  1. Confirm baseline revision is healthy and serving only 200 responses.
  2. Start the traffic generator against the external FQDN and keep it running for the entire deployment.
  3. Record current revision list and traffic weights:
az containerapp revision list \
  --name ca-revision-update \
  --resource-group rg-aca-revision-update-lab \
  -o table

az containerapp ingress traffic show \
  --name ca-revision-update \
  --resource-group rg-aca-revision-update-lab
  1. Trigger the scenario-specific update.

Scenario S1: single revision image update

az containerapp revision set-mode \
  --name ca-revision-update \
  --resource-group rg-aca-revision-update-lab \
  --mode single

az containerapp update \
  --name ca-revision-update \
  --resource-group rg-aca-revision-update-lab \
  --image acrrevisionupdatelab.azurecr.io/revision-lab:v2 \
  --revision-suffix s1img

Scenario S2: single revision config change

az containerapp update \
  --name ca-revision-update \
  --resource-group rg-aca-revision-update-lab \
  --set-env-vars APP_VERSION=v1 CONFIG_MARKER=$(date +%s) \
  --revision-suffix s2cfg

Scenario S3: multiple revision gradual traffic shift

az containerapp revision set-mode \
  --name ca-revision-update \
  --resource-group rg-aca-revision-update-lab \
  --mode multiple

az containerapp update \
  --name ca-revision-update \
  --resource-group rg-aca-revision-update-lab \
  --image acrrevisionupdatelab.azurecr.io/revision-lab:v2 \
  --revision-suffix s3multi

# After confirming new revision is healthy:
az containerapp ingress traffic set \
  --name ca-revision-update \
  --resource-group rg-aca-revision-update-lab \
  --revision-weight ca-revision-update--v1=100 ca-revision-update--s3multi=0

az containerapp ingress traffic set \
  --name ca-revision-update \
  --resource-group rg-aca-revision-update-lab \
  --revision-weight ca-revision-update--v1=90 ca-revision-update--s3multi=10

az containerapp ingress traffic set \
  --name ca-revision-update \
  --resource-group rg-aca-revision-update-lab \
  --revision-weight ca-revision-update--v1=50 ca-revision-update--s3multi=50

az containerapp ingress traffic set \
  --name ca-revision-update \
  --resource-group rg-aca-revision-update-lab \
  --revision-weight ca-revision-update--v1=0 ca-revision-update--s3multi=100

Scenario S4: minReplicas comparison

for MIN in 0 1 2; do
  az containerapp update \
    --name ca-revision-update \
    --resource-group rg-aca-revision-update-lab \
    --min-replicas "$MIN"

  az containerapp update \
    --name ca-revision-update \
    --resource-group rg-aca-revision-update-lab \
    --image acrrevisionupdatelab.azurecr.io/revision-lab:v2 \
    --revision-suffix "min${MIN}"
done

Scenario S5: probe behavior comparison

Apply profile B and C separately, then repeat the image update while traffic generation continues.

  1. Continue traffic for at least 2 minutes after the rollout appears stable.
  2. Export traffic logs, revision list, traffic weights, and system/console logs immediately after each scenario.
  3. Reset to the baseline state before the next scenario.

9. Expected signal

  • S1 / S2 single revision: short cluster of 502/503 or connection-reset style responses during the handoff if the new revision is not yet traffic-eligible.
  • S3 multiple revision: no external failures while traffic remains on the old revision until the new revision is ready.
  • S4 minReplicas: minReplicas=0 should show the worst transition behavior; 1 and 2 should shorten the failure window but may still not eliminate it in single revision mode.
  • S5 probe variants: delayed or misleading readiness should lengthen the transition or create premature routing to an unready backend.

Expected external error text, if the hypothesis is correct:

  • upstream connect error or disconnect/reset before headers
  • 503 service unavailable
  • brief 502 bad gateway depending on exact Envoy/backend state

10. Results

Not yet executed.

Use the following capture tables during execution.

10.1 Scenario summary

Scenario Revision mode Update type minReplicas Probe profile Total requests 200 count 502 count 503 count Timeout count Failure window Notes
S1
S2
S3
S4-0 0
S4-1 1
S4-2 2
S5-B B
S5-C C

10.2 Transition timing worksheet

Scenario Update started New revision created First ready signal Old revision removed from traffic First external failure Last external failure Stable 200 resumes Duration
S1
S2
S3

10.3 Latency summary worksheet

Scenario Baseline p50 Baseline p95 Transition p50 Transition p95 Max latency Comments
S1
S2
S3

10.4 Representative log evidence

Capture:

  • first non-200 response body from traffic generator
  • system log events for revision activation and termination
  • application log entries showing readiness transition
  • revision/traffic outputs before and after the update

11. Interpretation

Interpretation will be added after execution. Use calibrated evidence tags only.

Planned interpretation questions:

  1. Did single revision mode ever leave Envoy without an eligible backend? [Observed/Not Proven pending execution]
  2. Did multiple revision mode eliminate the external failure window when traffic remained on the previous healthy revision? [Observed/Not Proven pending execution]
  3. Did probe design alter the size of the external outage window or only the time to full activation? [Measured/Not Proven pending execution]
  4. Did minReplicas reduce transition failures or mainly reduce startup latency? [Measured/Not Proven pending execution]

12. What this proves

Pending execution. The final section should only claim evidence directly supported by:

  • traffic generator output
  • revision traffic assignment state
  • system/console log timestamps
  • repeated scenario comparison across revision modes and probe settings

13. What this does NOT prove

Even after execution, this experiment will not by itself prove:

  • behavior across every region or workload profile
  • behavior for internal ingress, TCP ingress, or Dapr-enabled apps
  • behavior under heavy concurrent load beyond the selected curl cadence
  • whether every 502 vs 503 variant comes from the exact same internal platform path
  • whether customer-specific apps with long startup, sidecars, or external dependencies will match the same timing

14. Support takeaway

Expected support guidance if the hypothesis is confirmed:

  1. If the customer requires near-zero downtime during deployment, prefer multiple revision mode and shift traffic only after the new revision is healthy.
  2. Treat minReplicas as a mitigation, not a guarantee, especially in single revision mode.
  3. Validate that readiness probes reflect real traffic readiness, not just process liveness.
  4. When customers report brief deploy-time 502/503, collect:
    • exact deployment timestamp
    • revision mode
    • traffic weights
    • probe configuration
    • min/max replica settings
    • sample error body showing Envoy text
  5. Escalation quality improves significantly when support includes both external traffic evidence and revision lifecycle logs for the same minute.

15. Reproduction notes

  • Keep the traffic generator running before, during, and after the update; one-off manual curls often miss the failure window.
  • Record UTC timestamps everywhere; correlation across traffic logs and Log Analytics is otherwise painful.
  • Use the same baseline revision before each scenario so that results are comparable.
  • If multiple revision traffic-set commands require exact revision names, fetch them immediately after the update rather than assuming the suffix-only format.
  • If the failure window is too small to capture at 0.2s intervals, retest at 0.05s intervals for a shorter period.
  • Do not mix scale-to-zero behavior with revision update behavior unless that is the explicit variable under test.