Burst Scaling Queueing Before Replica Add¶

Status: Draft - Awaiting Execution

Experiment designed but not yet executed. This draft targets a common Container Apps support question: under sudden HTTP bursts, how much latency/error budget is lost while HTTP autoscaling is still deciding to add replicas, and which knobs actually reduce the failure window?

1. Question¶

Under sudden HTTP burst traffic, how much queueing happens before Azure Container Apps adds replicas, and which HTTP scaling settings actually reduce 5xx rates and tail latency during scale-out?

2. Why this matters¶

Customers often interpret burst-time 502/503 responses or very high p95/p99 latency as evidence that Container Apps "could not handle load," even when average load is low and steady-state throughput is acceptable. In practice, the confusing part is the transition window between the moment traffic surges and the moment new replicas become available.

This matters for support because:

many workloads are mostly idle, then receive webhook, campaign, or API burst traffic
minReplicas=1 is commonly assumed to solve all burst issues, even though it only removes scale-from-zero delay
the default HTTP scaler threshold (concurrentRequests=10) may fit steady traffic but react too slowly to short spikes
customers often tune cooldownPeriod expecting faster scale-out, even though it primarily affects scale-in timing
tail latency and 5xx responses during the first 10-60 seconds can be more important than average throughput

The experiment is intended to separate:

cold-start contribution when the app begins at 0 replicas
warm burst queueing when the app begins at 1 replica
the effect of concurrentRequests threshold on scale trigger sensitivity
the effect of pollingInterval on how quickly scaling reacts to the burst
the non-effect of cooldownPeriod on the initial scale-out decision

3. Customer symptom¶

Typical ticket phrasing:

"Short traffic spikes cause timeouts before autoscale catches up."
"Container Apps eventually scales out, but users already got 503 or very slow responses."
"minReplicas=1 helped cold starts but did not stop burst failures."
"We lowered cooldownPeriod, but burst latency did not improve."
"What should we tune first: min replicas, concurrent requests, or polling interval?"

4. Hypothesis¶

Default HTTP scaling with concurrentRequests=10 may not react quickly enough for short bursts, causing transient queueing on the initial replica before additional replicas are ready.
pollingInterval is a major contributor to burst response lag; lower values such as 10s should reduce time-to-scale compared with 30s or 60s.
cooldownPeriod affects how long replicas remain after traffic falls, but should not materially improve initial scale-out speed.
minReplicas > 0 removes scale-from-zero cold start, but does not by itself eliminate queueing during a burst that exceeds one replica's capacity.
Lower concurrentRequests thresholds (for example 5) should trigger scale-out earlier than 10, 20, or 50, reducing p99 latency and 5xx rate at the cost of more aggressive scaling.

5. Environment¶

Parameter	Value
Service	Azure Container Apps
SKU / Plan	Consumption
Region	Korea Central
Runtime	Python 3.11 custom container
OS	Linux
Test app	HTTP service with configurable per-request latency and request timing headers
Ingress	External, target port `8080`
Scaling	HTTP rule with configurable `concurrentRequests`; environment-level `pollingInterval` and `cooldownPeriod`
Load tools	`hey` or `wrk` from external Linux VM / Cloud Shell
Logging	Log Analytics + Container Apps system/console logs + Azure Monitor metrics
Date tested	Not yet executed

6. Variables¶

Experiment type: Performance / scaling behavior comparison

Controlled:

same Container Apps environment and region
same app image and code
same CPU / memory allocation per replica
same target port and ingress mode
same maximum replica limit unless explicitly changed
same burst duration, request payload size, and client location for comparable runs
same baseline artificial server latency (for example 250 ms) unless noted otherwise

Independent variables:

starting replica count: 0 vs 1
HTTP scaler concurrentRequests: 5, 10, 20, 50
environment pollingInterval: 10, 30, 60 seconds
environment cooldownPeriod: baseline 300 seconds, optional comparison 60 seconds
burst profile: short spike, step burst, sustained burst

Observed:

time from burst start to first additional replica creation
time from burst start to first additional replica becoming ready
replica count over time
HTTP status distribution (200, 429, 502, 503, client timeout)
p50, p95, p99, and max latency during the burst window
request queueing signal inferred from app-side wait time and latency inflation
scale-related system log messages
app-side timestamps showing when each replica began serving requests

Independent run definition: one clean deployment or configuration state, followed by one burst test after confirming the intended starting replica count and idle/warm condition

Planned runs per configuration: minimum 5 independent runs

Warm-up exclusion rule: exclude pre-burst verification requests from latency analysis; include all burst requests

Primary metrics: time to second replica ready, burst-window p99 latency, burst-window 5xx rate

Meaningful effect threshold:

time-to-scale change of >= 10s
p99 latency change of >= 25%
absolute 5xx rate change of >= 1 percentage point

7. Instrumentation¶

Planned evidence sources:

External load generator: hey for repeatable concurrent HTTP bursts; optional wrk for higher-rate bursts
ContainerAppSystemLogs_CL: revision lifecycle, replica creation, scaling events, environment-level events
ContainerAppConsoleLogs_CL: app log markers for request arrival, service start, and per-replica identifiers
Azure CLI: environment configuration, revision inspection, replica listing
Azure Monitor metrics: request count, response code breakdown, latency, replica count if exposed in the selected metric namespace

Recommended application log markers:

APP_START
REQUEST_START
REQUEST_END
REPLICA_ID
ARTIFICIAL_DELAY_MS
INFLIGHT_COUNT

Test application example¶

import os
import time
import socket
from datetime import datetime, timezone
from flask import Flask, jsonify, request

app = Flask(__name__)
HOSTNAME = socket.gethostname()
APP_START_MONO = time.monotonic()
APP_START_UTC = datetime.now(timezone.utc).isoformat()
BASE_DELAY_MS = int(os.getenv("BASE_DELAY_MS", "250"))
EXTRA_DELAY_MS = int(os.getenv("EXTRA_DELAY_MS", "0"))
INFLIGHT = 0

@app.route("/")
def index():
    global INFLIGHT
    request_start = time.monotonic()
    request_start_utc = datetime.now(timezone.utc).isoformat()
    INFLIGHT += 1
    inflight_at_start = INFLIGHT

    try:
        time.sleep((BASE_DELAY_MS + EXTRA_DELAY_MS) / 1000)
        return jsonify({
            "status": "ok",
            "hostname": HOSTNAME,
            "app_start_utc": APP_START_UTC,
            "request_start_utc": request_start_utc,
            "response_utc": datetime.now(timezone.utc).isoformat(),
            "service_time_ms": BASE_DELAY_MS + EXTRA_DELAY_MS,
            "inflight_at_start": inflight_at_start,
            "uptime_seconds": round(time.monotonic() - APP_START_MONO, 3)
        })
    finally:
        INFLIGHT -= 1

Load test scripts¶

Short burst with `hey`¶

#!/usr/bin/env bash
set -euo pipefail

URL="$1"
TOTAL_REQUESTS="${2:-400}"
CONCURRENCY="${3:-100}"
OUTDIR="${4:-results/hey-short-burst}"

mkdir -p "$OUTDIR"

hey \
  -n "$TOTAL_REQUESTS" \
  -c "$CONCURRENCY" \
  -disable-keepalive \
  -o csv \
  "$URL" > "$OUTDIR/hey.csv"

Sustained burst with `wrk`¶

#!/usr/bin/env bash
set -euo pipefail

URL="$1"
DURATION="${2:-90s}"
CONNECTIONS="${3:-200}"
THREADS="${4:-4}"
OUTDIR="${5:-results/wrk-sustained-burst}"

mkdir -p "$OUTDIR"

wrk \
  --latency \
  -t "$THREADS" \
  -c "$CONNECTIONS" \
  -d "$DURATION" \
  "$URL" | tee "$OUTDIR/wrk.txt"

Step-burst pattern using `curl`¶

#!/usr/bin/env bash
set -euo pipefail

URL="$1"
OUTFILE="${2:-results/step-burst.csv}"

mkdir -p "$(dirname "$OUTFILE")"
printf 'phase,ts_utc,http_code,time_total,errormsg\n' > "$OUTFILE"

run_phase() {
  local phase="$1"
  local requests_per_phase="$2"
  local parallelism="$3"

  seq "$requests_per_phase" | xargs -I{} -P "$parallelism" bash -c '
    ts_utc="$(date -u +"%Y-%m-%dT%H:%M:%S.%3NZ")"
    result="$({ curl -skS "$0" -o /dev/null -w "%{http_code},%{time_total},%{errormsg}" --max-time 15; } 2>&1 || true)"
    printf "%s,%s,%s\n" "$1" "$ts_utc" "$result"
  ' "$URL" "$phase" >> "$OUTFILE"
}

run_phase warmup 20 2
sleep 10
run_phase burst_25 100 25
sleep 15
run_phase burst_100 400 100
sleep 15
run_phase burst_200 800 200

Replica / metric collection helpers¶

#!/usr/bin/env bash
set -euo pipefail

RG="$1"
APP="$2"
OUTFILE="${3:-results/replicas.csv}"

mkdir -p "$(dirname "$OUTFILE")"
printf 'ts_utc,replica_count,replicas\n' > "$OUTFILE"

while true; do
  ts_utc="$(date -u +"%Y-%m-%dT%H:%M:%S.%3NZ")"
  replicas_json="$(az containerapp replica list --resource-group "$RG" --name "$APP" --output json)"
  replica_count="$(python3 -c 'import json,sys; print(len(json.load(sys.stdin)))' <<< "$replicas_json")"
  replicas_flat="$(python3 -c 'import json,sys; data=json.load(sys.stdin); print(";".join(sorted(x.get("name","unknown") for x in data)))' <<< "$replicas_json")"
  printf '%s,%s,%s\n' "$ts_utc" "$replica_count" "$replicas_flat" >> "$OUTFILE"
  sleep 2
done

Kusto queries¶

// Scaling / replica lifecycle timeline
ContainerAppSystemLogs_CL
| where ContainerAppName_s == "ca-burst-scaling"
| where TimeGenerated between (datetime(2026-04-12T00:00:00Z) .. datetime(2026-04-12T01:00:00Z))
| project TimeGenerated, RevisionName_s, ReplicaName_s, Reason_s, Log_s
| order by TimeGenerated asc

// Application-side request markers and replica identifiers
ContainerAppConsoleLogs_CL
| where ContainerAppName_s == "ca-burst-scaling"
| where TimeGenerated between (datetime(2026-04-12T00:00:00Z) .. datetime(2026-04-12T01:00:00Z))
| project TimeGenerated, RevisionName_s, Log_s
| order by TimeGenerated asc

// Error-focused view during the burst window
ContainerAppSystemLogs_CL
| where ContainerAppName_s == "ca-burst-scaling"
| where Log_s has_any ("scale", "replica", "failed", "503", "502", "timeout")
| project TimeGenerated, ReplicaName_s, Reason_s, Log_s
| order by TimeGenerated asc

8. Procedure¶

8.1 Infrastructure setup¶

# Resource group
az group create \
  --name rg-aca-burst-scaling-lab \
  --location koreacentral

# Log Analytics workspace
az monitor log-analytics workspace create \
  --resource-group rg-aca-burst-scaling-lab \
  --workspace-name law-aca-burst-scaling \
  --location koreacentral

LAW_ID=$(az monitor log-analytics workspace show \
  --resource-group rg-aca-burst-scaling-lab \
  --workspace-name law-aca-burst-scaling \
  --query customerId -o tsv)

LAW_KEY=$(az monitor log-analytics workspace get-shared-keys \
  --resource-group rg-aca-burst-scaling-lab \
  --workspace-name law-aca-burst-scaling \
  --query primarySharedKey -o tsv)

# Container Apps environment
az containerapp env create \
  --name cae-burst-scaling-lab \
  --resource-group rg-aca-burst-scaling-lab \
  --location koreacentral \
  --logs-workspace-id "$LAW_ID" \
  --logs-workspace-key "$LAW_KEY"

# ACR
az acr create \
  --name acrburstscalinglab \
  --resource-group rg-aca-burst-scaling-lab \
  --sku Basic \
  --admin-enabled true \
  --location koreacentral

8.2 Build and push the test image¶

FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py .
EXPOSE 8080
CMD ["gunicorn", "--bind", "0.0.0.0:8080", "--workers", "1", "--threads", "8", "--timeout", "120", "app:app"]

az acr build \
  --registry acrburstscalinglab \
  --resource-group rg-aca-burst-scaling-lab \
  --image burst-scaling-app:v1 \
  --file Dockerfile .

8.3 Deploy baseline Container App¶

ACR_USER=$(az acr credential show \
  --name acrburstscalinglab \
  --resource-group rg-aca-burst-scaling-lab \
  --query username -o tsv)

ACR_PASS=$(az acr credential show \
  --name acrburstscalinglab \
  --resource-group rg-aca-burst-scaling-lab \
  --query "passwords[0].value" -o tsv)

az containerapp create \
  --name ca-burst-scaling \
  --resource-group rg-aca-burst-scaling-lab \
  --environment cae-burst-scaling-lab \
  --image acrburstscalinglab.azurecr.io/burst-scaling-app:v1 \
  --registry-server acrburstscalinglab.azurecr.io \
  --registry-username "$ACR_USER" \
  --registry-password "$ACR_PASS" \
  --target-port 8080 \
  --ingress external \
  --min-replicas 0 \
  --max-replicas 10 \
  --scale-rule-name http-rule \
  --scale-rule-type http \
  --scale-rule-http-concurrency 10 \
  --cpu 0.5 \
  --memory 1Gi \
  --env-vars BASE_DELAY_MS=250 EXTRA_DELAY_MS=0

Record the public URL:

APP_URL=$(az containerapp show \
  --name ca-burst-scaling \
  --resource-group rg-aca-burst-scaling-lab \
  --query properties.configuration.ingress.fqdn -o tsv)
APP_URL="https://${APP_URL}/"

8.4 Environment-level scaler timing variants¶

Container Apps environment settings must be adjusted between scenario groups.

# Example: set polling interval to 10s and cooldown to 300s
az containerapp env update \
  --name cae-burst-scaling-lab \
  --resource-group rg-aca-burst-scaling-lab \
  --keda-config polling-interval=10 cooldown-period=300

# Example: revert to 30s polling interval
az containerapp env update \
  --name cae-burst-scaling-lab \
  --resource-group rg-aca-burst-scaling-lab \
  --keda-config polling-interval=30 cooldown-period=300

If the CLI syntax for --keda-config changes, capture the exact command version used in the final execution notes.

8.5 App-level scaling variants¶

For each concurrentRequests threshold, update the app and wait for the revision to stabilize.

az containerapp update \
  --name ca-burst-scaling \
  --resource-group rg-aca-burst-scaling-lab \
  --min-replicas 1 \
  --max-replicas 10 \
  --scale-rule-name http-rule \
  --scale-rule-type http \
  --scale-rule-http-concurrency 5

Repeat for 10, 20, and 50.

8.6 Scenario matrix¶

Scenario	Start state	`concurrentRequests`	`pollingInterval`	`cooldownPeriod`	Burst pattern	Goal
1	`minReplicas=0`, verified 0 replicas	10	30s	300s	short burst	cold start + scale-out combined
2	`minReplicas=1`, warm single replica	10	30s	300s	short burst	warm burst baseline
3	`minReplicas=1`, warm single replica	5 / 10 / 20 / 50	30s	300s	short burst	concurrency threshold comparison
4	`minReplicas=1`, warm single replica	10	10s / 30s / 60s	300s	short burst	polling interval comparison
5	`minReplicas=1`, warm single replica	10	30s	60s / 300s	short burst	prove cooldown affects scale-in, not initial scale-out
6	`minReplicas=1`, warm single replica	best two thresholds	best polling interval	300s	sustained burst	see whether short-burst gains persist

8.7 Execution sequence per independent run¶

Apply the intended environment and app scaling configuration.
Wait until the latest revision is healthy and the previous scenario's traffic has ceased.
For start-from-zero scenarios, confirm zero replicas with az containerapp replica list.
For warm scenarios, send a few low-rate requests and confirm exactly one active replica if possible.
Start the replica polling helper at 2s intervals.
Start one load script (hey, wrk, or step burst) and capture raw output.
Immediately after the burst, export system and console logs for the matching time window.
Record:
- burst start timestamp
- first non-200 timestamp
- first second-replica observed timestamp
- first second-replica ready/serving timestamp
- last non-200 timestamp
Allow the app to settle, then stop polling and archive all outputs in a scenario/run-specific folder.
Repeat until at least 5 independent runs exist for the configuration.

9. Expected signal¶

Scenario 1 (0 replicas) should show the worst user experience because cold start and queueing overlap; latency will be dominated by first replica startup plus scale-out lag.
Scenario 2 (1 replica) should avoid the cold-start penalty but still show elevated p99 latency and possible 503/timeout responses while the first replica is saturated.
Lower concurrentRequests thresholds should produce earlier replica growth, lower p99, and lower 5xx rates.
Shorter pollingInterval should shift the first scale event earlier by roughly one polling cycle relative to slower settings.
Changing cooldownPeriod alone should not materially change the timestamp of the first scale-out event under the same burst profile.

10. Results¶

10.1 Scenario summary table¶

Populate after execution.

Scenario	Runs	Median time to 2nd replica ready	Median burst `p99`	Median `5xx` rate	Notes
Cold start + scale-out	-	-	-	-	Pending
Warm single replica baseline	-	-	-	-	Pending
Threshold = 5	-	-	-	-	Pending
Threshold = 10	-	-	-	-	Pending
Threshold = 20	-	-	-	-	Pending
Threshold = 50	-	-	-	-	Pending
Polling = 10s	-	-	-	-	Pending
Polling = 30s	-	-	-	-	Pending
Polling = 60s	-	-	-	-	Pending

10.2 Per-run template¶

Run	Scenario	Burst start	First non-200	First extra replica seen	First extra replica serving	Last non-200	`p99` latency	`5xx` rate
1	Example	-	-	-	-	-	-	-

10.3 Raw artifacts to preserve¶

hey.csv or wrk.txt
replica polling CSV
exported Kusto query results
revision / replica CLI snapshots
optional charts: replica count vs time, request latency vs time, non-200 count vs time

11. Interpretation¶

To be completed only after data collection. Use explicit evidence tags.

Planned interpretation prompts:

Observed: Did new replicas appear only after a measurable queueing/error window had already begun?
Measured: How many seconds elapsed between burst start and scale-out / ready-to-serve timestamps?
Correlated: Did lower concurrentRequests or shorter pollingInterval align with lower p99 and lower 5xx?
Inferred: Is the dominant factor scaler detection delay, replica startup time, or both together?
Not Proven: If bursts are highly variable, avoid over-claiming generality beyond the tested app profile.

12. What this proves¶

This section must stay evidence-bound after execution. The intended proof targets are:

whether queueing/error windows are measurable before extra replicas begin serving burst traffic
whether lowering concurrentRequests reduces burst-time tail latency and/or 5xx for this workload
whether shorter pollingInterval changes time-to-scale in a practically meaningful way
whether cooldownPeriod affects initial burst protection or only post-burst scale-in behavior
whether minReplicas=1 removes cold start without fully solving burst queueing

13. What this does NOT prove¶

Even after execution, this experiment will not by itself prove:

behavior for non-HTTP scalers such as Service Bus, Kafka, or custom KEDA triggers
behavior for workloads with very different startup costs, CPU profiles, or upstream dependency latency
platform-wide guarantees for all regions, environments, or future Container Apps releases
exact internal implementation details of the managed Envoy / KEDA integration beyond what is externally observable
the best production setting for every customer cost/performance tradeoff

14. Support takeaway¶

Planned support guidance, pending execution:

if the symptom is first burst after idle, check whether the issue is scale-from-zero rather than generic throughput
if the symptom is burst failures despite minReplicas=1, focus on warm scale-out sensitivity (concurrentRequests, burst shape, app service time)
validate whether the customer expects cooldownPeriod to improve scale-out; it usually should not
if request bursts are short and steep, test lower HTTP concurrency thresholds and compare p99/5xx, not only average latency
capture replica timeline and request timeline together; either one alone is usually insufficient

15. Reproduction notes¶

Keep client location stable; cross-region client variance can mask burst-time effects.
Disable keep-alive for the short-burst test if the goal is to maximize concurrency pressure at ingress.
Use a deliberately modest app service time (for example 250 ms) so the first replica can saturate under burst load without the app being unrealistically slow.
For scale-from-zero scenarios, confirm zero replicas immediately before the burst; waiting too long after verification can invalidate the run.
Preserve exact CLI versions and command syntax, especially if az containerapp env update --keda-config behavior changes.
If hey and wrk disagree substantially, prefer the raw per-request artifact that best matches the customer traffic pattern under investigation.