Traffic Routing and Canary Failure Lab¶

Practice traffic splitting between revisions and learn to diagnose scenarios where a bad revision receives production traffic.

Lab Metadata¶

Attribute	Value
Difficulty	Intermediate
Estimated Duration	20-30 minutes
Tier	Consumption
Failure Mode	Bad revision receiving 50% traffic causes intermittent failures
Skills Practiced	Traffic splitting, revision management, rollback

1) Background¶

Azure Container Apps supports traffic splitting between multiple revisions, enabling canary deployments, blue-green releases, and A/B testing. When activeRevisionsMode is set to Multiple, you can assign traffic weights to each revision.

A common failure scenario occurs when:

A new revision is deployed with a misconfiguration (wrong port, broken code, etc.)
Traffic is split between the good and bad revisions
Users experience intermittent failures—some requests succeed (good revision), others fail (bad revision)

This lab simulates this scenario by:

Deploying a healthy baseline revision
Creating a bad revision with an incorrect target port (9999)
Splitting traffic 50/50 between good and bad revisions
Observing intermittent failures and practicing rollback

Architecture¶

flowchart LR
    A[User Request] --> B[Ingress Controller]
    B --> C{Traffic Split}
    C -->|50%| D[Good Revision<br/>Port 80 ✓]
    C -->|50%| E[Bad Revision<br/>Port 9999 ✗]
    D --> F[HTTP 200]
    E --> G[HTTP 502]

2) Hypothesis¶

IF traffic is split 50/50 between a healthy revision and a revision with an incorrect target port, THEN approximately 50% of requests will fail with 502 errors.

Variable	Control State	Experimental State
Revision Count	1 (healthy)	2 (healthy + bad)
Traffic Split	100% to healthy	50% healthy, 50% bad
Request Success Rate	~100%	~50%

3) Runbook¶

Deploy Baseline Infrastructure¶

export RG="rg-aca-lab-traffic"
export LOCATION="koreacentral"

az group create --name "$RG" --location "$LOCATION"

az deployment group create \
    --name "lab-traffic" \
    --resource-group "$RG" \
    --template-file "./labs/traffic-routing-canary/infra/main.bicep" \
    --parameters baseName="labtraffic"

Capture Resource Names¶

export APP_NAME="$(az deployment group show \
    --resource-group "$RG" \
    --name "lab-traffic" \
    --query "properties.outputs.containerAppName.value" \
    --output tsv)"

export APP_FQDN="$(az containerapp show \
    --name "$APP_NAME" \
    --resource-group "$RG" \
    --query "properties.configuration.ingress.fqdn" \
    --output tsv)"

Verify Baseline (Before Trigger)¶

# Confirm single revision with 100% traffic
az containerapp revision list \
    --name "$APP_NAME" \
    --resource-group "$RG" \
    --output table

Expected output:

Name                          Active    TrafficWeight    HealthState
----------------------------  --------  ---------------  -----------
ca-labtraffic-xxxxxx--xxxxx   True      100              Healthy

# Confirm endpoint is fully reachable
for i in {1..5}; do
    curl --silent --fail "https://${APP_FQDN}" > /dev/null && echo "Request $i: OK"
done

Expected: All 5 requests succeed.

Trigger the Failure¶

cd labs/traffic-routing-canary
./trigger.sh

The trigger script:

Records the current healthy revision name
Creates a new revision with target port 9999 (no process listening)
Splits traffic 50/50 between good and bad revisions

Observe the Failure¶

# Check revision list - should show two revisions
az containerapp revision list \
    --name "$APP_NAME" \
    --resource-group "$RG" \
    --query "[].{name:name,active:properties.active,traffic:properties.trafficWeight,health:properties.healthState}" \
    --output table

Expected output:

Name                          Active    Traffic    Health
----------------------------  --------  ---------  --------
ca-labtraffic-xxxxxx--xxxxx   True      50         Healthy
ca-labtraffic-xxxxxx--yyyyy   True      50         Healthy

Note: Both revisions may show "Healthy" because the health check might pass—the container is running, just on the wrong port.

# Test multiple requests - observe intermittent failures
for i in {1..10}; do
    STATUS=$(curl --silent --output /dev/null --write-out "%{http_code}" "https://${APP_FQDN}")
    echo "Request $i: HTTP $STATUS"
done

Expected: Approximately 50% return 200, 50% return 502 or timeout.

# View current traffic distribution
az containerapp ingress traffic show \
    --name "$APP_NAME" \
    --resource-group "$RG"

Fix the Issue (Rollback)¶

Rollback by sending 100% traffic to the good revision:

# Get the good revision name (the one created first)
GOOD_REVISION=$(az containerapp revision list \
    --name "$APP_NAME" \
    --resource-group "$RG" \
    --query "sort_by([].{name:name,created:properties.createdTime}, &created)[0].name" \
    --output tsv)

# Send all traffic to good revision
az containerapp ingress traffic set \
    --name "$APP_NAME" \
    --resource-group "$RG" \
    --revision-weight "${GOOD_REVISION}=100"

Optionally, deactivate the bad revision:

BAD_REVISION=$(az containerapp revision list \
    --name "$APP_NAME" \
    --resource-group "$RG" \
    --query "sort_by([].{name:name,created:properties.createdTime}, &created)[-1].name" \
    --output tsv)

az containerapp revision deactivate \
    --name "$APP_NAME" \
    --resource-group "$RG" \
    --revision "$BAD_REVISION"

Verify the Fix¶

# Confirm traffic is 100% to good revision
az containerapp ingress traffic show \
    --name "$APP_NAME" \
    --resource-group "$RG"

# Test multiple requests - all should succeed
for i in {1..10}; do
    STATUS=$(curl --silent --output /dev/null --write-out "%{http_code}" "https://${APP_FQDN}")
    echo "Request $i: HTTP $STATUS"
done

Expected: All requests return HTTP 200.

4) Experiment Log¶

Step	Action	Expected
1	Deploy baseline	Single healthy revision
2	Test baseline	100% success rate
3	Run trigger.sh	Two revisions at 50/50
4	Test requests	~50% failure rate
5	Rollback traffic	100% to good revision
6	Test after rollback	100% success rate

Expected Evidence¶

During Failure¶

Evidence Source	Expected State
`az containerapp revision list`	2 revisions, both Active
`az containerapp ingress traffic show`	50/50 split
Request loop	~50% HTTP 502

After Rollback¶

Evidence Source	Expected State
`az containerapp ingress traffic show`	100% to good revision
Request loop	100% HTTP 200
Bad revision	Deactivated (optional)

Clean Up¶

az group delete --name "$RG" --yes --no-wait

Bad Revision Rollout and Rollback

Traffic Routing and Canary Failure Lab¶

Lab Metadata¶

1) Background¶

Architecture¶

2) Hypothesis¶

3) Runbook¶

Deploy Baseline Infrastructure¶

Capture Resource Names¶

Verify Baseline (Before Trigger)¶

Trigger the Failure¶

Observe the Failure¶

Fix the Issue (Rollback)¶

Verify the Fix¶

4) Experiment Log¶

Expected Evidence¶

During Failure¶

After Rollback¶

Clean Up¶

See Also¶

Sources¶

Traffic Routing and Canary Failure Lab¶

Lab Metadata¶

1) Background¶

Architecture¶

2) Hypothesis¶

3) Runbook¶

Deploy Baseline Infrastructure¶

Capture Resource Names¶

Verify Baseline (Before Trigger)¶

Trigger the Failure¶

Observe the Failure¶

Fix the Issue (Rollback)¶

Verify the Fix¶

4) Experiment Log¶

Expected Evidence¶

During Failure¶

After Rollback¶

Clean Up¶

Related Playbook¶

See Also¶

Sources¶