Lab Guides¶

Hands-on troubleshooting labs for Azure Container Apps with deployable infrastructure and scripted failure/recovery flows.

All sample outputs in lab guides are PII-scrubbed and use ca-myapp, cae-myapp, and job-myapp naming.

Available Labs¶

Lab	Description	Difficulty	Duration	Guide	Lab Files
ACR Image Pull Failure	Reproduces `ImagePullBackOff` from a non-existent image tag, then fixes image publishing/update.	Beginner	20-30 min	Guide	Directory
Revision Failover and Rollback	Deploys a healthy revision, then breaks ingress port on a new revision and restores traffic.	Intermediate	20-30 min	Guide	Directory
Scale Rule Mismatch	Uses unrealistic HTTP scaling thresholds to show non-scaling under load, then corrects KEDA settings.	Intermediate	25-35 min	Guide	Directory
Probe and Port Mismatch	App listens on port 3000 while ingress targets 8000, causing probe failures until target port is fixed.	Beginner	20-25 min	Guide	Directory
Managed Identity Key Vault Failure	App uses managed identity to read Key Vault secret but fails without `Key Vault Secrets User` role assignment.	Intermediate	25-35 min	Guide	Directory
Revision Provisioning Failure	Revision fails because container env var references a missing secret; fixed by setting secret and deploying new revision.	Intermediate	20-30 min	Guide	Directory
Ingress Target Port Mismatch	Diagnose and fix ingress failures caused by target port misconfiguration.	Beginner	15-20 min	Guide	Directory
Traffic Routing Canary Failure	Diagnose traffic splitting failures when a bad revision receives production traffic.	Intermediate	20-30 min	Guide	Directory
Dapr Integration	Troubleshoot Dapr sidecar and component configuration issues.	Intermediate	35-45 min	Guide	Directory
Observability and Tracing	Set up OpenTelemetry and Application Insights, troubleshoot missing traces and metrics.	Intermediate	35-45 min	Guide	Directory

Suggested Learning Path¶

How to Use These Labs Effectively¶

Use this section when you want a repeatable learning loop (reproduce → observe → fix → verify).

flowchart TD
    A[Choose Lab by Symptom] --> B[Deploy Lab Infrastructure]
    B --> C[Trigger Failure]
    C --> D[Collect Evidence]
    D --> E[Apply Targeted Fix]
    E --> F[Verify Recovery]
    F --> G[Capture Lessons Learned]

Run labs like incident drills

Treat each lab as an on-call simulation. Time-box your investigation and record which signal (revision state, system log, console log, metrics) gave you the fastest root-cause clue.

Reuse one naming convention across all labs

Keep variable names consistent between labs ($RG, $APP_NAME, $ENVIRONMENT_NAME, $ACR_NAME, $LOCATION) so your troubleshooting muscle memory transfers cleanly.

Lab Selection Matrix¶

Lab	Primary Symptom	First Signal to Check	Typical Root Cause	Fastest Recovery
ACR Image Pull Failure	Revision never starts	`ContainerAppSystemLogs_CL` pull errors	Bad image tag / registry auth	Push valid image + update app image
Revision Failover and Rollback	New revision unhealthy	`az containerapp revision list`	Risky config change in latest revision	Shift traffic back to healthy revision
Scale Rule Mismatch	Load increases, replicas do not	Replica count + KEDA events	Threshold too high / max replicas too low	Tune scale rule and retry load
Probe and Port Mismatch	Probe failures, no stable ready state	Probe failure warnings	App bind port != ingress target port	Align target port and rollout new revision
Managed Identity Key Vault Failure	Route returns 500/403	App logs with identity errors	Missing role assignment on Key Vault scope	Assign RBAC role and re-verify
Revision Provisioning Failure	Revision stuck/failed provisioning	Revision lifecycle events	`secretRef` points to missing secret	Add secret and redeploy revision
Ingress Target Port Mismatch	External endpoint unreachable	Ingress target port config	Target port doesn't match app listen port	Fix target port to match app
Traffic Routing Canary Failure	Intermittent failures (~50%)	Traffic weight and revision health	Bad revision receiving traffic	Rollback traffic to healthy revision
Dapr Integration	Dapr calls fail	System logs with Dapr errors	Sidecar not enabled or component misconfigured	Enable Dapr and fix component YAML
Observability and Tracing	No traces in App Insights	Application Insights query	Connection string not set	Configure OTel and connection string

Step-by-Step: Standard Lab Execution Pattern¶

Prepare shell variables

export RG="rg-aca-lab-shared"
export LOCATION="koreacentral"
export ENVIRONMENT_NAME="cae-myapp"
export APP_NAME="ca-myapp"
export ACR_NAME="acrmyapp"

Expected output: no output (environment variables set in your shell).

Validate CLI context
```
az account show --output table
az extension add --name containerapp --upgrade
```
Expected output: active subscription metadata and extension upgrade confirmation.

Deploy the chosen lab infrastructure

az deployment group create \
  --name "lab-run" \
  --resource-group "$RG" \
  --template-file "./labs/<lab-name>/infra/main.bicep" \
  --parameters baseName="labrun"

Expected output pattern:

"provisioningState": "Succeeded"

Trigger failure and collect signals
```
./labs/<lab-name>/trigger.sh
./labs/<lab-name>/verify.sh
```
Expected output: one or more failure indicators (for example ImagePullBackOff, ProbeFailed, 403 Forbidden, or non-scaling replica count).

Apply targeted fix and verify recovery

# Use the specific fix command from each lab guide
az containerapp revision list --name "$APP_NAME" --resource-group "$RG" --output table

Expected output pattern: at least one Healthy revision with intended traffic weight.

Clean up resources
```
./labs/<lab-name>/cleanup.sh
```
Expected output: deletion completed or a Succeeded state for cleanup actions.

Expected vs Actual Investigation Template¶

Checkpoint	Expected State	Typical Failure State	Action
Revision health	`Healthy` and active	`Failed` or stuck provisioning	Inspect system logs and revision events
Replica status	Running replicas under load	0 replicas or repeated restart	Check probes, scale settings, and runtime logs
Route behavior	HTTP 200 with expected payload	5xx, timeout, or connection refused	Validate ingress + target port + dependencies
Identity access	Token retrieval and authorized resource call	401/403 in console logs	Verify managed identity and RBAC scope

Lab Guides¶

Available Labs¶

Suggested Learning Path¶

How to Use These Labs Effectively¶

Lab Selection Matrix¶

Step-by-Step: Standard Lab Execution Pattern¶

Expected vs Actual Investigation Template¶

See Also¶

Sources¶