Lab Guides¶
Hands-on troubleshooting labs for Azure Container Apps with deployable infrastructure and scripted failure/recovery flows.
All sample outputs in lab guides are PII-scrubbed and use ca-myapp, cae-myapp, and job-myapp naming.
Available Labs¶
| Lab | Description | Difficulty | Duration | Guide | Lab Files |
|---|---|---|---|---|---|
| ACR Image Pull Failure | Reproduces ImagePullBackOff from a non-existent image tag, then fixes image publishing/update. | Beginner | 20-30 min | Guide | Directory |
| Revision Failover and Rollback | Deploys a healthy revision, then breaks ingress port on a new revision and restores traffic. | Intermediate | 20-30 min | Guide | Directory |
| Scale Rule Mismatch | Uses unrealistic HTTP scaling thresholds to show non-scaling under load, then corrects KEDA settings. | Intermediate | 25-35 min | Guide | Directory |
| Probe and Port Mismatch | App listens on port 3000 while ingress targets 8000, causing probe failures until target port is fixed. | Beginner | 20-25 min | Guide | Directory |
| Managed Identity Key Vault Failure | App uses managed identity to read Key Vault secret but fails without Key Vault Secrets User role assignment. | Intermediate | 25-35 min | Guide | Directory |
| Revision Provisioning Failure | Revision fails because container env var references a missing secret; fixed by setting secret and deploying new revision. | Intermediate | 20-30 min | Guide | Directory |
| Ingress Target Port Mismatch | Diagnose and fix ingress failures caused by target port misconfiguration. | Beginner | 15-20 min | Guide | Directory |
| Traffic Routing Canary Failure | Diagnose traffic splitting failures when a bad revision receives production traffic. | Intermediate | 20-30 min | Guide | Directory |
| Dapr Integration | Troubleshoot Dapr sidecar and component configuration issues. | Intermediate | 35-45 min | Guide | Directory |
| Observability and Tracing | Set up OpenTelemetry and Application Insights, troubleshoot missing traces and metrics. | Intermediate | 35-45 min | Guide | Directory |
Suggested Learning Path¶
- ACR Image Pull Failure
- Probe and Port Mismatch
- Revision Failover and Rollback
- Revision Provisioning Failure
- Scale Rule Mismatch
- Managed Identity Key Vault Failure
- Ingress Target Port Mismatch Lab
- Traffic Routing and Canary Failure Lab
- Dapr Integration
- Observability and Tracing
How to Use These Labs Effectively¶
Use this section when you want a repeatable learning loop (reproduce → observe → fix → verify).
flowchart TD
A[Choose Lab by Symptom] --> B[Deploy Lab Infrastructure]
B --> C[Trigger Failure]
C --> D[Collect Evidence]
D --> E[Apply Targeted Fix]
E --> F[Verify Recovery]
F --> G[Capture Lessons Learned] Run labs like incident drills
Treat each lab as an on-call simulation. Time-box your investigation and record which signal (revision state, system log, console log, metrics) gave you the fastest root-cause clue.
Reuse one naming convention across all labs
Keep variable names consistent between labs ($RG, $APP_NAME, $ENVIRONMENT_NAME, $ACR_NAME, $LOCATION) so your troubleshooting muscle memory transfers cleanly.
Lab Selection Matrix¶
| Lab | Primary Symptom | First Signal to Check | Typical Root Cause | Fastest Recovery |
|---|---|---|---|---|
| ACR Image Pull Failure | Revision never starts | ContainerAppSystemLogs_CL pull errors | Bad image tag / registry auth | Push valid image + update app image |
| Revision Failover and Rollback | New revision unhealthy | az containerapp revision list | Risky config change in latest revision | Shift traffic back to healthy revision |
| Scale Rule Mismatch | Load increases, replicas do not | Replica count + KEDA events | Threshold too high / max replicas too low | Tune scale rule and retry load |
| Probe and Port Mismatch | Probe failures, no stable ready state | Probe failure warnings | App bind port != ingress target port | Align target port and rollout new revision |
| Managed Identity Key Vault Failure | Route returns 500/403 | App logs with identity errors | Missing role assignment on Key Vault scope | Assign RBAC role and re-verify |
| Revision Provisioning Failure | Revision stuck/failed provisioning | Revision lifecycle events | secretRef points to missing secret | Add secret and redeploy revision |
| Ingress Target Port Mismatch | External endpoint unreachable | Ingress target port config | Target port doesn't match app listen port | Fix target port to match app |
| Traffic Routing Canary Failure | Intermittent failures (~50%) | Traffic weight and revision health | Bad revision receiving traffic | Rollback traffic to healthy revision |
| Dapr Integration | Dapr calls fail | System logs with Dapr errors | Sidecar not enabled or component misconfigured | Enable Dapr and fix component YAML |
| Observability and Tracing | No traces in App Insights | Application Insights query | Connection string not set | Configure OTel and connection string |
Step-by-Step: Standard Lab Execution Pattern¶
-
Prepare shell variables
export RG="rg-aca-lab-shared" export LOCATION="koreacentral" export ENVIRONMENT_NAME="cae-myapp" export APP_NAME="ca-myapp" export ACR_NAME="acrmyapp"Expected output: no output (environment variables set in your shell).
-
Validate CLI context
Expected output: active subscription metadata and extension upgrade confirmation.
-
Deploy the chosen lab infrastructure
az deployment group create \ --name "lab-run" \ --resource-group "$RG" \ --template-file "./labs/<lab-name>/infra/main.bicep" \ --parameters baseName="labrun"Expected output pattern:
-
Trigger failure and collect signals
Expected output: one or more failure indicators (for example
ImagePullBackOff,ProbeFailed,403 Forbidden, or non-scaling replica count). -
Apply targeted fix and verify recovery
# Use the specific fix command from each lab guide az containerapp revision list --name "$APP_NAME" --resource-group "$RG" --output tableExpected output pattern: at least one
Healthyrevision with intended traffic weight. -
Clean up resources
Expected output: deletion completed or a
Succeededstate for cleanup actions.
Expected vs Actual Investigation Template¶
| Checkpoint | Expected State | Typical Failure State | Action |
|---|---|---|---|
| Revision health | Healthy and active | Failed or stuck provisioning | Inspect system logs and revision events |
| Replica status | Running replicas under load | 0 replicas or repeated restart | Check probes, scale settings, and runtime logs |
| Route behavior | HTTP 200 with expected payload | 5xx, timeout, or connection refused | Validate ingress + target port + dependencies |
| Identity access | Token retrieval and authorized resource call | 401/403 in console logs | Verify managed identity and RBAC scope |