Lab Guides¶
Hands-on troubleshooting labs for Azure Container Apps with deployable infrastructure and scripted failure/recovery flows.
All sample outputs in lab guides are PII-scrubbed and use ca-myapp, cae-myapp, and job-myapp naming.
Available Labs¶
| Lab | Description | Difficulty | Duration | Guide | Lab Files |
|---|---|---|---|---|---|
| ACR Image Pull Failure | Reproduces ImagePullBackOff from a non-existent image tag, then fixes image publishing/update. | Beginner | 20-30 min | Guide | Directory |
| Revision Failover and Rollback | Deploys a healthy revision, then breaks ingress port on a new revision and restores traffic. | Intermediate | 20-30 min | Guide | Directory |
| Scale Rule Mismatch | Uses unrealistic HTTP scaling thresholds to show non-scaling under load, then corrects KEDA settings. | Intermediate | 25-35 min | Guide | Directory |
| Probe and Port Mismatch | App listens on port 3000 while ingress targets 8000, causing probe failures until target port is fixed. | Beginner | 20-25 min | Guide | Directory |
| Managed Identity Key Vault Failure | App uses managed identity to read Key Vault secret but fails without Key Vault Secrets User role assignment. | Intermediate | 25-35 min | Guide | Directory |
| Revision Provisioning Failure | Revision fails because container env var references a missing secret; fixed by setting secret and deploying new revision. | Intermediate | 20-30 min | Guide | Directory |
| Ingress Target Port Mismatch | Diagnose and fix ingress failures caused by target port misconfiguration. | Beginner | 15-20 min | Guide | Directory |
| Traffic Routing Canary Failure | Diagnose traffic splitting failures when a bad revision receives production traffic. | Intermediate | 20-30 min | Guide | Directory |
| Dapr Integration | Troubleshoot Dapr sidecar and component configuration issues. | Intermediate | 35-45 min | Guide | Directory |
| Observability and Tracing | Set up OpenTelemetry and Application Insights, troubleshoot missing traces and metrics. | Intermediate | 35-45 min | Guide | Directory |
| CD Reconnect RBAC Conflict | Reproduces AppRbacDeployment: The role assignment already exists after a previous CD disconnect left RBAC role assignments behind. | Intermediate | 25-35 min | Guide | Directory |
| Subnet CIDR Exhaustion | Demonstrates ACA environment creation failure when subnet is too small (/29) and resolves by resizing to /27. | Intermediate | 20-30 min | Guide | Inline guide only |
| UDR and NSG Egress Blocked | Shows replica startup failure when required outbound FQDNs are blocked by a UDR/NVA; resolves by allowing required rules. | Advanced | 30-45 min | Guide | Inline guide only |
| Private Endpoint DNS Failure | Reproduces DNS NXDOMAIN when Private DNS Zone is not linked to ACA VNet; resolves by adding VNet link. | Intermediate | 25-35 min | Guide | Inline guide only |
| Egress IP Change | Documents egress IP shift when environment is recreated and shows how to update downstream firewall allowlists. | Intermediate | 20-30 min | Guide | Inline guide only |
| Custom Domain TLS Renewal | Reproduces managed certificate stuck in Pending when CNAME/asuid TXT records are missing or stale. | Intermediate | 20-30 min | Guide | Inline guide only |
| WebSocket and gRPC Ingress | Demonstrates broken WebSocket connection when session affinity is off; resolves by enabling sticky sessions. | Intermediate | 25-35 min | Guide | Inline guide only |
| Session Affinity Failure | Shows state loss across replicas when sticky sessions are disabled; resolves by enabling ingress affinity. | Intermediate | 20-30 min | Guide | Inline guide only |
| Azure Files Mount Failure | Reproduces SMB mount error when storage account key or share name is wrong; resolves by correcting environment storage config. | Intermediate | 25-35 min | Guide | Inline guide only |
| EmptyDir Disk Full | Shows OOMKill-like restart when ephemeral storage is exhausted by log accumulation; resolves by increasing ephemeralStorage limit. | Intermediate | 20-30 min | Guide | Inline guide only |
| Volume Permission Denied | Reproduces permission denied when container UID does not match volume mount ownership; resolves by setting mountOptions. | Intermediate | 25-35 min | Guide | Inline guide only |
| CPU Throttling | Uses a CPU-intensive workload to trigger throttling; shows metrics and resolves by increasing CPU allocation or scaling out. | Intermediate | 25-35 min | Guide | Inline guide only |
| Memory Leak OOMKilled | Injects a memory leak to trigger OOMKilled restarts; resolves by profiling and patching the leak plus setting memory limits. | Advanced | 35-45 min | Guide | Inline guide only |
| Replica Load Imbalance | Demonstrates uneven replica utilization under steady load; resolves by tuning KEDA scale rules and concurrency. | Advanced | 30-40 min | Guide | Inline guide only |
| Docker Hub Rate Limit | Reproduces toomanyrequests pull error from Docker Hub anonymous pulls; resolves by adding authenticated registry credentials. | Beginner | 15-25 min | Guide | Inline guide only |
| Image Size Startup Delay | Shows cold-start latency from a large image (>1 GB); resolves by multi-stage build and layer caching. | Intermediate | 25-35 min | Guide | Inline guide only |
| Multi-Arch Image Mismatch | Reproduces exec format error when an ARM64-only image is pulled on AMD64 ACA host; resolves by building a multi-arch manifest. | Intermediate | 20-30 min | Guide | Inline guide only |
| Log Analytics Ingestion Gap | Demonstrates missing logs when diagnostic settings are not configured; resolves by enabling and linking Log Analytics workspace. | Beginner | 15-25 min | Guide | Inline guide only |
| App Insights Connection String Missing | Shows No telemetry when APPLICATIONINSIGHTS_CONNECTION_STRING env var is absent; resolves by injecting the secret. | Beginner | 15-20 min | Guide | Inline guide only |
| Diagnostic Settings Missing | Reproduces metrics/log gaps when Azure Monitor diagnostic settings are not created; resolves by adding diagnostic setting. | Beginner | 15-20 min | Guide | Inline guide only |
| GitHub Actions OIDC Failure | Reproduces AADSTS70021 when federated credential subject does not match repo/branch; resolves by correcting subject claim. | Intermediate | 25-35 min | Guide | Inline guide only |
| Bicep Deployment Timeout | Shows revision stuck in Provisioning during IaC deploy; resolves by reducing container startup time and tuning probe settings. | Intermediate | 25-35 min | Guide | Inline guide only |
| Revision History Limit | Demonstrates RevisionCountLimitReached (100-revision cap); resolves by deactivating and deleting old revisions. | Beginner | 15-20 min | Guide | Inline guide only |
| Subscription Quota Exceeded | Reproduces QuotaExceeded when core quota is exhausted; resolves by requesting quota increase or moving to another region. | Intermediate | 20-30 min | Guide | Inline guide only |
| Workload Profile Mismatch | Shows cost and performance issues from selecting wrong profile (Consumption vs Dedicated); resolves by switching profile. | Intermediate | 25-35 min | Guide | Inline guide only |
| Min Replicas Cost Surprise | Demonstrates unexpected billing from minReplicas: 2 during off-hours; resolves by setting minReplicas: 0 with cold-start mitigation. | Beginner | 15-20 min | Guide | Inline guide only |
| Scheduled Job Missed | Reproduces missed cron job execution due to UTC timezone mismatch; resolves by correcting cron expression. | Beginner | 15-20 min | Guide | Inline guide only |
| Event Job Storm | Demonstrates queue-backed job storm from low maxExecutions; resolves by tuning KEDA scale rules. | Advanced | 30-40 min | Guide | Inline guide only |
| Dapr State Store Failure | Reproduces Dapr state-store component failure from wrong component name or missing scope; resolves by correcting YAML. | Intermediate | 25-35 min | Guide | Inline guide only |
| Dapr Pub/Sub Failure | Shows messages not delivered when Dapr pub/sub component has wrong topic or missing consumer app scope. | Intermediate | 25-35 min | Guide | Inline guide only |
| EasyAuth Entra ID Failure | Reproduces AADSTS50011 redirect URI mismatch after enabling built-in auth; resolves by updating app registration reply URLs. | Intermediate | 20-30 min | Guide | Inline guide only |
| Multi-Region Failover | Demonstrates traffic failing to shift to secondary region when Front Door health probe is misconfigured. | Advanced | 35-50 min | Guide | Inline guide only |
Suggested Learning Path¶
- ACR Image Pull Failure
- Probe and Port Mismatch
- Revision Failover and Rollback
- Revision Provisioning Failure
- Scale Rule Mismatch
- Managed Identity Key Vault Failure
- Ingress Target Port Mismatch Lab
- Traffic Routing and Canary Failure Lab
- Dapr Integration
- Observability and Tracing
- CD Reconnect RBAC Conflict
How to Use These Labs Effectively¶
Use this section when you want a repeatable learning loop (reproduce → observe → fix → verify).
flowchart TD
A[Choose Lab by Symptom] --> B[Deploy Lab Infrastructure]
B --> C[Trigger Failure]
C --> D[Collect Evidence]
D --> E[Apply Targeted Fix]
E --> F[Verify Recovery]
F --> G[Capture Lessons Learned] Run labs like incident drills
Treat each lab as an on-call simulation. Time-box your investigation and record which signal (revision state, system log, console log, metrics) gave you the fastest root-cause clue.
Reuse one naming convention across all labs
Keep variable names consistent between labs ($RG, $APP_NAME, $ENVIRONMENT_NAME, $ACR_NAME, $LOCATION) so your troubleshooting muscle memory transfers cleanly.
Lab Selection Matrix¶
| Lab | Primary Symptom | First Signal to Check | Typical Root Cause | Fastest Recovery |
|---|---|---|---|---|
| ACR Image Pull Failure | Revision never starts | ContainerAppSystemLogs_CL pull errors | Bad image tag / registry auth | Push valid image + update app image |
| Revision Failover and Rollback | New revision unhealthy | az containerapp revision list | Risky config change in latest revision | Shift traffic back to healthy revision |
| Scale Rule Mismatch | Load increases, replicas do not | Replica count + KEDA events | Threshold too high / max replicas too low | Tune scale rule and retry load |
| Probe and Port Mismatch | Probe failures, no stable ready state | Probe failure warnings | App bind port != ingress target port | Align target port and rollout new revision |
| Managed Identity Key Vault Failure | Route returns 500/403 | App logs with identity errors | Missing role assignment on Key Vault scope | Assign RBAC role and re-verify |
| Revision Provisioning Failure | Revision stuck/failed provisioning | Revision lifecycle events | secretRef points to missing secret | Add secret and redeploy revision |
| Ingress Target Port Mismatch | External endpoint unreachable | Ingress target port config | Target port doesn't match app listen port | Fix target port to match app |
| Traffic Routing Canary Failure | Intermittent failures (~50%) | Traffic weight and revision health | Bad revision receiving traffic | Rollback traffic to healthy revision |
| Dapr Integration | Dapr calls fail | System logs with Dapr errors | Sidecar not enabled or component misconfigured | Enable Dapr and fix component YAML |
| Observability and Tracing | No traces in App Insights | Application Insights query | Connection string not set | Configure OTel and connection string |
| CD Reconnect RBAC Conflict | AppRbacDeployment failure on reconnect | Role assignment ID in deployment error | Orphaned role assignment from previous CD | Delete conflicting assignment, then reconnect |
Step-by-Step: Standard Lab Execution Pattern¶
-
Prepare shell variables
export RG="rg-aca-lab-shared" export LOCATION="koreacentral" export ENVIRONMENT_NAME="cae-myapp" export APP_NAME="ca-myapp" export ACR_NAME="acrmyapp"Expected output: no output (environment variables set in your shell).
-
Validate CLI context
Expected output: active subscription metadata and extension upgrade confirmation.
-
Deploy the chosen lab infrastructure
az deployment group create \ --name "lab-run" \ --resource-group "$RG" \ --template-file "./labs/<lab-name>/infra/main.bicep" \ --parameters baseName="labrun"Expected output pattern:
-
Trigger failure and collect signals
Expected output: one or more failure indicators (for example
ImagePullBackOff,ProbeFailed,403 Forbidden, or non-scaling replica count). -
Apply targeted fix and verify recovery
# Use the specific fix command from each lab guide az containerapp revision list --name "$APP_NAME" --resource-group "$RG" --output tableExpected output pattern: at least one
Healthyrevision with intended traffic weight. -
Clean up resources
Expected output: deletion completed or a
Succeededstate for cleanup actions.
Expected vs Actual Investigation Template¶
| Checkpoint | Expected State | Typical Failure State | Action |
|---|---|---|---|
| Revision health | Healthy and active | Failed or stuck provisioning | Inspect system logs and revision events |
| Replica status | Running replicas under load | 0 replicas or repeated restart | Check probes, scale settings, and runtime logs |
| Route behavior | HTTP 200 with expected payload | 5xx, timeout, or connection refused | Validate ingress + target port + dependencies |
| Identity access | Token retrieval and authorized resource call | 401/403 in console logs | Verify managed identity and RBAC scope |