Troubleshooting Mental Model¶
This page provides a classification model for Azure Container Apps incidents so you can start with the correct evidence source instead of guessing.
Core idea: classify the problem first, then investigate deeply.
Why this model matters¶
Most incident delays come from category mistakes:
- image pull failures investigated as application code bugs
- scaling issues investigated as networking problems
- identity/auth failures investigated as connectivity issues
- probe failures investigated as container crashes
This classification helps you avoid looking at the wrong logs from the start.
Classification flowchart¶
flowchart TD
A[Observed symptom] --> B{Primary failure signal}
B -->|Revision stuck, image errors| C[Category 1: Provisioning issue]
B -->|Container crashes, probe fails| D[Category 2: Startup issue]
B -->|502/503, DNS errors, connectivity| E[Category 3: Networking issue]
B -->|No scaling, wrong replica count| F[Category 4: Scaling issue]
B -->|401/403, secret not found| G[Category 5: Identity / Config issue]
B -->|OOM, slow response, resource exhaustion| H[Category 6: Runtime degradation]
C --> C1[Start with ContainerAppSystemLogs - pull/provision]
D --> D1[Start with ContainerAppConsoleLogs - startup errors]
E --> E1[Start with ingress and DNS evidence]
F --> F1[Start with scale rules and KEDA metrics]
G --> G1[Start with identity assignment and RBAC]
H --> H1[Start with resource limits and memory trends] Category summary matrix¶
| Category | Typical Symptoms | First Signal to Check | Common Mistake |
|---|---|---|---|
| Provisioning issue | Revision stuck "Provisioning", image pull errors | ContainerAppSystemLogs - pull/auth errors | Assuming app code is broken |
| Startup issue | Container crashes, probe timeout, unhealthy | ContainerAppConsoleLogs - startup sequence | Checking only system logs |
| Networking issue | 502/503, DNS failures, connectivity errors | Ingress config + DNS resolution | Restarting app without validating network |
| Scaling issue | No scale-out, stuck at min/max replicas | Scale rules + KEDA scaler logs | Looking at application logs only |
| Identity / Config issue | 401/403 to Azure services, secret not found | Identity assignment + RBAC roles | Assuming network block |
| Runtime degradation | OOM restarts, slow response, resource pressure | Resource limits + memory/CPU trends | Looking only at error logs |
1) Category: Provisioning Issue¶
Provisioning issues occur when a new revision cannot be created or deployed successfully.
Typical symptom patterns¶
- Revision stays in "Provisioning" state indefinitely
- Image pull errors:
unauthorized,manifest unknown - ACR authentication failures
- Revision creation timeout
First signal to check¶
ContainerAppSystemLogs
| where TimeGenerated > ago(1h)
| where Log_s has_any ("pull", "image", "manifest", "auth", "401", "403", "provision", "failed")
| project TimeGenerated, Log_s
| order by TimeGenerated desc
Common mistakes¶
- Assuming the application code is the problem when the image never started
- Not verifying ACR credentials or managed identity permissions
- Missing the difference between image pull failure and container start failure
Recommended playbooks¶
2) Category: Startup Issue¶
Startup issues occur when the container image is pulled successfully but the application fails to start or pass health probes.
Typical symptom patterns¶
- Container starts but crashes immediately (CrashLoopBackOff)
- Health probe timeout
- Application listening on wrong port
- Missing environment variables or configuration
First signal to check¶
ContainerAppConsoleLogs
| where TimeGenerated > ago(1h)
| where Log_s has_any ("error", "exception", "traceback", "failed", "exit", "bind", "listen", "port")
| project TimeGenerated, RevisionName_s, Log_s
| order by TimeGenerated desc
Common mistakes¶
- Checking only system logs when the issue is in application startup
- Not verifying the
targetPortmatches the actual listening port - Ignoring probe configuration mismatch
Recommended playbooks¶
3) Category: Networking Issue¶
Networking issues occur when the container is running but cannot be reached or cannot reach dependencies.
Typical symptom patterns¶
- 502/503 errors from ingress
- DNS resolution failures for private endpoints
- Service-to-service connectivity failures
- VNet integration issues
First signal to check¶
ContainerAppSystemLogs
| where TimeGenerated > ago(1h)
| where Log_s has_any ("ingress", "upstream", "502", "503", "DNS", "resolve", "connection refused")
| project TimeGenerated, Log_s
| order by TimeGenerated desc
Common mistakes¶
- Restarting the app without checking network configuration
- Assuming DNS works without testing from inside the container
- Missing Private DNS Zone linkage to VNet
Recommended playbooks¶
- Ingress Not Reachable
- Internal DNS and Private Endpoint Failure
- Service-to-Service Connectivity Failure
4) Category: Scaling Issue¶
Scaling issues occur when KEDA autoscaling does not respond correctly to load or events.
Typical symptom patterns¶
- Replica count stays at minimum despite traffic
- Scale-out too slow for traffic spikes
- Event-driven scaling not triggering (queue, Kafka, etc.)
- Scale rules not evaluated
First signal to check¶
ContainerAppSystemLogs
| where TimeGenerated > ago(2h)
| where Log_s has_any ("scale", "replica", "KEDA", "scaler", "metric", "trigger")
| project TimeGenerated, Log_s
| order by TimeGenerated desc
Common mistakes¶
- Assuming scaling is broken when the issue is misconfigured rules
- Not checking minReplicas/maxReplicas constraints
- Ignoring scaler authentication/connection issues
Recommended playbooks¶
5) Category: Identity / Config Issue¶
Identity issues occur when the application cannot authenticate to Azure services or access secrets.
Typical symptom patterns¶
- 401/403 errors when accessing Key Vault, Storage, SQL
- "Secret not found" or reference resolution failures
- Managed identity token acquisition failures
- Missing RBAC role assignments
First signal to check¶
ContainerAppConsoleLogs
| where TimeGenerated > ago(1h)
| where Log_s has_any ("401", "403", "Unauthorized", "Forbidden", "AADSTS", "ManagedIdentity", "secret", "keyvault")
| project TimeGenerated, Log_s
| order by TimeGenerated desc
Common mistakes¶
- Assuming network block when the issue is missing RBAC
- Not verifying managed identity is assigned to the container app
- Confusing system-assigned vs user-assigned identity scope
Recommended playbooks¶
6) Category: Runtime Degradation¶
Runtime degradation means the app starts correctly but performance deteriorates due to resource constraints.
Typical symptom patterns¶
- OOM kills and container restarts
- Increasing latency over time
- CPU throttling under load
- Memory growth without release
First signal to check¶
ContainerAppSystemLogs
| where TimeGenerated > ago(6h)
| where Log_s has_any ("OOM", "killed", "memory", "evicted", "resource", "throttle")
| project TimeGenerated, Log_s
| order by TimeGenerated desc
Common mistakes¶
- Relying only on error logs without checking resource metrics
- Setting memory limits too low for the workload
- Not identifying memory leaks vs legitimate memory usage
Recommended playbooks¶
Classification workflow in practice¶
- Pick one dominant symptom and timestamp window.
- Map to one of the six categories using the flowchart.
- Run only the first evidence query for that category.
- If evidence contradicts the category, reclassify immediately.
- Open the linked playbook and continue with hypothesis-driven analysis.
Anti-patterns this model prevents¶
- Wrong-table bias: querying
ContainerAppConsoleLogsfor image pull issues. - Single-signal bias: assuming all 502s are application bugs.
- Identity blind spot: treating auth failures as network problems.
- Scaling assumption: assuming KEDA is broken without checking configuration.
Use category labels in incident notes
Add an explicit category label in the first incident update. Example: "Initial classification: Category 3 (Networking), confidence medium." This keeps the team aligned on which evidence to collect first.
See Also¶
- Troubleshooting Method
- Detector Map
- Architecture Overview
- Evidence Map
- Decision Tree
- Quick Diagnosis Cards
- Playbooks Index
- KQL Query Library