Troubleshooting Decision Tree¶

Use this page when you need to triage quickly from symptom to likely failure category and then open the right playbook.

The tree is intentionally symptom-first and optimized for the first 10–15 minutes of incident response.

Main triage decision tree¶

flowchart TD
    S[Incident starts: user-visible impact] --> Q1{Is it a 5xx issue?}

    Q1 -->|Yes| Q1A{Intermittent or constant?}
    Q1 -->|No| Q2{Is startup failing?}

    Q1A -->|Intermittent under load| P1[Playbook: HTTP Scaling Not Triggering]
    Q1A -->|Constant after deploy| Q3{Was there a recent revision update?}
    Q1A -->|Constant, no deploy| P2[Playbook: CrashLoop OOM and Resource Pressure]

    Q3 -->|Yes, image pull error| P3[Playbook: Image Pull Failure]
    Q3 -->|Yes, probe failures| P4[Playbook: Probe Failure and Slow Start]
    Q3 -->|Yes, revision not healthy| P5[Playbook: Revision Provisioning Failure]
    Q3 -->|No clear deploy signal| P6[Playbook: Container Start Failure]

    Q2 -->|Yes| Q2A{Image pull or container crash?}
    Q2 -->|No| Q4{Is outbound dependency failing?}

    Q2A -->|Image pull failure| P3
    Q2A -->|Container crash or exit| P6
    Q2A -->|Probe timeout| P4

    Q4 -->|Yes| Q4A{DNS, VNet, or private endpoint?}
    Q4 -->|No| Q5{Is it a scaling issue?}

    Q4A -->|DNS| P9[Playbook: Internal DNS and Private Endpoint Failure]
    Q4A -->|Service-to-service| P10[Playbook: Service-to-Service Connectivity Failure]
    Q4A -->|Private endpoint| P9

    Q5 -->|Yes| Q5A{HTTP scaling or event-driven?}
    Q5 -->|No| P12[Use Methodology: build hypotheses from evidence]

    Q5A -->|HTTP scaling issues| P1
    Q5A -->|Event scaler mismatch| P13[Playbook: Event Scaler Mismatch]
    Q5A -->|Memory or resource pressure| P2

5xx branch deep-dive tree¶

flowchart LR
    A[Observed 5xx] --> B{Status pattern}
    B -->|Mostly 500| C[Check application exceptions and console logs]
    B -->|Mostly 502| D[Check ingress and probe failures]
    B -->|Mostly 503| E[Check replica availability and scaling]

    C --> F{Startup related?}
    F -->|Yes| G[Container Start Failure]
    F -->|No| H[CrashLoop OOM and Resource Pressure]

    D --> I{Probe failures present?}
    I -->|Yes| J[Probe Failure and Slow Start]
    I -->|No| K[Ingress Not Reachable]

    E --> L{Recent revision update?}
    L -->|Yes| M[Revision Provisioning Failure]
    L -->|Scaling events| N[HTTP Scaling Not Triggering]
    L -->|No| O[Event Scaler Mismatch]

Playbook leaves (direct links)¶

Startup and Provisioning¶

Ingress and Networking¶

Scaling and Runtime¶

Identity and Configuration¶

Platform Features¶

Quick reference matrix¶

Symptom Pattern	Most Likely Cause Category	Playbook Link
5xx spikes only during traffic bursts	Replica count insufficient, scaling delay	HTTP Scaling Not Triggering
503 after revision update	Startup/probe sequence failure	Revision Provisioning Failure
502 with probe failures	Probe configuration or app health issue	Probe Failure and Slow Start
Container runs but no responses	Port binding mismatch or app not listening	Container Start Failure
Image pull errors in system logs	ACR authentication or image tag issue	Image Pull Failure
Ingress unreachable externally	Ingress configuration or external access disabled	Ingress Not Reachable
High latency with restarts	Memory growth and OOM kills	CrashLoop OOM and Resource Pressure
Event-driven scaler not firing	KEDA configuration mismatch	Event Scaler Mismatch
Private endpoint dependency unreachable	DNS/private zone configuration	Internal DNS and Private Endpoint Failure
Managed identity token errors	Identity configuration or RBAC issue	Managed Identity Auth Failure
Secret resolution failures	Key Vault reference or secret configuration	Secret and Key Vault Reference Failure
Dapr sidecar not starting	Dapr component or configuration issue	Dapr Sidecar or Component Failure

Triage prompts to ask in order¶

Is it a 5xx issue? If yes, is it intermittent or constant?
Was there a recent revision update or deployment in the incident window?
Is startup failing (image pull, container crash, probe timeout)?
Is outbound dependency failing (DNS, private endpoint, service-to-service)?
Is it a scaling issue (HTTP scaling, event-driven scaling, resource pressure)?

Minimal evidence before choosing a branch¶

15-minute system log timeline (ContainerAppSystemLogs_CL)
Console logs for startup and errors (ContainerAppConsoleLogs_CL)
Revision list to correlate timing (az containerapp revision list)

ContainerAppSystemLogs_CL
| where TimeGenerated > ago(2h)
| where ContainerAppName_s == "<app-name>"
| summarize count() by Reason_s
| order by count_ desc

ContainerAppSystemLogs_CL
| where TimeGenerated > ago(24h)
| where ContainerAppName_s == "<app-name>"
| where Reason_s has_any ("ProbeFailed", "ContainerStarted", "ContainerTerminated", "ImagePullBackOff")
| project TimeGenerated, Reason_s, Log_s
| order by TimeGenerated desc

ContainerAppConsoleLogs_CL
| where TimeGenerated > ago(6h)
| where ContainerAppName_s == "<app-name>"
| where Log_s has_any ("Exception", "Error", "timeout", "failed", "could not")
| project TimeGenerated, Log_s
| order by TimeGenerated desc

CLI triage bundle¶

az containerapp revision list --name $APP_NAME --resource-group $RG --output table
az containerapp logs show --name $APP_NAME --resource-group $RG --type system --tail 50
az containerapp logs show --name $APP_NAME --resource-group $RG --type console --tail 50
az containerapp show --name $APP_NAME --resource-group $RG --query "properties.latestRevisionName"
az monitor activity-log list --resource-group $RG --offset 24h

Avoid branch bias

Do not choose a branch only because it matches a familiar past issue. If the first branch is disproven by timestamps, return to the top and re-classify. Decision trees accelerate triage, but evidence still decides root cause.

Decision Tree Limits¶

This tree is optimized for Azure Container Apps consumption workloads.
Multi-cause incidents can map to more than one branch.
If no branch matches cleanly, use Troubleshooting Method and build explicit competing hypotheses.

Branch-specific first checks¶

If you choose the startup branch¶

Confirm expected port (8000) and startup command alignment.
Check whether probe path depends on unavailable dependencies.
Validate image tag and ACR authentication.

If you choose the networking branch¶

Verify whether only one dependency host fails.
Compare failure windows against outbound-heavy endpoints.
Test DNS resolution from within the container using exec.

If you choose the scaling branch¶

Compare replica count with request volume.
Check KEDA scaling events in system logs.
Identify whether high latency leads 5xx or follows it.

Practical triage examples¶

Intermittent 502 + probe failures + traffic bursts
- Decision tree branch: 5xx → intermittent → scaling candidate.
- Start with HTTP Scaling Not Triggering.
Revision created + immediate 503 + image pull errors
- Decision tree branch: restart/deployment → startup failing.
- Start with Image Pull Failure.
Container starts but exits repeatedly + OOM in logs
- Decision tree branch: runtime → resource pressure.
- Start with CrashLoop OOM and Resource Pressure.