Troubleshooting Mental Model¶

This page provides a classification model for Azure Container Apps incidents so you can start with the correct evidence source instead of guessing.

Core idea: classify the problem first, then investigate deeply.

Why this model matters¶

Most incident delays come from category mistakes:

image pull failures investigated as application code bugs
scaling issues investigated as networking problems
identity/auth failures investigated as connectivity issues
probe failures investigated as container crashes

This classification helps you avoid looking at the wrong logs from the start.

Classification flowchart¶

flowchart TD
    A[Observed symptom] --> B{Primary failure signal}
    B -->|Revision stuck, image errors| C[Category 1: Provisioning issue]
    B -->|Container crashes, probe fails| D[Category 2: Startup issue]
    B -->|502/503, DNS errors, connectivity| E[Category 3: Networking issue]
    B -->|No scaling, wrong replica count| F[Category 4: Scaling issue]
    B -->|401/403, secret not found| G[Category 5: Identity / Config issue]
    B -->|OOM, slow response, resource exhaustion| H[Category 6: Runtime degradation]

    C --> C1[Start with ContainerAppSystemLogs - pull/provision]
    D --> D1[Start with ContainerAppConsoleLogs - startup errors]
    E --> E1[Start with ingress and DNS evidence]
    F --> F1[Start with scale rules and KEDA metrics]
    G --> G1[Start with identity assignment and RBAC]
    H --> H1[Start with resource limits and memory trends]

Category summary matrix¶

Category	Typical Symptoms	First Signal to Check	Common Mistake
Provisioning issue	Revision stuck "Provisioning", image pull errors	`ContainerAppSystemLogs` - pull/auth errors	Assuming app code is broken
Startup issue	Container crashes, probe timeout, unhealthy	`ContainerAppConsoleLogs` - startup sequence	Checking only system logs
Networking issue	502/503, DNS failures, connectivity errors	Ingress config + DNS resolution	Restarting app without validating network
Scaling issue	No scale-out, stuck at min/max replicas	Scale rules + KEDA scaler logs	Looking at application logs only
Identity / Config issue	401/403 to Azure services, secret not found	Identity assignment + RBAC roles	Assuming network block
Runtime degradation	OOM restarts, slow response, resource pressure	Resource limits + memory/CPU trends	Looking only at error logs

1) Category: Provisioning Issue¶

Provisioning issues occur when a new revision cannot be created or deployed successfully.

Typical symptom patterns¶

Revision stays in "Provisioning" state indefinitely
Image pull errors: unauthorized, manifest unknown
ACR authentication failures
Revision creation timeout

First signal to check¶

ContainerAppSystemLogs
| where TimeGenerated > ago(1h)
| where Log_s has_any ("pull", "image", "manifest", "auth", "401", "403", "provision", "failed")
| project TimeGenerated, Log_s
| order by TimeGenerated desc

Common mistakes¶

Assuming the application code is the problem when the image never started
Not verifying ACR credentials or managed identity permissions
Missing the difference between image pull failure and container start failure

Recommended playbooks¶

2) Category: Startup Issue¶

Startup issues occur when the container image is pulled successfully but the application fails to start or pass health probes.

Typical symptom patterns¶

Container starts but crashes immediately (CrashLoopBackOff)
Health probe timeout
Application listening on wrong port
Missing environment variables or configuration

First signal to check¶

ContainerAppConsoleLogs
| where TimeGenerated > ago(1h)
| where Log_s has_any ("error", "exception", "traceback", "failed", "exit", "bind", "listen", "port")
| project TimeGenerated, RevisionName_s, Log_s
| order by TimeGenerated desc

Common mistakes¶

Checking only system logs when the issue is in application startup
Not verifying the targetPort matches the actual listening port
Ignoring probe configuration mismatch

Recommended playbooks¶

3) Category: Networking Issue¶

Networking issues occur when the container is running but cannot be reached or cannot reach dependencies.

Typical symptom patterns¶

502/503 errors from ingress
DNS resolution failures for private endpoints
Service-to-service connectivity failures
VNet integration issues

First signal to check¶

ContainerAppSystemLogs
| where TimeGenerated > ago(1h)
| where Log_s has_any ("ingress", "upstream", "502", "503", "DNS", "resolve", "connection refused")
| project TimeGenerated, Log_s
| order by TimeGenerated desc

Common mistakes¶

Restarting the app without checking network configuration
Assuming DNS works without testing from inside the container
Missing Private DNS Zone linkage to VNet

Recommended playbooks¶

4) Category: Scaling Issue¶

Scaling issues occur when KEDA autoscaling does not respond correctly to load or events.

Typical symptom patterns¶

Replica count stays at minimum despite traffic
Scale-out too slow for traffic spikes
Event-driven scaling not triggering (queue, Kafka, etc.)
Scale rules not evaluated

First signal to check¶

ContainerAppSystemLogs
| where TimeGenerated > ago(2h)
| where Log_s has_any ("scale", "replica", "KEDA", "scaler", "metric", "trigger")
| project TimeGenerated, Log_s
| order by TimeGenerated desc

Common mistakes¶

Assuming scaling is broken when the issue is misconfigured rules
Not checking minReplicas/maxReplicas constraints
Ignoring scaler authentication/connection issues

Recommended playbooks¶

5) Category: Identity / Config Issue¶

Identity issues occur when the application cannot authenticate to Azure services or access secrets.

Typical symptom patterns¶

401/403 errors when accessing Key Vault, Storage, SQL
"Secret not found" or reference resolution failures
Managed identity token acquisition failures
Missing RBAC role assignments

First signal to check¶

ContainerAppConsoleLogs
| where TimeGenerated > ago(1h)
| where Log_s has_any ("401", "403", "Unauthorized", "Forbidden", "AADSTS", "ManagedIdentity", "secret", "keyvault")
| project TimeGenerated, Log_s
| order by TimeGenerated desc

Common mistakes¶

Assuming network block when the issue is missing RBAC
Not verifying managed identity is assigned to the container app
Confusing system-assigned vs user-assigned identity scope

Recommended playbooks¶

6) Category: Runtime Degradation¶

Runtime degradation means the app starts correctly but performance deteriorates due to resource constraints.

Typical symptom patterns¶

OOM kills and container restarts
Increasing latency over time
CPU throttling under load
Memory growth without release

First signal to check¶

ContainerAppSystemLogs
| where TimeGenerated > ago(6h)
| where Log_s has_any ("OOM", "killed", "memory", "evicted", "resource", "throttle")
| project TimeGenerated, Log_s
| order by TimeGenerated desc

Common mistakes¶

Relying only on error logs without checking resource metrics
Setting memory limits too low for the workload
Not identifying memory leaks vs legitimate memory usage

Recommended playbooks¶

CrashLoop OOM and Resource Pressure

Classification workflow in practice¶

Pick one dominant symptom and timestamp window.
Map to one of the six categories using the flowchart.
Run only the first evidence query for that category.
If evidence contradicts the category, reclassify immediately.
Open the linked playbook and continue with hypothesis-driven analysis.

Anti-patterns this model prevents¶

Wrong-table bias: querying ContainerAppConsoleLogs for image pull issues.
Single-signal bias: assuming all 502s are application bugs.
Identity blind spot: treating auth failures as network problems.
Scaling assumption: assuming KEDA is broken without checking configuration.

Use category labels in incident notes

Add an explicit category label in the first incident update. Example: "Initial classification: Category 3 (Networking), confidence medium." This keeps the team aligned on which evidence to collect first.

Troubleshooting Mental Model¶

Why this model matters¶

Classification flowchart¶

Category summary matrix¶

1) Category: Provisioning Issue¶

Typical symptom patterns¶

First signal to check¶

Common mistakes¶

Recommended playbooks¶

2) Category: Startup Issue¶

Typical symptom patterns¶

First signal to check¶

Common mistakes¶

Recommended playbooks¶

3) Category: Networking Issue¶

Typical symptom patterns¶

First signal to check¶

Common mistakes¶

Recommended playbooks¶

4) Category: Scaling Issue¶

Typical symptom patterns¶

First signal to check¶

Common mistakes¶

Recommended playbooks¶

5) Category: Identity / Config Issue¶

Typical symptom patterns¶

First signal to check¶

Common mistakes¶

Recommended playbooks¶

6) Category: Runtime Degradation¶

Typical symptom patterns¶

First signal to check¶

Common mistakes¶

Recommended playbooks¶

Classification workflow in practice¶

Anti-patterns this model prevents¶

See Also¶

Sources¶