Container Apps Troubleshooting¶
This section is a practical field guide for troubleshooting real-world issues on Azure Container Apps. Use it to quickly narrow symptoms, validate hypotheses, and apply targeted mitigation.
This is not a general tutorial. It is designed to help engineers move from symptom to validated interpretation faster during active incidents.
What This Is¶
This is a hypothesis-driven troubleshooting guide built around repeatable incident patterns. Each playbook follows the same reasoning model so you can move from observation to root cause with less guesswork.
How It Works¶
graph LR
A[Observe Symptom] --> B[List Hypotheses]
B --> C[Collect Evidence]
C --> D[Validate / Disprove]
D --> E[Identify Root Cause]
E --> F[Mitigate] Every playbook uses this 6-step flow: observe symptoms, enumerate likely causes, gather targeted evidence, validate or disprove each hypothesis, isolate the root cause, then apply mitigation.
Scope¶
| Included | Not Included |
|---|---|
| Commands, tables, snippets | Long conceptual explanations (see Platform) |
| Frequent incidents and fixes | End-to-end deployment tutorials (see Language Guides) |
| Runtime defaults and knobs | Operational guides (see Operations) |
Start Here¶
| Your Situation | Go To |
|---|---|
| First incident, no idea where to start | First 10 Minutes |
| Need to identify the failure category | Detector Map |
| Already know the symptom category | Jump to Playbooks below |
| Want a systematic diagnosis framework | Methodology |
| Need KQL queries to investigate | KQL Query Library |
| Want hands-on practice | Labs below |
Triage Logic¶
When something goes wrong, ask these questions in order:
- Is the revision provisioned? Check
az containerapp revision list. - Is the replica running? Check
az containerapp replica list. - Is the health probe failing? Check system logs in Log Analytics.
- Is the app crashing? Check console logs via log stream or Log Analytics.
Quick Decision Tree¶
graph TD
A[Symptom Observed] --> B{App returns HTTP errors?}
B -->|503 on all requests| C[Startup / Provisioning Failure]
B -->|Intermittent 5xx| D[Scaling / Runtime]
B -->|No HTTP response| E[Networking / Ingress]
B -->|200 but wrong behavior| F[Config / Identity]
C --> C1{Image pull succeeded?}
C1 -->|No, ImagePullBackOff| C2[Image Pull Failure]
C1 -->|Yes, but revision failed| C3{Health probe passing?}
C3 -->|No| C4[Probe Failure / Slow Start]
C3 -->|Yes, container crashing| C5[Container Start Failure]
D --> D1{Replica count correct?}
D1 -->|Stuck at 0| D2[HTTP Scaling Not Triggering]
D1 -->|Scaling but crashing| D3[CrashLoop / OOM]
D1 -->|Scale rule mismatch| D4[Event Scaler Mismatch]
E --> E1{Ingress enabled?}
E1 -->|No| E2[Ingress Not Reachable]
E1 -->|Yes, internal DNS fail| E3[DNS / Private Endpoint]
E1 -->|Yes, svc-to-svc fail| E4[Service Connectivity]
F --> F1{Managed identity error?}
F1 -->|Yes| F2[MI Auth Failure]
F1 -->|No, secret/KV error| F3[Secret / Key Vault Failure]
style C fill:#c62828,color:#fff
style D fill:#ef6c00,color:#fff
style E fill:#1565c0,color:#fff
style F fill:#6a1b9a,color:#fff Representative Log Patterns¶
| Pattern | Indicates | Playbook |
|---|---|---|
ImagePullBackOff + 401 Unauthorized | Registry auth failure | Image Pull Failure |
Revision stuck in Provisioning > 5 min | Resource or config error | Revision Provisioning Failure |
Replica X exited with code 1 in system logs | Container crash on startup | Container Start Failure |
| Startup probe failed, 0 console logs | Wrong entrypoint or port mismatch | Probe Failure and Slow Start |
ConnectionRefused on internal FQDN | Service discovery or DNS issue | Service-to-Service Connectivity |
| Replica count stuck at 0, HTTP requests queuing | Scale rule not triggering | HTTP Scaling Not Triggering |
OOMKilled in system logs | Memory limit exceeded | CrashLoop OOM and Resource Pressure |
ManagedIdentityCredential auth error | MI not assigned or wrong scope | Managed Identity Auth Failure |
SecretNotFound or Key Vault 403 | Secret ref or RBAC misconfiguration | Secret and Key Vault Reference Failure |
Dapr sidecar connection refused on :3500 | Dapr not enabled or component error | Dapr Sidecar or Component Failure |
Topics¶
Startup and Provisioning¶
- Image Pull Failure
- Revision Provisioning Failure
- Container Start Failure
- Probe Failure and Slow Start
Ingress and Networking¶
- Ingress Not Reachable
- Internal DNS and Private Endpoint Failure
- Service-to-Service Connectivity Failure
Scaling and Runtime¶
Identity and Configuration¶
Platform Features¶
- Dapr Sidecar or Component Failure
- Container App Job Execution Failure
- Bad Revision Rollout and Rollback
Quick Start¶
| Need | Start Here |
|---|---|
| First 10 minutes of any incident | First 10 Minutes |
| Reusable KQL queries | KQL Query Library |
| Systematic diagnosis framework | Methodology |
| Symptom-to-playbook routing | Detector Map |
Hands-on Labs¶
Deploy reproduction environments and observe real symptoms:
- ACR Pull Failure
- Revision Failover
- Scale Rule Mismatch
- Probe and Port Mismatch
- Managed Identity Key Vault Failure
- Revision Provisioning Failure
- Ingress Target Port Mismatch
- Traffic Routing Canary Failure
- Dapr Integration
- Observability and Tracing
Architecture and Methodology¶
- Methodology — Systematic root-cause workflow
- Detector Map — Symptom-to-playbook routing tree and error-string mapping
Incident Escalation and Routing Matrix¶
| Signal | Severity Hint | First Escalation Target | Immediate Containment |
|---|---|---|---|
| All requests fail with 5xx after rollout | High | App owner + platform on-call | Route traffic to last healthy revision |
| Region-wide ingress anomalies | Critical | Platform/SRE + cloud operations | Shift traffic or activate fallback path |
| Single endpoint fails with identity errors | Medium | App owner + security/identity | Validate role assignment and token scope |
| Scale-out not triggering under rising traffic | High | App owner + capacity/SRE | Temporarily raise min replicas and tune rules |
Escalate by blast radius, not by stack layer
Start with user impact and affected scope first. Then route to the owning team while continuing evidence collection.
Preserve a timeline while troubleshooting
Capture timestamps for deployment, first failure, mitigation action, and recovery confirmation. A precise timeline accelerates post-incident reviews.