Container Apps Troubleshooting¶

This section is a practical field guide for troubleshooting real-world issues on Azure Container Apps. Use it to quickly narrow symptoms, validate hypotheses, and apply targeted mitigation.

This is not a general tutorial. It is designed to help engineers move from symptom to validated interpretation faster during active incidents.

What This Is¶

This is a hypothesis-driven troubleshooting guide built around repeatable incident patterns. Each playbook follows the same reasoning model so you can move from observation to root cause with less guesswork.

How It Works¶

graph TD
    A[Observe Symptom] --> B[List Hypotheses]
    B --> C[Collect Evidence]
    C --> D[Validate / Disprove]
    D --> E[Identify Root Cause]
    E --> F[Mitigate]

Every playbook uses this 6-step flow: observe symptoms, enumerate likely causes, gather targeted evidence, validate or disprove each hypothesis, isolate the root cause, then apply mitigation.

Scope¶

Included	Not Included
Commands, tables, snippets	Long conceptual explanations (see Platform)
Frequent incidents and fixes	End-to-end deployment tutorials (see Language Guides)
Runtime defaults and knobs	Operational guides (see Operations)

Start Here¶

Your Situation	Go To
First incident, no idea where to start	First 10 Minutes
Need to identify the failure category	Detector Map
Already know the symptom category	Jump to Playbooks below
Want a systematic diagnosis framework	Methodology
Need KQL queries to investigate	KQL Query Library
Want hands-on practice	Labs below

Triage Logic¶

When something goes wrong, ask these questions in order:

Is the revision provisioned? Check az containerapp revision list.
Is the replica running? Check az containerapp replica list.
Is the health probe failing? Check system logs in Log Analytics.
Is the app crashing? Check console logs via log stream or Log Analytics.

Quick Decision Tree¶

graph TD
    A[Symptom Observed] --> B{App returns HTTP errors?}
    B -->|503 on all requests| C[Startup / Provisioning Failure]
    B -->|Intermittent 5xx| D[Scaling / Runtime]
    B -->|No HTTP response| E[Networking / Ingress]
    B -->|200 but wrong behavior| F[Config / Identity]

    C --> C1{Image pull succeeded?}
    C1 -->|No, ImagePullBackOff| C2[Image Pull Failure]
    C1 -->|Yes, but revision failed| C3{Health probe passing?}
    C3 -->|No| C4[Probe Failure / Slow Start]
    C3 -->|Yes, container crashing| C5[Container Start Failure]

    D --> D1{Replica count correct?}
    D1 -->|Stuck at 0| D2[HTTP Scaling Not Triggering]
    D1 -->|Scaling but crashing| D3[CrashLoop / OOM]
    D1 -->|Scale rule mismatch| D4[Event Scaler Mismatch]

    E --> E1{Ingress enabled?}
    E1 -->|No| E2[Ingress Not Reachable]
    E1 -->|Yes, internal DNS fail| E3[DNS / Private Endpoint]
    E1 -->|Yes, svc-to-svc fail| E4[Service Connectivity]

    F --> F1{Managed identity error?}
    F1 -->|Yes| F2[MI Auth Failure]
    F1 -->|No, secret/KV error| F3[Secret / Key Vault Failure]

    style C fill:#c62828,color:#fff
    style D fill:#ef6c00,color:#fff
    style E fill:#1565c0,color:#fff
    style F fill:#6a1b9a,color:#fff

Representative Log Patterns¶

Pattern	Indicates	Playbook
`ImagePullBackOff` + `401 Unauthorized`	Registry auth failure	Image Pull Failure
Revision stuck in `Provisioning` > 5 min	Resource or config error	Revision Provisioning Failure
`Replica X exited with code 1` in system logs	Container crash on startup	Container Start Failure
Startup probe failed, 0 console logs	Wrong entrypoint or port mismatch	Probe Failure and Slow Start
`ConnectionRefused` on internal FQDN	Service discovery or DNS issue	Service-to-Service Connectivity
Replica count stuck at 0, HTTP requests queuing	Scale rule not triggering	HTTP Scaling Not Triggering
`OOMKilled` in system logs	Memory limit exceeded	CrashLoop OOM and Resource Pressure
`ManagedIdentityCredential` auth error	MI not assigned or wrong scope	Managed Identity Auth Failure
`SecretNotFound` or Key Vault 403	Secret ref or RBAC misconfiguration	Secret and Key Vault Reference Failure
Dapr sidecar `connection refused` on :3500	Dapr not enabled or component error	Dapr Sidecar or Component Failure

Topics¶

Startup and Provisioning¶

Ingress and Networking¶

Scaling and Runtime¶

Identity and Configuration¶

Platform Features¶

Quick Start¶

Need	Start Here
First 10 minutes of any incident	First 10 Minutes
Reusable KQL queries	KQL Query Library
Systematic diagnosis framework	Methodology
Symptom-to-playbook routing	Detector Map

Hands-on Labs¶

Deploy reproduction environments and observe real symptoms:

Architecture and Methodology¶

Methodology — Systematic root-cause workflow
Detector Map — Symptom-to-playbook routing tree and error-string mapping

Incident Escalation and Routing Matrix¶

Signal	Severity Hint	First Escalation Target	Immediate Containment
All requests fail with 5xx after rollout	High	App owner + platform on-call	Route traffic to last healthy revision
Region-wide ingress anomalies	Critical	Platform/SRE + cloud operations	Shift traffic or activate fallback path
Single endpoint fails with identity errors	Medium	App owner + security/identity	Validate role assignment and token scope
Scale-out not triggering under rising traffic	High	App owner + capacity/SRE	Temporarily raise min replicas and tune rules

Escalate by blast radius, not by stack layer

Start with user impact and affected scope first. Then route to the owning team while continuing evidence collection.

Preserve a timeline while troubleshooting

Capture timestamps for deployment, first failure, mitigation action, and recovery confirmation. A precise timeline accelerates post-incident reviews.

Sources¶

Azure Container Apps documentation (Microsoft Learn)