Skip to content

Troubleshooting Methodology

When the quick triage checklist does not isolate the issue, use this systematic workflow to move from symptom to verified root cause.

Diagnostic Flow

flowchart TD
    A[Collect symptom and impact] --> B[Map to affected revision]
    B --> C[Inspect replica behavior]
    C --> D[Validate image/runtime integrity]
    D --> E[Trace identity chain]
    E --> F[Trace network path]
    F --> G[Validate dependencies]
    G --> H[Apply fix and re-verify]

1) Symptom Collection

Capture exact failure mode first: no response, HTTP 5xx, slow response, intermittent timeout, startup failure.

let AppName = "ca-myapp-api";
ContainerAppConsoleLogs_CL
| where ContainerAppName_s == AppName
| where Log_s has_any ("error", "exception", "timeout", "traceback")
| project TimeGenerated, RevisionName_s, ReplicaName_s, Log_s
| order by TimeGenerated desc
  • Record first seen timestamp, blast radius, and user-facing symptom.
  • Decision: if no app logs exist, start with system logs and provisioning path.

2) Revision Analysis

Determine whether failure is isolated to the latest revision or shared across active revisions.

az containerapp revision list --name "$APP_NAME" --resource-group "$RG" --query "[].{name:name,active:properties.active,trafficWeight:properties.trafficWeight,health:properties.healthState}" --output table
let AppName = "ca-myapp-api";
ContainerAppSystemLogs_CL
| where ContainerAppName_s == AppName
| project TimeGenerated, RevisionName_s, Reason_s, Log_s
| order by TimeGenerated desc
  • Check active vs inactive revisions and traffic percentage.
  • Decision: broken latest revision with healthy prior revision suggests rollback/failover path.

3) Replica Deep Dive

Focus on restart frequency, OOM kill patterns, and resource pressure.

let AppName = "ca-myapp-api";
ContainerAppSystemLogs_CL
| where ContainerAppName_s == AppName
| where Log_s has_any ("restart", "terminated", "OOM", "killed")
| summarize events=count() by RevisionName_s, ReplicaName_s
| order by events desc
  • Frequent restarts with no code tracebacks often point to resource limits or probe failures.
  • Decision: if OOM or throttling signals appear, tune CPU/memory and startup behavior before redeploy.

4) Image Investigation

Validate image tag immutability, base image compatibility, and dependency completeness.

az containerapp show --name "$APP_NAME" --resource-group "$RG" --query "properties.template.containers[0].image" --output tsv
az acr repository show-tags --name "$ACR_NAME" --repository "$APP_NAME" --output table
let AppName = "ca-myapp-api";
ContainerAppSystemLogs_CL
| where ContainerAppName_s == AppName
| where Log_s has_any ("pull", "manifest", "unauthorized", "denied")
| project TimeGenerated, RevisionName_s, Log_s
| order by TimeGenerated desc
  • Decision: if image pulls fail, treat identity/registry/network as primary branch.

5) Identity Chain

Walk the chain: managed identity assignment → RBAC grant → token retrieval → target resource authorization.

az containerapp show --name "$APP_NAME" --resource-group "$RG" --query "identity" --output json
az role assignment list --assignee "$(az containerapp show --name "$APP_NAME" --resource-group "$RG" --query identity.principalId --output tsv)" --output table
let AppName = "ca-myapp-api";
ContainerAppConsoleLogs_CL
| where ContainerAppName_s == AppName
| where Log_s has_any ("ManagedIdentityCredential", "403", "401", "token")
| project TimeGenerated, RevisionName_s, Log_s
| order by TimeGenerated desc
  • Decision: identity exists but 403 persists usually means missing role scope or wrong audience.

6) Network Path

Trace path end-to-end: DNS resolution → ingress routing → healthy replica → egress connectivity.

az containerapp show --name "$APP_NAME" --resource-group "$RG" --query "properties.configuration.ingress" --output json
az containerapp env show --name "$ENVIRONMENT_NAME" --resource-group "$RG" --query "properties.vnetConfiguration" --output json
let AppName = "ca-myapp-api";
ContainerAppSystemLogs_CL
| where ContainerAppName_s == AppName
| where Log_s has_any ("DNS", "ingress", "connection refused", "timeout")
| project TimeGenerated, RevisionName_s, Log_s
| order by TimeGenerated desc
  • Decision: ingress healthy but dependency timeout indicates egress or dependency firewall branch.

7) Dependency Mapping

Create an explicit map of downstream dependencies and test each path.

let AppName = "ca-myapp-api";
ContainerAppConsoleLogs_CL
| where ContainerAppName_s == AppName
| where Log_s has_any ("sql", "redis", "cosmos", "storage", "api")
| project TimeGenerated, RevisionName_s, Log_s
| order by TimeGenerated desc
  • Verify endpoint hostnames, identity scopes, firewall allow rules, and secret versions.
  • Decision: a single failing dependency can produce cascading startup or probe failures.

Decision Tree (Symptom → Likely Cause → Next Investigation)

Symptom Likely cause Investigate next
ImagePullBackOff Registry auth or image tag issue Image Investigation, Identity Chain
Revision Failed before traffic Invalid config or missing secret Revision Analysis, Secret references
502/504 from ingress No healthy replica or wrong target port Replica Deep Dive, Network Path
Intermittent timeout under load Scale rule mismatch or throttling Replica Deep Dive, scaling logs
Works locally, fails in ACA Identity/network/environment mismatch Identity Chain, Network Path

Anti-Patterns (Don't Do This)

Do not redeploy blindly

Repeated redeploys without collecting logs destroys evidence and slows root-cause analysis.

Do not change multiple variables at once

Change one dimension per iteration (image, config, probes, scale) so the result is attributable.

Do not ignore system logs

Console logs show app behavior; system logs usually reveal platform-level failures first.

Verification Loop (Fix Confidence)

After applying a candidate fix, run this loop before declaring incident closure.

flowchart LR
    A[Apply single targeted fix] --> B[Check revision health]
    B --> C[Run endpoint validation]
    C --> D[Review system and console logs]
    D --> E{Residual errors?}
    E -->|Yes| F[Return to hypothesis stage]
    E -->|No| G[Document root cause and closure evidence]

Closure requires evidence, not assumption

A successful deployment operation alone is not proof of recovery. Confirm user-facing behavior and error-rate normalization.

Command Pack by Investigation Stage

Stage Fast Command Expected Signal
Revision status az containerapp revision list --name "$APP_NAME" --resource-group "$RG" --output table Identify healthy vs failed revisions
Replica health az containerapp replica list --name "$APP_NAME" --resource-group "$RG" --output table Detect restart or zero-replica patterns
Ingress config az containerapp show --name "$APP_NAME" --resource-group "$RG" --query "properties.configuration.ingress" --output json Validate external/internal access and target port
Identity baseline az containerapp show --name "$APP_NAME" --resource-group "$RG" --query "identity" --output json Confirm principal assignment
Live logs az containerapp logs show --name "$APP_NAME" --resource-group "$RG" --type system Capture platform-level failure reasons

See Also

Sources