Troubleshooting Architecture Overview¶

This page answers one practical question first: where can this fail?

Before deep debugging, map the symptom to a platform segment (Ingress, Environment, Revision, Container, or outbound path). That classification tells you which logs to query first and which playbook to open.

Why this page exists¶

Playbooks are symptom-driven and detailed. During active incidents, engineers usually need one faster artifact:

A request-path view with failure points
A runtime model that explains timeout, OOM, and replica behavior
A deployment path showing where revision and probe failures happen
A network path showing DNS/SNAT/private routing issues

Use this architecture map to route quickly to the right playbook.

1) Request Path Architecture (where 5xx can originate)¶

flowchart LR
    A[Client Browser or API Caller] --> B[Container Apps Ingress]
    B --> C[Environment Load Balancer]
    C --> D[Revision Replica]
    D --> E[Your Container]
    E --> F[Response]

    B -. FP-REQ-01 .-> B1[Failure Point: Ingress routing or TLS issue]
    C -. FP-REQ-02 .-> C1[Failure Point: No healthy replicas]
    D -. FP-REQ-03 .-> D1[Failure Point: Probe failure or container crash]
    E -. FP-REQ-04 .-> E1[Failure Point: Unhandled exception or timeout]

Typical interpretation¶

HTTP 5xx is not one thing.
A 5xx can originate at Ingress, Environment, Revision startup, or app code.
Treat each layer as a competing hypothesis until logs disprove it.

Request-path failure points and playbooks¶

Failure Point	Typical Symptom	First Evidence	Playbook
FP-REQ-01 Ingress	502/503 bursts, routing issues	`ContainerAppSystemLogs_CL` + HTTP status trend	Ingress Not Reachable
FP-REQ-02 No healthy replicas	Intermittent 5xx during scaling	Replica count, KEDA events	HTTP Scaling Not Triggering
FP-REQ-03 Probe failure	App marked unhealthy, startup failures	Console startup logs + probe messages	Probe Failure and Slow Start
FP-REQ-04 Application code	500 with stack traces or long latency	App/console logs + endpoint-level HTTP logs	CrashLoop OOM and Resource Pressure

2) Runtime / Replica Model (memory pressure, OOM, timeout)¶

flowchart TD
    A[Container Apps Environment] --> B[Revision]
    B --> C[Replica Instances]
    C --> D[Container Process]
    D --> E[Dependency Calls]

    D --> F{Memory growth trend}
    F -->|steady growth| G[Pressure: GC churn and degraded latency]
    F -->|exceeds limit| H[OOM Kill]

    C --> I{Probe fails}
    I -->|yes| J[Replica marked unhealthy]
    I -->|no| K[Request completes]

    H --> L[Platform restarts container]
    J --> M[Traffic shifted to healthy replicas]

Runtime failure mapping¶

Failure Point	Why it happens	What to check first	Playbook
FP-RUN-01 Memory pressure	Memory leak, large object churn, high concurrency	Memory trend + console logs for kill/restart	CrashLoop OOM and Resource Pressure
FP-RUN-02 OOM Kill	Container exceeded memory limit	Platform kill messages, restart cadence	CrashLoop OOM and Resource Pressure
FP-RUN-03 Probe timeout	Slow startup or unresponsive health endpoint	Probe configuration, startup timing	Probe Failure and Slow Start
FP-RUN-04 Scale mismatch	KEDA rules don't match traffic patterns	KEDA events, replica count over time	Event Scaler Mismatch

3) Deployment Path (revision failures and config drift)¶

flowchart LR
    A[Image push or config change] --> B[New Revision created]
    B --> C[Image pull from registry]
    C --> D[Container start command executes]
    D --> E[Probe evaluation]
    E --> F[Traffic shifted to new revision]

    C -. FP-DEP-01 .-> C1[Image pull failure]
    D -. FP-DEP-02 .-> D1[Startup command or port mismatch]
    E -. FP-DEP-03 .-> E1[Probe path failure]
    F -. FP-DEP-04 .-> F1[Traffic split misconfiguration]

Deployment path failure mapping¶

Failure Point	Typical signal	Primary playbook
FP-DEP-01 Image pull failure	ACR auth error, image not found	Image Pull Failure
FP-DEP-02 Startup mismatch	Container runs but exits or wrong port	Container Start Failure
FP-DEP-03 Probe failure	Revision never becomes healthy	Probe Failure and Slow Start
FP-DEP-04 Traffic split issue	Wrong revision receiving traffic	Bad Revision Rollout and Rollback

4) Outbound / Network Path (DNS, Private Endpoints)¶

flowchart LR
    A[Container Replica] --> B[DNS Resolver Path]
    B --> C[Resolved target IP]
    C --> D[Outbound via Environment VNet]
    D --> E[Internet or Private Endpoint]
    E --> F[Dependency Service]

    B -. FP-NET-01 .-> B1[DNS resolution failure]
    D -. FP-NET-02 .-> D1[VNet routing or NSG issue]
    E -. FP-NET-03 .-> E1[Private endpoint unreachable]
    F -. FP-NET-04 .-> F1[Dependency timeout or refusal]

Outbound path failure mapping¶

Failure Point	Symptom pattern	Primary playbook
FP-NET-01 DNS failure	Intermittent/constant name lookup failures	Internal DNS and Private Endpoint Failure
FP-NET-02 VNet routing	Outbound blocked by NSG or route table	Service-to-Service Connectivity Failure
FP-NET-03 Private endpoint	Endpoint unreachable despite configuration	Internal DNS and Private Endpoint Failure
FP-NET-04 Dependency failures	Outbound errors isolated to one backend service	Service-to-Service Connectivity Failure

5) Observability Coverage Map¶

flowchart TD
    A[Component] --> B[Best First Log Source]

    C[Ingress and routing] --> D[ContainerAppSystemLogs_CL]
    E[Revision lifecycle and scaling] --> F[ContainerAppSystemLogs_CL]
    G[Container startup and runtime stderr/stdout] --> H[ContainerAppConsoleLogs_CL]
    I[App-level business errors] --> J[Structured app logs in ContainerAppConsoleLogs_CL]
    K[Deployment and config changes] --> L[Azure Activity Log]
    M[Dependency failures] --> N[App logs + Container logs correlation]

Quick evidence commands by component¶

az containerapp logs show --name $APP_NAME --resource-group $RG --type system
az containerapp logs show --name $APP_NAME --resource-group $RG --type console
az monitor activity-log list --resource-group $RG --offset 24h
az containerapp revision list --name $APP_NAME --resource-group $RG --output table
az containerapp show --name $APP_NAME --resource-group $RG --query "properties.configuration.ingress"

ContainerAppConsoleLogs_CL
| where TimeGenerated > ago(2h)
| where ContainerAppName_s == "<app-name>"
| summarize count() by bin(TimeGenerated, 5m)
| order by TimeGenerated asc

ContainerAppSystemLogs_CL
| where TimeGenerated > ago(24h)
| where ContainerAppName_s == "<app-name>"
| where Reason_s has_any ("ProbeFailed", "ContainerStarted", "ContainerTerminated", "PulledImage")
| project TimeGenerated, Reason_s, Log_s
| order by TimeGenerated desc

ContainerAppConsoleLogs_CL
| where TimeGenerated > ago(6h)
| where ContainerAppName_s == "<app-name>"
| where Log_s has_any ("Exception", "Error", "timeout", "OOM", "killed")
| project TimeGenerated, Log_s
| order by TimeGenerated desc

6) Fast routing examples¶

Example A: 5xx appears only during scaling events.
- Start with runtime/replica model (FP-RUN-03/04), then check KEDA configuration.
- Open: HTTP Scaling Not Triggering and Event Scaler Mismatch.
Example B: Revision created but never becomes healthy.
- Start with deployment path (FP-DEP-01/02/03).
- Open: Image Pull Failure and Probe Failure and Slow Start.
Example C: Outbound calls fail to private endpoint targets.
- Start with outbound path (FP-NET-01/03).
- Open: Internal DNS and Private Endpoint Failure.

How to use this architecture page during incidents

Do not treat any single metric as proof. Use this page to identify the most likely failure layer, then validate with time-correlated evidence. Move to the linked playbook only after you identify which layer best matches the symptom timing.