Container Apps Labs¶
Azure Container Apps troubleshooting experiments focused on ingress behavior, container lifecycle, OOM observability, networking edge cases, and scaling patterns.
Architecture Overview¶
Azure Container Apps is a managed container platform built on Kubernetes. Understanding the ingress, scaling, and container lifecycle components is essential for diagnosing where failures originate.
graph TB
subgraph "Azure Container Apps Environment"
Client([Client]) --> ENVOY[Envoy Proxy<br/>Ingress Controller]
ENVOY --> REV1[Revision 1<br/>Active]
ENVOY --> REV2[Revision 2<br/>Inactive]
subgraph "Revision (Active)"
direction TB
REV1 --> R1[Replica 1]
REV1 --> R2[Replica 2]
subgraph "Replica"
direction LR
R1 --> APP[App Container<br/>Target Port]
R1 --> SIDE[Sidecar<br/>Optional]
end
end
KEDA[KEDA<br/>Scale Controller] --> REV1
KEDA --> |"scale to 0<br/>when idle"| ZERO([0 Replicas])
subgraph "Container Resources"
direction LR
VCPU[vCPU<br/>0.25 - 4]
CMEM[Memory<br/>0.5Gi - 8Gi]
CGROUP[cgroup<br/>Memory Limit]
end
CGROUP --> |"OOM Kill"| APP
PROBE[Startup / Readiness<br/>/ Liveness Probes] --> APP
end
subgraph "Logging & Metrics"
SYSLOG[ContainerAppSystemLogs]
CONLOG[ContainerAppConsoleLogs]
METRICS[Azure Monitor Metrics]
end
APP --> CONLOG
REV1 --> SYSLOG
REV1 --> METRICS
style ENVOY fill:#4a9eff,color:#fff
style KEDA fill:#ff9800,color:#fff
style CGROUP fill:#e91e63,color:#fff
style PROBE fill:#9c27b0,color:#fff
style CONLOG fill:#4caf50,color:#fff
style SYSLOG fill:#607d8b,color:#fff
Key Components for Troubleshooting¶
| Component | Role | Why It Matters |
|---|---|---|
| Envoy Proxy | Ingress controller handling HTTP routing and TLS | Target port misconfiguration, SNI routing, and host header handling happen here |
| Revision | Immutable deployment unit containing replica configuration | Traffic splitting, blue-green deployment, and rollback operate at revision level |
| Replica | Running container instance within a revision | Each replica has its own cgroup memory limit; OOM kills target processes inside |
| KEDA | Event-driven autoscaler | Scale-to-zero creates cold start latency; scaling decisions affect availability |
| cgroup Memory Limit | Kernel-enforced memory boundary per container | OOM kills are invisible in system logs when multi-process servers absorb worker kills |
| Probes | Startup, readiness, and liveness health checks | Misconfigured probe timing causes restart loops or premature traffic routing |
| ContainerAppConsoleLogs | Application stdout/stderr captured as logs | Often the ONLY evidence source for worker-level OOM kills |
| ContainerAppSystemLogs | Platform lifecycle events (start, stop, crash) | Does NOT capture worker-level OOM kills when PID 1 survives |
Note
These experiments focus on the managed Container Apps environment and its Envoy ingress layer. Underlying Kubernetes control plane behavior is out of scope.
Experiment Status¶
| Experiment | Status | Description |
|---|---|---|
| Scale-to-Zero 503 | Published | First-request failure modes after idle scale-down |
| Target Port Detection | Published | Auto-detection failures causing 502 on running containers |
| OOM Visibility Gap | Published | Observability gaps across metrics and logs for OOM kills |
| Custom DNS Forwarding | Published | Outbound resolution failure with unreachable custom DNS |
| Ingress SNI / Host Header | Published | SNI and host header routing behavior |
| Private Endpoint FQDN vs IP | Published | FQDN vs. direct IP access differences |
| Startup Probes | Published | Probe interaction and failure patterns |
| Revision Update Downtime | Draft | 502/503 errors during revision updates |
| Internal Name Routing | Draft | Internal name vs FQDN routing behavior |
| Burst Scaling Queueing | Draft | Request queueing during burst scaling |
| Scaling Rule Conflicts | Draft | Conflicting KEDA scaling rules behavior |
Published Experiments¶
Scale-to-Zero 503 — Published¶
First-request failure modes after idle scale-down to zero replicas. Documents the cold start window where incoming requests receive 503 errors or experience extended timeouts while the first replica initializes.
Experiment Complete
Completed 2026-04 on Consumption tier (koreacentral). Captures the activation delay, error codes, and the timeline from zero replicas to first successful response.
Target Port Detection — Published¶
Auto-detection failures causing 502 errors on running containers. Demonstrates how Container Apps' ingress port auto-detection can select the wrong port, causing all traffic to fail even though the container is healthy and listening.
Experiment Complete
Completed 2026-04 on Consumption tier (koreacentral). Documents the auto-detection algorithm behavior and the specific conditions that cause detection failure.
OOM Visibility Gap — Published¶
Observability gaps across Azure Monitor metrics, system logs, and console logs when containers are OOM-killed. Reveals that multi-process servers (gunicorn) absorb worker OOM kills without triggering any platform-level telemetry — console logs are the only evidence source.
Experiment Complete
Completed 2026-04 on Consumption tier (koreacentral). Five OOM kills across two variants (gradual and spike). WorkingSetBytes underreports peaks by 2.4×; RestartCount stays 0; SystemLogs contain zero events.
Startup Probes — Published¶
Interaction between startup, readiness, and liveness probes. Investigates failure patterns that emerge from misconfigured probe timing, threshold settings, and the order of probe evaluation during container initialization.
Experiment Complete
Completed 2026-04 on Consumption tier (koreacentral). Four probe scenarios tested: startup-only failure, no-startup with liveness, readiness-only failure, and combined aggressive probes. Documents restart cascades, traffic routing gaps, and probe handoff timing.
Ingress SNI / Host Header — Published¶
How Container Apps ingress handles Server Name Indication (SNI) and host header routing. Demonstrates that Envoy routes by Host header (not SNI), SNI is required for TLS admission, and any app in a shared environment can be reached by manipulating the Host header.
Experiment Complete
Completed 2026-04 on Consumption tier (koreacentral). Eight SNI/Host permutations tested across 3 runs with 100% reproducibility. Key finding: Host header is the routing key; SNI is only a TLS admission gate.
Custom DNS Forwarding — Published¶
Outbound resolution failure when custom DNS servers configured in the Container Apps environment become unreachable. Demonstrates that there is no DNS fallback to Azure Default DNS, that recovery requires VNet DNS change + propagation time + new revision, and that DNS failure also breaks platform-level operations (ACR image pulls).
Experiment Complete
Completed 2026-04-11 on Consumption tier (VNet-injected, koreacentral). 54 probes across 4 phases. All 4 hypothesis points confirmed; unexpected finding that recovery is asymmetric — breaking DNS takes ~30s but restoring takes 2-5 minutes.
Private Endpoint FQDN vs IP — Published¶
Behavioral differences when accessing a Container App via private endpoint FQDN versus direct IP address. Demonstrates that direct IP access fails at the TLS level due to missing SNI — not certificate validation — and that curl --resolve is the correct workaround.
Experiment Complete
Completed 2026-04-12 on Consumption tier (internal-only, VNet-injected, koreacentral). 10 access patterns tested across 5 runs with 100% reproducibility. Key finding: SNI is mandatory for TLS admission; -k and -H Host: do not help because the failure occurs before certificate presentation and before HTTP layer processing.
Draft Experiments¶
Revision Update Downtime — Draft¶
502/503 errors during revision updates. Documents the downtime window when deploying new revisions, conditions that cause failed requests, and mitigation strategies (traffic splitting, minReplicas).
Status: Draft - Awaiting Execution
Designed based on Container Apps GitHub issues #1166, #1305. Awaiting execution.
Internal Name Routing — Draft¶
Internal name vs FQDN routing behavior in Container Apps. Investigates when internal names (without environment domain) work vs fail, the "Connection refused" errors, and DNS resolution differences.
Status: Draft - Awaiting Execution
Designed based on Container Apps GitHub issue #1315. Awaiting execution.
Burst Scaling Queueing — Draft¶
Request queueing behavior during rapid scale-out events. Tests how incoming requests are handled when KEDA triggers scaling faster than replicas can start, and whether Envoy queues or rejects excess traffic.
Status: Draft - Awaiting Execution
Designed based on Container Apps scaling patterns. Awaiting execution.
Scaling Rule Conflicts — Draft¶
Behavior when multiple KEDA scaling rules conflict. Tests what happens when HTTP and custom (queue-based) scaling rules give contradictory signals, and which rule takes precedence.
Status: Draft - Awaiting Execution
Designed based on Container Apps GitHub issues #468, #536, #972. Awaiting execution.
Related Experiments in Other Services¶
- App Service — Memory Pressure (Published) covers plan-level resource contention, relevant when comparing Container Apps scaling and resource isolation.
- App Service — Health Check Eviction (Published) investigates health check cascading failures, conceptually similar to probe misconfiguration in Container Apps.
- Cross-cutting — PE DNS Negative Cache tests DNS negative caching during private endpoint cutover, affecting Container Apps with VNet integration.