Symptom to Hypothesis Mapping¶
A reference for common customer-reported symptoms and the hypotheses worth investigating first. This is not exhaustive — it is a structured starting point for investigation.
How to use this table¶
- Match the customer's reported symptom to the closest entry
- Review the possible hypotheses in priority order
- Start with the listed investigation steps
- Use the evidence levels framework to tag your findings
Common symptom mappings¶
| Customer Symptom | Possible Hypotheses | First Investigation Steps |
|---|---|---|
| "My app is slow" | Memory pressure, CPU throttling, dependency timeout, cold start, thread pool exhaustion | App Service plan metrics (CPU, memory), App Insights dependency calls, procfs memory stats |
| "App restarts randomly" | OOM kill, health check failure, platform instance migration, unhandled exception, deployment in progress | Container logs, platform event timeline, memory usage timeline, deployment history |
| "Intermittent 502/503 errors" | Instance recycling, health check timeout, deployment slot swap, scale-in event, upstream dependency failure | Platform events, load balancer health probe config, instance count timeline, error distribution across instances |
| "Cold start takes too long" | Large dependency tree, heavy framework initialization, storage mount delay, package restore, database warmup | Startup trace timeline, deployment package size, init code profiling, dependency count |
| "High CPU but app isn't doing much" | GC pressure, swap thrashing, noisy neighbor (shared plans), background threads, platform overhead | procfs CPU breakdown, cgroup stats, per-process CPU, GC metrics, thread count |
| "Connections timing out" | SNAT port exhaustion, DNS resolution delay, downstream service overload, connection pool misconfiguration | Outbound connection metrics, SNAT port usage, DNS TTL settings, connection pool stats |
| "Deployment succeeds but app doesn't start" | Missing environment variables, incorrect port binding, startup command error, image pull failure, dependency crash | Container startup logs, environment variable audit, port configuration, image pull status |
| "Metrics show high memory but app is fine" | Plan-level vs. app-level metric confusion, committed vs. working set, buffer/cache inclusion | Verify metric scope (plan vs. instance vs. app), check per-instance view, compare with procfs |
| "Requests succeed locally but fail on Azure" | Missing dependencies in deployment, environment variable differences, network policy blocking outbound, platform proxy behavior | Compare local vs. deployed environment, check NSG/firewall rules, verify outbound connectivity |
| "Latency spikes at specific times" | Scheduled scaling events, platform maintenance window, cron job contention, log rotation, certificate renewal | Correlate with platform event timeline, check scheduled tasks, review scaling history |
Experiment-backed mappings¶
| Customer Symptom | Experiment-backed Hypothesis | First Investigation Steps | Reference |
|---|---|---|---|
| "Memory is 85% but app still responds" | Memory plateau can hide reclaim pressure; risk shifts to startup delays and restart cascade, not immediate steady-state outage | Correlate MemoryPercentage with cold-start trend, swap indicators, and restart spikes |
Memory Pressure |
| "Outbound API calls fail randomly while CPU looks fine" | SNAT exhaustion can cause TimeoutError with normal CPU/memory |
Count concurrent outbound connections per destination and verify connection pooling usage | SNAT Exhaustion |
| "One dependency failed and traffic disappeared from one instance" | Health check eviction is binary after threshold failures; unhealthy instance can drop from ~50% to 0% traffic instantly | Compare health probe failures to traffic split timeline and instance state transitions | Health Check Eviction |
| "Files disappear after deploy but not after restart" | Stop/start and deploy recreate container (local layer lost); /home persists across lifecycle events |
Check write path and verify whether files are under /home or local writable layer |
Filesystem Persistence |
| "First request after idle is very slow but eventually works" | Scale-to-zero cold start is often 20-40s; no 503 may still occur if timeout budget is high | Confirm zero replicas before test, then map cold-request latency against scale-up events | Scale-to-Zero 503 |
| "Container is running but endpoint always times out" | Wrong targetPort on fresh revision can keep revision in Activating with startup probe failures |
Compare ingress targetPort to app listen port in container logs and system probe failures |
Target Port Detection |
| "503 upstream connect error appears after ingress change" | Wrong port on running revision tends to fail fast with Envoy connection refused | Validate recent ingress changes and test direct response after port correction | Target Port Detection |
| "Memory drops periodically but restart count stays flat" | Worker-level OOM kills can be invisible in system logs and restart metrics | Query console logs for SIGKILL and correlate with WorkingSetBytes sawtooth pattern |
OOM Visibility Gap |
| "Revision flaps unhealthy during deploy" | Probe budget shorter than actual startup time causes deterministic probe-driven restart loops | Compute startup budget vs measured startup time and inspect ProbeFailed sequence |
Startup Probes |
| "No restarts, but app never receives traffic" | Readiness can block routing while process remains alive; not always a crash problem | Separate readiness vs liveness outcomes in logs before escalating runtime failure | Startup Probes |
Quick interpretation tips¶
- Use experiment links to validate likely failure signatures before escalating to platform incidents.
- Prioritize hypotheses with reproducible evidence in section 11/14 of each experiment.
- If symptom and metric disagree, prefer logs and state-transition evidence over coarse averages.
- Treat probe-driven failures and port-mismatch failures as configuration defects first.
Guidance¶
Warning
Symptom-to-hypothesis mapping is a starting point, not a diagnosis. Multiple symptoms can share the same root cause, and a single symptom can have multiple independent causes. Always validate hypotheses against observed evidence before concluding.
See also:
- False Positives — signals that suggest problems that don't exist
- Metric Misreads — commonly misinterpreted Azure metrics
- Platform vs App Boundary — framework for boundary analysis