Troubleshooting Mental Model¶
This page provides a practical thinking model for Azure Monitor incidents so you can move from symptom to verified root cause without skipping disproof.
Core idea: treat every incident as a path from symptom → hypothesis → validation → disproof → mitigation. Azure Monitor problems are often misclassified because operators jump from one visible symptom to one familiar cause.
Why this model matters¶
Most troubleshooting delays come from category mistakes:
- ingestion gaps investigated as query bugs
- alert-rule failures investigated as workspace outages
- slow workbooks investigated as portal problems instead of KQL shape problems
- missing application telemetry investigated as platform regression when the connection string changed
This model keeps the investigation falsifiable.
Symptom-to-hypothesis framework¶
flowchart TD
A[Observed symptom] --> B[Classify failure domain]
B --> C[List competing hypotheses]
C --> D[Collect minimum evidence set]
D --> E{Does evidence prove one hypothesis?}
E -->|No| F[Disprove and reclassify]
E -->|Yes| G[Apply targeted mitigation]
G --> H[Run recovery verification] 1) Start with the exact symptom¶
Capture one precise statement before touching tooling.
| Weak statement | Better statement |
|---|---|
| Azure Monitor is broken | AKS container logs stopped arriving in one workspace after 09:10 UTC |
| Alerts do not work | One scheduled query rule did not fire although matching rows existed between 10:00 and 10:15 UTC |
| Queries are slow | One workbook query against AzureDiagnostics times out for 24 hours but succeeds for 30 minutes |
The better statement creates falsifiable boundaries: workload, signal type, time window, and blast radius.
2) Classify the failure domain first¶
Use four domains only.
| Domain | Meaning | Typical symptom |
|---|---|---|
| Source | Telemetry was never emitted correctly | Missing requests, traces, or guest logs |
| Routing | Telemetry was emitted but not delivered correctly | Wrong workspace, missing categories, DCR mismatch |
| Data store | Telemetry arrived but is delayed, expired, or queried in the wrong place | Historical gaps, wrong table, retention confusion |
| Consumer | Data exists, but the query, alert, or workbook is wrong | Slow KQL, bad thresholds, workbook fan-out |
Do not start with root cause labels
Start with a failure domain, not with a favorite explanation such as "platform issue" or "networking issue".
3) Build competing hypotheses¶
For each incident, create at least three plausible explanations.
| Symptom | Good competing hypotheses |
|---|---|
| No logs in workspace | diagnostic setting missing, wrong workspace target, ingestion delay |
| App telemetry gap | SDK misconfiguration, sampling, network egress restriction |
| Alert never fires | no underlying data, bad query logic, wrong evaluation window |
| Slow query | broad time range, weak predicates, workbook/cross-workspace overhead |
If you only have one hypothesis, you are usually guessing.
4) Ask evidence questions, not fix questions¶
Wrong question: What should I change?
Better questions:
- What evidence would prove the data never arrived?
- What evidence would show the workspace is healthy but the query is bad?
- What evidence would disprove a service-side incident?
- What changed immediately before the symptom started?
This prevents premature mitigation that destroys useful evidence.
5) Minimum evidence set¶
flowchart TD
A[Symptom statement] --> B[Control query]
A --> C[Configuration evidence]
A --> D[Change history]
A --> E[Signal-specific table evidence] Use one artifact from each category whenever possible.
| Evidence category | Example |
|---|---|
| Control query | Heartbeat or Usage narrow query |
| Configuration evidence | diagnostic settings, DCR, alert rule, Application Insights settings |
| Change history | Activity Log around incident start |
| Signal-specific evidence | requests, AppRequests, AzureDiagnostics, ContainerLogV2 |
6) Validation logic: prove and disprove¶
Every hypothesis needs both sides.
| Hypothesis | Proves if | Disproves if |
|---|---|---|
| Workspace outage | multiple trivial control queries are slow for unrelated tables | narrow control queries are healthy |
| Diagnostic setting issue | Activity Log shows change and target categories or destination are wrong | settings are unchanged and another routing issue explains gap |
| Query-shape issue | small-window or selective rewrite is dramatically faster | performance stays poor even with narrow selective probes |
| Sampling issue | requests arrive but traces are selectively reduced per configuration | traffic is absent across all signal types |
If you cannot state the disproof condition, the hypothesis is not ready.
7) Common Azure Monitor failure patterns¶
Pattern A: "No data" is really wrong routing¶
- Resource exists and is healthy.
- Metrics may still appear.
- Workspace query is empty because diagnostic settings or DCR target is wrong.
Pattern B: "Application Insights is down" is really app-side config drift¶
- Requests drop after deployment.
- Connection string or SDK package changed.
- Platform is healthy, but source emission changed.
Pattern C: "Query engine is slow" is really data volume plus bad KQL¶
- Control query is fast.
- Slow query uses wide windows,
contains, or joins high-cardinality sets. - Rewritten query proves the workspace is fine.
Pattern D: "Alert is broken" is really threshold or cadence mismatch¶
- Data exists interactively.
- Alert window or aggregation does not align with the observed signal.
- Rule logic fails even though ingestion is healthy.
8) Common anti-patterns¶
| Anti-pattern | Why it slows diagnosis |
|---|---|
| Restarting agents or apps before collecting evidence | destroys timeline and masks intermittent issues |
| Changing diagnostic settings and alert rules together | makes attribution impossible |
| Looking at only one table | creates wrong-table bias |
| Treating portal behavior as proof of platform state | workbook and portal UX can hide the real KQL issue |
| Escalating before running a control query | confuses local design issues with service incidents |
9) Recovery verification model¶
flowchart TD
A[Apply one targeted fix] --> B[Check fresh data or control query]
B --> C[Compare against original symptom]
C --> D{Recovered?}
D -->|No| E[Return to hypotheses]
D -->|Yes| F[Document root cause and prevention] Recovery is not "the command succeeded." Recovery means the original symptom is absent in fresh evidence.
10) Incident note template¶
Use this short structure during investigations:
- Symptom: what is wrong, where, and since when
- Failure domain: source, routing, data store, or consumer
- Top hypotheses: at least three
- Evidence collected: control query, config check, change history, signal-specific query
- Current conclusion: what is disproved, what remains likely
11) When to escalate¶
Escalate to likely Azure-side service impact only when:
- multiple unrelated workloads or tables show the same degradation
- narrow control queries also fail or are abnormally slow
- configuration evidence does not explain the issue
- Service Health or broader tenant evidence fits the incident window
Otherwise, continue disproof inside the relevant playbook.
See Also¶
- Troubleshooting
- Troubleshooting Architecture Overview
- Decision Tree
- Evidence Map
- Playbooks Index
- Slow Query Performance