Troubleshooting Mental Model¶
The goal is not to guess the fix first. The goal is to identify the first broken layer and then prove or disprove competing hypotheses quickly.
Core model¶
mermaid flowchart TD A[Observed symptom] --> B[Classify the layer] B --> C[Build competing hypotheses] C --> D[Collect minimal decisive evidence] D --> E[Disprove fast] E --> F[Validate likely root cause] F --> G[Mitigate and prevent]
The four Azure networking layers¶
| Layer | Key question | Typical mistake |
|---|---|---|
| Resolution | Did we get the right destination IP? | Treating wrong DNS as packet loss |
| Path | Did Azure choose the intended next hop? | Looking only at NSG without checking routes |
| Policy | Was the chosen path allowed? | Blaming routing when firewall or NSG denied it |
| Target / performance | Did the backend answer correctly and on time? | Calling every slow response a network issue |
Use competing hypotheses, not a single favorite¶
| Symptom | Common first guess | Competing hypotheses you should keep alive |
|---|---|---|
| Cannot reach Private Endpoint | Private Endpoint is broken | wrong DNS, missing VNet link, NSG deny, stale record, target not listening |
| Peering traffic fails | Peering is disconnected | overlap, transit flag mismatch, NSG deny, UDR override |
| Outbound internet fails | Firewall is down | DNS failure, missing NAT path, route-all to wrong hop, target outage |
| High latency | Azure network issue | backend saturation, MTU, path asymmetry, probe mismatch, ISP issue |
Practical classification flow¶
- Pick one failing source, one failing destination, and one time window.
- Ask whether the failure is primarily resolution, path, policy, or performance.
- Collect only the first decisive artifact for that category.
- Reclassify immediately if the artifact contradicts your initial category.
mermaid flowchart LR A[One failing source] --> B[One failing destination] B --> C[One incident window] C --> D[One first hypothesis set] D --> E[One decisive test]
Anti-patterns this model prevents¶
- DNS blindness: troubleshooting TCP before proving name resolution.
- Route blindness: checking NSG or Firewall before proving next hop.
- Portal-only bias: trusting configured state without live test evidence.
- Symptom drift: mixing unrelated client, target, and time windows.
- Single-cause bias: stopping at the first plausible explanation.
What “good troubleshooting” looks like¶
| Good habit | Why it matters |
|---|---|
| Compare IP-only and name-based tests | separates DNS from raw reachability |
| Use effective routes and effective NSG together | separates path choice from policy outcome |
| Check both sides of peering or hybrid links | many Azure links are bilateral by design |
| Correlate time-based failures | intermittent issues need time alignment, not static inspection |
| Document disproven hypotheses | prevents looping back to already falsified ideas |
The first broken layer wins
If DNS is wrong, route analysis is premature. If the route is wrong, NSG tuning is premature. Move layer by layer.