VM Troubleshooting Mental Model¶
The core VM troubleshooting habit is simple: classify the failure domain first, then collect disproving evidence before committing to a root cause.
Classification model¶
flowchart TD
A[Observed symptom] --> B{Failure domain}
B -->|Admin path broken| C[Connectivity]
B -->|Runtime degraded| D[Performance]
B -->|Startup or recovery broken| E[Boot and disk]
C --> C1[Check network, auth, agent]
D --> D1[Check CPU, memory, disk, network]
E --> E1[Check boot artifacts, serial console, snapshot path] Four rules¶
- Start with the narrowest true symptom: “cannot SSH” is better than “VM is broken.”
- Use competing hypotheses: at least two plausible explanations before taking action.
- Prefer disproof over confirmation: look for evidence that would invalidate your favorite theory.
- Separate platform from guest: Azure state and guest state are not the same thing.
Typical category mistakes¶
| Mistake | What it causes | Better move |
|---|---|---|
| treating every access failure as an NSG issue | misses guest firewall, VM agent, or credential problems | check network path and guest readiness together |
| using only CPU for performance diagnosis | misses memory pressure and disk throttling | inspect CPU, memory, disk, and queue/latency together |
| trying RDP/SSH fixes during boot corruption | wastes time on a path that cannot work yet | switch immediately to Boot Diagnostics and Serial Console |
| retrying backup without checking agent state | repeats the same failed snapshot workflow | validate VM agent and extension health first |
Investigation rhythm¶
graph TD
A[Symptom] --> B[Hypotheses]
B --> C[Collect evidence]
C --> D[Disprove weak hypotheses]
D --> E[Mitigate]
E --> F[Prevention update] How to apply this in practice¶
- Use Quick Diagnosis Cards when speed matters.
- Use the matching First 10 Minutes checklist to stabilize routing.
- Open one canonical playbook and finish the evidence loop before jumping categories.