VM Troubleshooting Architecture Overview¶
This page maps the Azure VM control plane, data plane, guest OS, and storage/network dependencies so you can place failures in the right layer before opening a playbook.
Failure-domain overview¶
graph TD
A[Azure control plane] --> B[Host / compute fabric]
B --> C[VM runtime]
C --> D[Guest OS]
D --> E[Application and agents]
C --> F[OS disk]
C --> G[Data disks]
C --> H[NIC / VNet / NSG / Routes / DNS]
E --> I[Extensions]
E --> J[Backup snapshot workflow] Where incidents usually start¶
| Layer | Typical Failure Modes | First Evidence |
|---|---|---|
| Azure control plane | failed start, resize, redeploy, extension orchestration | Activity Log, provisioning state |
| Host / compute fabric | allocation failure, host maintenance, unavailable size | Activity Log, instance view |
| Guest OS boot | boot loop, kernel panic, BCD/GRUB corruption, driver regression | Boot Diagnostics, Serial Console |
| Network path | NSG deny, UDR misroute, DNS failure, guest firewall | Network Watcher, effective routes, guest checks |
| Disk path | IOPS or throughput cap, caching mismatch, snapshot lock | Azure Monitor disk metrics, disk config |
| Guest runtime | CPU saturation, memory pressure, paging, process lockup | VM Insights, Task Manager, top, perfmon, iostat |
| Agent-dependent features | extension failure, backup failure, Run Command issue | VM agent state, extension logs |
Architectural thinking model¶
- Classify the symptom surface: connect, perform, boot, or recover.
- Decide whether the first clue is outside or inside the guest.
- Confirm the dependency chain: host, disk, network, guest, agent.
- Only then choose the canonical playbook.
Common fault chains¶
graph TD
A[Boot issue] --> B[Cannot RDP or SSH]
B --> C[Need Boot Diagnostics / Serial Console]
D[DNS or NSG issue] --> E[Cannot connect to VM or dependency]
F[Disk cap reached] --> G[High latency]
G --> H[High CPU / queue / timeout symptoms]
I[VM agent unhealthy] --> J[Extension failures]
I --> K[Backup failures] What this means for routing¶
- Connectivity playbooks are for administrative access, DNS, route, and VM-agent-dependent control paths.
- Performance playbooks are for CPU, memory, disk, and saturation or throttling patterns.
- Boot and disk recovery playbooks are for startup, boot repair, serial-console-led diagnosis, and backup/snapshot recovery paths.