Disk Performance Issues¶
Symptoms¶
- Administrative or workload impact is visible to users or operators.
- The VM is deployed, but one part of the expected control plane or data path is failing.
- You need a fast way to narrow the problem before making a risky change.
flowchart TD
A[Disk Performance Issues] --> B[Confirm current symptom and blast radius]
B --> C[Collect platform evidence first]
C --> D[Collect guest or workload evidence]
D --> E[Map findings to the most likely hypothesis]
E --> F[Apply the smallest safe fix]
F --> G[Validate recovery and prevention actions] 1. Summary¶
Use this playbook when sustained disk latency, queue depth, or throughput throttling affects application response time even though CPU and memory headroom look healthy.
Disk metrics, guest evidence, caching mode, VM-level caps, and retiering or remapping actions.
2. Common Misreadings¶
| Observation | Often misread as | Actually means |
|---|---|---|
| One failed probe or one stale metric | Total VM outage | The issue may be scoped to one path or one recovery dependency. |
| A successful extension deployment | Guest health is good | Extensions can succeed while the underlying guest service still fails. |
| A recent change record | Guaranteed root cause | Recent changes are strong leads, but they still need proof from evidence. |
| A restart fixes the issue | Permanent resolution | Recovery after restart may only hide the real structural cause. |
3. Competing Hypotheses¶
| Hypothesis | Likelihood | Key discriminator |
|---|---|---|
| Control-plane or configuration drift | High | Azure resource state no longer matches the intended pattern. |
| Guest OS or agent issue | High | Guest or serial evidence shows service, boot, or firewall failure. |
| Capacity or platform dependency bottleneck | Medium | Metrics or SKU limits explain the symptom better than configuration drift. |
| Security control blocked expected behavior | Medium | NSG, ASG, JIT, or policy state changed before the incident. |
| External dependency issue | Low | VM appears healthy, but a downstream service path is broken. |
4. What to Check First¶
-
Review VM instance view
-
Review boot diagnostics settings
-
Review NIC effective security rules
-
Review recent platform changes
5. Evidence to Collect¶
5.1 KQL Queries¶
// Disk latency and throughput trend
InsightsMetrics
| where TimeGenerated > ago(6h)
| where Namespace == "vm.azm.ms"
| where Name in ("LogicalDiskAvgSecPerRead", "LogicalDiskAvgSecPerWrite", "LogicalDiskTransfersPerSec")
| summarize AvgVal=avg(Val) by bin(TimeGenerated, 5m), Computer, Name
| order by TimeGenerated asc
| Field | Interpretation |
|---|---|
TimeGenerated | Incident sequence and correlation window. |
| Resource identifier | Confirms the signal belongs to the affected VM. |
| Operation or metric value | Explains whether the failure is change-driven, capacity-driven, or guest-driven. |
How to read this
Compare these results with the last known healthy window. A change in shape matters more than a single absolute value.
// Heartbeat plus incident correlation
Heartbeat
| where TimeGenerated > ago(6h)
| summarize LastHeartbeat=max(TimeGenerated) by Computer, _ResourceId
| join kind=leftouter (AzureActivity | where TimeGenerated > ago(6h) | project ActivityTime=TimeGenerated, OperationNameValue, ResourceId) on $left._ResourceId == $right.ResourceId
| Field | Interpretation |
|---|---|
TimeGenerated | Incident sequence and correlation window. |
| Resource identifier | Confirms the signal belongs to the affected VM. |
| Operation or metric value | Explains whether the failure is change-driven, capacity-driven, or guest-driven. |
How to read this
Compare these results with the last known healthy window. A change in shape matters more than a single absolute value.
5.2 CLI Investigation¶
az vm show --resource-group $RG --name $VM_NAME --query "{powerState:powerState,vmSize:hardwareProfile.vmSize,priority:priority,provisioningState:provisioningState}" --output json
az network nic show --resource-group $RG --name $NIC_NAME --query "{ipConfigs:ipConfigurations[].privateIPAddress,acceleratedNetworking:enableAcceleratedNetworking,networkSecurityGroup:networkSecurityGroup.id}" --output json
Interpretation:
- If the VM is not in a usable power state, fix power and boot issues before guest remediation.
- If the NIC or NSG binding is wrong, repair the control plane before changing guest settings.
- If accelerated networking or disk attachment changed after a resize, validate that the new size still supports the intended feature set.
6. Validation and Disproof by Hypothesis¶
Hypothesis 1: Configuration drift or recent change¶
Proves if: Evidence clearly shows the current state changed from the last healthy pattern in a way that explains the symptom.
Disproves if: Control-plane, guest, and capacity evidence all remain healthy and consistent with baseline.
Recommended validation steps:
- Compare Azure Activity Log against the incident start time.
- Compare current configuration with the approved landing zone pattern.
- Review the most recent guest evidence or boot or connection output.
- Re-test the original symptom after the smallest safe correction.
Hypothesis 2: Guest OS or service failure¶
Proves if: Evidence clearly shows the current state changed from the last healthy pattern in a way that explains the symptom.
Disproves if: Control-plane, guest, and capacity evidence all remain healthy and consistent with baseline.
Recommended validation steps:
- Compare Azure Activity Log against the incident start time.
- Compare current configuration with the approved landing zone pattern.
- Review the most recent guest evidence or boot or connection output.
- Re-test the original symptom after the smallest safe correction.
Hypothesis 3: Capacity limit or SKU mismatch¶
Proves if: Evidence clearly shows the current state changed from the last healthy pattern in a way that explains the symptom.
Disproves if: Control-plane, guest, and capacity evidence all remain healthy and consistent with baseline.
Recommended validation steps:
- Compare Azure Activity Log against the incident start time.
- Compare current configuration with the approved landing zone pattern.
- Review the most recent guest evidence or boot or connection output.
- Re-test the original symptom after the smallest safe correction.
Hypothesis 4: Security control blocking the expected path¶
Proves if: Evidence clearly shows the current state changed from the last healthy pattern in a way that explains the symptom.
Disproves if: Control-plane, guest, and capacity evidence all remain healthy and consistent with baseline.
Recommended validation steps:
- Compare Azure Activity Log against the incident start time.
- Compare current configuration with the approved landing zone pattern.
- Review the most recent guest evidence or boot or connection output.
- Re-test the original symptom after the smallest safe correction.
7. Likely Root Cause Patterns¶
| Pattern | Evidence | Resolution |
|---|---|---|
| Unsupported feature after resize or redeploy | Settings drift, feature not enabled, or SKU capability mismatch | Move back to a supported size or re-enable the feature with validation. |
| NSG or route or JIT drift | Effective rules do not match the intended admin or workload path | Repair the policy and document the expected flow. |
| Guest service stopped or corrupted | Serial, extension, or guest evidence points to OS-level failure | Repair the guest service, driver, boot loader, or firewall configuration. |
| Performance bottleneck blamed as an outage | CPU, memory, or disk metrics saturate before the user-visible failure | Resize, retier, or redistribute workload pressure. |
8. Immediate Mitigations¶
- Compare disk SKU limits with VM aggregate disk bandwidth and IOPS ceilings before buying a faster disk.
- Review host caching and move log-heavy or write-heavy paths to
Nonewhere recommended. - Consider Premium SSD v2 or Ultra Disk when the workload needs elastic or very high data-plane performance.
Step-by-step resolution:
- Stabilize the VM or admin path without erasing forensic evidence.
- Correct the control-plane configuration first when Azure intent is clearly wrong.
- Apply guest-side repair only after confirming the platform path is healthy.
- Re-run the original command, probe, or sign-in flow to verify recovery.
- Record the exact evidence that proved the fix, not just that the symptom disappeared.
CLI commands commonly used during fixes:
az vm run-command invoke --resource-group $RG --name $VM_NAME --command-id RunShellScript --scripts "sudo systemctl status walinuxagent"
az vm restart --resource-group $RG --name $VM_NAME
9. Prevention¶
Prevention checklist¶
- [ ] Keep a documented healthy baseline for VM size, NIC, disk, and admin-path settings
- [ ] Alert on drift in critical VM security and connectivity controls
- [ ] Test boot diagnostics, Bastion, serial console, and backup restore before production go-live
- [ ] Review SKU feature compatibility before resize operations
- [ ] Capture post-incident evidence and turn it into a reusable guardrail