Disk Performance Issues¶

Symptoms¶

Administrative or workload impact is visible to users or operators.
The VM is deployed, but one part of the expected control plane or data path is failing.
You need a fast way to narrow the problem before making a risky change.

flowchart TD
    A[Disk Performance Issues] --> B[Confirm current symptom and blast radius]
    B --> C[Collect platform evidence first]
    C --> D[Collect guest or workload evidence]
    D --> E[Map findings to the most likely hypothesis]
    E --> F[Apply the smallest safe fix]
    F --> G[Validate recovery and prevention actions]

1. Summary¶

Use this playbook when sustained disk latency, queue depth, or throughput throttling affects application response time even though CPU and memory headroom look healthy.

Disk metrics, guest evidence, caching mode, VM-level caps, and retiering or remapping actions.

2. Common Misreadings¶

Observation	Often misread as	Actually means
One failed probe or one stale metric	Total VM outage	The issue may be scoped to one path or one recovery dependency.
A successful extension deployment	Guest health is good	Extensions can succeed while the underlying guest service still fails.
A recent change record	Guaranteed root cause	Recent changes are strong leads, but they still need proof from evidence.
A restart fixes the issue	Permanent resolution	Recovery after restart may only hide the real structural cause.

3. Competing Hypotheses¶

Hypothesis	Likelihood	Key discriminator
Control-plane or configuration drift	High	Azure resource state no longer matches the intended pattern.
Guest OS or agent issue	High	Guest or serial evidence shows service, boot, or firewall failure.
Capacity or platform dependency bottleneck	Medium	Metrics or SKU limits explain the symptom better than configuration drift.
Security control blocked expected behavior	Medium	NSG, ASG, JIT, or policy state changed before the incident.
External dependency issue	Low	VM appears healthy, but a downstream service path is broken.

4. What to Check First¶

Review VM instance view

az vm get-instance-view             --resource-group $RG             --name $VM_NAME             --output json

Review boot diagnostics settings

az vm boot-diagnostics get-boot-log             --resource-group $RG             --name $VM_NAME

Review NIC effective security rules

az network nic list-effective-nsg             --resource-group $RG             --name $NIC_NAME             --output json

Review recent platform changes

az monitor activity-log list             --resource-group $RG             --offset 24h             --output table

5. Evidence to Collect¶

5.1 KQL Queries¶

// Disk latency and throughput trend
InsightsMetrics
| where TimeGenerated > ago(6h)
| where Namespace == "vm.azm.ms"
| where Name in ("LogicalDiskAvgSecPerRead", "LogicalDiskAvgSecPerWrite", "LogicalDiskTransfersPerSec")
| summarize AvgVal=avg(Val) by bin(TimeGenerated, 5m), Computer, Name
| order by TimeGenerated asc

Field	Interpretation
`TimeGenerated`	Incident sequence and correlation window.
Resource identifier	Confirms the signal belongs to the affected VM.
Operation or metric value	Explains whether the failure is change-driven, capacity-driven, or guest-driven.

How to read this

Compare these results with the last known healthy window. A change in shape matters more than a single absolute value.

// Heartbeat plus incident correlation
Heartbeat
| where TimeGenerated > ago(6h)
| summarize LastHeartbeat=max(TimeGenerated) by Computer, _ResourceId
| join kind=leftouter (AzureActivity | where TimeGenerated > ago(6h) | project ActivityTime=TimeGenerated, OperationNameValue, ResourceId) on $left._ResourceId == $right.ResourceId

Field	Interpretation
`TimeGenerated`	Incident sequence and correlation window.
Resource identifier	Confirms the signal belongs to the affected VM.
Operation or metric value	Explains whether the failure is change-driven, capacity-driven, or guest-driven.

How to read this

Compare these results with the last known healthy window. A change in shape matters more than a single absolute value.

5.2 CLI Investigation¶

az vm show     --resource-group $RG     --name $VM_NAME     --query "{powerState:powerState,vmSize:hardwareProfile.vmSize,priority:priority,provisioningState:provisioningState}"     --output json

az network nic show     --resource-group $RG     --name $NIC_NAME     --query "{ipConfigs:ipConfigurations[].privateIPAddress,acceleratedNetworking:enableAcceleratedNetworking,networkSecurityGroup:networkSecurityGroup.id}"     --output json

Interpretation:

If the VM is not in a usable power state, fix power and boot issues before guest remediation.
If the NIC or NSG binding is wrong, repair the control plane before changing guest settings.
If accelerated networking or disk attachment changed after a resize, validate that the new size still supports the intended feature set.

6. Validation and Disproof by Hypothesis¶

Hypothesis 1: Configuration drift or recent change¶

Proves if: Evidence clearly shows the current state changed from the last healthy pattern in a way that explains the symptom.

Disproves if: Control-plane, guest, and capacity evidence all remain healthy and consistent with baseline.

Recommended validation steps:

Compare Azure Activity Log against the incident start time.
Compare current configuration with the approved landing zone pattern.
Review the most recent guest evidence or boot or connection output.
Re-test the original symptom after the smallest safe correction.

Hypothesis 2: Guest OS or service failure¶

Proves if: Evidence clearly shows the current state changed from the last healthy pattern in a way that explains the symptom.

Disproves if: Control-plane, guest, and capacity evidence all remain healthy and consistent with baseline.

Recommended validation steps:

Compare Azure Activity Log against the incident start time.
Compare current configuration with the approved landing zone pattern.
Review the most recent guest evidence or boot or connection output.
Re-test the original symptom after the smallest safe correction.

Hypothesis 3: Capacity limit or SKU mismatch¶

Proves if: Evidence clearly shows the current state changed from the last healthy pattern in a way that explains the symptom.

Disproves if: Control-plane, guest, and capacity evidence all remain healthy and consistent with baseline.

Recommended validation steps:

Compare Azure Activity Log against the incident start time.
Compare current configuration with the approved landing zone pattern.
Review the most recent guest evidence or boot or connection output.
Re-test the original symptom after the smallest safe correction.

Hypothesis 4: Security control blocking the expected path¶

Proves if: Evidence clearly shows the current state changed from the last healthy pattern in a way that explains the symptom.

Disproves if: Control-plane, guest, and capacity evidence all remain healthy and consistent with baseline.

Recommended validation steps:

Compare Azure Activity Log against the incident start time.
Compare current configuration with the approved landing zone pattern.
Review the most recent guest evidence or boot or connection output.
Re-test the original symptom after the smallest safe correction.

7. Likely Root Cause Patterns¶

Pattern	Evidence	Resolution
Unsupported feature after resize or redeploy	Settings drift, feature not enabled, or SKU capability mismatch	Move back to a supported size or re-enable the feature with validation.
NSG or route or JIT drift	Effective rules do not match the intended admin or workload path	Repair the policy and document the expected flow.
Guest service stopped or corrupted	Serial, extension, or guest evidence points to OS-level failure	Repair the guest service, driver, boot loader, or firewall configuration.
Performance bottleneck blamed as an outage	CPU, memory, or disk metrics saturate before the user-visible failure	Resize, retier, or redistribute workload pressure.

8. Immediate Mitigations¶

Compare disk SKU limits with VM aggregate disk bandwidth and IOPS ceilings before buying a faster disk.
Review host caching and move log-heavy or write-heavy paths to None where recommended.
Consider Premium SSD v2 or Ultra Disk when the workload needs elastic or very high data-plane performance.

Step-by-step resolution:

Stabilize the VM or admin path without erasing forensic evidence.
Correct the control-plane configuration first when Azure intent is clearly wrong.
Apply guest-side repair only after confirming the platform path is healthy.
Re-run the original command, probe, or sign-in flow to verify recovery.
Record the exact evidence that proved the fix, not just that the symptom disappeared.

CLI commands commonly used during fixes:

az vm run-command invoke     --resource-group $RG     --name $VM_NAME     --command-id RunShellScript     --scripts "sudo systemctl status walinuxagent"

az vm restart     --resource-group $RG     --name $VM_NAME

9. Prevention¶

Prevention checklist¶

[ ] Keep a documented healthy baseline for VM size, NIC, disk, and admin-path settings
[ ] Alert on drift in critical VM security and connectivity controls
[ ] Test boot diagnostics, Bastion, serial console, and backup restore before production go-live
[ ] Review SKU feature compatibility before resize operations
[ ] Capture post-incident evidence and turn it into a reusable guardrail

Disk Performance Issues¶

Symptoms¶

1. Summary¶

2. Common Misreadings¶

3. Competing Hypotheses¶

4. What to Check First¶

5. Evidence to Collect¶

5.1 KQL Queries¶

5.2 CLI Investigation¶

6. Validation and Disproof by Hypothesis¶

Hypothesis 1: Configuration drift or recent change¶

Hypothesis 2: Guest OS or service failure¶

Hypothesis 3: Capacity limit or SKU mismatch¶

Hypothesis 4: Security control blocking the expected path¶

7. Likely Root Cause Patterns¶

8. Immediate Mitigations¶

9. Prevention¶

Prevention checklist¶

See Also¶

Sources¶