High CPU / Memory / Disk¶
1. Summary¶
Symptom¶
One core resource is clearly exhausted: CPU pinned high, memory nearly depleted, or disk queue and latency are elevated.
Why this scenario is confusing¶
The visible hot resource is not always the originating cause; for example, disk stalls can drive CPU wait time and memory pressure can trigger disk paging.
Troubleshooting decision flow¶
graph TD
A[High resource usage] --> B{Which signal dominates?}
B -->|CPU| C[Check process, credits, run-away work]
B -->|Memory| D[Check leak, paging, cache growth]
B -->|Disk| E[Check queue, latency, IOPS and throughput] 2. Common Misreadings¶
- "High CPU means resize immediately."
- "Low free memory always means a leak."
- "Disk latency is only a storage-tier issue."
3. Competing Hypotheses¶
- H1: Guest process saturation.
- H2: Burstable-credit depletion or undersized VM.
- H3: Memory pressure causing paging and secondary slowdown.
- H4: Disk throttling or queue buildup.
4. What to Check First¶
- Dominant resource metric and incident duration.
- Process-level evidence from Task Manager, perfmon,
top,free -m,iostat. - VM SKU and burst-credit behavior if using B-series.
- Disk configuration and caching mode if storage is involved.
5. Evidence to Collect¶
- CPU percentage, credits, and top CPU process.
- Available memory, page faults, and reclaim pressure.
- Disk latency, queue depth, IOPS, throughput.
- Correlation with backup, extension, antivirus, or patch windows.
6. Validation and Disproof by Hypothesis¶
H1: Guest process saturation¶
- Supports: one process consistently dominates CPU or memory.
- Weakens: no process-level outlier and platform cap is evident.
H2: Undersized or burst-limited VM¶
- Supports: zero credits or chronic saturation at expected load.
- Weakens: larger peers show same issue from the same image/workload.
H3: Memory pressure¶
- Supports: paging, very low available memory, degraded response before restart.
- Weakens: healthy memory and no paging.
H4: Disk throttling¶
- Supports: high latency or queue despite modest CPU usage.
- Weakens: disk metrics stable while issue persists.
7. Likely Root Cause Patterns¶
- Run-away process after deployment.
- B-series workload outgrew credit model.
- Memory leak or large page cache growth.
- Data disk tier or VM throughput cap reached.
8. Immediate Mitigations¶
- Reduce the hot workload or stop the run-away process if safe.
- Resize or move off B-series when credits are exhausted.
- Add memory or remediate paging source.
- Shift to disk-specific analysis if storage metrics dominate.
9. Prevention¶
- Alert on CPU, guest memory, queue depth, and credits together.
- Capacity-plan by workload profile, not average utilization only.
- Review scheduled jobs that create periodic spikes.