Troubleshooting Mental Model¶
The core storage troubleshooting rule is simple: classify the failure surface first, then collect evidence from that surface before changing configuration.
Classification model¶
mermaid flowchart TD A[Observed storage symptom] --> B{Primary signal} B -->|Cannot reach endpoint or mount| C[Category 1: Access path] B -->|403 or auth rejection| D[Category 2: Security and identity] B -->|Slow transfer, 429, 503| E[Category 3: Performance and scale] B -->|Missing overwritten deleted data| F[Category 4: Protection and recovery]
Category summary¶
| Category | Typical Symptoms | First Thing to Verify | Common Mistake |
|---|---|---|---|
| Access path | timeout, unreachable endpoint, mount failure, private endpoint confusion | DNS answer and network path | changing RBAC before validating endpoint reachability |
| Security and identity | 403, authorization mismatch, SAS rejected | auth method and scope | assuming Contributor is enough for data-plane access |
| Performance and scale | slow upload, latency spike, 429/503 | server latency vs end-to-end latency | calling every slow transfer a throttle event |
| Protection and recovery | deleted, overwritten, cannot restore | protection state before incident | assuming retention can be enabled after impact and still help |
Practical thinking rules¶
- Path before permission: confirm where traffic is going before debating who can access it.
- Evidence before remediation: capture error text, timestamp, DNS result, and current config before changing anything.
- Server latency vs end-to-end latency: low server latency with slow transfers usually points away from account saturation.
- Feature state before incident matters: recovery depends on what was enabled earlier, not what is enabled now.
Reclassification trigger points¶
mermaid flowchart LR A[Initial hypothesis] --> B{Does evidence contradict it?} B -->|No| C[Stay on current playbook] B -->|Yes| D[Reclassify symptom] D --> E[Open another checklist or playbook]
Reclassify immediately when:
- a 403 turns out to be a network rule denial on the wrong endpoint path,
- a throttling hypothesis shows low server latency and no 429/503,
- a restore request reveals that soft delete or versioning was never enabled.
Good incident notes format¶
- Primary symptom: what the user sees.
- Classification: access, security, performance, or recovery.
- Evidence collected: exact command output or metric snapshot.
- Hypotheses still alive: two or three, not ten.