Troubleshooting Architecture Overview¶
This page answers the first question in a storage incident: where in the storage path can this fail? Use it to classify the problem before opening a detailed playbook.
Storage failure path¶
```mermaid flowchart LR A[Client or workload] --> B[Identity or token] B --> C[DNS resolution] C --> D[Network path] D --> E[Storage account front door] E --> F[Service endpoint
Blob Files Queue Table] F --> G[Data object share queue table entity]
B -. FP-SEC-01 .-> B1[RBAC SAS key auth failure]
C -. FP-ACC-01 .-> C1[Wrong public/private DNS resolution]
D -. FP-ACC-02 .-> D1[Firewall NSG route port issue]
E -. FP-PERF-01 .-> E1[Account throttle or service latency]
F -. FP-PERF-02 .-> F1[Partitioning protocol or transfer inefficiency]
G -. FP-REC-01 .-> G1[Delete overwrite retention mismatch]
```
Failure domains and first checks¶
| Failure Point | Typical Symptom | First Evidence | Primary Playbook |
|---|---|---|---|
| FP-SEC-01 Identity and authorization | 403, auth mismatch, token rejected | error code, token/SAS fields, RBAC scope | Authorization Failures |
| FP-ACC-01 DNS and name resolution | public IP instead of private IP, name lookup failure | nslookup, zone link state, endpoint FQDN | Private Endpoint and DNS Issues |
| FP-ACC-02 Connectivity path | timeout, cannot mount, cannot reach account | storage firewall, port test, private endpoint approval | Cannot Access Storage Account |
| FP-PERF-01 Account/service saturation | 429, 503, latency spike | transaction metrics, success rate, server latency | Throttling and Performance Issues |
| FP-PERF-02 Transfer design inefficiency | slow upload/download, many small-file delays | concurrency settings, RTT, object size mix | Slow Upload / Download |
| FP-REC-01 Protection and recovery gap | deleted or overwritten data cannot be restored | retention state, versioning, soft delete, backup | Data Protection and Recovery Issues |
Public and private access model¶
mermaid flowchart TD A[Client request for <account>.blob.core.windows.net] --> B{DNS answer} B -->|Public IP| C[Public endpoint path] B -->|Private IP| D[Private endpoint path] C --> E{Firewall allows source?} D --> F{Private endpoint approved and routable?} E -->|No| G[Access blocked] F -->|No| G E -->|Yes| H[Storage service] F -->|Yes| H
The most common misclassification is treating a DNS or routing problem as an authorization problem. If the request is hitting the wrong endpoint path, the auth evidence is often misleading.
Evidence layers to collect in order¶
- Symptom evidence: exact error code, timestamp, protocol, target endpoint.
- Path evidence: DNS answer, firewall state, private endpoint state, required port reachability.
- Identity evidence: RBAC role, SAS fields, account key policy, token audience and expiry.
- Performance evidence: server latency, end-to-end latency, transaction spikes, concurrency level.
- Recovery evidence: retention settings that were enabled before the incident.
Quick routing examples¶
- 403 with a valid-looking SAS often still needs a security playbook first.
- Private endpoint configured but traffic resolves to public IP usually belongs in access playbooks.
- Slow transfers with no 429/503 often belong in transfer-performance playbooks, not throttling.
- Missing data after deletion belongs in the recovery playbook, and the key question is whether protection existed before impact.