Storage Throttling¶
Use this playbook when clients report ServerBusy, 429-like retry behavior, 503 responses, or sudden latency spikes during heavy upload, download, or file-share operations. The root cause is usually a mix of account limits, partition hotspots, concurrency overshoot, or cross-region traffic.
Symptoms¶
- Latency climbs sharply during peak transfer windows.
- Clients report ServerBusy, 503, or SDK retries.
- One container or prefix performs poorly while the rest of the account looks healthy.
- Premium storage was added but the issue remained because request distribution never changed.
Diagnostic Flowchart¶
mermaid flowchart TD A[Reported symptom] --> B[Confirm exact failing operation and time] B --> C{Is the issue configuration, network, or data pattern?} C -->|Configuration| D[Inspect account settings and recent changes] C -->|Network| E[Validate firewall, private access, and DNS evidence] C -->|Data pattern| F[Inspect workload shape, tiering, or replication behavior] D --> G[Run targeted CLI fixes] E --> G F --> G G --> H[Re-test and capture evidence]
Step-by-Step Resolution¶
- Identify the exact storage account, container or share, operation, time window, and calling identity.
- Confirm whether the symptom is isolated to one client, one subnet, one prefix, or the whole account.
- Check the current storage account configuration and compare it with the last known-good state.
- Use KQL to collect evidence before making changes so the eventual root cause is explainable.
- Apply the smallest safe fix first and re-test from the original failing path.
- Update long-term controls so the incident does not recur silently.
Resolution detail¶
- Validate that the issue is reproducible now, not only historical.
- Compare management-plane changes in Azure Activity with the incident timeline.
- Review whether a security, lifecycle, replication, or performance assumption changed without broad communication.
- Prefer reversible changes first, especially during business hours.
- After recovery, capture the design or governance control that would have prevented the issue.
KQL Queries for Diagnostics¶
Server busy responses¶
StorageBlobLogs
| where TimeGenerated > ago(4h)
| where StatusCode in (500, 503) or StatusText has_any ("ServerBusy", "OperationTimedOut")
| summarize Count=count() by StatusText, OperationName, bin(TimeGenerated, 15m)
| order by TimeGenerated desc
How to read it:
- Spikes during predictable transfer windows point to workload shape, not random failure.
- Separate read-heavy and write-heavy operation names to find the hot path.
- Correlate the time range with the exact complaint window and any recent configuration change.
Account-level latency trend¶
AzureMetrics
| where TimeGenerated > ago(6h)
| where ResourceProvider == "MICROSOFT.STORAGE"
| where MetricName in ("SuccessServerLatency", "SuccessE2ELatency", "Transactions")
| summarize AvgValue=avg(Average) by MetricName, bin(TimeGenerated, 15m)
| order by TimeGenerated asc
How to read it:
- Compare latency against transaction volume to see whether load and slowdowns move together.
- If volume is stable but latency spikes, check client network and region placement.
- Correlate the time range with the exact complaint window and any recent configuration change.
Prefix or container concentration¶
StorageBlobLogs
| where TimeGenerated > ago(2h)
| extend Container = tostring(split(Uri, "/")[3])
| summarize Requests=count() by Container, OperationName
| order by Requests desc
How to read it:
- A single container or prefix dominating requests is a hot-partition candidate.
- Use this with application naming analysis for remediation.
- Correlate the time range with the exact complaint window and any recent configuration change.
CLI Commands for Fixes¶
Fix step 1: Measure current SKU and replication configuration¶
az storage account show \
--resource-group $RG \
--name $STORAGE_NAME \
--query "{sku:sku.name,kind:kind,accessTier:accessTier,location:primaryLocation}" \
--output json
- Record the command output in the incident timeline.
- Re-test from the same client identity and network path that originally failed.
- If the change is temporary, document the rollback and a permanent follow-up action.
Fix step 2: Move hot workloads to Premium BlockBlobStorage or Premium FileStorage when justified¶
az storage account update \
--resource-group $RG \
--name $STORAGE_NAME \
--access-tier Hot \
--output json
- Record the command output in the incident timeline.
- Re-test from the same client identity and network path that originally failed.
- If the change is temporary, document the rollback and a permanent follow-up action.
Fix step 3: Use AzCopy with controlled concurrency during bulk transfers¶
azcopy copy "./dataset" "https://$STORAGE_NAME.blob.core.windows.net/$CONTAINER_NAME?<sas-token>" --recursive=true --cap-mbps=800
- Record the command output in the incident timeline.
- Re-test from the same client identity and network path that originally failed.
- If the change is temporary, document the rollback and a permanent follow-up action.
Fix step 4: Scale out into separate accounts when limits or hotspots remain¶
az storage account create \
--resource-group $RG \
--name $STORAGE_NAME_ALT \
--location $LOCATION \
--sku Premium_LRS \
--kind BlockBlobStorage \
--output json
- Record the command output in the incident timeline.
- Re-test from the same client identity and network path that originally failed.
- If the change is temporary, document the rollback and a permanent follow-up action.
Prevention Checklist¶
- [ ] The ownership of this storage account and its policies is documented.
- [ ] Monitoring exists for the symptom class described in this playbook.
- [ ] Teams use long-lived credentials only by exception and with review.
- [ ] Private networking, DNS, and route dependencies are documented where relevant.
- [ ] Blob lifecycle and access tier behavior are explained to data owners.
- [ ] Premium storage or scale-out decisions are backed by measured evidence.
- [ ] Change control captures storage account setting updates that alter runtime behavior.
- [ ] The runbook includes validation and rollback steps.