CNI IP Exhaustion¶
1. Summary¶
Pods fail to schedule or nodes fail to scale because the subnet or pod IP allocation model has run out of usable addresses.
flowchart TD
A[Symptom] --> B[Hypotheses]
B --> C[Evidence]
C --> D[Disprove weak paths]
D --> E[Mitigation] 2. Common Misreadings¶
- The first visible symptom is the root cause.
- Restarting the pod proves the issue is fixed.
- If one namespace is affected, the cluster is healthy.
3. Competing Hypotheses¶
- H1: The node subnet has no free IPs.
- H2: Pod subnet or overlay configuration is undersized for growth.
- H3: Old nodes, NICs, or orphaned resources are still consuming addresses.
- H4: The symptom is actually quota-driven, not IP-driven.
4. What to Check First¶
kubectl describe pod <pod-name> -n <namespace>
az aks show --resource-group $RG --name $CLUSTER_NAME --query networkProfile --output yaml
az network vnet subnet show --resource-group <network-rg> --vnet-name <vnet-name> --name <subnet-name> --output yaml
5. Evidence to Collect¶
- Scheduler and autoscaler events.
- Network profile and plugin mode.
- Subnet size and remaining addresses.
- VMSS or orphaned NIC state in the node resource group.
6. Validation and Disproof by Hypothesis¶
- If node provisioning fails with subnet allocation errors, IP exhaustion is stronger than workload mis-sizing.
- If capacity exists but quota blocks node creation, disprove H1-H3 and handle quota instead.
- If overlay is used, inspect the right address domain before resizing VNets unnecessarily.
7. Likely Root Cause Patterns¶
- Subnet sized for initial cluster only.
- Sudden scale spike with no headroom.
- Orphaned network artifacts after failed operations.
- Wrong assumption about overlay vs direct pod subnet addressing.
8. Immediate Mitigations¶
- Free unused resources and confirm actual IP usage.
- Expand or redesign subnets where supported.
- Lower growth pressure temporarily with workload controls.
- Review whether overlay mode is a better long-term fit.
9. Prevention¶
- Perform IP growth modeling during cluster design.
- Review subnet utilization as part of scaling readiness.
- Keep a clear standard for supported networking models.