Cluster Autoscaler Issues¶
1. Summary¶
Use this playbook when pending pods do not trigger scale-out, empty nodes do not scale in, or autoscaler events show repeated failures. In AKS, the most common causes are tight node pool min and max limits, unsupported scheduling constraints, exhausted subnet or quota limits, or incorrect expectations about daemonsets and disruption budgets.
Typical incident window: 10-20 minutes to establish whether the issue is workload-specific, node-specific, or cluster-wide. Time to resolution: 30 minutes to several hours depending on whether the fix is manifest-level, node-level, or Azure control-plane level.
Symptoms¶
- Pods remain
Pendingwith scheduling messages even though autoscaler is enabled. kubectl describe podshows CPU, memory, or node affinity constraints that no current node satisfies.- AKS activity or autoscaler logs mention scale-up rejection, quota issues, or subnet capacity exhaustion.
- Scale-in never happens because PDBs, local storage, or daemonset overhead keep nodes non-empty.
Diagnostic flowchart¶
flowchart TD
A[Reported symptom] --> B{Can the object be reproduced now?}
B -->|No| C[Use recent events, Container Insights, and rollout history]
B -->|Yes| D[Capture current state with kubectl and Azure CLI]
D --> E{Is the issue isolated to one workload or node pool?}
E -->|Workload| F[Check image, probes, config, and service wiring]
E -->|Node pool| G[Check node health, autoscaler, subnet, and VMSS state]
F --> H[Validate with KQL and controller logs]
G --> H
H --> I[Apply targeted fix and verify telemetry returns to baseline] 2. Common Misreadings¶
| Observation | Often Misread As | Actually Means |
|---|---|---|
| Symptom appears in one namespace | Entire cluster outage | The issue may still be isolated to one rollout, one pool, or one ingress class. |
| Azure portal shows cluster healthy | Workload path is healthy | Control plane health does not prove pod, node, or ingress behavior. |
| Restart or reschedule seems to help briefly | Root cause is fixed | Many AKS issues recur until the underlying manifest, node, or network condition is corrected. |
| Monitoring has partial data | Monitoring is the problem | Partial Container Insights data is itself useful evidence about scope and timing. |
3. Competing Hypotheses¶
| Hypothesis | Likelihood | Key Discriminator |
|---|---|---|
| Node pool limits or autoscaler profile settings prevent scale-out | High | Node pool max count, surge headroom, or profile settings block new nodes. |
| Pod constraints cannot be satisfied by any node pool | High | Node selectors, taints, zones, or resource requests mismatch the available pools. |
| Azure quota or subnet capacity prevents new nodes | Medium | Azure CLI shows VMSS, vCPU, or subnet address space limits reached. |
| Scale-in blocked by workload protections | Medium | PDBs, local storage, or daemonsets keep nodes from becoming removable. |
4. What to Check First¶
-
Confirm the current object state from Kubernetes
-
Describe the affected object to capture recent events
-
Check AKS cluster and node pool configuration from Azure
-
List node pools and autoscaler settings
-
Run a fast Container Insights control query
5. Evidence to Collect¶
5.1 KQL Queries¶
KubePodInventory
| where TimeGenerated > ago(30m)
| summarize Restarts=max(ContainerRestartCount), LastSeen=max(TimeGenerated) by ClusterName, Namespace, PodName, ContainerName
| order by Restarts desc
| Column | Example value | Interpretation |
|---|---|---|
Restarts | 14 | Confirms the issue is current and identifies which container is unstable. |
LastSeen | 2026-04-07 09:41:00 | Shows how fresh the inventory signal is. |
Namespace | payments | Helps isolate whether blast radius is limited. |
How to Read This
Start by proving scope. If restart or state anomalies are limited to one namespace or one pool, avoid cluster-wide changes first.
ContainerLogV2
| where TimeGenerated > ago(30m)
| summarize LogLines=count(), LastSeen=max(TimeGenerated) by Namespace, PodName
| order by LastSeen desc
| Column | Example value | Interpretation |
|---|---|---|
LogLines | 152 | Confirms whether the pod is emitting logs during failure. |
LastSeen | recent timestamp | Stale logs can indicate the container never reaches full runtime. |
How to Read This
Pair this query with kubectl logs --previous so you do not confuse current healthy logs with the failing previous container instance.
KubeEvents
| where TimeGenerated > ago(30m)
| where Reason in ("Failed", "BackOff", "Unhealthy", "NodeNotReady", "FailedScheduling")
| project TimeGenerated, Namespace, Name, Reason, Message
| order by TimeGenerated desc
| Column | Example value | Interpretation |
|---|---|---|
Reason | BackOff | Indicates repeated restart attempts or scheduling failures depending on the object. |
Message | Back-off restarting failed container | Often provides the shortest path to the likely hypothesis. |
How to Read This
Events often age out faster than logs. Capture them early in the incident before recreating pods or nodes.
5.2 CLI Investigation¶
Interpretation: previous logs are usually more valuable than current logs during restart loops because they contain the container exit path.
Interpretation: look for probe failures, image pull errors, FailedScheduling, NodeNotReady, or backend controller warnings near the incident start time.
az vmss list-instances \
--resource-group "$NODE_RESOURCE_GROUP" \
--name "$VMSS_NAME" \
--query "[].{instanceId:instanceId,provisioningState:provisioningState,latestModelApplied:latestModelApplied}" \
--output table
Interpretation: when the problem is node- or ingress-related, VMSS state and model drift provide important Azure-side evidence.
6. Validation and Disproof by Hypothesis¶
Node pool limits or autoscaler profile settings prevent scale-out¶
Proves if: Kubernetes events, previous logs, and Azure-side state all align around this hypothesis.
Disproves if: Another signal explains the timing more directly or the expected discriminator is missing.
Pod constraints cannot be satisfied by any node pool¶
Proves if: Kubernetes events, previous logs, and Azure-side state all align around this hypothesis.
Disproves if: Another signal explains the timing more directly or the expected discriminator is missing.
Azure quota or subnet capacity prevents new nodes¶
Proves if: Kubernetes events, previous logs, and Azure-side state all align around this hypothesis.
Disproves if: Another signal explains the timing more directly or the expected discriminator is missing.
Scale-in blocked by workload protections¶
Proves if: Kubernetes events, previous logs, and Azure-side state all align around this hypothesis.
Disproves if: Another signal explains the timing more directly or the expected discriminator is missing.
7. Likely Root Cause Patterns¶
| Pattern | Evidence | Resolution |
|---|---|---|
| Manifest drift after a rollout | New revision correlates with events, logs, or controller errors | Revert or patch the manifest and validate against staging first |
| Pool-level capacity mismatch | Pending pods, high utilization, or NotReady nodes align to one pool | Tune requests, autoscaler limits, or node pool shape |
| Network or DNS drift | Ingress, image pull, or dependency lookups fail while pods otherwise look normal | Correct DNS, NSG, route, or ingress controller configuration |
| Operational blind spot | Teams deleted or recreated resources before collecting evidence | Add a first-response checklist and automation for evidence capture |
8. Immediate Mitigations and Step-by-Step Resolution¶
- Inspect pending pod events and the target node pool autoscaler settings together.
- Increase max counts or add a suitable pool only after proving that workload constraints are legitimate.
- Resolve quota or subnet exhaustion before retrying scale actions.
- Tune PDBs, requests, and daemonset placement so scale-in can happen safely.
- Review cost and reliability implications after every autoscaler policy change.
Example resolution commands:
az aks nodepool update \
--resource-group "$RG" \
--cluster-name "$CLUSTER_NAME" \
--name "$NODEPOOL_NAME" \
--max-count 10
9. Prevention Checklist¶
- [ ] Create saved Container Insights queries for the symptom family and link them in the team runbook.
- [ ] Require long-flag CLI examples and standardized evidence capture in incident response docs.
- [ ] Review ingress, autoscaler, probes, and node pool settings during every production readiness review.
- [ ] Alert on restart spikes,
NotReadynodes, andFailedSchedulingevents before customers report impact. - [ ] Document which changes require platform-team approval, especially around networking, ingress, and security policy.