Node Not Ready¶
1. Summary¶
Use this playbook when one or more AKS nodes report NotReady, workloads are being evicted, or pods stop scheduling on a specific pool. The failure is usually a kubelet health issue, disk or memory pressure, CNI connectivity break, or an Azure infrastructure dependency problem visible through VMSS health and Container Insights.
Typical incident window: 10-20 minutes to establish whether the issue is workload-specific, node-specific, or cluster-wide. Time to resolution: 30 minutes to several hours depending on whether the fix is manifest-level, node-level, or Azure control-plane level.
Symptoms¶
kubectl get nodesshows one or more nodes inNotReadystate.- Pods on the node report
NodeNotReady,Unknown, or repeated eviction events. - Container Insights shows a node dropping out of
KubeNodeInventoryor disk pressure metrics spiking. - VMSS instance view or Azure activity history shows recent extension, reboot, or networking changes.
Diagnostic flowchart¶
flowchart TD
A[Reported symptom] --> B{Can the object be reproduced now?}
B -->|No| C[Use recent events, Container Insights, and rollout history]
B -->|Yes| D[Capture current state with kubectl and Azure CLI]
D --> E{Is the issue isolated to one workload or node pool?}
E -->|Workload| F[Check image, probes, config, and service wiring]
E -->|Node pool| G[Check node health, autoscaler, subnet, and VMSS state]
F --> H[Validate with KQL and controller logs]
G --> H
H --> I[Apply targeted fix and verify telemetry returns to baseline] 2. Common Misreadings¶
| Observation | Often Misread As | Actually Means |
|---|---|---|
| Symptom appears in one namespace | Entire cluster outage | The issue may still be isolated to one rollout, one pool, or one ingress class. |
| Azure portal shows cluster healthy | Workload path is healthy | Control plane health does not prove pod, node, or ingress behavior. |
| Restart or reschedule seems to help briefly | Root cause is fixed | Many AKS issues recur until the underlying manifest, node, or network condition is corrected. |
| Monitoring has partial data | Monitoring is the problem | Partial Container Insights data is itself useful evidence about scope and timing. |
3. Competing Hypotheses¶
| Hypothesis | Likelihood | Key Discriminator |
|---|---|---|
| Kubelet or container runtime on the node is unhealthy | High | Node conditions stop updating and kubelet-related events appear first. |
| Disk pressure or image layer buildup prevents normal kubelet behavior | High | Node condition or InsightsMetrics shows disk usage saturation. |
| CNI or route failure blocks node heartbeat and pod networking | Medium | Pods lose network reachability and node events align with CNI logs. |
| Underlying VMSS or host maintenance event disrupted the node | Medium | Azure instance view or activity logs show platform action near incident start. |
4. What to Check First¶
-
Confirm the current object state from Kubernetes
-
Describe the affected object to capture recent events
-
Check AKS cluster and node pool configuration from Azure
-
List node pools and autoscaler settings
-
Run a fast Container Insights control query
5. Evidence to Collect¶
5.1 KQL Queries¶
KubePodInventory
| where TimeGenerated > ago(30m)
| summarize Restarts=max(ContainerRestartCount), LastSeen=max(TimeGenerated) by ClusterName, Namespace, PodName, ContainerName
| order by Restarts desc
| Column | Example value | Interpretation |
|---|---|---|
Restarts | 14 | Confirms the issue is current and identifies which container is unstable. |
LastSeen | 2026-04-07 09:41:00 | Shows how fresh the inventory signal is. |
Namespace | payments | Helps isolate whether blast radius is limited. |
How to Read This
Start by proving scope. If restart or state anomalies are limited to one namespace or one pool, avoid cluster-wide changes first.
ContainerLogV2
| where TimeGenerated > ago(30m)
| summarize LogLines=count(), LastSeen=max(TimeGenerated) by Namespace, PodName
| order by LastSeen desc
| Column | Example value | Interpretation |
|---|---|---|
LogLines | 152 | Confirms whether the pod is emitting logs during failure. |
LastSeen | recent timestamp | Stale logs can indicate the container never reaches full runtime. |
How to Read This
Pair this query with kubectl logs --previous so you do not confuse current healthy logs with the failing previous container instance.
KubeEvents
| where TimeGenerated > ago(30m)
| where Reason in ("Failed", "BackOff", "Unhealthy", "NodeNotReady", "FailedScheduling")
| project TimeGenerated, Namespace, Name, Reason, Message
| order by TimeGenerated desc
| Column | Example value | Interpretation |
|---|---|---|
Reason | BackOff | Indicates repeated restart attempts or scheduling failures depending on the object. |
Message | Back-off restarting failed container | Often provides the shortest path to the likely hypothesis. |
How to Read This
Events often age out faster than logs. Capture them early in the incident before recreating pods or nodes.
5.2 CLI Investigation¶
Interpretation: previous logs are usually more valuable than current logs during restart loops because they contain the container exit path.
Interpretation: look for probe failures, image pull errors, FailedScheduling, NodeNotReady, or backend controller warnings near the incident start time.
az vmss list-instances \
--resource-group "$NODE_RESOURCE_GROUP" \
--name "$VMSS_NAME" \
--query "[].{instanceId:instanceId,provisioningState:provisioningState,latestModelApplied:latestModelApplied}" \
--output table
Interpretation: when the problem is node- or ingress-related, VMSS state and model drift provide important Azure-side evidence.
6. Validation and Disproof by Hypothesis¶
Kubelet or container runtime on the node is unhealthy¶
Proves if: Kubernetes events, previous logs, and Azure-side state all align around this hypothesis.
Disproves if: Another signal explains the timing more directly or the expected discriminator is missing.
Disk pressure or image layer buildup prevents normal kubelet behavior¶
Proves if: Kubernetes events, previous logs, and Azure-side state all align around this hypothesis.
Disproves if: Another signal explains the timing more directly or the expected discriminator is missing.
CNI or route failure blocks node heartbeat and pod networking¶
Proves if: Kubernetes events, previous logs, and Azure-side state all align around this hypothesis.
Disproves if: Another signal explains the timing more directly or the expected discriminator is missing.
Underlying VMSS or host maintenance event disrupted the node¶
Proves if: Kubernetes events, previous logs, and Azure-side state all align around this hypothesis.
Disproves if: Another signal explains the timing more directly or the expected discriminator is missing.
7. Likely Root Cause Patterns¶
| Pattern | Evidence | Resolution |
|---|---|---|
| Manifest drift after a rollout | New revision correlates with events, logs, or controller errors | Revert or patch the manifest and validate against staging first |
| Pool-level capacity mismatch | Pending pods, high utilization, or NotReady nodes align to one pool | Tune requests, autoscaler limits, or node pool shape |
| Network or DNS drift | Ingress, image pull, or dependency lookups fail while pods otherwise look normal | Correct DNS, NSG, route, or ingress controller configuration |
| Operational blind spot | Teams deleted or recreated resources before collecting evidence | Add a first-response checklist and automation for evidence capture |
8. Immediate Mitigations and Step-by-Step Resolution¶
- Cordon and drain the unhealthy node if workloads can move safely.
- Verify kubelet and CNI health, then remediate disk pressure or extension failures on the backing VMSS instance.
- If networking is broken, inspect subnet routes, NSGs, and CNI daemon logs before forcing image upgrades.
- Recycle or reimage the node only after capturing evidence required for post-incident learning.
- Confirm the node returns to
Readyand that pod density and autoscaler behavior normalize.
Example resolution commands:
az aks nodepool update \
--resource-group "$RG" \
--cluster-name "$CLUSTER_NAME" \
--name "$NODEPOOL_NAME" \
--max-count 10
9. Prevention Checklist¶
- [ ] Create saved Container Insights queries for the symptom family and link them in the team runbook.
- [ ] Require long-flag CLI examples and standardized evidence capture in incident response docs.
- [ ] Review ingress, autoscaler, probes, and node pool settings during every production readiness review.
- [ ] Alert on restart spikes,
NotReadynodes, andFailedSchedulingevents before customers report impact. - [ ] Document which changes require platform-team approval, especially around networking, ingress, and security policy.