Node Not Ready¶
1. Summary¶
A node marked NotReady is a cluster-capacity and reliability risk. The cause may be kubelet health, CNI problems, resource pressure, or Azure VM-level issues.
flowchart TD
A[Symptom] --> B[Hypotheses]
B --> C[Evidence]
C --> D[Disprove weak paths]
D --> E[Mitigation] 2. Common Misreadings¶
- The first visible symptom is the root cause.
- Restarting the pod proves the issue is fixed.
- If one namespace is affected, the cluster is healthy.
3. Competing Hypotheses¶
- H1: Kubelet or node services are unhealthy.
- H2: Disk, memory, or PID pressure caused readiness degradation.
- H3: CNI or DNS components on the node failed.
- H4: Underlying VM or network resource issues exist in Azure.
4. What to Check First¶
5. Evidence to Collect¶
- Node conditions and taints.
- Recent events tied to the node.
- kube-system pod health on the affected node.
- Azure VMSS instance or NIC status if the issue persists.
6. Validation and Disproof by Hypothesis¶
- If pressure conditions are present, resource exhaustion is more likely than API auth issues.
- If only one node in one pool is affected, compare it to healthy nodes in the same pool.
- If all nodes in a pool degrade together, inspect pool-wide image or network changes.
7. Likely Root Cause Patterns¶
- Resource pressure from runaway workloads.
- CNI/daemonset failure after upgrade.
- VMSS instance issues or subnet-level networking trouble.
- Node image drift or failed extension updates.
8. Immediate Mitigations¶
- Cordon and drain if the node is unstable.
- Scale the pool out if capacity is tight.
- Repair or replace the node if it does not recover quickly.
- Validate daemonset health after recovery.
9. Prevention¶
- Alert on node conditions before workloads are impacted.
- Keep daemonsets and node images current.
- Review pool isolation for noisy workloads.