Node Not Ready¶

1. Summary¶

A node marked NotReady is a cluster-capacity and reliability risk. The cause may be kubelet health, CNI problems, resource pressure, or Azure VM-level issues.

flowchart TD
    A[Symptom] --> B[Hypotheses]
    B --> C[Evidence]
    C --> D[Disprove weak paths]
    D --> E[Mitigation]

2. Common Misreadings¶

The first visible symptom is the root cause.
Restarting the pod proves the issue is fixed.
If one namespace is affected, the cluster is healthy.

3. Competing Hypotheses¶

H1: Kubelet or node services are unhealthy.
H2: Disk, memory, or PID pressure caused readiness degradation.
H3: CNI or DNS components on the node failed.
H4: Underlying VM or network resource issues exist in Azure.

4. What to Check First¶

kubectl get nodes
kubectl describe node <node-name>
kubectl get pods -n kube-system -o wide

5. Evidence to Collect¶

Node conditions and taints.
Recent events tied to the node.
kube-system pod health on the affected node.
Azure VMSS instance or NIC status if the issue persists.

6. Validation and Disproof by Hypothesis¶

If pressure conditions are present, resource exhaustion is more likely than API auth issues.
If only one node in one pool is affected, compare it to healthy nodes in the same pool.
If all nodes in a pool degrade together, inspect pool-wide image or network changes.

7. Likely Root Cause Patterns¶

Resource pressure from runaway workloads.
CNI/daemonset failure after upgrade.
VMSS instance issues or subnet-level networking trouble.
Node image drift or failed extension updates.

8. Immediate Mitigations¶

Cordon and drain if the node is unstable.
Scale the pool out if capacity is tight.
Repair or replace the node if it does not recover quickly.
Validate daemonset health after recovery.

9. Prevention¶

Alert on node conditions before workloads are impacted.
Keep daemonsets and node images current.
Review pool isolation for noisy workloads.

Node Not Ready¶

1. Summary¶

2. Common Misreadings¶

3. Competing Hypotheses¶

4. What to Check First¶

5. Evidence to Collect¶

6. Validation and Disproof by Hypothesis¶

7. Likely Root Cause Patterns¶

8. Immediate Mitigations¶

9. Prevention¶

See Also¶

Sources¶