Upgrade Failure¶
1. Summary¶
An AKS upgrade stalls, partially completes, or leaves workloads unhealthy. The problem is usually compatibility, disruption controls, or insufficient pre-checking.
flowchart TD
A[Symptom] --> B[Hypotheses]
B --> C[Evidence]
C --> D[Disprove weak paths]
D --> E[Mitigation] 2. Common Misreadings¶
- The first visible symptom is the root cause.
- Restarting the pod proves the issue is fixed.
- If one namespace is affected, the cluster is healthy.
3. Competing Hypotheses¶
- H1: Deprecated APIs, controllers, or CRDs are incompatible with the target version.
- H2: PDBs or workload topology block draining.
- H3: Node image or daemonset components fail during rollout.
- H4: The cluster was upgraded, but workload validation was insufficient.
4. What to Check First¶
az aks get-upgrades --resource-group $RG --name $CLUSTER_NAME --output table
kubectl get events -A --sort-by=.lastTimestamp
kubectl get pdb -A
5. Evidence to Collect¶
- Upgrade history and current version.
- Event stream during drain and rescheduling.
- Controller and daemonset health.
- Application readiness after node replacement.
6. Validation and Disproof by Hypothesis¶
- If nodes cannot drain because of PDBs, disprove version-compatibility-only theories.
- If platform upgrade succeeded but workloads fail later, focus on workload or controller compatibility.
- If only one pool fails, isolate node image or pool-specific constraints.
7. Likely Root Cause Patterns¶
- Deprecated APIs or unsupported operators.
- Singleton workloads with strict disruption budgets.
- Node-level add-on incompatibility.
- No staged upgrade process.
8. Immediate Mitigations¶
- Pause expansion of the change.
- Stabilize affected workloads or pools.
- Restore capacity and validate critical controllers.
- Rework the upgrade plan before the next attempt.
9. Prevention¶
- Track support windows continuously.
- Test upgrades in lower environments.
- Keep workload APIs and controllers current.