Network Connectivity Issues¶
Symptoms¶
- Administrative or workload impact is visible to users or operators.
- The VM is deployed, but one part of the expected control plane or data path is failing.
- You need a fast way to narrow the problem before making a risky change.
flowchart TD
A[Network Connectivity Issues] --> B[Confirm current symptom and blast radius]
B --> C[Collect platform evidence first]
C --> D[Collect guest or workload evidence]
D --> E[Map findings to the most likely hypothesis]
E --> F[Apply the smallest safe fix]
F --> G[Validate recovery and prevention actions] 1. Summary¶
Use this playbook when a VM cannot reach internal or external endpoints, dependencies time out, or east-west communication breaks after NSG, NIC, route, or DNS changes.
NIC effective configuration, NSG and route review, DNS, accelerated networking, and dependency reachability.
2. Common Misreadings¶
| Observation | Often misread as | Actually means |
|---|---|---|
| One failed probe or one stale metric | Total VM outage | The issue may be scoped to one path or one recovery dependency. |
| A successful extension deployment | Guest health is good | Extensions can succeed while the underlying guest service still fails. |
| A recent change record | Guaranteed root cause | Recent changes are strong leads, but they still need proof from evidence. |
| A restart fixes the issue | Permanent resolution | Recovery after restart may only hide the real structural cause. |
3. Competing Hypotheses¶
| Hypothesis | Likelihood | Key discriminator |
|---|---|---|
| Control-plane or configuration drift | High | Azure resource state no longer matches the intended pattern. |
| Guest OS or agent issue | High | Guest or serial evidence shows service, boot, or firewall failure. |
| Capacity or platform dependency bottleneck | Medium | Metrics or SKU limits explain the symptom better than configuration drift. |
| Security control blocked expected behavior | Medium | NSG, ASG, JIT, or policy state changed before the incident. |
| External dependency issue | Low | VM appears healthy, but a downstream service path is broken. |
4. What to Check First¶
-
Review VM instance view
-
Review boot diagnostics settings
-
Review NIC effective security rules
-
Review recent platform changes
5. Evidence to Collect¶
5.1 KQL Queries¶
// NSG flow or denied connection signals
AzureDiagnostics
| where TimeGenerated > ago(6h)
| where Category has "NetworkSecurityGroup" or Category has "NetworkWatcher"
| project TimeGenerated, Category, Resource, OperationName, ResultDescription
| order by TimeGenerated desc
| Field | Interpretation |
|---|---|
TimeGenerated | Incident sequence and correlation window. |
| Resource identifier | Confirms the signal belongs to the affected VM. |
| Operation or metric value | Explains whether the failure is change-driven, capacity-driven, or guest-driven. |
How to read this
Compare these results with the last known healthy window. A change in shape matters more than a single absolute value.
// Heartbeat freshness by subnet incident window
Heartbeat
| where TimeGenerated > ago(12h)
| summarize LastHeartbeat=max(TimeGenerated) by Computer, _ResourceId
| order by LastHeartbeat asc
| Field | Interpretation |
|---|---|
TimeGenerated | Incident sequence and correlation window. |
| Resource identifier | Confirms the signal belongs to the affected VM. |
| Operation or metric value | Explains whether the failure is change-driven, capacity-driven, or guest-driven. |
How to read this
Compare these results with the last known healthy window. A change in shape matters more than a single absolute value.
5.2 CLI Investigation¶
az vm show --resource-group $RG --name $VM_NAME --query "{powerState:powerState,vmSize:hardwareProfile.vmSize,priority:priority,provisioningState:provisioningState}" --output json
az network nic show --resource-group $RG --name $NIC_NAME --query "{ipConfigs:ipConfigurations[].privateIPAddress,acceleratedNetworking:enableAcceleratedNetworking,networkSecurityGroup:networkSecurityGroup.id}" --output json
Interpretation:
- If the VM is not in a usable power state, fix power and boot issues before guest remediation.
- If the NIC or NSG binding is wrong, repair the control plane before changing guest settings.
- If accelerated networking or disk attachment changed after a resize, validate that the new size still supports the intended feature set.
6. Validation and Disproof by Hypothesis¶
Hypothesis 1: Configuration drift or recent change¶
Proves if: Evidence clearly shows the current state changed from the last healthy pattern in a way that explains the symptom.
Disproves if: Control-plane, guest, and capacity evidence all remain healthy and consistent with baseline.
Recommended validation steps:
- Compare Azure Activity Log against the incident start time.
- Compare current configuration with the approved landing zone pattern.
- Review the most recent guest evidence or boot or connection output.
- Re-test the original symptom after the smallest safe correction.
Hypothesis 2: Guest OS or service failure¶
Proves if: Evidence clearly shows the current state changed from the last healthy pattern in a way that explains the symptom.
Disproves if: Control-plane, guest, and capacity evidence all remain healthy and consistent with baseline.
Recommended validation steps:
- Compare Azure Activity Log against the incident start time.
- Compare current configuration with the approved landing zone pattern.
- Review the most recent guest evidence or boot or connection output.
- Re-test the original symptom after the smallest safe correction.
Hypothesis 3: Capacity limit or SKU mismatch¶
Proves if: Evidence clearly shows the current state changed from the last healthy pattern in a way that explains the symptom.
Disproves if: Control-plane, guest, and capacity evidence all remain healthy and consistent with baseline.
Recommended validation steps:
- Compare Azure Activity Log against the incident start time.
- Compare current configuration with the approved landing zone pattern.
- Review the most recent guest evidence or boot or connection output.
- Re-test the original symptom after the smallest safe correction.
Hypothesis 4: Security control blocking the expected path¶
Proves if: Evidence clearly shows the current state changed from the last healthy pattern in a way that explains the symptom.
Disproves if: Control-plane, guest, and capacity evidence all remain healthy and consistent with baseline.
Recommended validation steps:
- Compare Azure Activity Log against the incident start time.
- Compare current configuration with the approved landing zone pattern.
- Review the most recent guest evidence or boot or connection output.
- Re-test the original symptom after the smallest safe correction.
7. Likely Root Cause Patterns¶
| Pattern | Evidence | Resolution |
|---|---|---|
| Unsupported feature after resize or redeploy | Settings drift, feature not enabled, or SKU capability mismatch | Move back to a supported size or re-enable the feature with validation. |
| NSG or route or JIT drift | Effective rules do not match the intended admin or workload path | Repair the policy and document the expected flow. |
| Guest service stopped or corrupted | Serial, extension, or guest evidence points to OS-level failure | Repair the guest service, driver, boot loader, or firewall configuration. |
| Performance bottleneck blamed as an outage | CPU, memory, or disk metrics saturate before the user-visible failure | Resize, retier, or redistribute workload pressure. |
8. Immediate Mitigations¶
- Check effective NSGs, effective routes, DNS server configuration, and whether the NIC is on the expected subnet.
- Validate accelerated networking if latency or packet processing CPU changed after a resize or NIC recreation.
- Use Network Watcher connectivity tests and packet capture only after you have validated the control plane intent.
Step-by-step resolution:
- Stabilize the VM or admin path without erasing forensic evidence.
- Correct the control-plane configuration first when Azure intent is clearly wrong.
- Apply guest-side repair only after confirming the platform path is healthy.
- Re-run the original command, probe, or sign-in flow to verify recovery.
- Record the exact evidence that proved the fix, not just that the symptom disappeared.
CLI commands commonly used during fixes:
az vm run-command invoke --resource-group $RG --name $VM_NAME --command-id RunShellScript --scripts "sudo systemctl status walinuxagent"
az vm restart --resource-group $RG --name $VM_NAME
9. Prevention¶
Prevention checklist¶
- [ ] Keep a documented healthy baseline for VM size, NIC, disk, and admin-path settings
- [ ] Alert on drift in critical VM security and connectivity controls
- [ ] Test boot diagnostics, Bastion, serial console, and backup restore before production go-live
- [ ] Review SKU feature compatibility before resize operations
- [ ] Capture post-incident evidence and turn it into a reusable guardrail