Connectivity Failures¶

1. Summary¶

Use this playbook when workloads cannot reach each other across Azure network boundaries and the failure could be caused by VNet peering, NSG rules, UDRs, firewall policy, or asymmetric path design.

The fastest diagnosis comes from splitting the path into five checks: name resolution, source NIC policy, effective routes, transit path policy, and destination listener health. Azure incidents often start with a symptom that sounds like "the network is down" when only one of those layers is broken.

Symptoms¶

Connections time out or are refused.
Traffic works from one source but fails from another seemingly similar source.
A private endpoint or hybrid path behaves differently after a recent change.
Operators have a healthy-looking control plane but an unhealthy application path.

mermaid flowchart TD A[Connection fails] --> B{Does DNS return the expected private target?} B -->|No| C[Fix name resolution first] B -->|Yes| D{Do effective NSG rules allow the flow?} D -->|No| E[Correct source or destination NSG] D -->|Yes| F{Do effective routes reach the correct next hop?} F -->|No| G[Fix UDR, peering, or gateway transit] F -->|Yes| H{Does a firewall or appliance deny the flow?} H -->|Yes| I[Adjust inspection policy] H -->|No| J[Validate destination listener and health]

2. Common Misreadings¶

Observation	Often Misread As	Actually Means
A VM in the same subnet can connect	Global Azure routing is healthy	The broken path may affect only one NIC, subnet, or peered route domain.
Peering state shows Connected	Peering alone guarantees end-to-end reachability	Peering does not override NSGs, UDRs, firewall policy, or DNS mistakes.
A single ping works	The application path is healthy	ICMP success does not prove TCP 443, SQL 1433, or a load-balanced listener path.
The firewall shows no denies	The firewall is not involved	Traffic may never reach the firewall because a route or NSG blocks it earlier.

3. Competing Hypotheses¶

Hypothesis	Likelihood	Key Discriminator
A source or destination NSG blocks the required flow	High	Effective NSG output lacks an allow rule for the destination port and direction.
A UDR or peering setting sends traffic to the wrong next hop	High	Effective routes show VirtualAppliance, None, or missing remote prefixes unexpectedly.
Gateway transit or forwarded traffic is misconfigured in peering	Medium	Spokes can reach the hub but not remote spoke or on-premises prefixes.
A firewall or NVA policy denies the traffic	Medium	Diagnostics show denies that match the source, destination, or FQDN.
The destination workload is unhealthy or not listening	Medium	The path is allowed but the listener port, health probe, or service process is down.

4. What to Check First¶

Inspect effective routes on the source NIC

az network nic show-effective-route-table \
    --resource-group $RG \
    --name $SOURCE_NIC_NAME

Inspect effective NSG rules on the source NIC

az network nic show-effective-nsg \
    --resource-group $RG \
    --name $SOURCE_NIC_NAME

Inspect peering configuration on both sides

az network vnet peering list \
    --resource-group $RG \
    --vnet-name $VNET_NAME \
    --output table

Run a path test from the source resource

az network watcher test-connectivity \
    --resource-group $RG \
    --source-resource $SOURCE_ID \
    --dest-address $DESTINATION_IP \
    --dest-port 443

Review firewall diagnostics for matching denies

az monitor diagnostic-settings list \
    --resource $FIREWALL_ID

5. Evidence to Collect¶

5.1 KQL Queries¶

Azure Activity changes affecting routing or peering¶

AzureActivity
| where TimeGenerated > ago(24h)
| where OperationNameValue has_any (
    "Microsoft.Network/routeTables/write",
    "Microsoft.Network/virtualNetworks/virtualNetworkPeerings/write",
    "Microsoft.Network/networkSecurityGroups/write"
)
| project TimeGenerated, OperationNameValue, ActivityStatusValue, Caller, ResourceGroup, ResourceId
| order by TimeGenerated desc

Column	Interpretation
`OperationNameValue`	Recent writes highlight whether the outage aligns with a route, peering, or NSG change.
`Caller`	Use this to identify who or what automation touched the path.
`ResourceId`	Confirms the exact object changed before symptoms began.

How to Read This

Start with the rows nearest the incident start time. Use them to separate configuration changes from recurring background noise.

Firewall or appliance deny trend¶

AzureDiagnostics
| where TimeGenerated > ago(6h)
| where Category in ("AzureFirewallNetworkRule", "AzureFirewallApplicationRule")
| summarize Denies=countif(action_s == "Deny"), Allows=countif(action_s == "Allow") by Resource, msg_s, bin(TimeGenerated, 15m)
| order by TimeGenerated desc

Column	Interpretation
`Denies`	A spike aligned with the incident window supports firewall involvement.
`msg_s`	The message often reveals rule collection, protocol, or destination context.

How to Read This

Start with the rows nearest the incident start time. Use them to separate configuration changes from recurring background noise.

Network path health from Connection Monitor¶

AzureDiagnostics
| where TimeGenerated > ago(6h)
| where Category == "ConnectionMonitorTestResult"
| project TimeGenerated, TestName=tostring(TestConfigurationName_s), Source=tostring(SourceAddress_s), Destination=tostring(DestinationAddress_s), Status=tostring(ConnectionStatus_s), AvgLatencyMs=todouble(AvgLatencyInMs_d)
| order by TimeGenerated desc

Column	Interpretation
`Status`	Failed or Degraded states help prove the issue is on the monitored path.
`AvgLatencyMs`	Increasing latency before failure suggests a stressed or indirect path.

How to Read This

Start with the rows nearest the incident start time. Use them to separate configuration changes from recurring background noise.

5.2 CLI Investigation¶

Show effective routes for the affected NIC¶

az network nic show-effective-route-table \
    --resource-group $RG \
    --name $SOURCE_NIC_NAME

Sample output:

{"nextHopType":"VirtualAppliance","addressPrefix":"10.40.0.0/16"}

Interpretation:

Unexpected next hops are a common root cause for black holes.
Compare the returned prefixes with the destination network under investigation.

Show effective NSG rules for the same NIC¶

az network nic show-effective-nsg \
    --resource-group $RG \
    --name $SOURCE_NIC_NAME

Sample output:

{"access":"Deny","direction":"Outbound","destinationPortRange":"443"}

Interpretation:

Look for the exact destination port and direction.
If the expected allow is absent, fix policy before touching routes.

Run end-to-end connectivity test¶

az network watcher test-connectivity \
    --resource-group $RG \
    --source-resource $SOURCE_ID \
    --dest-address $DESTINATION_FQDN \
    --dest-port 443

Sample output:

{"connectionStatus":"Unreachable","hops":[{"type":"VirtualAppliance"}]}

Interpretation:

The hop list can show where the path stops progressing.
Use the result to decide whether to inspect firewall, route tables, or destination health next.

6. Validation and Disproof by Hypothesis¶

Hypothesis: Source or destination NSG denies the flow¶

Proves if: Effective NSG results show Deny for the required port and direction, or the expected allow rule is missing.

Disproves if: The same effective NSG output shows an explicit allow and no later deny for the tested path.

az network nic show-effective-nsg \
    --resource-group $RG \
    --name $SOURCE_NIC_NAME

Hypothesis: UDR or peering path is incorrect¶

Proves if: Effective routes lack the destination prefix, or the next hop differs from the intended architecture.

Disproves if: Effective routes show the expected remote prefix and next hop, and test-connectivity reaches the destination.

az network nic show-effective-route-table \
    --resource-group $RG \
    --name $SOURCE_NIC_NAME

Hypothesis: Firewall or NVA policy denies traffic¶

Proves if: Diagnostics show deny events that match the source, destination, or FQDN during the failure window.

Disproves if: Logs show no matching denies and the path fails before the firewall hop.

az monitor log-analytics query \
    --workspace $WORKSPACE_ID \
    --analytics-query "AzureDiagnostics | where TimeGenerated > ago(30m) | where Category has "Firewall" | take 20" \
    --timespan PT30M

Hypothesis: Destination service is unhealthy¶

Proves if: The network path is permitted but listener checks, health probes, or application logs show the endpoint is down.

Disproves if: A probe or listener test proves the destination answers normally once network policy is validated.

az network watcher test-connectivity \
    --resource-group $RG \
    --source-resource $SOURCE_ID \
    --dest-address $DESTINATION_IP \
    --dest-port 443

7. Likely Root Cause Patterns¶

Pattern	Evidence	Resolution
Missing remote prefix in route table	Effective routes show no path to the target CIDR	Add or correct the UDR, peering setting, or gateway advertisement.
Outbound deny in source NSG	Effective NSG output contains Deny outbound for the destination port	Insert the precise allow rule with a documented business reason.
Peering without forwarded traffic or gateway use	Hub reachability works but transitive or hybrid reachability fails	Enable the correct peering flags on both peering objects.
Firewall rule mismatch	Firewall diagnostics show denies for the right path but wrong rule collection or FQDN policy	Update firewall policy and validate with a new connectivity test.
Healthy network, unhealthy destination	Routes and policies are correct but no listener responds	Engage the workload owner and fix the destination service or load balancer target.

8. Immediate Mitigations¶

Restore the last known-good route table or NSG rule set if the outage aligns to a recent change.
Temporarily bypass an incorrect next hop only if the security owner approves the emergency exception.
Recreate the failed path test after every change so rollback and recovery are evidence based.
Keep a UTC incident timeline that records exactly when routes, policies, and results changed.

9. Prevention¶

Prevention checklist¶

[ ] Maintain a dependency map showing which subnets require hub transit, direct peering, or firewall inspection.
[ ] Alert on peering, UDR, and NSG changes in production subscriptions.
[ ] Run scheduled connectivity tests for the top east-west and hybrid paths.
[ ] Review route and policy exceptions quarterly and retire stale temporary changes.
[ ] Validate new workload onboarding with the same packet path tests used in incident response.

Connectivity Failures¶

1. Summary¶

Symptoms¶

2. Common Misreadings¶

3. Competing Hypotheses¶

4. What to Check First¶

5. Evidence to Collect¶

5.1 KQL Queries¶

Azure Activity changes affecting routing or peering¶

Firewall or appliance deny trend¶

Network path health from Connection Monitor¶

5.2 CLI Investigation¶

Show effective routes for the affected NIC¶

Show effective NSG rules for the same NIC¶

Run end-to-end connectivity test¶

6. Validation and Disproof by Hypothesis¶

Hypothesis: Source or destination NSG denies the flow¶

Hypothesis: UDR or peering path is incorrect¶

Hypothesis: Firewall or NVA policy denies traffic¶

Hypothesis: Destination service is unhealthy¶

7. Likely Root Cause Patterns¶

8. Immediate Mitigations¶

9. Prevention¶

Prevention checklist¶

See Also¶

Sources¶