VPN Gateway Troubleshooting¶
1. Summary¶
Use this playbook when Site-to-Site (S2S), Point-to-Site (P2S), or ExpressRoute backup/failover paths are unstable, disconnected, or routing traffic inconsistently.
Hybrid incidents often involve more than tunnel status. The real issue may be BGP advertisement mismatch, local network gateway drift, DNS dependency across the tunnel, or failover behavior that was never tested under load.
Symptoms¶
- Connections time out or are refused.
- Traffic works from one source but fails from another seemingly similar source.
- A private endpoint or hybrid path behaves differently after a recent change.
- Operators have a healthy-looking control plane but an unhealthy application path.
mermaid flowchart TD A[Hybrid path fails] --> B{Is the tunnel or circuit connected?} B -->|No| C[Check gateway, local device, shared key, or provider status] B -->|Yes| D{Are expected routes learned and preferred?} D -->|No| E[Fix BGP, prefixes, or local network gateway settings] D -->|Yes| F{Is DNS or application traffic still failing?} F -->|Yes| G[Check NSGs, UDRs, and resolver reachability] F -->|No| H[Validate failover path and monitor recurrence]
2. Common Misreadings¶
| Observation | Often Misread As | Actually Means |
|---|---|---|
| Gateway shows Connected | All hybrid traffic should work | A connected tunnel can still carry the wrong routes or no useful DNS traffic. |
| A backup VPN tunnel is up | ExpressRoute failover is ready | Failover may still send traffic down the wrong path or use stale DNS assumptions. |
| P2S clients can connect | They can reach every internal resource | P2S address pools, DNS settings, and route pushes may still be wrong. |
| One prefix is reachable | BGP and local prefixes are correct | Another learned route may override the intended path for other subnets. |
3. Competing Hypotheses¶
| Hypothesis | Likelihood | Key Discriminator |
|---|---|---|
| The VPN tunnel or gateway connection is unhealthy | High | Gateway connection status is Disconnected, NotConnected, or flapping. |
| Local network gateway or BGP prefixes are wrong | High | Learned routes do not include the remote prefixes or prefer the wrong path. |
| Failover design between ExpressRoute and VPN backup is untested or mispreferred | Medium | Traffic still prefers the degraded path or fails to return after failover. |
| P2S clients receive wrong DNS or routes | Medium | Clients connect to the gateway but cannot resolve or reach internal prefixes. |
| Traffic is blocked after leaving the gateway | Medium | The tunnel is healthy but NSGs, UDRs, or firewall rules deny the path deeper inside Azure. |
4. What to Check First¶
- Show gateway connections and state
- Show BGP peer status
- Show learned routes
- Show local network gateway prefixes
az network local-gateway show \
--resource-group $RG \
--name $LOCAL_GATEWAY_NAME \
--query "localNetworkAddressSpace.addressPrefixes"
- Show P2S configuration
az network vnet-gateway show \
--resource-group $RG \
--name $VPN_GATEWAY_NAME \
--query "vpnClientConfiguration"
5. Evidence to Collect¶
5.1 KQL Queries¶
Gateway-related activity changes¶
AzureActivity
| where TimeGenerated > ago(7d)
| where OperationNameValue has_any (
"Microsoft.Network/virtualNetworkGateways/write",
"Microsoft.Network/connections/write",
"Microsoft.Network/localNetworkGateways/write"
)
| project TimeGenerated, OperationNameValue, ActivityStatusValue, Caller, ResourceGroup, ResourceId
| order by TimeGenerated desc
| Column | Interpretation |
|---|---|
OperationNameValue | Configuration writes often explain why a healthy tunnel stopped carrying the right routes. |
Caller | Useful when automation or provider handoffs changed settings. |
How to Read This
Start with the rows nearest the incident start time. Use them to separate configuration changes from recurring background noise.
Gateway diagnostics and tunnel state¶
AzureDiagnostics
| where TimeGenerated > ago(6h)
| where Category has_any ("GatewayDiagnosticLog", "TunnelDiagnosticLog", "RouteDiagnosticLog")
| project TimeGenerated, Category, Resource, msg_s, status_s
| order by TimeGenerated desc
| Column | Interpretation |
|---|---|
status_s | Look for transition patterns that align with outage start or failover. |
msg_s | Messages often reveal negotiation, route, or keepalive problems. |
How to Read This
Start with the rows nearest the incident start time. Use them to separate configuration changes from recurring background noise.
Hybrid application symptom correlation¶
AzureDiagnostics
| where TimeGenerated > ago(6h)
| where msg_s has_any ("connection timed out", "host unreachable", "temporary failure in name resolution")
| summarize Failures=count() by Resource, msg_s, bin(TimeGenerated, 15m)
| order by TimeGenerated desc
| Column | Interpretation |
|---|---|
Failures | Correlate gateway-level instability with application-visible impact. |
Resource | Shows which workloads are most sensitive to the hybrid outage. |
How to Read This
Start with the rows nearest the incident start time. Use them to separate configuration changes from recurring background noise.
5.2 CLI Investigation¶
List gateway connections¶
Sample output:
Interpretation:
- Flapping state or zero transfer counters during expected traffic is suspicious.
- Compare timestamps with provider-side monitoring.
List learned routes¶
Sample output:
Interpretation:
- Missing prefixes point to BGP or local gateway drift.
- Unexpected more-specific routes can override your intended failover path.
Show BGP peer status¶
Sample output:
Interpretation:
- Healthy BGP is necessary but not sufficient; still verify learned and advertised routes.
- Disconnected or Idle states justify immediate investigation of peer configuration or provider path.
6. Validation and Disproof by Hypothesis¶
Hypothesis: Tunnel or connection instability¶
Proves if: Connection state is Disconnected or flaps repeatedly in diagnostics.
Disproves if: The connection remains stable and traffic counters advance during the test window.
Hypothesis: Prefix or BGP mismatch¶
Proves if: Learned routes or local gateway prefixes do not include the expected remote network.
Disproves if: The required prefix is present and preferred as designed.
Hypothesis: Failover preference problem¶
Proves if: Backup routes exist but remain less preferred or do not activate during a primary-path failure.
Disproves if: Traffic shifts to the intended backup path during a controlled failover test.
az network vnet-gateway list-advertised-routes \
--resource-group $RG \
--name $VPN_GATEWAY_NAME \
--peer 203.0.113.10
Hypothesis: P2S client route or DNS issue¶
Proves if: P2S clients connect but cannot resolve or reach remote private networks.
Disproves if: P2S clients receive the documented DNS servers and route set and can reach the required prefixes.
az network vnet-gateway show \
--resource-group $RG \
--name $VPN_GATEWAY_NAME \
--query "vpnClientConfiguration"
7. Likely Root Cause Patterns¶
| Pattern | Evidence | Resolution |
|---|---|---|
| Wrong local network gateway prefixes | Tunnel is connected but the destination prefix is absent | Correct the local network gateway address space or BGP advertisements. |
| Shared key or provider issue | Connection status drops or renegotiates frequently | Coordinate with the on-premises or provider team and stabilize the tunnel first. |
| Unverified ExpressRoute to VPN failover | Backup exists but traffic does not shift as expected | Test failover and route preference in a controlled maintenance window. |
| P2S DNS gap | P2S users connect but internal names fail | Publish the correct DNS servers and validate split-horizon resolution. |
| Healthy gateway, blocked inside Azure | Hybrid routes exist but Azure-side NSGs or UDRs still deny the traffic | Move the investigation from gateway state to subnet policy and effective routes. |
8. Immediate Mitigations¶
- If a recent change caused the issue, restore the last known-good local network gateway or connection settings.
- Prefer the stable path temporarily by adjusting route preference only after verifying the security implications.
- Capture learned routes, advertised routes, and connection state before making simultaneous changes on both sides.
- Validate DNS across the hybrid path after the tunnel returns to Connected status.
9. Prevention¶
Prevention checklist¶
- [ ] Test S2S, P2S, and backup failover paths on a regular schedule.
- [ ] Keep local network gateway prefixes and BGP advertisements under strict change control.
- [ ] Record provider contact paths and escalation data in the same runbook as Azure commands.
- [ ] Monitor gateway diagnostics and alert on flapping connections or route changes.
- [ ] Document which applications depend on which hybrid path so triage can prioritize correctly.