Intermittent Network Failures¶
1. Summary¶
Intermittent network failures require time-correlated evidence because the control plane may look healthy between failure windows.
mermaid graph TD A[Intermittent loss] --> B{Random or periodic?} B -->|Periodic| C[Correlate with route / DNS / load changes] B -->|Random| D[Capture packets and connection metrics] C --> E[Identify recurring trigger] D --> E
2. Common Misreadings¶
- "It worked once, so networking is healthy."
- "Random failures mean Azure is flaky."
- "A single successful packet capture disproves the incident."
3. Competing Hypotheses¶
- H1: DNS answers flap because of cache, TTL, or resolver path changes.
- H2: Route, tunnel, or peering state changes during the incident window.
- H3: Burst-driven port, dependency, or policy pressure causes temporary failure.
- H4: The target or provider path is unstable rather than Azure core routing.
4. What to Check First¶
| Check | Data source | Expected good signal |
|---|---|---|
| Time-based spikes | Metrics timeline | No alignment with load or config events |
| DNS behavior | Resolver logs / repeated query output | Stable answer over time |
| Port allocation / connections | SNAT and connection counters | No exhaustion pattern |
| Hybrid or peering state | Connection status timeline | Stable connected state |
5. Evidence to Collect¶
- Failure timeline with UTC timestamps.
- Repeated DNS answers during good and bad windows.
- Connection Monitor or packet capture output during the incident window.
- Metrics or logs showing load, route, or tunnel state changes.
- Provider-side or on-prem evidence if the path leaves Azure.
6. Validation¶
| Hypothesis | Signals that support | Signals that weaken |
|---|---|---|
| H1 DNS flap | answers change across windows | DNS answer remains stable |
| H2 State change | tunnel/peering/route transitions align to failures | control-plane state is steady |
| H3 Burst pressure | failures match traffic bursts or connection churn | failures happen at idle too |
| H4 External instability | issue reproduces beyond Azure local path | Azure-local path alone shows failure |
7. Root Cause Patterns¶
- DNS cache expiry exposed a broken forwarder chain.
- Hybrid tunnel or BGP session flapped under load or provider instability.
- Connection churn or port pressure created short-lived outages.
- Upstream provider loss was misread as Azure VNet instability.
8. Immediate Mitigations¶
- Extend monitoring across good and bad windows before changing design.
- Stabilize DNS forwarders and validate TTL-sensitive records.
- Reduce burst concurrency or reroute around unstable transit.
- Capture packet and route evidence during the next failure window.
9. Prevention¶
- Maintain continuous probes for critical paths.
- Keep UTC-aligned incident timelines and route-change records.
- Add repeated resolver checks for Private Endpoint and hybrid dependencies.
See Also¶
- Latency and Packet Loss
- Hybrid Connectivity Issues
- Monitor Network Paths
- Packet Capture and Diagnostics