Intermittent Network Failures¶

1. Summary¶

Intermittent network failures require time-correlated evidence because the control plane may look healthy between failure windows.

mermaid graph TD A[Intermittent loss] --> B{Random or periodic?} B -->|Periodic| C[Correlate with route / DNS / load changes] B -->|Random| D[Capture packets and connection metrics] C --> E[Identify recurring trigger] D --> E

2. Common Misreadings¶

"It worked once, so networking is healthy."
"Random failures mean Azure is flaky."
"A single successful packet capture disproves the incident."

3. Competing Hypotheses¶

H1: DNS answers flap because of cache, TTL, or resolver path changes.
H2: Route, tunnel, or peering state changes during the incident window.
H3: Burst-driven port, dependency, or policy pressure causes temporary failure.
H4: The target or provider path is unstable rather than Azure core routing.

4. What to Check First¶

Check	Data source	Expected good signal
Time-based spikes	Metrics timeline	No alignment with load or config events
DNS behavior	Resolver logs / repeated query output	Stable answer over time
Port allocation / connections	SNAT and connection counters	No exhaustion pattern
Hybrid or peering state	Connection status timeline	Stable connected state

5. Evidence to Collect¶

Failure timeline with UTC timestamps.
Repeated DNS answers during good and bad windows.
Connection Monitor or packet capture output during the incident window.
Metrics or logs showing load, route, or tunnel state changes.
Provider-side or on-prem evidence if the path leaves Azure.

6. Validation¶

Hypothesis	Signals that support	Signals that weaken
H1 DNS flap	answers change across windows	DNS answer remains stable
H2 State change	tunnel/peering/route transitions align to failures	control-plane state is steady
H3 Burst pressure	failures match traffic bursts or connection churn	failures happen at idle too
H4 External instability	issue reproduces beyond Azure local path	Azure-local path alone shows failure

7. Root Cause Patterns¶

DNS cache expiry exposed a broken forwarder chain.
Hybrid tunnel or BGP session flapped under load or provider instability.
Connection churn or port pressure created short-lived outages.
Upstream provider loss was misread as Azure VNet instability.

8. Immediate Mitigations¶

Extend monitoring across good and bad windows before changing design.
Stabilize DNS forwarders and validate TTL-sensitive records.
Reduce burst concurrency or reroute around unstable transit.
Capture packet and route evidence during the next failure window.

9. Prevention¶

Maintain continuous probes for critical paths.
Keep UTC-aligned incident timelines and route-change records.
Add repeated resolver checks for Private Endpoint and hybrid dependencies.