Troubleshooting Methodology¶

This guide provides a deep-dive into the hypothesis-driven troubleshooting approach used throughout this practical guide.

The Hypothesis-Driven Approach¶

Instead of random searching, we use a structured process to isolate root causes.

graph TD
    A[Observe Symptom] --> B[Generate Hypotheses]
    B --> C[Identify Evidence Needed]
    C --> D[Collect and Analyze Evidence]
    D --> E{Validate Hypothesis?}
    E -- No --> B
    E -- Yes --> F[Confirm Root Cause]
    F --> G[Implement Mitigation]

Evidence Collection Framework¶

We categorize evidence based on its reliability and directness.

Tag	Level	Description
[Observed]	Direct	Directly seen in logs or error messages (e.g., `401 Unauthorized`).
[Measured]	Quantitative	Quantitatively confirmed by metrics (e.g., `99.9% packet loss`).
[Correlated]	Statistical	Two signals move together (e.g., `Latency spikes during high CPU`).
[Inferred]	Logical	A reasonable conclusion based on surrounding evidence.
[Strongly Suggested]	Probable	High probability, but not definitive.
[Not Proven]	Uncertain	Tested but not confirmed.

Layered Diagnosis¶

Troubleshooting should proceed logically through the communication layers:

Client/SDK: Device issues, token expiry, initialization errors.
Network: Firewall, proxy, NAT, bandwidth, jitter.
Service (Azure): ACS resource limits, regional issues, global service health.
Handoff (Carrier/Recipient): Downstream delivery status, recipient policies.

Best Practices¶

Isolate Variables: Change one thing at a time when testing a hypothesis.
Reproduction: Try to reproduce the issue in a controlled environment.
Document Everything: Keep a record of the evidence collected and the hypotheses tested.
Automate Alerts: Use Azure Monitor to alert on critical failure patterns before they become major incidents.

Sources¶

Azure PaaS Troubleshooting Methodology
SRE Incident Management Best Practices