Systematic Troubleshooting Methodology¶

Use this methodology for incidents that are not solved by a single quick check. The sequence is Observe → Hypothesize → Test → Fix → Verify and is aligned with Azure Monitor and Azure Functions guidance from Microsoft Learn.

Why a method matters¶

Unstructured debugging increases MTTR and creates risky changes during outages. A repeatable method helps teams:

preserve evidence,
avoid guess-driven configuration changes,
and produce reusable runbooks after resolution.

Troubleshooting mental model¶

Use this classification to select the first evidence source before deep dives.

Category	Examples	First Check	Typical Evidence
Request path issue	5xx, timeout, 403, connection refused	`requests` + `exceptions` tables	HTTP status codes, error types
App startup issue	Host not starting, container ping failure, health check timeout	`traces` table (host lifecycle)	`Host started` missing, startup duration
Runtime degradation	Memory pressure, GIL contention, thread pool starvation	`customMetrics`, process metrics	CPU/memory trends, cold start frequency
Dependency / outbound issue	DNS failure, SNAT exhaustion, private endpoint unreachable	`dependencies` table	Failed dependency calls, target resolution
Deployment / recycle event	Post-deploy failures, slot swap issues, config drift	Activity Log, `traces`	Deploy events, host restart events

About customMetrics

Most metrics in the customMetrics table require explicit instrumentation from your application code. Only a few runtime metrics (for example, FunctionExecutionCount) are automatically emitted. Queue processing metrics, latency measurements, and business metrics must be explicitly tracked.

Operations Guide

For monitoring setup and alert configuration, see Monitoring and Alerts.

1) Observe¶

Start with facts, not assumptions. Capture the incident window, blast radius, impacted triggers, and user-facing symptoms.

Primary telemetry sources:

Metrics: request rate, failures, latency, execution count.
Logs (traces): host lifecycle, trigger listener status, runtime warnings.
Exceptions: top exception types and first-seen timestamps.
Dependencies: failed or slow external calls.
Alerts: who was notified, what threshold fired, and when.

Evidence checklist:

Start time and detection source.
Affected environments and regions.
Last known good time.
Most recent deployment or config change.
Current customer impact.

2) Hypothesize¶

Convert observations into explicit, testable hypotheses. Good hypotheses target one component at a time.

Examples:

"Queue trigger listener is unhealthy because storage auth changed."
"Latency is caused by dependency timeout, not function runtime."
"Blob trigger failed after Flex migration because Event Grid subscription is missing."

Prioritize hypotheses by:

impact severity,
likelihood given recent change history,
speed and safety of validation.

Evidence collection patterns¶

Before testing, map each hypothesis to the minimum evidence set.

Hypothesis Type	Evidence Needed	Tool	Example Query
Request failure is application logic	5xx trend + top exception type	Log Analytics (`AppRequests`, `AppExceptions`)	`AppRequests \| where TimeGenerated > ago(30m) \| where ResultCode startswith "5" \| summarize count() by OperationName`
Host startup regression after change	Startup lifecycle logs + deploy timestamp	Log Analytics (`AppTraces`) + Activity Log	`AppTraces \| where TimeGenerated > ago(30m) \| where Message has "Host started" or Message has "Starting Host"`
Outbound dependency timeout	Failed dependency calls by target	Application Insights (`dependencies`)	`dependencies \| where timestamp > ago(30m) \| where success == false \| summarize count(), avg(duration) by target, type`
Trigger listener unhealthy	Listener/lock/trigger errors	Log Analytics (`AppTraces`)	`AppTraces \| where TimeGenerated > ago(30m) \| where Message has_any ("listener", "Host lock", "trigger")`
Scale bottleneck on event trigger	Backlog growth vs execution flatline	Azure Monitor Metrics	`az monitor metrics list --resource "/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RG/providers/Microsoft.Web/sites/$APP_NAME" --metric "FunctionExecutionCount" --interval PT1M --aggregation Total --offset 30m`

3) Test¶

Use a minimal set of diagnostic queries and commands that can prove or disprove a hypothesis. Avoid broad, expensive "search everything" approaches during active incidents.

Testing rules:

Define expected result before running a query.
Keep time range tight (ago(15m), ago(1h)).
Compare against baseline if available.
Log findings in incident notes.

Common test tools:

KQL Query Library
az monitor metrics list
az monitor log-analytics query
health endpoint (/api/health)

4) Fix¶

Apply the smallest safe change that addresses the validated cause. During incidents, controlled reversibility is more important than broad refactoring.

Fix guidance:

Prefer rollback when a fresh deployment introduced regression.
If changing app settings, record before/after values (without secrets).
Avoid simultaneous multi-variable changes.
Use staged rollout when possible.

Examples of minimal fixes:

Re-enable one disabled function.
Restore one missing app setting.
Recreate one missing Event Grid subscription.
Roll back one deployment artifact.

5) Verify¶

Verification confirms both restoration and recurrence prevention.

Immediate verification:

Failure rate returns to baseline.
Throughput catches up with incoming demand.
Health endpoint and key user paths succeed.
No new high-severity alerts fire in observation window.

Post-incident verification:

Add alerting for earlier detection.
Add dashboards for leading indicators.
Update playbook with confirmed signal and fix steps.

Troubleshooting decision tree¶

flowchart TD
    A[Incident detected] --> B{Platform health degraded?}
    B -->|Yes| C[Check Azure Service Health and advisories]
    C --> D[Apply platform mitigation and communicate impact]
    B -->|No| E{"Function app healthy? /api/health"}
    E -->|No| F[Inspect host startup logs and recent config changes]
    F --> G{Recent deployment or setting change?}
    G -->|Yes| H[Rollback or revert smallest change]
    G -->|No| I[Check runtime dependencies and identity access]
    E -->|Yes| J{Trigger-specific failure?}
    J -->|Yes| K[Use trigger playbook and KQL filters]
    J -->|No| L[Investigate latency and dependency bottlenecks]
    H --> M[Verify recovery metrics and error rate]
    I --> M
    K --> M
    L --> M
    M --> N[Document root cause and prevention actions]

Anti-patterns to avoid¶

Anti-pattern	Why It's Dangerous	Better Approach
Restarting without evidence	Destroys diagnostic state and erases startup timing clues	Collect `traces`, `exceptions`, and recent Activity Log first, then restart once if needed
Expanding incident scope without data	Pulls teams into unrelated systems and delays mitigation	Constrain scope to confirmed blast radius, then widen only with evidence
Applying multiple config changes at once	Creates attribution ambiguity and raises rollback risk	Apply one reversible change at a time and measure impact window
Declaring resolved without stability window	Causes incident reopen and underestimates latent failures	Observe at least one sustained low-error window before closeout

Systematic Troubleshooting Methodology¶

Why a method matters¶

Troubleshooting mental model¶

1) Observe¶

2) Hypothesize¶

Evidence collection patterns¶

3) Test¶

4) Fix¶

5) Verify¶

Troubleshooting decision tree¶

Anti-patterns to avoid¶

See Also¶

Sources¶