Troubleshooting Architecture Overview¶

This page answers the most important first question in Azure Monitor incidents: where can the monitoring path fail?

Before opening a specific playbook, classify the symptom against the Azure Monitor data path, control path, and consumption path. That routing step prevents common mistakes such as debugging Application Insights SDK configuration when the real issue is diagnostic settings, or blaming Log Analytics query latency when the actual problem is a workbook fanning out across multiple workspaces.

Why this page exists¶

Playbooks are detailed and symptom-driven. During active incidents, operators usually need one faster artifact:

An end-to-end map of telemetry flow from source to consumer
A list of failure points where data can be delayed, dropped, blocked, or misrouted
A routing guide for choosing between ingestion, alerting, or query-performance investigation
A shared vocabulary for incident notes and escalation

Use this page to map the symptom first, then move to the appropriate playbook.

1) End-to-end Azure Monitor flow¶

flowchart TD
    A[Workload or Azure resource] --> B[Telemetry source and agent or SDK]
    B --> C[Collection and routing layer]
    C --> D[Azure Monitor data stores]
    D --> E[Queries alerts workbooks APIs]
    E --> F[Operator or downstream system]

    B -. FP-01 .-> B1[Failure point: source misconfiguration or missing SDK or agent]
    C -. FP-02 .-> C1[Failure point: diagnostic setting DCR or network path issue]
    D -. FP-03 .-> D1[Failure point: ingestion delay retention table mismatch]
    E -. FP-04 .-> E1[Failure point: bad query alert logic workbook scope]

Interpretation¶

Source problems usually look like zero or partial telemetry from one workload.
Routing problems usually look like data reaching the wrong workspace, missing categories, or blocked outbound collection.
Data-store problems usually look like ingestion gaps, delay, retention surprises, or table-specific anomalies.
Consumer problems usually look like slow queries, broken workbooks, or alert rules that do not reflect the actual data.

2) Telemetry source architecture¶

Different Azure Monitor signals originate from different collection models.

Signal type	Typical source	Collection mechanism	First failure question
Platform metrics	Azure resources	Native Azure Monitor metrics pipeline	Is the resource emitting the expected metric dimension?
Resource logs	Azure resources	Diagnostic settings	Is the resource sending the right categories to the intended destination?
VM or server logs and metrics	Guest OS	Azure Monitor Agent + DCR	Is AMA installed, healthy, and associated with the correct DCR?
Application traces and requests	Application code	Application Insights SDK, autoinstrumentation, or OpenTelemetry	Is the app sending telemetry with the correct connection string or endpoint?
Container and AKS logs	Cluster or node agents	Container Insights / AMA / extension pipeline	Is the extension enabled and the workspace/DCR mapping correct?

Source-side failure points¶

Failure point	Typical symptom	First evidence	Primary page
FP-SRC-01 Application SDK missing or broken	Requests missing, traces absent, custom events never arrive	App config and application logs	Missing Application Telemetry
FP-SRC-02 Diagnostic setting disabled or incomplete	Resource logs absent in workspace	Azure Activity Log + diagnostic settings list	No Data in Workspace
FP-SRC-03 AMA unhealthy or not assigned	VM heartbeat stops, guest data disappears	`Heartbeat`, DCR association, extension state	Agent Not Reporting
FP-SRC-04 AKS monitoring addon or extension misconfigured	`ContainerLogV2` and insights tables stale or empty	AKS addon state, DCR, extension logs	AKS Container Insights Issues

3) Collection and routing path¶

flowchart TD
    A[Resource or app emits telemetry] --> B{Collection model}
    B -->|Diagnostic settings| C[Azure resource logs and metrics export]
    B -->|Azure Monitor Agent| D[Guest collection via DCR]
    B -->|Application Insights SDK| E[Ingestion endpoint]
    B -->|Container Insights| F[Cluster extension and DCR]

    C --> G[Workspace Event Hub Storage or Metrics]
    D --> G
    E --> H[Application Insights resource and workspace-based tables]
    F --> G

    C -. FP-ROUTE-01 .-> C1[Wrong destination or missing category]
    D -. FP-ROUTE-02 .-> D1[Missing DCR association or blocked egress]
    E -. FP-ROUTE-03 .-> E1[Wrong connection string sampling or private access issue]
    F -. FP-ROUTE-04 .-> F1[Extension disabled namespace filter or DCR mismatch]

What to check first on the routing layer¶

az monitor diagnostic-settings list \
    --resource "$RESOURCE_ID"

az monitor data-collection rule association list \
    --resource "$RESOURCE_ID"

az monitor app-insights component show \
    --resource-group "$RG" \
    --app "$APP_INSIGHTS_NAME"

Routing-layer interpretation¶

If a resource emits metrics but no logs, the problem is often diagnostic settings rather than resource health.
If Heartbeat is stale for one VM cohort, DCR association or agent health is more likely than workspace outage.
If Application Insights requests arrive but custom traces do not, sampling, SDK configuration, or module coverage is more likely than ingestion failure.
If AKS node metrics appear but container logs do not, inspect namespace filters and Container Insights configuration before debugging query logic.

4) Data-store architecture¶

Azure Monitor is not a single store. Troubleshooting quality improves when you know which store is supposed to hold the evidence.

Store	Purpose	Typical symptoms when wrong store is queried
Metrics store	Near-real-time numeric platform and custom metrics	Alert seems wrong because operator searched logs instead of metrics dimensions
Log Analytics workspace	Central log and analytics store for resource logs, agent data, and workspace-based AI data	Data appears missing because operator queried the wrong workspace or table
Application Insights tables	Application telemetry schema over workspace-based or classic storage	Requests exist but user looks only at resource logs or vice versa
Activity Log	Control-plane change history	Operators miss the config change that caused the incident

Data-store failure points¶

Failure point	Typical symptom	First question
FP-DATA-01 Wrong workspace	Query returns nothing even though data exists elsewhere	Are we querying the intended workspace ID and table family?
FP-DATA-02 Ingestion delay or temporary lag	Data appears later than expected	Is there fresh control data in simple tables such as `Heartbeat` or `Usage`?
FP-DATA-03 Retention or plan misunderstanding	Historical data seems missing	Was the data retained in this table plan for the requested period?
FP-DATA-04 Table mismatch	Query hits the wrong schema or old table name	Is the signal expected in `requests`, `AppRequests`, `AzureDiagnostics`, or another table?

5) Consumer architecture: queries, alerts, and workbooks¶

flowchart TD
    A[Workspace and metrics stores] --> B[Logs query experience]
    A --> C[Scheduled query rules]
    A --> D[Metrics alerts]
    A --> E[Workbooks dashboards and APIs]

    B -. FP-CONS-01 .-> B1[Slow query or wrong table scope]
    C -. FP-CONS-02 .-> C1[Heavy KQL bad window or threshold mismatch]
    D -. FP-CONS-03 .-> D1[Wrong metric namespace dimension or aggregation]
    E -. FP-CONS-04 .-> E1[Workbook parameter scope cross-workspace fan-out]

Consumer-side failure patterns¶

Failure point	Typical symptom	Primary page
FP-CONS-01 Slow KQL or timeouts	Logs and workbooks are slow or fail to load	Slow Query Performance
FP-CONS-02 Alert rule never fires	Data exists but query or threshold logic does not align	Alert Not Firing
FP-CONS-03 Alert storm	Thresholds or dimensions produce excessive noise	Alert Storm
FP-CONS-04 Workbook overhead	Portal visual is slow but direct query is healthy	Slow Query Performance

6) Failure domains by symptom¶

Observed symptom	Highest-probability domain	Why
No data in a workspace	Source or routing	Telemetry usually failed before or during collection
One app has no requests in Application Insights	Source	SDK, connection string, or app restart issue is likely
VM logs missing but workspace healthy	Agent and DCR	Collection path is per-VM and easy to isolate
Queries time out only on one table	Consumer plus table design	Data exists, but query shape or table volume is unhealthy
Alerts do not reflect visible data	Consumer logic	Alert cadence, dimensions, and query scope often diverge from ad hoc analysis
Cost spikes suddenly	Routing and data volume	Diagnostic setting changes or noisy apps often drive ingestion growth

7) Minimal evidence set for first routing¶

Use one fast check per layer before going deeper.

az monitor log-analytics workspace show \
    --resource-group "$RG" \
    --workspace-name "$WORKSPACE_NAME" \
    --query "{id:id,name:name,retentionInDays:retentionInDays}"

az monitor scheduled-query show \
    --resource-group "$RG" \
    --name "$ALERT_RULE_NAME" \
    --output json

Heartbeat
| where TimeGenerated > ago(15m)
| summarize LastHeartbeat = max(TimeGenerated) by Computer
| take 5

Usage
| where TimeGenerated > ago(24h)
| summarize TotalGB = round(sum(Quantity) / 1024.0, 2) by DataType
| order by TotalGB desc
| take 10

How to use this evidence set¶

Confirm you are in the correct workspace.
Prove the workspace answers a narrow control query.
Check whether one table or signal family dominates recent ingestion.
If the symptom is alert-specific, inspect the rule before changing data collection.

8) Fast routing examples¶

Example A: VM metrics stop, but workspace queries are healthy
- Start in the source/routing layer.
- Open: Agent Not Reporting.
Example B: Application requests exist, but workbook takes minutes to load
- Start in the consumer layer.
- Open: Slow Query Performance.
Example C: New cost spike after a platform rollout
- Start in routing and data-volume analysis.
- Open: High Ingestion Cost.
Example D: Azure resource logs disappeared after a change window
- Start with control-plane evidence and diagnostic settings.
- Open: No Data in Workspace.
Example E: App traces disappeared but requests still arrive
- Start at the application telemetry source.
- Open: Missing Application Telemetry.

9) Escalation boundaries¶

Only escalate as a likely Azure-side platform incident when all of these are true:

Narrow control queries are also unhealthy.
Multiple unrelated signal types or workloads are affected.
Configuration and routing evidence do not explain the symptom.
Azure Service Health or broader tenant impact signals support the timeline.

If those conditions are not met, stay in the relevant playbook and continue with evidence-driven disproof.

Use architecture labels in incident notes

Add the current failure domain to the incident timeline.

Examples:

Initial routing: Source/Routing
Initial routing: Consumer/Query
Reclassified after evidence: Data-store mismatch