Skip to content

Evidence Map

Data sources to check for each symptom category.

graph TD
    subgraph "Symptoms"
        A[No Data]
        B[Alert Issues]
        C[Cost Issues]
        D[Performance]
        I[AKS Issues]
        J[App Insights Gaps]
    end

    subgraph "Evidence Sources"
        E[Activity Log]
        F[Usage Table]
        G[Heartbeat]
        H[Alert History]
        K[Container Insights Tables]
        L[AppRequests and AppDependencies]
    end

    A --> E & G
    B --> H & E
    C --> F
    D --> F
    I --> K & E & G
    J --> L & E

Evidence by Symptom

No Data in Workspace

Evidence Source What to Check KQL Query
AzureActivity Diagnostic setting changes AzureActivity \| where OperationNameValue contains "diagnosticSettings"
Heartbeat Agent connectivity Heartbeat \| where TimeGenerated > ago(1h) \| summarize by Computer
Usage Table ingestion rates Usage \| where TimeGenerated > ago(1d) \| summarize sum(Quantity) by DataType
_LogOperation Ingestion errors _LogOperation \| where Level == "Error"

Missing Application Telemetry

Evidence Source What to Check KQL Query
requests Request data present requests \| where timestamp > ago(1h) \| count
dependencies Dependency tracking dependencies \| where timestamp > ago(1h) \| count
traces Custom logging traces \| where timestamp > ago(1h) \| summarize count() by severityLevel
AppServicePlatformLogs App Service startup AppServicePlatformLogs \| where TimeGenerated > ago(1h)

Alert Not Firing

Evidence Source What to Check KQL Query
AzureActivity Alert rule changes AzureActivity \| where OperationNameValue contains "alertRules"
Signal data Data exists for condition Check the metric/log table in alert condition
Alert history Past alert instances Azure Portal → Alerts → History

Alert Storm

Evidence Source What to Check KQL Query
AzureActivity Alert firing rate AzureActivity \| where OperationNameValue == "Microsoft.Insights/alertRules/activated/action"
Metrics Metric volatility Check metric values over time for threshold violations
Alert processing rules Suppression configured Azure Portal → Alerts → Alert processing rules

High Ingestion Cost

Evidence Source What to Check KQL Query
Usage Data volume by table Usage \| where TimeGenerated > ago(1d) \| summarize GB=sum(Quantity)/1000 by DataType \| order by GB desc
Usage Data volume trend Usage \| where TimeGenerated > ago(7d) \| summarize GB=sum(Quantity)/1000 by bin(TimeGenerated, 1d)
_LogOperation Ingestion anomalies _LogOperation \| where Category == "Ingestion"

Query Performance

Evidence Source What to Check Notes
Query text Time scope present Ensure where TimeGenerated > ago(...)
Query text Table specified Avoid search * patterns
Usage Table size Large tables need narrower queries
Workspace Concurrent queries Check for query throttling

Agent Not Reporting

Evidence Source What to Check KQL Query
Heartbeat Last heartbeat time Heartbeat \| summarize LastHeartbeat=max(TimeGenerated) by Computer
AzureActivity DCR changes AzureActivity \| where OperationNameValue contains "dataCollectionRules"
_LogOperation Collection errors _LogOperation \| where Category == "Collection" and Level == "Error"

AKS Container Insights Issues

Evidence Source What to Check KQL Query
KubeNodeInventory Cluster and node freshness KubeNodeInventory \| where TimeGenerated > ago(30m) \| summarize LastSeen=max(TimeGenerated), Nodes=dcount(Computer) by ClusterName
KubePodInventory Namespace or pod inventory gaps KubePodInventory \| where TimeGenerated > ago(30m) \| summarize Pods=dcount(PodUid), LastSeen=max(TimeGenerated) by ClusterName, Namespace
ContainerLogV2 Pod log arrival by namespace ContainerLogV2 \| where TimeGenerated > ago(30m) \| summarize LogLines=count(), LastSeen=max(TimeGenerated) by ClusterName, PodNamespace
InsightsMetrics Metrics path still active InsightsMetrics \| where TimeGenerated > ago(30m) \| where Origin == "container.azm.ms" \| summarize Samples=count(), LastSeen=max(TimeGenerated) by Namespace

Application Insights Gaps

Evidence Source What to Check KQL Query
AppRequests Request timeline and ItemCount sampling signal AppRequests \| where TimeGenerated > ago(1h) \| summarize Recorded=count(), Estimated=sum(ItemCount), LastSeen=max(TimeGenerated)
AppDependencies One telemetry type missing while requests still arrive AppDependencies \| where TimeGenerated > ago(1h) \| summarize Count=count(), LastSeen=max(TimeGenerated)
AppTraces Partial logging gaps by severity or role AppTraces \| where TimeGenerated > ago(1h) \| summarize Count=count(), LastSeen=max(TimeGenerated) by AppRoleName, SeverityLevel
AzureActivity Deployment or configuration changes near the gap AzureActivity \| where TimeGenerated > ago(24h) \| where ResourceProviderValue has_any ("MICROSOFT.WEB", "MICROSOFT.INSIGHTS")

Failure-Domain Matrix

Failure domain Relevant tables Typical symptoms First KQL check
Source AppRequests, AppDependencies, AppTraces, Heartbeat, Perf, ContainerLogV2 One app, VM, or AKS cluster stopped producing expected telemetry while the workspace still receives other data union isfuzzy=true (AppRequests | summarize LastSeen=max(TimeGenerated) by TableName="AppRequests"), (Heartbeat | summarize LastSeen=max(TimeGenerated) by TableName="Heartbeat"), (ContainerLogV2 | summarize LastSeen=max(TimeGenerated) by TableName="ContainerLogV2")
Routing AzureActivity, _LogOperation, Heartbeat, KubeNodeInventory Diagnostic settings, DCR association, or monitoring extension changed and data stopped shortly after AzureActivity | where TimeGenerated > ago(24h) | where OperationNameValue has_any ("diagnosticSettings", "dataCollectionRules", "managedClusters") | project TimeGenerated, OperationNameValue, ResourceGroup, ActivityStatusValue
Data Store Usage, _Usage, _LogOperation, Operation Sudden ingestion drop, table growth anomaly, latency, or workspace-side throttling suspicion Usage | where TimeGenerated > ago(2d) | summarize GB=sum(Quantity)/1000 by DataType, bin(TimeGenerated, 1h) | order by TimeGenerated desc
Consumer AppRequests, AppDependencies, Usage, AzureMetrics Data exists, but a workbook, alert, or query looks empty, late, or too slow AppRequests | where TimeGenerated > ago(1h) | summarize Count=sum(ItemCount), LastSeen=max(TimeGenerated) by bin(TimeGenerated, 5m)

Use the matrix to decide whether you are debugging production telemetry generation, the Azure Monitor routing path, workspace ingestion/storage behavior, or only the downstream query and alert consumer.

Azure Portal Evidence Locations

Evidence Type Portal Location
Diagnostic settings Resource → Diagnostic settings
Alert rules Monitor → Alerts → Alert rules
Alert history Monitor → Alerts → (filter by time)
Action groups Monitor → Alerts → Action groups
DCR assignments Monitor → Data Collection Rules → (select DCR) → Resources
Agent health VM → Extensions + Applications
Workspace usage Log Analytics workspace → Usage and estimated costs
Ingestion anomalies Log Analytics workspace → Insights

CLI Evidence Collection

# Check diagnostic settings for a resource
az monitor diagnostic-settings list \
    --resource $RESOURCE_ID

# List alert rules in subscription
az monitor metrics alert list \
    --resource-group $RG

# Check DCR associations
az monitor data-collection rule association list \
    --resource $RESOURCE_ID

# View agent extensions on VM
az vm extension list \
    --resource-group $RG \
    --vm-name $VM_NAME

Change-History Evidence

Use AzureActivity to test whether the incident began immediately after a monitoring configuration, deployment, identity, or networking change. Microsoft Learn troubleshooting guidance repeatedly emphasizes correlating the first missing-data window with recent control-plane writes before assuming a platform outage.

AzureActivity
| where TimeGenerated > ago(24h)
| where ActivityStatusValue == "Succeeded"
| where OperationNameValue has_any (
    "diagnosticSettings/write",
    "dataCollectionRules/write",
    "scheduledQueryRules/write",
    "metricAlerts/write",
    "components/write",
    "managedClusters/write")
| project TimeGenerated, ResourceGroup, ResourceProviderValue, OperationNameValue, Caller, ResourceId
| order by TimeGenerated desc

Query 2: Correlate change window with Application Insights gap timing

let GapStart = ago(6h);
let AppTimeline =
    AppRequests
    | where TimeGenerated > GapStart
    | summarize RequestCount = sum(ItemCount) by TimeBucket = bin(TimeGenerated, 15m), AppRoleName;
let MonitorChanges =
    AzureActivity
    | where TimeGenerated > GapStart
    | where OperationNameValue has_any ("components/write", "sites/config/write", "webapps/write")
    | summarize Changes = count(), Operations = make_set(OperationNameValue) by TimeBucket = bin(TimeGenerated, 15m), ResourceGroup;
AppTimeline
| join kind=leftouter MonitorChanges on TimeBucket
| order by TimeBucket asc

Query 3: Correlate DCR or AKS changes with missing Container Insights data

let ClusterSignals =
    KubeNodeInventory
    | where TimeGenerated > ago(12h)
    | summarize LastSeen=max(TimeGenerated), Nodes=dcount(Computer) by ClusterName;
AzureActivity
| where TimeGenerated > ago(12h)
| where OperationNameValue has_any ("dataCollectionRules/write", "managedClusters/write", "extensions/write")
| extend ClusterName = extract(@"managedClusters/([^/]+)", 1, ResourceId)
| project TimeGenerated, ResourceGroup, ClusterName, OperationNameValue, ResourceId, Caller
| join kind=leftouter ClusterSignals on ClusterName
| order by TimeGenerated desc

When change history is especially useful

  • A rule stopped firing right after someone edited the scope or condition.
  • Container Insights became partial after AKS extension, DCR, or networking changes.
  • Application Insights gaps began after deployment, app-setting, or component updates.

See Also

Sources