Quick Diagnosis Cards¶

Use these cards when you need a fast symptom-to-first-check mapping before opening a deeper checklist or playbook.

flowchart TD
    A[Symptom observed] --> B[Quick diagnosis card]
    B --> C[First 10 Minutes checklist]
    C --> D[Detailed playbook or KQL pack]

Card 1: No Data in Workspace¶

Field	Guidance
Primary symptom	Workspace tables are empty or stale
First question	Is every table stale, or only one source/table?
Check first	`Heartbeat`, `AzureActivity`, workspace cap, diagnostic settings, DCR association
High-probability causes	Daily cap, missing diagnostic settings, missing DCR, agent path break, ingestion delay
Open next	First 10 Minutes: No Data

Field	Guidance
Primary symptom	Expected alert or notification never arrived
First question	Did the signal actually meet the rule logic?
Check first	Rule enabled state, scope, window, action group, alert processing rules, ingestion delay
High-probability causes	Threshold mismatch, wrong scope, disabled rule, suppression, delivery failure
Open next	First 10 Minutes: Alert Not Firing

Field	Guidance
Primary symptom	Daily GB or ingestion bill increased sharply
First question	Which table and resource grew first?
Check first	`Usage`, `_Usage`, DCR list, diagnostic settings, Application Insights sampling
High-probability causes	Noisy diagnostic category, DCR rollout, verbose traces, retry storm, solution scope expansion
Open next	First 10 Minutes: High Cost

Field	Guidance
Primary symptom	Logs, workbook, or alert query is too slow or times out
First question	Are narrow control queries also slow?
Check first	Control query, top table volume, time range, selective predicates, service health
High-probability causes	Large scan scope, weak predicates, heavy join/summarize, hot table volume, workbook scope expansion
Open next	First 10 Minutes: Query Timeout

Field	Guidance
Primary symptom	App requests, dependencies, traces, or exceptions are empty or stale
First question	Is every telemetry type missing, or only one type or one app role?
Check first	Application Insights connection string, workspace linkage, `AppRequests`, `AppDependencies`, `AppTraces`, recent deployment timing
High-probability causes	Wrong connection string, SDK initialization gap, disabled collection module, endpoint reachability issue, ingestion delay
Open next	Missing Application Telemetry

Field	Guidance
Primary symptom	Too many alerts or duplicate notifications arrive for one incident
First question	Is one rule flapping, or are several overlapping rules firing together?
Check first	Alert rule inventory, action groups, alert processing rules, signal replay, dimension count
High-probability causes	Flapping threshold, overlapping scope, high-cardinality dimensions, aggressive evaluation frequency, notification fan-out
Open next	Alert Storm

Field	Guidance
Primary symptom	VM or Arc machine stops sending `Heartbeat`, `Perf`, or guest logs
First question	Is the failure isolated to one machine, one subnet, or one DCR rollout?
Check first	`Heartbeat`, AMA extension state, DCR association, managed identity, endpoint access
High-probability causes	Missing DCR association, unhealthy AMA runtime, identity drift, blocked IMDS or Azure Monitor endpoints, wrong data flows
Open next	Agent Not Reporting

Field	Guidance
Primary symptom	Container Insights is blank, partial, or stale for AKS nodes, pods, or namespaces
First question	Is monitoring disabled, or is the failure only in one table path such as `ContainerLogV2` or `KubePodInventory`?
Check first	AKS monitoring enablement, Azure Monitor extension state, `ama-logs` pod health, DCR association, `KubeNodeInventory` and `ContainerLogV2` freshness
High-probability causes	Monitoring never enabled, AMA pod failure, DCR or DCE mismatch, namespace filtering, blocked ingestion endpoints
Open next	AKS Container Insights Issues

Field	Guidance
Primary symptom	Application Insights still has data, but there are gaps, partial tables, or unexpected low volume
First question	Are the gaps explained by sampling, one missing telemetry type, or a recent deployment/change window?
Check first	`AppRequests` `ItemCount`, `ingestion_time()`, table-by-type comparison, app settings, deployment history
High-probability causes	Adaptive or fixed sampling, module-specific collection gap, recent config drift, private-link or network issue, short analytics delay
Open next	Application Insights Gaps

Symptom family	Card to start with	Typical escalation
Workspace ingestion loss	Card 1	No Data in Workspace playbook, then Evidence Map
Missing alert	Card 2	Alert Not Firing checklist, then alert rule validation
Cost spike	Card 3	Usage analysis, DCR review, Application Insights sampling review
Query slowness	Card 4	Query optimization and table-volume checks
App telemetry outage	Card 5	Connection string and SDK path validation
Excessive notifications	Card 6	Rule overlap and suppression review
Guest agent outage	Card 7	AMA runtime and DCR association validation
AKS monitoring gap	Card 8	Container Insights enablement and agent pod validation
Partial App Insights visibility	Card 9	Sampling, deployment, and ingestion-delay correlation