Skip to content

Quick Diagnosis Cards

Use these cards when you need a fast symptom-to-first-check mapping before opening a deeper checklist or playbook.

flowchart TD
    A[Symptom observed] --> B[Quick diagnosis card]
    B --> C[First 10 Minutes checklist]
    C --> D[Detailed playbook or KQL pack]

Card 1: No Data in Workspace

Field Guidance
Primary symptom Workspace tables are empty or stale
First question Is every table stale, or only one source/table?
Check first Heartbeat, AzureActivity, workspace cap, diagnostic settings, DCR association
High-probability causes Daily cap, missing diagnostic settings, missing DCR, agent path break, ingestion delay
Open next First 10 Minutes: No Data

Card 2: Alert Not Firing

Field Guidance
Primary symptom Expected alert or notification never arrived
First question Did the signal actually meet the rule logic?
Check first Rule enabled state, scope, window, action group, alert processing rules, ingestion delay
High-probability causes Threshold mismatch, wrong scope, disabled rule, suppression, delivery failure
Open next First 10 Minutes: Alert Not Firing

Card 3: High Cost

Field Guidance
Primary symptom Daily GB or ingestion bill increased sharply
First question Which table and resource grew first?
Check first Usage, _Usage, DCR list, diagnostic settings, Application Insights sampling
High-probability causes Noisy diagnostic category, DCR rollout, verbose traces, retry storm, solution scope expansion
Open next First 10 Minutes: High Cost

Card 4: Query Timeout

Field Guidance
Primary symptom Logs, workbook, or alert query is too slow or times out
First question Are narrow control queries also slow?
Check first Control query, top table volume, time range, selective predicates, service health
High-probability causes Large scan scope, weak predicates, heavy join/summarize, hot table volume, workbook scope expansion
Open next First 10 Minutes: Query Timeout

Card 5: Missing Application Telemetry

Field Guidance
Primary symptom App requests, dependencies, traces, or exceptions are empty or stale
First question Is every telemetry type missing, or only one type or one app role?
Check first Application Insights connection string, workspace linkage, AppRequests, AppDependencies, AppTraces, recent deployment timing
High-probability causes Wrong connection string, SDK initialization gap, disabled collection module, endpoint reachability issue, ingestion delay
Open next Missing Application Telemetry

Card 6: Alert Storm

Field Guidance
Primary symptom Too many alerts or duplicate notifications arrive for one incident
First question Is one rule flapping, or are several overlapping rules firing together?
Check first Alert rule inventory, action groups, alert processing rules, signal replay, dimension count
High-probability causes Flapping threshold, overlapping scope, high-cardinality dimensions, aggressive evaluation frequency, notification fan-out
Open next Alert Storm

Card 7: Agent Not Reporting

Field Guidance
Primary symptom VM or Arc machine stops sending Heartbeat, Perf, or guest logs
First question Is the failure isolated to one machine, one subnet, or one DCR rollout?
Check first Heartbeat, AMA extension state, DCR association, managed identity, endpoint access
High-probability causes Missing DCR association, unhealthy AMA runtime, identity drift, blocked IMDS or Azure Monitor endpoints, wrong data flows
Open next Agent Not Reporting

Card 8: AKS Container Insights Issues

Field Guidance
Primary symptom Container Insights is blank, partial, or stale for AKS nodes, pods, or namespaces
First question Is monitoring disabled, or is the failure only in one table path such as ContainerLogV2 or KubePodInventory?
Check first AKS monitoring enablement, Azure Monitor extension state, ama-logs pod health, DCR association, KubeNodeInventory and ContainerLogV2 freshness
High-probability causes Monitoring never enabled, AMA pod failure, DCR or DCE mismatch, namespace filtering, blocked ingestion endpoints
Open next AKS Container Insights Issues

Card 9: Application Insights Gaps

Field Guidance
Primary symptom Application Insights still has data, but there are gaps, partial tables, or unexpected low volume
First question Are the gaps explained by sampling, one missing telemetry type, or a recent deployment/change window?
Check first AppRequests ItemCount, ingestion_time(), table-by-type comparison, app settings, deployment history
High-probability causes Adaptive or fixed sampling, module-specific collection gap, recent config drift, private-link or network issue, short analytics delay
Open next Application Insights Gaps

Coverage Map

Symptom family Card to start with Typical escalation
Workspace ingestion loss Card 1 No Data in Workspace playbook, then Evidence Map
Missing alert Card 2 Alert Not Firing checklist, then alert rule validation
Cost spike Card 3 Usage analysis, DCR review, Application Insights sampling review
Query slowness Card 4 Query optimization and table-volume checks
App telemetry outage Card 5 Connection string and SDK path validation
Excessive notifications Card 6 Rule overlap and suppression review
Guest agent outage Card 7 AMA runtime and DCR association validation
AKS monitoring gap Card 8 Container Insights enablement and agent pod validation
Partial App Insights visibility Card 9 Sampling, deployment, and ingestion-delay correlation

How to Use the Cards

  1. Match the incident to one primary symptom.
  2. Run one quick KQL check and one CLI or control-plane check.
  3. Escalate into the linked first-response checklist.
  4. Open the detailed playbook only after narrowing to a small hypothesis set.

See Also

Sources