First 10 Minutes: Alert Not Firing¶

Quick Context¶

Use this checklist when a metric or log condition appears real but Azure Monitor did not create the expected alert, or created it without delivering a notification. In the first 10 minutes, determine whether the miss is in signal evaluation, rule configuration, suppression, or action group delivery.

flowchart TD
    A[Expected alert missing] --> B{Did the signal truly meet rule logic?}
    B -->|No| C[Fix threshold, aggregation, query, or dimensions]
    B -->|Yes| D{Was the rule enabled and scoped correctly?}
    D -->|No| E[Fix rule state or scope]
    D -->|Yes| F{Did the alert activate but notification fail?}
    F -->|Yes| G[Check action group delivery]
    F -->|No| H[Check ingestion delay or suppression]

Step 1: Identify alert type first¶

Metric alert: validate metric aggregation, dimensions, frequency, and window.
Scheduled query alert: validate the exact KQL, time window, and ingestion delay.
Good signal: you know which evaluation engine applies.
Bad signal: the team is replaying the wrong logic in Logs or Metrics Explorer.

Step 2: Check rule state, scope, and evaluation settings¶

Metric alert:

az monitor metrics alert show \
    --resource-group "$RG" \
    --name "$ALERT_RULE_NAME" \
    --output json

Scheduled query alert:

az monitor scheduled-query show \
    --resource-group "$RG" \
    --name "$ALERT_RULE_NAME" \
    --output json

Good signal: rule is enabled, scope is correct, and window/frequency match the expected symptom.
Bad signal: rule is disabled, scoped to the wrong resource, or uses a window that misses the event.

Step 3: Replay the signal using the same evaluation shape¶

Perf
| where TimeGenerated > ago(2h)
| where ObjectName == "Processor" and CounterName == "% Processor Time"
| summarize AvgCPU=avg(CounterValue) by bin(TimeGenerated, 5m), Computer
| order by TimeGenerated asc

Good signal: replay confirms the threshold breach exactly as the rule evaluates it.
Bad signal: the chart looked bad, but the evaluated signal never crossed the real threshold.

Step 4: Check for ingestion delay if the alert is log-based¶

Perf
| where TimeGenerated > ago(6h)
| extend DelayMinutes = datetime_diff('minute', ingestion_time(), TimeGenerated)
| summarize AvgDelay=avg(DelayMinutes), P95Delay=percentile(DelayMinutes, 95), MaxDelay=max(DelayMinutes)

Good signal: low delay relative to the alert rule window.
Bad signal: delay approaches or exceeds the evaluation window, causing late or missed alerts.

Step 5: Check action groups connected to the rule¶

az monitor action-group list \
    --resource-group "$RG" \
    --output table

Good signal: the expected action group is attached and active.
Bad signal: wrong action group, missing receiver, or outdated endpoint.

Step 6: Check for alert processing rules or suppression¶

az monitor alert-processing-rule list \
    --resource-group "$RG" \
    --output json

Good signal: no rule suppresses the affected scope or time window.
Bad signal: maintenance or routing rules suppress notifications.

Step 7: Check control-plane activity for recent alert-rule changes¶

AzureActivity
| where TimeGenerated > ago(7d)
| where OperationNameValue has_any ("Microsoft.Insights/metricAlerts", "Microsoft.Insights/scheduledQueryRules", "Microsoft.Insights/actionGroups")
| project TimeGenerated, OperationNameValue, ActivityStatusValue, ResourceId, Caller
| order by TimeGenerated desc

Good signal: no risky change aligned with the first missed alert.
Bad signal: a recent write changed rule settings, scope, or receivers.

Decision Points¶

Signal mismatch: the operator's chart or manual query does not match real rule logic.
Rule configuration issue: disabled state, wrong scope, or incorrect cadence explains the miss.
Suppression issue: alert processing rules prevented notification.
Delivery issue: the alert existed, but action group delivery failed.